Cloudflare's AI Bot Shield: One-Click Protection

type

status

date

slug

summary

Main features of AI Bot Shield

Easy to use

Website operators can simply navigate to the “Security -> Bots” section in the Cloudflare dashboard and enable the switch labeled “AI Crawlers and Scrapers” to block all AI bots with one click.

Wide range of applications

This feature is available to all Cloudflare customers, including free users, at no additional charge.

Automatic Updates

The feature automatically updates based on Cloudflare’s analysis of network traffic and newly discovered AI bot characteristics, continuously identifying and blocking new malicious crawlers and scrapers.

Full network coverage

Cloudflare analyzes global web traffic to identify and flag a large number of common AI bots to ensure comprehensive protection. The most common crawlers currently include Bytespider, Amazonbot, ClaudeBot, and GPTBot.

Machine Learning Support

Cloudflare uses machine learning models and global signal computation to accurately identify and block malicious AI bots pretending to be real browsers, even when they try to masquerade as legitimate users.

Enhanced content protection

By blocking unauthorized AI robots from accessing website content, we protect the original works of content creators and prevent them from being used for training and reasoning by unauthorized AI models.

Steps to enable the AI Bot Shield

Navigate to the Security -> Robotics section.

Find and enable the AI crawlers and spiders switch.

Steps to enable AI Bot Shield — **Steps to enable** AI Bot Shield

More details: https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click

💡

Lack of Flexibility Limiting Your Options? Adapt to change with ease

To help content creators protect a safe internet, we just launched a new “simple button” that blocks all AI bots. It’s available to all customers, including those on the Free plan.

The popularity of generative AI has dramatically increased the demand for content to train models or run inferences, and while some AI companies clearly identify their web crawler bots, not all AI companies are transparent. Google reportedly pays $60 million per year to buy user-generated content from Reddit, Scarlett Johansson claims OpenAI used her voice without her consent to develop a new personal assistant, and more recently, Perplexity was accused of impersonating legitimate visitors to scrape website content. The value of large quantities of original content has never been higher. Last year, Cloudflare announced that customers can easily block well-behaved AI bots . These bots follow robots.txt and do not use unlicensed content to train their models or run inferences for RAG applications that use website data. Although these AI bots follow the rules, Cloudflare’s customers generally choose to block them.

We’ve heard loud and clear that customers don’t want AI bots accessing their sites, especially those that are dishonest. To that end, we’ve added a new “Block All AI Bots in One Click” feature. It’s available to all customers, including Free Tier users. To enable it, simply navigate to the Security > Bots section of the Cloudflare dashboard and click the toggle labeled AI Scrapers and Crawlers.

This feature will automatically update as we discover new bot fingerprints that extensively crawl the web for model training. To ensure we have a comprehensive view of all AI crawler activity, we survey the traffic in our network.

Today’s AI Robot Activity

The following chart shows the most popular AI bots by request volume on the Cloudflare network. We looked at common AI crawler user agents and aggregated the number of requests these AI user agents have made on our platform over the past year:

common AI crawler user agents and aggregated the number of requests

When looking at the number of requests sent to Cloudflare sites, we see that Bytespider, Amazonbot , ClaudeBot, and GPTBot are the top four AI crawlers. Bytespider is operated by ByteDance, the Chinese company that owns TikTok, and is allegedly used to collect training data for its Large Language Models (LLMs), including the model that powers its ChatGPT competitor Doubao. Amazonbot and ClaudeBot follow closely behind in terms of requests. Amazonbot, allegedly used to index content for Alexa’s question answering, is the second most requested, and ClaudeBot, used to train the Claude chatbot, has seen an increase in requests recently.

Of the top AI bots we looked at, Bytespider_ leads not only in number of requests, but also in the breadth of internet properties it crawls and how often it’s blocked. It’s followed by _GPTBot_ , which ranks second in crawling and being blocked. _GPTBot_ is managed by OpenAI and collects training data for its LLMs, which underpin AI-powered products like ChatGPT. The “% of sites visited” in the chart below refers to the percentage of Cloudflare-protected sites visited by the named AI bot.

AI BotPercentage of websites visitedBytespider40.40%GPTBot35.46%ClaudeBot11.17%ImagesiftBot8.75%CCBot2.14%ChatGPT-User1.84%omgili0.10%Diffbot0.08%Claude-Web0.04%PerplexityBot0.01%

While our analysis identifies the most popular crawlers in terms of request volume and number of Internet properties visited, many customers may not be aware that these most popular AI crawlers are actively crawling their sites. Our Radar team analyzed the robots.txt entries of the top 10,000 Internet domains to identify the most commonly used AI bots, and then looked at how often we saw these bots on sites protected by Cloudflare.

In the image below, looking at the banned bots for these sites, we see that customers most commonly reference _GPTBot_, _CCBot_, and _Google_ in their robots.txt , but do not specifically ban popular AI bots like _Bytespider_ and _ClaudeBot_.

distribution of user-agents disallowed in robots.txt

As these AI bots flood the Internet, we were curious about how website operators have responded. In June, AI bots visited approximately 39% of the top million Internet properties using Cloudflare, but only 2.98% of these properties took steps to block or challenge these requests. Furthermore, the higher the ranking (and more popular) an Internet property is, the more likely it is to be targeted by AI bots, and accordingly, the more likely it is to block these requests.

We have seen website operators use robots.txt to completely block access from these AI crawlers. However, these blocks rely on crawler operators respecting robots.txt and following RFC9309 (ensuring that variations of the user agent all match the product markup) to honestly identify themselves when accessing Internet properties, but user agents are easy for crawler operators to change.

💡

Inefficient Processes Slowing You Down? Streamline operations with our servers

How AI Bot Shield Spotted AI bots Pretending To Be Real Browsers

Unfortunately, we have observed bot operators attempting to appear to be real browsers by using fake user agents. We monitor this activity for a long time, and we are proud to say that our global machine learning models are consistently able to identify this activity, even when operators lie about their user agents.

Take for example a specific bot that others have observed hiding its activity . We ran an analysis to see how our machine learning models would score this bot’s traffic. In the image below, you can see that all bot scores are firmly below 30, indicating that our scoring believes this activity is likely coming from a bot.

This graph reflects the results of scoring requests using our latest model , with “hot” colors indicating more requests fall into that range, and “cold” colors indicating fewer requests fall into that range. We can see that the vast majority of requests fall into the lowest two ranges, indicating that Cloudflare’s model scores offending bots 9 or below. Changing the user-agent has no effect on the score, as this is the first thing we expect a bot operator to do.

Any customer who has set up WAF rules to challenge bot scores below 30 (our recommended value) has all of this AI bot traffic blocked automatically, with no new action required. The same will be true for future AI bots that use similar techniques to hide their activity.

We leverage Cloudflare global signals to calculate our bot scores, and for the AI bot above, we correctly identified and scored it as “likely bot.”

When bad actors attempt to scrape websites at scale, they often use tools and frameworks that we are able to fingerprint. For each fingerprint we see, we use Cloudflare’s network (which handles over 57 million requests per second on average) to understand how much we should trust that fingerprint. To power our model, we compute global aggregates across multiple signals. Based on these signals, our model is able to appropriately label the above example of evading AI bot traffic as a bot.

The benefit of this globally aggregated data is that we can immediately detect new crawlers and their behaviors without having to manually fingerprint bots, ensuring customers are protected from the latest wave of bot activity.

If you have a tip about a misbehaving AI bot, we’d love to investigate. You can report a misbehaving AI crawler in two ways:

Enterprise Bot Management customers can submit false negative feedback loop reports through Bot Analysis by selecting traffic segments where they notice inappropriate behavior :

We’ve also set up a reporting tool where any Cloudflare customer can submit reports of AI bots crawling your site without permission.

We are concerned that some AI companies are bent on bypassing the rules to access content and are constantly adapting to circumvent bot detection. We will continue to monitor closely and add more bot blocking to our AI crawler and scraping rules, optimize our machine learning models, and help keep the Internet a place where content creators can thrive and have full control over their content used to train or infer models.

We protect the entire enterprise network , help customers efficiently build internet-scale applications , accelerate any website or internet application , defend against DDoS attacks , protect against hackers , and assist in your Zero Trust journey .

Visit 1.1.1.1 from any device to start using our free app to make your internet faster and more secure.

To learn more about our mission to help build a better Internet, start here .

💡

Website Traffic Overloading Your Server? Handle spikes effortlessly