TL;DR

On June 3, 2026, six major US publishers (AP, New York Times, NBC, Bloomberg, NPR, Fox) ordered the Common Crawl Foundation to stop archiving their content and to delete what it had already collected. For a small business, the real question isn't "block or not" — it's "how do I get cited by AI rather than merely scraped." For most companies, blocking AI crawlers means vanishing from the answers in ChatGPT and Google AI.

On June 3, 2026, Digital Content Next (DCN) — the consortium that includes the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR and Fox — sent a cease-and-desist letter to the Common Crawl Foundation, demanding it stop collecting, retaining and sharing their protected content, and delete what already sits in its datasets. DCN CEO Jason Kint made the letter public the next day, as Search Engine Land reported on June 10.

Common Crawl isn't a household name, but it's one of the most consequential foundations in the AI ecosystem. This non-profit has archived the public web since 2008 and publishes its data for free. The result: its archives have become raw material for AI. According to the New York Times' 2023 lawsuit against OpenAI, they made up roughly 60% of GPT-3's training data.

The legal argument: permission, not opt-out

The heart of the cease-and-desist isn't technical, it's legal. DCN argues that "copyright law is not an opt-out system": Common Crawl should obtain permission before including content, rather than publishers having to ask to be excluded after the fact. Jason Kint summed it up in a blog post: the notice "challenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetized simply because it is technically accessible."

It's a reversal of the web's historical norm: "crawl by default, honor opt-outs when asked" becomes "permission first, inclusion second." If that principle took hold, it would reshape the entire generative-AI supply chain.

For its part, Common Crawl executive director Rich Skrenta denied bypassing paywalls and stated: "When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset." As of mid-June 2026 this is a cease-and-desist, not yet a lawsuit. It follows an earlier letter from the News/Media Alliance in April 2026.

Why this matters to you: the content publishers are trying to protect is exactly what feeds the answers in ChatGPT, Perplexity and Google AI Mode. The fight over content access quietly decides who gets cited by AI tomorrow — and who is left out.

For a small business, the math is reversed

A major publisher protects a paywalled catalog and negotiates multi-million-dollar licenses with AI labs — exactly the subject of the licensing deals between OpenAI and publishers. Its logic is defensive: scarcity in exchange for payment.

For a small business, the math is the opposite. Your content isn't a treasure to lock away, it's your acquisition channel. Blocking AI crawlers — via robots.txt, llms.txt or a firewall — usually means making yourself invisible right where prospects now ask their questions. And blocking isn't even as simple as it looks: the line between what llms.txt actually controls and what Google does with it stays blurry, and AI crawlers now outnumber humans in traffic volume.

Not sure whether AI is citing — or ignoring — your site?
Cicéro runs a free GEO diagnostic: what ChatGPT, Perplexity and Google AI say about you, and how to show up there.

Training ≠ citation: the distinction that changes everything

The costliest confusion is mixing up two very different things:

60%of GPT-3's training data came from Common Crawl (NYT lawsuit, 2023)
6major US publishers behind the June 3, 2026 cease-and-desist
2008year the Common Crawl archive was created
  • Training data (Common Crawl, GPTBot, ClaudeBot…) builds a model's general knowledge, once, at a fixed point in time. It shapes what the AI "knows" about your brand — with no clickable link.
  • Real-time citation (retrieval) happens when the AI fetches pages at the moment of answering and shows a link to the source. That's what decides whether you're cited and clicked today.

Blocking training-data collection doesn't stop real-time citation — and vice versa. The two are controlled separately, crawler by crawler, use case by use case.

What to do now

  1. Audit who accesses your content. List the crawlers allowed and blocked in your robots.txt (CCBot, GPTBot, ClaudeBot, Google-Extended, PerplexityBot…). Many sites block by default without realizing it.
  2. Decide deliberately. Protecting paid premium content? Blocking is defensible. Selling a service to SMBs or consumers? Visibility almost always wins.
  3. Optimize for citation, not just for scraping. Direct answers, hard numbers, clean structure and schema.org: that's what gets an AI to cite you in its answers (GEO) instead of merely reading you.
  4. Measure your AI presence. The new AI reports in Search Console finally let you track impressions inside Google's generative features.

Our take

This fight is legitimate for publishers whose business is paid content. But it sends the wrong signal if a small business concludes it should "barricade" itself against AI. In a web where a growing share of content is already machine-written, the risk isn't being copied: it's being absent. The right stance isn't to block blindly, nor to leave everything open — it's to choose, page by page, and to build content distinctive enough to be cited rather than diluted.

What this article doesn't cover

This is not legal advice: the opt-out vs permission-first debate depends on the applicable jurisdiction and remains, at this stage, a cease-and-desist not ruled on by any court. We also don't get into the detailed technical configuration of robots.txt or llms.txt for your CMS — it varies too much across stacks. And the situation can move fast: Common Crawl, DCN and the AI labs could announce a deal or a lawsuit at any time.

Frequently asked questions

What is Common Crawl?
Common Crawl is a non-profit foundation that archives the public web and publishes its data for free. Those archives have become a primary training source for large AI models: according to the New York Times' 2023 lawsuit against OpenAI, they made up roughly 60% of GPT-3's training data.
Does blocking Common Crawl actually protect my content from AI?
Only partly. Blocking the CCBot crawler stops Common Crawl from archiving your pages, but other crawlers (OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended) collect independently. Blocking training-data collection also does nothing to stop real-time retrieval (RAG) that powers citations in ChatGPT Search or AI Overviews. Blocking must be handled crawler by crawler, and use case by use case.
Should a small business block AI crawlers?
Rarely. A major publisher protects a paywalled catalog and negotiates multi-million-dollar licenses. A small business depends on visibility: blocking AI crawlers usually means disappearing from the answers in ChatGPT, Perplexity or Google AI Mode, exactly where prospects now ask their questions. For most SMBs, the goal is not to block but to be cited correctly.
What's the difference between training data and real-time citation?
Training data (like Common Crawl) builds a model's general knowledge, once, at a fixed point in time. Real-time citation (retrieval) happens when the AI fetches pages at the moment of answering a question and shows a link to the source. The first shapes what the AI knows about you; the second decides whether you get cited and clicked today.

Sources

Alexis Dollé, founder of Cicéro
Alexis Dollé
CEO & Founder

Growth and SEO content strategist, I founded Cicéro to help businesses build lasting organic visibility — on Google and in AI-generated answers alike. Every piece of content we produce is designed to convert, not just to exist.

LinkedIn

Your AI visibility, audited for free

From €250 to €1,800/month, Cicéro combines a GEO audit, editorial production and automated semantic internal linking: agency-quality work, software-grade productivity.