Servers and lines of code representing Googlebot crawling a web page's HTML

The 20-second version

  • The fact: on June 23, 2026, Google clarified in its documentation how many bytes of textual content Googlebot crawls.
  • The number: 2MB for HTML and text files, 64MB for PDFs — measured on uncompressed data, HTTP headers included.
  • The risk: past 2MB, Googlebot stops reading. Content and structured data placed after the threshold are never indexed.
  • The fix: slim down the HTML, externalize CSS/JS, and move useful content to the top of the file.

On June 23, 2026, Google updated its official documentation to clarify how many bytes of textual content — HTML in particular — Googlebot actually downloads from a page, on the Search Central "What Is Googlebot" page. The answer fits in a single number: Googlebot crawls the first 2MB of an HTML file for Google Search, and the first 64MB of a PDF.

This isn't a new limit. It's a clarification of the rule introduced in March 2026 in the "Inside Googlebot" post, which for the first time separated two phases — fetching and indexing. The June update spells out exactly what the 2MB threshold covers and resolves an ambiguity that had lingered in the SEO community since spring.

What the documentation actually says

Three points are worth pausing on, because they change how you build a page:

2MBof HTML crawled by Googlebot for Search
64MBfor a PDF file
0bytes indexed past the threshold
  • The limit applies to uncompressed data. Your server may ship 300KB gzipped, but it's the weight after decompression that counts. HTML that's "light" over the wire can be heavy on arrival.
  • HTTP request headers are included in the count. Marginal on most sites, but real.
  • Going over isn't a rejection, it's a truncation. Once 2MB is reached, Googlebot stops the fetch and only sends the already-downloaded part for indexing consideration. The rest doesn't exist for Google.

Each resource the page calls (CSS, JavaScript) is fetched separately and bound by the same limit. The HTML, however, is one file: everything inside it — menus, inline scripts, base64-encoded images — eats into your 2MB budget before your article even begins.

Why this matters for small businesses

On paper, 2MB of raw HTML is enormous: a healthy page rarely exceeds 100KB of markup. In practice, several modern architectures inflate the HTML without anyone noticing:

  • base64-encoded images embedded straight in the HTML, each weighing hundreds of kilobytes;
  • massive blocks of inline CSS and JavaScript, common on sites built with page builders or misconfigured frameworks;
  • mega-menus and duplicated data at the top of the document, pushing real content downward;
  • JS framework output that serializes the full app state (hydration payloads) into the HTML.

The danger is silent: the page renders perfectly for the user, but content sitting past the 2MB mark — or your JSON-LD structured data if it's in the footer — is never read. This ties directly into how Google chunks and reads content for Search and AI: if the crawler doesn't reach your text, neither Google nor generative engines can cite it.

The GEO angle. The same logic applies to visibility in AI answers. Generative engines rely on content that's accessible and readable server-side. Bloated HTML that buries your expertise under kilobytes of code isn't just a classic SEO problem — it's a problem for getting cited by AI.

What to do now

  1. Measure your raw HTML. Not the rendered weight in the browser: the source HTML, uncompressed. From the command line: curl -s https://yoursite.com/page | wc -c. If you're nearing a megabyte, act.
  2. Externalize CSS and JavaScript. Move them out of the HTML into dedicated .css and .js files. They're crawled separately, with their own limit.
  3. Ban base64 images from the HTML. Serve them as .webp files with real URLs. Inline base64 is the worst enemy of your byte budget.
  4. Move useful content to the top of the file. Your main text and structured data should appear early in the HTML, not after 1.8MB of menus and scripts.
  5. Check your structured data. JSON-LD placed late in the document is the first thing to drop on a heavy page. Put it in the <head>.

This technical-hygiene reflex goes hand in hand with editorial quality: Google now builds entity profiles from what it reads on your site, as shown by a recent patent on how AI learns from websites. But the crawler has to reach that content first.

Is your HTML readable by Google and AI?

Cicéro audits the technical and editorial visibility of your site — on Google and in AI answers — then produces the content that converts. Agency-quality work, software-grade productivity, from €250 to €1,800 / month.

Our take

This update revolutionizes nothing — it puts an old discipline back at the center. For years, HTML weight was a performance issue; it's becoming an indexability issue. The lesson is plain: your content must be reachable before the code. Sites that pile on frameworks without watching their raw weight risk having Google — and the AIs that lean on it — never read the essentials. 2MB is generous. The problem isn't the limit: it's everything you put in front of the content.

Sources

Frequently asked questions

How many bytes of an HTML page does Googlebot crawl?
For Google Search, Googlebot crawls the first 2MB of an HTML file (and supported text files), and the first 64MB of a PDF. The limit applies to uncompressed data and includes the HTTP request headers. Once that threshold is reached, Googlebot stops the fetch and only sends the already-downloaded part for indexing consideration.
Is my page rejected if it's larger than 2MB?
No. Google does not reject the page. It stops reading at 2MB. The risk isn't rejection but truncation: if your real content or structured data sit past the threshold, they are never considered for indexing.
How do I check my page's HTML weight?
Measure the raw, un-rendered, uncompressed HTML, for example with curl -s URL | wc -c, or via the URL Inspection tool in Search Console. Aim well under the 2MB threshold: most healthy pages weigh under 100KB of HTML, excluding images and external resources.

What this article doesn't cover

This piece is about the textual-content byte limit (HTML and text files) Googlebot applies for Search. It does not cover the 15MB fetching limit that applies to generic crawlers, the overall crawl budget (frequency and number of URLs crawled), or the JavaScript rendering limits (WRS), which follow other mechanisms. The figures cited are those published by Google; they may change — always check the current source documentation.

Alexis Dollé, founder of Cicéro
Alexis Dollé
CEO & Founder

Growth and SEO content strategist, I founded Cicéro to help businesses build lasting organic visibility — on Google and in AI-generated answers alike. Every piece of content we produce is designed to convert, not just to exist.

LinkedIn