Crawl Budget
Last updated
Crawl budget refers to the number of URLs Googlebot will crawl on a site within a given period. Google allocates a finite crawl capacity across the entire web, and each site receives a share of it based on its size, authority, and server performance. When a site has more URLs than budget allows, some pages will be crawled infrequently or not at all.
When does crawl budget matter?
For most sites, crawl budget is not a meaningful constraint. A site with a few thousand pages will have all its important content crawled regularly regardless of crawl efficiency.
Crawl budget becomes a practical concern when a site has hundreds of thousands of indexable URLs. E-commerce sites with large product catalogues, news publishers with extensive archives, and sites with significant parameterised URL spaces are the most likely to be affected. For these sites, crawl budget management is not an optimisation exercise but a requirement for consistent indexation.
How does Google determine crawl budget?
Google determines how much of a site to crawl based on two factors:
Crawl demand reflects how much Google wants to crawl a site. Signals include the number and quality of inbound links, how frequently content changes, and how often users click on results from the site. High-authority sites with frequently updated content receive higher crawl demand.
Crawl rate limit reflects how fast Googlebot can crawl without overloading the server. Google monitors server response times and backs off if pages are slow or returning errors. A slow server reduces effective crawl budget even if demand is high. The Search Console Crawl Rate Limiter tool was deprecated on January 8, 2024.1 Googlebot now manages the rate automatically, reducing crawling when servers return errors or respond slowly. For unusual crawling activity, Google’s Googlebot report form is the available option.
The effective crawl budget is the intersection of these two: Google wants to crawl X pages, but can only crawl Y without hurting the server. The lower of the two determines actual crawl volume.
What causes crawl waste?
Crawl waste occurs when Googlebot spends its allocated budget on URLs that have no ranking value. Common sources:
Faceted navigation URLs are generated by filter combinations on category pages. A clothing site with 5 colour options, 6 size options, and 4 sort orders can generate thousands of distinct URLs from a single category. Most of these are near-duplicates of each other and of the base category URL, and they should not be indexed.
URL parameters that do not change the content of a page: tracking parameters (?utm_source=...), session IDs, affiliate codes, and internal sort parameters all create new URLs for the same content. Google’s Search Console URL parameters tool was deprecated and removed in April 2022; Google now handles parameter detection automatically. The reliable approach is to canonicalise parameter URLs to the clean equivalent.
Soft 404 pages return a 200 status code but display “no results found” or similar messages. Search engines treat these as valid pages, crawl them, and eventually index thin, valueless content. They should return 404 or 301 to a relevant category.
Paginated URLs with thin content on deep pagination (page 20 of a category) where the content is sparse or near-identical to earlier pages.
Redirect chains add unnecessary crawl hops. Googlebot follows each redirect in a chain, consuming budget for each step. Redirect chains should be collapsed to a single hop wherever possible.
Orphan pages with no internal links receive low crawl priority and consume budget without contributing to the site’s authority structure.
How do you audit crawl budget?
The two primary data sources are:
Google Search Console Crawl Stats (Settings > Crawl stats) shows how many pages Google crawled per day over the past 90 days, response codes, and file types. A decline in daily crawl volume can indicate server problems, reduced crawl demand, or a configuration change blocking access.
Server access logs are the most accurate source. Logs record every request Googlebot makes, including URLs that GSC does not report, and allow analysis of which URLs are being crawled most frequently versus which important URLs are being crawled rarely. Log file analysis for SEO is covered in more depth in the log file analysis cluster.
For a full crawl budget audit checklist, see the Technical SEO Audit Checklist.
How do you fix crawl waste?
The priority order for fixing crawl waste:
- Disallow in robots.txt for URLs that should never be crawled: admin areas, internal search results, parameterised duplicates that cannot be canonicalised at the server level.
- Return correct status codes for pages that do not exist: 404 for missing pages, 410 for permanently removed content.
- Consolidate duplicate URLs via canonicals or 301 redirects so Googlebot crawls one version rather than many.
- Fix redirect chains by updating links and redirects to point directly to the final destination URL.
- Improve server response times to increase the effective crawl rate limit.
Crawl budget improvements take time to show up in Search Console. After changes are implemented, allow four to eight weeks for Google to recrawl affected URLs and update its crawl patterns.
What is index bloat?
Index bloat describes a site having a large proportion of low-value pages in Google’s index relative to its pages with genuine ranking potential. It frequently accompanies crawl waste, but the problems are distinct. Many of the same URL types that burn crawl budget also generate indexed pages that should never have been indexed. A site can have minimal crawl waste and still carry significant index bloat if historically generated content was indexed before clean-up processes were in place.
The concern with index bloat is not primarily the crawl cost of those pages. Google evaluates site quality in part by examining what it has indexed. A site where a significant portion of the index is thin, near-duplicate, or low-effort content sends a weaker quality signal, which can suppress the performance of the genuinely valuable pages elsewhere.
Common sources:
- Faceted navigation and parameter URLs indexed before canonicalisation was in place
- Auto-generated tag, category, or date archive pages with little standalone content
- Expired product pages, past events, or discontinued content still returning 200 responses
- Paginated sequences where later pages are sparse or near-identical to earlier ones
- Previously published thin content that passed indexing thresholds before the site had editorial standards
To assess the scale: compare the number of pages in GSC’s Pages report (indexed) against the number of pages you would consider meaningfully worth indexing. A large gap, or a significant volume of URLs in “Discovered - currently not indexed” or “Crawled - currently not indexed” states, suggests Google has reservations about a portion of the site.
Remediation follows the same pattern as crawl waste: noindex on navigational pages without standalone value, 404 or 410 for removed content, canonicals or robots.txt rules for duplicate URL patterns. The difference from crawl waste remediation is that the priority order is determined by what is currently indexed, not what is being crawled.
Index bloat is worth addressing even on smaller sites where crawl budget is not a constraint, because quality evaluation applies regardless of site size.
Crawl budget and indexation
Crawl budget affects when pages get discovered and recrawled, not whether they can rank. A page that is crawled infrequently can still rank well once indexed. However, if important pages are being crawled only monthly rather than daily, content updates and technical fixes take much longer to propagate into search results, which has compounding effects on both freshness and ranking velocity.