Log File Analysis & Crawl Budget

Most SEO tools tell you what could happen to your site. Server logs tell you what did happen. Every time Googlebot fetches a URL, your server writes a line to a log file — the real URL, the real status code, the real timestamp. No sampling, no estimates, no third-party crawler pretending to be Google. If you want to know what search engines actually do on your site, the logs are the only source of truth.

This guide is about reading that truth and acting on it. The lens is crawl budget: the finite amount of crawling Google is willing to spend on you. On a small site, the budget is effectively infinite and you can stop reading after the next section. On a large site — an ecommerce catalog, a marketplace, a programmatic site with hundreds of thousands of URLs — crawl budget is a hard constraint, and logs are how you find where it leaks.

🧑‍💻 Developer view: a log file is just an append-only event stream keyed by (timestamp, ip, method, url, status, user_agent). Everything in this guide is a GROUP BY or a WHERE clause over that stream. If you’ve ever debugged a service from its access logs, you already have the muscle for log-based SEO.

What crawl budget is

“Crawl budget” is not a number Google publishes. It’s the emergent result of two forces Google balances for every site.

Crawl rate limit is the ceiling on how hard Google will hit your server. Googlebot is deliberately polite: if your responses get slow or you start returning 5xx errors, it backs off automatically to avoid taking your site down. A fast, healthy server raises the ceiling; a slow or flaky one lowers it. This is the supply side — how much crawling your infrastructure can absorb.

Crawl demand is how much Google wants to crawl you. It’s driven by popularity (URLs with more links and traffic get crawled more) and staleness (Google re-crawls pages it thinks have changed). A news site with constantly updated articles has high demand; a static brochure site has low demand. This is the demand side — how much crawling your content actually merits.

Effective crawl budget is roughly min(rate limit, demand). You optimize it by raising both: serve fast (lift the limit) and concentrate Google’s attention on pages that matter (shape the demand).

Site size (indexable URLs)	Does crawl budget matter?
Under ~10k	Almost never — Google crawls it all easily
10k–100k	Sometimes — watch it if you have URL parameters or faceted nav
Over 100k	Yes — this is a primary lever
Any size with crawl traps	Yes — a calendar or filter explosion can sink a small site

⚠️ Note: the most common crawl-budget disaster isn’t a big site — it’s a small site with a trap that generates infinite URLs (a date picker, an unbounded filter combination). Google can burn its entire budget crawling 400,000 useless filter permutations while your 200 real product pages go stale.

Server logs 101

A web server access log is one line per request. The classic format is the NCSA “combined log format” that Apache and Nginx emit by default:

66.249.66.1 - - [22/Jun/2026:08:14:32 +0000] "GET /products/widget-42?color=red HTTP/1.1" 200 5123 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Decoded, the fields you care about for SEO are:

Field	Example	Why it matters
Client IP	`66.249.66.1`	Verifying real Googlebot (reverse DNS)
Timestamp	`22/Jun/2026:08:14:32`	Crawl frequency, freshness
Method + URL	`GET /products/widget-42?color=red`	What got crawled; parameter waste
Status code	`200`	404s, redirects, 5xx errors
Bytes	`5123`	Payload size, response weight
User agent	`...Googlebot/2.1...`	Which crawler hit the URL

Getting the logs is the real hurdle. Where they live depends on your stack:

Self-hosted Nginx/Apache — /var/log/nginx/access.log, rotated daily via logrotate. Easiest case; just scp them.
Cloud load balancers — AWS ALB writes access logs to S3; GCP load balancers write to Cloud Logging. You enable them per-listener.
Behind a CDN — this is the trap. If you’re behind Cloudflare, Fastly, or CloudFront, Googlebot hits the CDN edge, not your origin. Your origin logs only see cache misses, so they undercount crawling badly. You need the CDN’s logs.
Cloudflare specifically — use Logpush to stream HTTP request logs to R2 (or S3, or an external sink) continuously. This is the canonical setup for a Cloudflare-hosted site and we walk through it in the Cloudflare section below.

💡 Tip: collect at least 30 days of logs before drawing conclusions. Googlebot’s crawl pattern over a single day is noisy; over a month, the waste patterns become unmistakable. Aim for a rolling window you can re-query.

Analyzing Googlebot

Before you trust a single log line attributed to “Googlebot,” verify it. The user-agent string is trivially spoofable — scrapers and competitors routinely claim to be Googlebot to bypass rate limits. Counting fake Googlebot hits as crawl budget will send you chasing ghosts.

Verify with reverse + forward DNS. Real Googlebot IPs reverse-resolve to googlebot.com or google.com, and that hostname must forward-resolve back to the same IP. Spoofers control the user agent but not Google’s DNS.

# Reverse lookup: does the IP claim a Google hostname?
host 66.249.66.1
# -> 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

# Forward lookup: does that hostname resolve back to the same IP?
host crawl-66-249-66-1.googlebot.com
# -> crawl-66-249-66-1.googlebot.com has address 66.249.66.1   ✓ verified

If you don’t want to do per-IP DNS, Google publishes its crawler IP ranges as JSON at https://developers.google.com/static/search/apis/ipranges/googlebot.json — match the client IP against those CIDRs offline.

Once you trust the data, ask four questions of it.

What is getting crawled? Group hits by URL and look at the top of the list. On a healthy site the most-crawled URLs are your most important pages. If your top-crawled URLs are ?sessionid= variants or filter permutations, that’s leak number one.

How often are key pages crawled? Pull the last-crawl timestamp for your money pages. If your top category page was last fetched 18 days ago, Google thinks it’s stale or unimportant — that’s a freshness problem.

Where is budget wasted? This is the heart of log analysis. The usual suspects:

Waste pattern	What it looks like in logs	Typical cause
Parameter URLs	Many hits to `?sort=`, `?utm_`, `?sessionid=`	Tracking params, sorting, session IDs
Duplicate content	Same content under `/p/123` and `/products/widget`	Multiple routes to one page
Redirect chains	`301 -> 301 -> 200` sequences	Migrations layered over migrations
404 / 410 storms	High volume of `404` to bots	Dead links, removed products still linked
Soft 404s	`200` on pages that are really “not found”	Empty search results, sold-out items
Faceted nav	Combinatorial `?color=&size=&brand=` URLs	Filters with no crawl controls

A quick triage you can run on a raw log with nothing but the shell:

# Top 20 URLs crawled by anything claiming to be Googlebot
grep -i googlebot access.log \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn | head -20

# Status-code distribution for Googlebot — 4xx/5xx ratio is your waste signal
grep -i googlebot access.log \
  | awk '{print $9}' \
  | sort | uniq -c | sort -rn

If 30% of Googlebot’s requests return 404 or hit ?-laden URLs, roughly a third of your crawl budget is going in the bin.

Optimizing crawl budget

Once the logs tell you where the leaks are, you plug them. The toolkit is small but each tool has a specific job — using the wrong one is how people accidentally deindex their site.

Block at the source with robots.txt. Disallowing a path stops Google from crawling it at all, which is exactly what you want for crawl-budget waste like internal search, sort orders, and tracking params. Note that Disallow does not remove a URL from the index if it’s linked elsewhere — it only stops the fetch.

User-agent: *
Disallow: /search
Disallow: /*?sort=
Disallow: /*?sessionid=
Disallow: /cart

Consolidate duplicates with rel=canonical. When the same content lives at multiple URLs (parameterized variants, tracking-tagged links), point them all at one canonical URL. Google will fold the signals together and prefer crawling the canonical.

<link rel="canonical" href="https://example.com/products/widget-42" />

Keep low-value pages out of the index with noindex. For pages that must stay crawlable for users but shouldn’t compete in search — thin tag archives, paginated tails — use a noindex meta robots tag. Important: don’t combine noindex with a robots.txt Disallow, because if Google can’t crawl the page it can’t see the noindex.

The two highest-impact structural fixes:

Kill crawl traps. Faceted navigation and infinite calendars are the classic budget incinerators. A filter UI with 6 facets of 10 values each generates a million URL combinations, none of them worth indexing. Fix it by serving filter links that bots won’t follow (POST forms, or rel=nofollow plus Disallow on the parameter), and cap the infinite calendar so /events/2099/12 returns a 404 past a sane horizon.

Repair redirect chains and soft 404s. Every hop in a 301 -> 301 -> 200 chain is a wasted fetch and a small loss of link equity; collapse chains so each old URL points directly at the final destination in one hop. For soft 404s — pages that look empty but return 200, like sold-out products or no-result searches — return a real 404/410 so Google stops re-crawling them as if they were live content.

💡 Tip: after you ship fixes, the logs are also your verification tool. Re-pull a week of Googlebot hits and confirm the parameter URLs and 404s have dropped, and that crawls of your money pages have risen. Crawl budget freed from junk gets redistributed to your good pages — you can watch it happen.

Tooling

You can go a long way with grep, awk, and sort, but at scale you’ll want something queryable.

BigQuery is the workhorse for large sites. Load logs into a table and crawl analysis becomes plain SQL over billions of rows:

SELECT
  REGEXP_EXTRACT(url, r'^[^?]+') AS path,   -- strip query string
  COUNT(*)                       AS hits,
  COUNTIF(status >= 400)         AS errors
FROM `logs.googlebot`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY path
ORDER BY hits DESC
LIMIT 50;

Screaming Frog Log File Analyser is the no-code option: drag in raw log files, point it at your crawl, and it cross-references the two so you can see crawled-but-not-in-crawl (“orphan”) URLs and important-but-rarely-crawled pages. Great for one-off audits.

A self-built parser is worth it for recurring analysis you want to own. A combined-log line is regular enough to parse with one pattern:

import re
from collections import Counter

# Matches: ip ... "METHOD url HTTP/x" status bytes "ref" "ua"
LINE = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<url>\S+) [^"]+" '
    r'(?P<status>\d{3}) (?P<bytes>\S+) "[^"]*" "(?P<ua>[^"]*)"'
)

paths, statuses = Counter(), Counter()

with open("access.log", encoding="utf-8", errors="replace") as fh:
    for line in fh:
        m = LINE.match(line)
        if not m or "googlebot" not in m["ua"].lower():
            continue
        path = m["url"].split("?", 1)[0]   # group by path, ignore params
        paths[path] += 1
        statuses[m["status"]] += 1

print("Top crawled paths:")
for path, n in paths.most_common(20):
    print(f"{n:6d}  {path}")

print("\nStatus distribution:", dict(statuses))

Run that against a month of logs and you have, in seconds, the same answers an expensive tool gives you — top crawled paths and the error ratio that quantifies your waste.

🧑‍💻 Cloudflare angle

If your site is on Cloudflare, your origin logs are blind to most crawling — Googlebot mostly hits cached responses at the edge. The fix is Logpush: a managed pipeline that streams HTTP request logs from Cloudflare’s edge to a sink of your choice. The cheapest, most SEO-friendly sink is R2, Cloudflare’s S3-compatible object storage, with no egress fees.

The shape of the setup:

In the Cloudflare dashboard, go to Analytics & Logs -> Logpush and create a job for the HTTP requests dataset.
Pick R2 as the destination and select the fields you need: ClientIP, ClientRequestURI, EdgeResponseStatus, ClientRequestUserAgent, EdgeStartTimestamp, CacheCacheStatus.
Logpush drops gzipped JSON batches into your R2 bucket every few minutes.

Then query the logs directly — for example with R2 SQL / a DuckDB session over the exported files:

-- Top URIs fetched by Googlebot, last 7 days, with error ratio
SELECT
  ClientRequestURI                                     AS uri,
  COUNT(*)                                             AS hits,
  SUM(CASE WHEN EdgeResponseStatus >= 400 THEN 1 ELSE 0 END) AS errors
FROM read_json_auto('r2://my-logs/http/2026/06/*.json.gz')
WHERE lower(ClientRequestUserAgent) LIKE '%googlebot%'
GROUP BY uri
ORDER BY hits DESC
LIMIT 50;

Because Logpush includes CacheCacheStatus, you also get a crawl-budget superpower the origin can never give you: you can see which crawled URLs are served from cache (hit) versus forcing an origin fetch (miss/expired). High Googlebot miss rates on important pages mean your edge caching isn’t helping the crawler — a tuning opportunity that ties directly back to the Build layer, where caching and edge configuration live.

💡 Tip: set the Logpush job to also capture verified-bot metadata if you have the Bot Management add-on — Cloudflare verifies Googlebot for you, so you can skip the reverse-DNS dance and filter on a trusted flag instead.

Key takeaways

✅ Decide if crawl budget even applies: under ~10k clean URLs, skip it; over 100k or with any crawl trap, treat logs as a primary lever.
✅ Get real edge logs — if you’re behind a CDN, ship Cloudflare Logpush to R2 rather than trusting origin logs that miss cached crawls.
✅ Verify Googlebot by reverse + forward DNS (or IP ranges) before trusting any “Googlebot” line in your logs.
✅ Quantify the waste: measure the share of Googlebot hits going to parameter URLs, 404s, redirect chains, and soft 404s.
✅ Plug leaks with the right tool — robots.txt to stop crawling, canonical to consolidate, noindex to deindex, and structural fixes for faceted nav and calendar traps.
✅ Close the loop: re-pull logs after shipping fixes and confirm waste fell while crawls of your money pages rose.