Files

Kunthawat Greethong 69f7d8bdda Initial: pi-skill — 68 skills, 43 extensions, 11 themes for Pi

2026-05-25 16:41:08 +07:00

17 KiB

Raw Blame History

Crawlability — Robots.txt, Sitemaps, Crawl Budget & Log Analysis

A comprehensive reference for ensuring search engine crawlers can discover, access, render, and efficiently crawl all important pages on a website. Crawlability is the foundation of technical SEO — if search engines cannot crawl a page, it cannot rank.

Robots.txt

Purpose

The robots.txt file tells search engine crawlers which URLs they are allowed or disallowed from requesting. It is a crawl directive, not an indexation directive — pages blocked by robots.txt can still appear in search results if other pages link to them (they will show as "URL is blocked by robots.txt" in GSC).

Syntax Reference

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /internal-search?
Allow: /api/public/

User-agent: Googlebot
Disallow: /staging/
Crawl-delay: 1

Sitemap: https://example.com/sitemap-index.xml

Directives:

User-agent: Specifies which crawler the rules apply to. * means all crawlers
Disallow: Blocks crawling of the specified path. Empty value (Disallow:) means allow everything
Allow: Explicitly permits crawling of a path within a broader Disallow. Googlebot supports Allow; some crawlers do not
Crawl-delay: Requests a delay (in seconds) between requests. Google ignores this — use GSC crawl rate settings instead. Bing respects it
Sitemap: Points to the XML sitemap. Can list multiple Sitemap directives

Pattern matching (Googlebot-specific):

* matches any sequence of characters: Disallow: /*.pdf$ blocks all PDF files
$ anchors to end of URL: Disallow: /page$ blocks /page but allows /page/subpage
Path matching is case-sensitive

Common Robots.txt Mistakes

Mistake	Impact	Fix
Blocking CSS/JS files	Googlebot cannot render the page; mobile-first indexing fails	Allow all CSS and JS: `Allow: /.css` and `Allow: /.js`
Blocking entire site accidentally (`Disallow: /`)	No pages crawled; entire site deindexed over time	Audit robots.txt after every deployment
Blocking parameterized URLs that have unique content	Valuable pages never crawled	Use noindex instead of Disallow for pages that should not be indexed but can be crawled
No Sitemap directive	Crawlers must discover sitemap through other means	Always include `Sitemap:` directive
Using robots.txt to prevent indexation	Pages can still be indexed if linked externally	Use meta noindex or X-Robots-Tag for indexation control
Different robots.txt on staging vs production	Staging robots.txt (Disallow: /) deployed to production	Add robots.txt validation to deployment checklist
Blocking the robots.txt file itself via server config	Crawlers assume everything is allowed	Ensure robots.txt returns 200 status code

Testing Robots.txt

Google Search Console > Robots.txt Tester: Validates syntax and tests specific URLs against rules
Bing Webmaster Tools: Similar testing functionality for Bingbot rules
Robots.txt must be served at the root of the domain: https://example.com/robots.txt
Must return HTTP 200. If it returns 5xx, Google treats it as a temporary allow-all. If 4xx, Google treats it as no restrictions
Maximum file size: 500KB (Google ignores rules beyond this limit)

XML Sitemaps

Purpose

XML sitemaps tell search engines which URLs exist and are worth crawling. They supplement natural crawl discovery through links. Sitemaps are especially important for large sites, new sites with few inbound links, sites with deep architecture, and pages with limited internal linking.

Structure

Basic sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page-1</loc>
    <lastmod>2025-11-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap index (for large sites):

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2025-11-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2025-11-10</lastmod>
  </sitemap>
</sitemapindex>

Limits and Requirements

Constraint	Limit
URLs per sitemap file	50,000
Sitemap file size (uncompressed)	50 MB
Sitemaps per sitemap index	50,000
Maximum total URLs (via index)	2.5 billion (50K x 50K)
Encoding	UTF-8
Compression	gzip supported and recommended for large sitemaps

Sitemap Best Practices

Only include canonical, indexable URLs: Do not include URLs with noindex, non-canonical URLs, redirected URLs, or 4xx/5xx pages
Use accurate lastmod dates: Google uses lastmod to prioritize crawling. Only update the date when content meaningfully changes. Inaccurate dates (auto-updating to today) cause Google to ignore lastmod entirely
Segment sitemaps by content type: Separate sitemaps for products, blog posts, category pages, and editorial content. This makes GSC reporting more useful and helps diagnose crawl issues by section
Keep sitemaps current: Dynamically generate sitemaps or update them on content publish/update. Stale sitemaps with dead URLs waste crawl budget
Submit sitemaps in GSC: Submit via Google Search Console AND reference in robots.txt
Gzip large sitemaps: Compress sitemaps to reduce server bandwidth and speed up crawler downloads
Monitor sitemap status in GSC: Check "Sitemaps" report for errors, warnings, and coverage

Specialized Sitemaps

Image Sitemap:

<url>
  <loc>https://example.com/product-page</loc>
  <image:image>
    <image:loc>https://example.com/images/product.jpg</image:loc>
    <image:title>Product Name</image:title>
    <image:caption>Description of the product image</image:caption>
  </image:image>
</url>

Video Sitemap:

<url>
  <loc>https://example.com/video-page</loc>
  <video:video>
    <video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
    <video:title>Video Title</video:title>
    <video:description>Video description</video:description>
    <video:content_loc>https://example.com/video.mp4</video:content_loc>
    <video:duration>600</video:duration>
  </video:video>
</url>

News Sitemap (for Google News publishers):

URLs must be less than 2 days old
Requires <news:publication>, <news:publication_date>, and <news:title> elements
Submit only articles, not index pages or category pages

Crawl Budget

What It Is

Crawl budget is the number of URLs Googlebot will crawl on a site within a given time period. It is determined by two factors:

Crawl rate limit: The maximum crawling speed Googlebot uses to avoid overloading the server. Determined by server responsiveness and health
Crawl demand: How much Google wants to crawl based on the site's popularity, freshness signals, and perceived size

When Crawl Budget Matters

Crawl budget is primarily a concern for:

Sites with 10,000+ pages
Sites that generate new URLs rapidly (ecommerce, classified, UGC platforms)
Sites with slow server response times (< 200ms TTFB is ideal for crawl efficiency)
Sites where important pages are changing frequently and need quick recrawling

For small sites (under 10,000 pages), crawl budget is rarely a limiting factor.

What Wastes Crawl Budget

Waste Source	Description	Fix
Faceted navigation URLs	Filtering/sorting creates millions of parameter combinations	Block low-value facets in robots.txt; canonicalize to base category
Internal search result pages	`/search?q=xyz` indexed and crawled for every query	Block `/search` in robots.txt; add noindex to search results
Session ID URLs	Same page with different session parameters	Remove session IDs from URLs; use cookies instead
Infinite scroll/pagination traps	Calendar widgets, infinite pagination generating unlimited URLs	Cap pagination depth; use `rel="canonical"` on component pages
Soft 404 pages	Pages returning 200 but showing "no results" or empty content	Return proper 404 or 410 status codes
Duplicate content from parameters	Sort orders, tracking parameters, currency selectors	Canonicalize to the parameter-free version
Orphan pages	Pages with no internal links — only reachable through sitemap	Either add internal links or remove from sitemap if not valuable
Redirect chains	Each redirect consumes a crawl, and Google may stop following after 5 hops	Resolve chains to direct 301 redirects

Crawl Budget Optimization Strategies

Improve server response time: TTFB under 200ms allows Googlebot to crawl more URLs per session
Block crawling of low-value URLs via robots.txt (search results, filtered views, admin pages, API endpoints)
Clean up redirect chains: Resolve to direct single-hop 301s
Return proper status codes: 404 for not-found, 410 for permanently removed, 503 for temporary downtime
Keep XML sitemaps clean: Only canonical, indexable, 200-status URLs
Use internal linking to signal priority: Pages with more internal links get crawled more frequently
Update lastmod accurately: Helps Googlebot prioritize recently changed URLs
Monitor crawl stats in GSC: Crawl Stats report shows pages crawled per day, average response time, and crawl response breakdowns

JavaScript Rendering and Crawling

How Googlebot Handles JavaScript

Googlebot uses a two-phase process:

Crawl phase: Downloads HTML, discovers links and resources in the raw HTML
Render phase: Executes JavaScript using a headless Chromium instance, discovers additional content and links in the rendered DOM

The render phase is resource-intensive and happens in a separate queue. During peak load, rendering can be delayed by seconds to days. Content and links that exist only in JavaScript-rendered DOM may be discovered late.

Rendering Strategies and SEO Impact

Strategy	Initial HTML	SEO Risk	Best For
Static HTML	Complete content	None	Blogs, marketing sites, documentation
Server-Side Rendering (SSR)	Complete content	None	Dynamic content that changes per request
Static Site Generation (SSG)	Complete content	None	Content that changes infrequently
Incremental Static Regeneration (ISR)	Complete content (stale-while-revalidate)	Very low	High-traffic dynamic content
Client-Side Rendering (CSR)	Empty shell or skeleton	High	Authenticated dashboards (not for SEO pages)
Hybrid (SSR + CSR)	Critical content server-rendered; interactive parts client-rendered	Low	Modern web apps with SEO requirements

JavaScript SEO Checklist

Critical content visible in the raw HTML source (View Source, not Inspect Element)
Internal links are standard <a href="..."> tags, not JavaScript-triggered navigation
Meta tags (title, description, canonical, robots) are in the initial HTML, not injected by JS
Structured data (JSON-LD) is in the initial HTML response
URL Inspection tool in GSC shows rendered HTML matches what users see
No critical rendering errors in GSC's URL Inspection "More Info" section
Client-side routing uses History API (pushState), not hash-based routing (#/page)
Server returns proper HTTP status codes (404, 301) rather than handling them client-side

Log File Analysis

What to Analyze

Server logs record every request made to the server, including search engine crawlers. Analyzing these logs reveals how crawlers actually behave on the site, which may differ significantly from what you expect.

Key Metrics from Log Files

Metric	What It Tells You	Healthy Range
Crawl frequency by URL	How often Googlebot visits each URL	Important pages: daily; low-value: weekly or less
Crawl frequency by section	Which site sections get the most crawler attention	Should align with business value of each section
Response code distribution	Percentage of 200, 301, 404, 5xx responses served to bots	> 90% should be 200; < 1% should be 5xx
Average response time for bots	Server performance under crawler load	< 200ms ideal; > 500ms is a problem
Crawl of non-indexable URLs	How much crawl budget is wasted on noindex, blocked, or redirected URLs	< 20% of total bot requests
Crawl of orphan pages	Pages crawled that have no internal links	Should be near 0 for important content
Bot identification	Which bots are crawling and their behavior	Verify Googlebot, Bingbot; watch for scraper bots

Log Analysis Workflow

Extract bot requests from access logs (filter by user-agent containing "Googlebot", "bingbot", "Yandex", etc.)
Verify bot identity: Googlebot IPs resolve to *.googlebot.com or *.google.com via reverse DNS. Fake Googlebots are common
Segment by URL pattern: Group crawled URLs by directory/template (product pages, blog posts, category pages, etc.)
Calculate crawl distribution: What percentage of crawls goes to each section? Does it match the site's priority?
Identify crawl waste: URLs returning 3xx, 4xx, 5xx to bots; non-indexable URLs being crawled repeatedly
Check response times: Are any URL patterns consistently slow for bots?
Compare to sitemap: Are all sitemap URLs being crawled? Are non-sitemap URLs being crawled more than sitemap URLs?
Track over time: Weekly log analysis to detect crawl behavior changes after site updates

Tools for Log Analysis

Screaming Frog Log File Analyser: Dedicated tool for SEO log analysis. Parses common log formats, segments by bot, visualizes crawl patterns
Custom scripts (Python/pandas): For large log files or custom analysis needs. Parse with regex, aggregate in pandas
ELK Stack (Elasticsearch, Logstash, Kibana): For continuous log monitoring and dashboarding at scale
BigQuery or Athena: For querying very large log files stored in cloud storage
Botify, OnCrawl, JetOctopus: Enterprise SEO platforms with built-in log file analysis

Orphan Page Detection

What Are Orphan Pages

Orphan pages are URLs that exist on the server and may be indexed but have zero internal links pointing to them. They are only discoverable through:

XML sitemaps
External backlinks
Direct URL entry
Previously cached crawl data

Why Orphan Pages Matter

Crawl inefficiency: If the page is valuable, it is being starved of crawl frequency and PageRank
Index bloat: If the page is low-value, it is consuming index space without contributing
Missed SEO opportunity: Pages with no internal links signal low importance to search engines

Detection Method

Crawl the site with a tool like Screaming Frog, Sitebulb, or a custom crawler starting from the homepage
Export the list of discovered URLs (reachable through internal links)
Compare against: XML sitemap URLs, GSC indexed URLs, server log URLs (pages Googlebot actually crawled)
Any URL in the sitemap, GSC, or logs that was NOT found by the internal crawl is an orphan

Resolution

Valuable orphan pages: Add internal links from relevant parent pages. Include in navigation or related-content sections
Low-value orphan pages: Remove from sitemap, add noindex, or return 410 Gone if truly obsolete
Orphan pages with backlinks: High priority to rescue — add internal links to capture that external link equity

URL Parameter Handling

The Problem

URL parameters (query strings) create multiple URLs pointing to the same or similar content:

example.com/shoes (base)
example.com/shoes?color=red (filtered)
example.com/shoes?sort=price (sorted)
example.com/shoes?color=red&sort=price&page=2 (combined)

For a site with 50 categories, 10 filters, 5 sort options, and 10 pages of pagination, the combinatorial explosion produces 250,000 URL variations from 50 base categories.

Resolution Strategies

Strategy	When to Use	Implementation
Canonical to base URL	Parameter does not create unique, valuable content (sort, session, tracking)	`<link rel="canonical" href="base-url">` on parameterized pages
Robots.txt block	High-volume parameter URLs that waste crawl budget	`Disallow: /*?sort=` in robots.txt
Noindex, follow	Parameter pages have some link value but should not rank	`<meta name="robots" content="noindex, follow">`
Allow indexation	Parameter creates genuinely unique, search-valuable content (e.g., `/shoes?color=red` targets "red shoes")	Ensure unique title, description, and content; self-referencing canonical
AJAX-based filtering	Prevent parameter URLs from being generated at all	Filtering updates content via JavaScript without changing the URL; use History API for shareable state

Note: Google deprecated its URL Parameters tool in Google Search Console in 2022. Parameter handling must now be managed entirely through on-site signals (canonicals, robots, noindex).

17 KiB Raw Blame History

Crawlability — Robots.txt, Sitemaps, Crawl Budget & Log Analysis

Robots.txt

Purpose

Syntax Reference

Common Robots.txt Mistakes

Testing Robots.txt

XML Sitemaps

Purpose

Structure

Limits and Requirements

Sitemap Best Practices

Specialized Sitemaps

Crawl Budget

What It Is

When Crawl Budget Matters

What Wastes Crawl Budget

Crawl Budget Optimization Strategies

JavaScript Rendering and Crawling

How Googlebot Handles JavaScript

Rendering Strategies and SEO Impact

JavaScript SEO Checklist

Log File Analysis

What to Analyze

Key Metrics from Log Files

Log Analysis Workflow

Tools for Log Analysis

Orphan Page Detection

What Are Orphan Pages

Why Orphan Pages Matter

Detection Method

Resolution

URL Parameter Handling

The Problem

Resolution Strategies

17 KiB

Raw Blame History