Initial: pi-skill — 68 skills, 43 extensions, 11 themes for Pi

This commit is contained in:
Kunthawat Greethong
2026-05-25 16:38:02 +07:00
commit 69f7d8bdda
1689 changed files with 342427 additions and 0 deletions

View File

@@ -0,0 +1,222 @@
---
name: technical-seo
description: "Deep technical SEO analysis. Use when: optimizing crawlability, Core Web Vitals, rendering, redirects, or sitemaps."
---
# Technical SEO
## When to Use This Skill
Activate this module when the user's request involves any of the following:
- **Core Web Vitals**: Optimizing LCP, INP, or CLS scores; diagnosing page speed issues; interpreting CrUX data or PageSpeed Insights reports
- **Crawlability**: Robots.txt configuration, XML sitemap creation or auditing, crawl budget management, or Googlebot access issues
- **Site Architecture**: URL structure planning, information architecture, internal linking strategy, site depth optimization, or content siloing
- **Indexation**: Canonical tag implementation, noindex/nofollow directives, index bloat, duplicate content resolution, or Google Search Console index coverage issues
- **Redirects**: Redirect chain auditing, 301/302 strategy, redirect maps for site migrations, or HTTP-to-HTTPS migration
- **JavaScript SEO**: Client-side rendering issues, SSR vs CSR vs SSG evaluation, dynamic rendering, or JavaScript crawlability problems
- **Mobile-First Indexing**: Mobile rendering issues, mobile parity checks, responsive design auditing, or mobile usability errors
- **Structured Data**: Schema markup implementation (JSON-LD), rich result eligibility, schema validation, or structured data strategy
- **Log File Analysis**: Server log interpretation, crawl frequency analysis, crawl waste identification, or bot behavior auditing
- **International SEO**: Hreflang implementation, ccTLD vs subdomain vs subdirectory decisions, geotargeting, or multilingual site architecture
- **Security**: HTTPS migration, mixed content resolution, HSTS implementation, or security header configuration
- **HTTP Status Codes**: Diagnosing 4xx/5xx errors, soft 404 detection, server error patterns, or status code strategy
- **Page Speed**: Server response time (TTFB), render-blocking resources, image optimization, code splitting, or CDN configuration
- **Site Migrations**: Domain changes, platform migrations, HTTPS transitions, URL restructuring, or merger/acquisition site consolidation
**Trigger phrases**: "technical seo," "core web vitals," "page speed," "crawl budget," "robots.txt," "sitemap," "redirect," "canonical," "indexation," "noindex," "hreflang," "javascript seo," "mobile-first indexing," "log file analysis," "site architecture," "internal linking," "crawl errors," "HTTP status," "schema markup," "structured data," "site migration," "TTFB," "LCP," "INP," "CLS," "render blocking," "crawlability," "index bloat," "redirect chain," "mixed content," "HTTPS"
## Brand Context (Auto-Applied)
Before producing any marketing output from this module:
1. **Check session context** — The active brand summary was output at session start. Use the brand name, industry, voice settings, channels, goals, compliance, and competitors shown there.
2. **If you need the full profile**, read: `~/.claude-marketing/brands/{slug}/profile.json`
3. **Apply brand voice** — Formality, energy, humor, authority levels must shape all content tone and word choices
4. **Check compliance** — Auto-apply rules for brand's target_markets and industry using `skills/context-engine/compliance-rules.md`
5. **Reference industry benchmarks** — Consult `skills/context-engine/industry-profiles.md` for the brand's industry
6. **Use platform specs** — Reference `skills/context-engine/platform-specs.md` for character limits and format requirements
7. **Check campaign history** — Run `python campaign-tracker.py --brand {slug} --action list-campaigns` before planning new work
8. **If no brand exists**, say: "No brand profile found. Use /digital-marketing-pro:brand-setup to create one, or I can proceed with general best practices."
9. **Check brand guidelines** — If `~/.claude-marketing/brands/{slug}/guidelines/_manifest.json` exists, load and enforce: `restrictions.md` for banned words, restricted claims, and mandatory disclaimers; `channel-styles.md` for channel-specific tone overrides (may differ from base voice); `messaging.md` for approved key messages, taglines, and positioning language; `voice-and-tone.md` for detailed voice rules beyond the 4 numeric scores. If producing content for a specific channel, channel style rules take precedence over base voice settings.
Do not ask the user for information that already exists in their brand profile.
## Required Context
Before executing technical SEO work, gather:
1. **Website URL**: The domain to audit or optimize
2. **CMS / Platform**: WordPress, Shopify, Webflow, custom, headless, etc. — determines implementation paths
3. **Hosting Environment**: Shared, VPS, dedicated, cloud (AWS/GCP/Azure), CDN provider — affects server-side recommendations
4. **Current Performance Data**: Google Search Console access, PageSpeed Insights scores, CrUX data, or existing audit reports
5. **Site Scale**: Approximate page count (hundreds, thousands, hundreds of thousands) — determines crawl budget relevance
6. **Rendering Method**: Static HTML, server-side rendered, client-side rendered (React/Angular/Vue), hybrid (Next.js/Nuxt) — critical for JavaScript SEO
7. **International Presence**: Target countries and languages, current URL structure for international versions
8. **Known Issues**: Existing problems the user is aware of (crawl errors, indexation drops, speed complaints, ranking losses)
9. **Migration Plans**: Any upcoming domain changes, platform migrations, or URL restructuring
10. **Tech Stack Constraints**: Development team availability, deployment processes, CDN limitations, plugin/extension restrictions
For quick diagnostic requests (e.g., "why is my page slow"), infer reasonable defaults and deliver immediately. For comprehensive audits, gather full context.
## Capabilities
- **Core Web Vitals Optimization**: Diagnose and fix LCP (target < 2.5s), INP (target < 200ms), and CLS (target < 0.1) issues with specific, implementation-ready recommendations; interpret field data (CrUX) vs lab data (Lighthouse) discrepancies; prioritize fixes by user impact
- **Crawlability Audits**: Robots.txt analysis and optimization, XML sitemap structure and validation, crawl budget allocation for large sites, crawl waste identification, orphan page detection, and crawl path optimization
- **Site Architecture Design**: URL structure planning (flat vs hierarchical), information architecture using topic clusters and content silos, internal linking strategy with PageRank flow modeling, click depth optimization (critical pages within 3 clicks), and breadcrumb implementation
- **Internal Linking Optimization**: Link equity distribution analysis, contextual link placement strategy, anchor text optimization, navigation structure auditing, footer and sidebar link strategy, and orphan page rescue
- **Indexation Management**: Canonical tag strategy (self-referencing, cross-domain, parametrized URLs), meta robots directive implementation, X-Robots-Tag HTTP headers, index coverage diagnosis using GSC, index bloat identification and cleanup, and new content indexation acceleration
- **JavaScript SEO**: Client-side rendering assessment, server-side rendering implementation guidance, static site generation recommendations, dynamic rendering as a fallback, Googlebot rendering verification, JavaScript crawl budget impact analysis, and hydration issue diagnosis
- **Mobile-First Indexing**: Mobile rendering parity checks, responsive design validation, mobile usability error resolution, touch target sizing, viewport configuration, and mobile page speed optimization
- **Page Speed Optimization**: TTFB reduction (server tuning, CDN, caching), render-blocking resource elimination, image optimization (format selection, lazy loading, responsive images, preload), CSS/JS minification and code splitting, third-party script auditing, and font loading strategy (font-display, preload, subsetting)
- **Redirect Management**: Redirect chain detection and resolution, 301 vs 302 decision framework, redirect map creation for migrations, redirect loop identification, and redirect performance impact analysis
- **HTTP Status Code Auditing**: 4xx error diagnosis and resolution, 5xx server error pattern analysis, soft 404 detection, 410 Gone implementation for permanently removed content, and status code monitoring strategy
- **Log File Analysis Guidance**: Googlebot crawl frequency and pattern analysis, crawl waste identification (non-indexable URL crawling), response code distribution, crawl budget utilization assessment, and bot vs human traffic ratios
- **Structured Data Implementation**: JSON-LD schema markup for Organization, Product, Article, FAQ, HowTo, BreadcrumbList, LocalBusiness, Event, and Review types; rich result eligibility assessment; schema validation and testing; nested and advanced schema patterns
- **International Technical SEO**: Hreflang implementation (HTML link, HTTP header, XML sitemap methods), ccTLD vs subdomain vs subdirectory decision framework, geotargeting configuration, language and region targeting, and international sitemap strategy
- **Security & HTTPS**: HTTPS migration planning, mixed content detection and resolution, HSTS implementation, security header configuration (CSP, X-Frame-Options, X-Content-Type-Options), and certificate management
- **XML Sitemap Strategy**: Sitemap structure for large sites (sitemap index), image and video sitemaps, news sitemaps, sitemap priority and changefreq guidance, dynamic sitemap generation, and sitemap submission and monitoring
- **URL Structure Optimization**: URL readability and keyword inclusion, parameter handling, trailing slash consistency, URL case sensitivity, and URL length optimization
## Process
**Primary Workflow: Technical SEO Audit & Optimization**
1. **Site Health Snapshot**
- Pull current Core Web Vitals from CrUX or PageSpeed Insights (LCP, INP, CLS for mobile and desktop)
- Review Google Search Console index coverage report (valid, excluded, error, warning counts)
- Check for manual actions or security issues in GSC
- Note current crawl stats (pages crawled per day, average response time, crawl errors)
- Baseline: Document current organic traffic, indexed page count, and ranking positions for target keywords
2. **Crawlability Analysis**
- Review robots.txt for blocking issues (critical resources, CSS/JS, important directories)
- Validate XML sitemap (well-formed, all important pages included, no non-indexable URLs, within 50K URL / 50MB limit)
- Assess crawl budget allocation (are crawlers spending time on low-value pages?)
- Check for crawl traps (infinite calendar pagination, session IDs in URLs, faceted navigation generating millions of URLs)
- Verify Googlebot can access all critical resources (CSS, JS, images needed for rendering)
3. **Indexation Review**
- Audit canonical tags across page templates (self-referencing, cross-domain, parameterized URLs)
- Check for conflicting directives (canonical pointing to page A while noindex is set)
- Review meta robots directives across templates
- Identify index bloat (thin content, tag pages, internal search results, parameter variations)
- Verify pagination handling (paginated series, rel=canonical on component pages)
- Check for unintended noindex tags (common after staging-to-production migration)
4. **Site Architecture Assessment**
- Map URL structure and identify depth issues (critical pages beyond 3 clicks from homepage)
- Analyze internal linking patterns (pages with high/low internal link counts, orphan pages)
- Evaluate information architecture (logical grouping, topic clusters, content silos)
- Review navigation structure (header, footer, sidebar, breadcrumbs)
- Check URL format consistency (trailing slashes, case sensitivity, parameter handling)
5. **Page Speed Deep Dive**
- **LCP optimization**: Identify the LCP element, check server response time (TTFB < 800ms), audit render-blocking resources, verify image optimization (format, size, lazy loading, preload for above-fold)
- **INP optimization**: Identify long tasks (> 50ms), audit event handlers, check for main thread blocking, review third-party script impact
- **CLS optimization**: Check for images/iframes without explicit dimensions, dynamic content injection above the fold, web font loading causing layout shift, ad slot reservations
- Audit third-party scripts for performance impact (tag managers, analytics, chat widgets, A/B testing tools)
- Review caching headers (Cache-Control, ETag, Expires) and CDN configuration
6. **Mobile-First Compliance Check**
- Verify content parity between mobile and desktop rendered versions
- Check for mobile-specific rendering issues (viewport configuration, touch targets, font sizes)
- Test mobile page speed separately (mobile networks have higher latency)
- Review structured data presence on mobile version (must match desktop)
- Check for lazy-loaded content that Googlebot might miss on mobile
7. **JavaScript SEO Evaluation**
- Determine rendering strategy (CSR, SSR, SSG, ISR, hybrid)
- Test Googlebot rendering using URL Inspection tool (rendered HTML vs raw HTML)
- Check if critical content and links require JavaScript to render
- Assess JavaScript crawl budget impact (render queue delays)
- Review client-side routing and its impact on crawlability
- Verify meta tags and canonicals are in the server-rendered HTML (not injected by JS)
8. **Redirect Chain Audit**
- Identify redirect chains (more than 1 hop) and redirect loops
- Check for mixed 301/302 usage (302s that should be 301s)
- Audit HTTPS redirect implementation (HTTP to HTTPS, www to non-www or vice versa)
- Verify redirects from old URLs after any past migrations
- Assess redirect response time impact on crawl efficiency
9. **Structured Data Validation**
- Audit existing schema markup for errors and warnings (Google Rich Results Test)
- Identify missing schema opportunities based on content types
- Validate JSON-LD syntax and nesting
- Check for rich result eligibility (FAQ, HowTo, Product, Review, Breadcrumb, etc.)
- Verify schema matches visible on-page content (no hidden/misleading markup)
10. **International SEO Review** (if applicable)
- Validate hreflang implementation (self-referencing tags, x-default, return links)
- Check for hreflang conflicts with canonical tags
- Verify geotargeting settings in Google Search Console
- Assess URL structure for international versions
- Review content localization vs translation quality signals
11. **Security Assessment**
- Verify full HTTPS implementation (no mixed content)
- Check HSTS header presence and configuration
- Review security headers (CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy)
- Verify SSL certificate validity and chain
- Check for exposed sensitive files (wp-config.php, .env, .git)
12. **Prioritized Recommendation Plan**
- Score each finding by impact (high/medium/low) and effort (quick win/medium/major project)
- Create an impact/effort matrix to visualize priority
- Group recommendations into: Immediate fixes (0-48 hours), Short-term wins (1-2 weeks), Medium-term projects (2-8 weeks), Long-term strategic initiatives (2-6 months)
- Estimate traffic recovery or growth potential for each fix category
- Provide implementation specs for the top 5 highest-priority items
## Reference Files
- `core-web-vitals.md` — LCP, INP, and CLS optimization guides with specific thresholds, common causes, fix strategies, measurement methodology, and field vs lab data interpretation
- `crawlability.md` — Robots.txt syntax and best practices, XML sitemap structure and limits, crawl budget management, JavaScript rendering, log file analysis, and orphan page detection
- `site-architecture.md` — URL structure best practices, information architecture frameworks, internal linking strategy, pagination handling, faceted navigation, breadcrumbs, and site migration planning
- `indexation.md` — Canonical tag implementation, meta robots directives, X-Robots-Tag, index coverage diagnosis, duplicate content management, index bloat cleanup, and new content indexation acceleration
- `international-seo.md` — URL structure strategies for international sites, hreflang implementation methods with examples, common hreflang mistakes, geotargeting, content localization, and search engine market share by country
## Output Formats
| Deliverable | Format | Description |
|---|---|---|
| Technical SEO Audit Report | Document | Comprehensive audit across all 12 dimensions with scores, findings, and prioritized recommendations |
| Core Web Vitals Report | Document | CWV-specific analysis with per-metric diagnosis, fix specifications, and expected improvement ranges |
| Redirect Map | Spreadsheet | Source URL to destination URL mapping with status codes and redirect type for migrations |
| XML Sitemap Strategy | Document + Code | Sitemap structure plan with implementation code (sitemap index, per-type sitemaps, generation approach) |
| Site Architecture Plan | Document + Diagram description | URL hierarchy, internal linking strategy, content silo structure, and navigation recommendations |
| Robots.txt Specification | Code | Optimized robots.txt file with directives, sitemap references, and crawl-delay settings |
| Structured Data Spec | Code (JSON-LD) | Ready-to-implement schema markup for all applicable page templates |
| International SEO Plan | Document | Hreflang implementation spec, URL structure recommendation, and geotargeting configuration |
| Migration Checklist | Checklist document | Pre-migration, migration day, and post-migration monitoring checklist with rollback procedures |
| Page Speed Optimization Plan | Document | Prioritized speed fixes with implementation details, expected LCP/INP/CLS improvements, and testing plan |
## Edge Cases
### JavaScript-Heavy Single Page Applications (React, Angular, Vue)
- **Situation**: Site renders all content client-side; Googlebot may see empty or incomplete pages
- **Approach**: Test rendered HTML using Google's URL Inspection tool and compare to source HTML. If critical content or links are missing from server response, recommend SSR (Next.js, Nuxt, Angular Universal) or static site generation. If SSR is not feasible, evaluate dynamic rendering as a stopgap (Rendertron, Puppeteer-based prerendering). Ensure meta tags, canonicals, and hreflang are in the initial HTML response, not injected by JavaScript. Audit JavaScript bundle size and hydration time as they directly impact INP.
### Large Ecommerce Sites (100K+ Pages, Faceted Navigation)
- **Situation**: Faceted navigation generates millions of URL combinations; crawl budget is consumed by low-value parameter pages
- **Approach**: Implement a canonicalization strategy for faceted URLs (canonical to the base category page unless the facet creates genuinely unique, valuable content). Use robots.txt or meta robots to block crawling of low-value parameter combinations. Create a curated internal linking strategy that directs crawlers to high-value pages. Build separate XML sitemaps for product pages, category pages, and editorial content. Monitor crawl stats to verify crawl budget is allocated to revenue-generating pages. Consider AJAX-based filtering that does not generate crawlable URLs for non-valuable combinations.
### Website Migrations (Domain, Platform, HTTPS)
- **Situation**: Business is changing domains, switching CMS platforms, or consolidating multiple sites
- **Approach**: Create a comprehensive URL mapping (old URL to new URL) before migration. Implement 301 redirects for every URL with organic traffic or backlinks. Set up monitoring for crawl errors, index coverage, and organic traffic immediately after migration. Expect a temporary ranking dip (typically 2-8 weeks for well-executed migrations). Keep the old domain/hosting active for at least 12 months to serve redirects. Verify all internal links, canonical tags, sitemaps, and hreflang tags reference the new URL structure. Run a full technical audit 1 week, 1 month, and 3 months post-migration.
### Multilingual Sites with Complex Hreflang Requirements
- **Situation**: Site serves content in 10+ languages with regional variations (e.g., en-US, en-GB, en-AU, es-ES, es-MX)
- **Approach**: Use XML sitemap method for hreflang at scale (HTML link tags become unmanageable above 20+ versions). Ensure every page has self-referencing hreflang and an x-default fallback. Verify bidirectional return links (if page A points to page B with hreflang, page B must point back to page A). Watch for canonical conflicts (canonical and hreflang must reference the same URL). Automate hreflang generation through CMS or build system to prevent manual errors. Test with Google's hreflang testing tools and monitor international targeting in GSC.
### Sites with Legacy Technical Debt
- **Situation**: Years of accumulated issues — mixed HTTP/HTTPS, orphan pages, redirect chains 5+ hops deep, duplicate content across subdomains, abandoned staging environments indexed by Google
- **Approach**: Prioritize by damage — index bloat and crawl waste first (they affect the entire site), then redirect chains (they bleed PageRank), then mixed content (security and trust signals), then orphan pages (wasted content investment). Do not try to fix everything at once. Create a phased remediation plan: Phase 1 (crawl and index cleanup), Phase 2 (redirect consolidation), Phase 3 (architecture optimization). Monitor organic traffic after each phase to measure impact and catch regressions.
## Related Skills
- **Content Engine** — Page speed and crawlability directly affect content discoverability; structured data enhances content appearance in SERPs; site architecture determines how content authority flows through internal links
- **Paid Advertising** — Core Web Vitals and landing page speed impact Google Ads Quality Score; technical health of landing pages affects conversion rates and ad spend efficiency
- **AEO/GEO Intelligence** — Structured data implementation strengthens AI platform comprehension and citation likelihood; site architecture affects how AI crawlers discover and interpret content
- **Analytics & Insights** — Technical SEO changes require measurement through organic traffic, crawl stats, index coverage, and Core Web Vitals dashboards; analytics data drives technical SEO prioritization
- **CRO** — Page speed directly correlates with conversion rates (every 100ms of LCP improvement can increase conversions by up to 1%); mobile usability affects conversion paths; site architecture determines user flow efficiency

View File

@@ -0,0 +1,304 @@
# Core Web Vitals — Optimization Reference
A comprehensive guide to measuring, diagnosing, and fixing Core Web Vitals issues. These three metrics (LCP, INP, CLS) are Google's primary user experience signals and directly affect search rankings, ad Quality Score, and conversion rates.
---
## Thresholds Summary
| Metric | Good | Needs Improvement | Poor |
|---|---|---|---|
| **LCP** (Largest Contentful Paint) | < 2.5s | 2.5s 4.0s | > 4.0s |
| **INP** (Interaction to Next Paint) | < 200ms | 200ms 500ms | > 500ms |
| **CLS** (Cumulative Layout Shift) | < 0.1 | 0.1 0.25 | > 0.25 |
Google uses the 75th percentile of page loads (p75) from the Chrome UX Report (CrUX) for ranking signals. Optimizing for the median is not enough — you must bring the 75th percentile into the "good" range.
---
## LCP — Largest Contentful Paint
### What It Measures
The render time of the largest visible content element in the viewport. LCP elements are typically:
- `<img>` elements (most common LCP element on the web)
- `<video>` poster images
- Elements with CSS `background-image`
- Block-level text elements (`<h1>`, `<p>`, etc.)
### Common Causes of Poor LCP
**1. Slow Server Response Time (TTFB > 800ms)**
- The browser cannot start rendering until it receives the first byte of HTML
- Target TTFB: < 800ms for the document request
- Causes: Unoptimized database queries, no server-side caching, no CDN, under-provisioned hosting, application-level bottlenecks
**Fix strategies:**
- Implement a CDN for static assets and HTML (Cloudflare, Fastly, CloudFront)
- Enable server-side caching (Redis, Memcached, Varnish, full-page cache)
- Optimize database queries (indexing, query optimization, connection pooling)
- Use HTTP/2 or HTTP/3 for multiplexed connections
- Consider edge computing (Cloudflare Workers, Vercel Edge Functions) to reduce latency
- Upgrade hosting if consistently over capacity
**2. Render-Blocking Resources**
- CSS and synchronous JavaScript in the `<head>` block rendering until they download and execute
- Every additional blocking resource adds to LCP
**Fix strategies:**
- Inline critical CSS (the CSS needed for above-the-fold content) directly in the `<head>`
- Defer non-critical CSS with `media="print" onload="this.media='all'"` or load asynchronously
- Add `defer` or `async` attribute to non-critical `<script>` tags
- Remove unused CSS (PurgeCSS, Coverage tab in Chrome DevTools)
- Remove unused JavaScript (tree-shaking, code splitting)
- Minimize CSS/JS file count through bundling (but balance against cache granularity)
**3. Slow Resource Load Times (LCP Element)**
- Large, unoptimized images are the most common LCP bottleneck
- LCP image is discovered late in the loading waterfall (e.g., loaded via CSS or JavaScript)
**Fix strategies:**
- **Preload the LCP image**: `<link rel="preload" as="image" href="hero.webp">` — this is the single highest-impact LCP fix for image-based LCP elements
- **Use modern image formats**: WebP (25-35% smaller than JPEG), AVIF (50% smaller than JPEG). Serve with `<picture>` element for fallback support
- **Responsive images**: Use `srcset` and `sizes` attributes so browsers download the appropriately sized image for the viewport
- **Compress images**: Target quality 75-85 for JPEG/WebP (visually lossless for most content)
- **Set explicit width and height** on `<img>` elements (prevents layout shift AND helps browser allocate space early)
- **Do NOT lazy-load the LCP image**: `loading="lazy"` on the LCP element delays it. Only lazy-load below-the-fold images
- **Set fetchpriority="high"** on the LCP image element to prioritize its download
- **Avoid CSS background-image for LCP**: The browser cannot discover CSS background images until the CSS file is parsed. Use `<img>` with preload instead
**4. Client-Side Rendering**
- SPAs that render content with JavaScript delay LCP until JS is downloaded, parsed, and executed
- The browser sees an empty or skeleton page until JavaScript hydrates
**Fix strategies:**
- Implement Server-Side Rendering (SSR) for above-the-fold content
- Use Static Site Generation (SSG) for pages that do not change frequently
- Implement Incremental Static Regeneration (ISR) for dynamic content with SSG benefits
- If SSR is not feasible, use prerendering or document-level critical HTML injection
### LCP Optimization Priority Checklist
1. Identify the LCP element (Chrome DevTools > Performance panel > Timings > LCP)
2. Check TTFB (target < 800ms) — if high, fix server/CDN first
3. Check if LCP image is preloaded — if not, add preload link
4. Check image format and compression — convert to WebP/AVIF
5. Check for render-blocking CSS/JS — defer or inline critical
6. Check if LCP element requires JavaScript to render — implement SSR if so
7. Verify `fetchpriority="high"` is set on LCP image
8. Verify `loading="lazy"` is NOT set on LCP image
---
## INP — Interaction to Next Paint
### What It Measures
INP measures the latency of all user interactions (clicks, taps, keyboard inputs) throughout the page lifecycle and reports the worst interaction (with outliers excluded). It replaced FID (First Input Delay) as a Core Web Vital in March 2024.
Key difference from FID: FID only measured the delay of the first interaction. INP measures ALL interactions and reports the worst one, making it a much more comprehensive responsiveness metric.
### Common Causes of Poor INP
**1. Long Tasks on the Main Thread**
- Any JavaScript task longer than 50ms blocks the main thread and delays interaction response
- Common offenders: large framework initialization, complex DOM manipulation, synchronous API calls, heavy computation
**Fix strategies:**
- **Break long tasks**: Use `setTimeout(fn, 0)`, `requestAnimationFrame`, or `scheduler.yield()` to break work into smaller chunks (< 50ms each)
- **Use web workers**: Offload heavy computation (data processing, parsing, calculations) to web workers so they do not block the main thread
- **Defer non-critical initialization**: Lazy-load components and initialize them on user interaction rather than on page load
- **Code-split aggressively**: Only load the JavaScript needed for the current view. Use dynamic `import()` for below-the-fold and interaction-triggered features
**2. Expensive Event Handlers**
- Click, input, and keydown handlers that perform heavy DOM manipulation, state recalculation, or synchronous layout queries
- React/Angular/Vue re-renders triggered by state changes during interaction
**Fix strategies:**
- **Debounce and throttle**: For scroll, resize, and input handlers, debounce (wait for pause) or throttle (limit frequency)
- **Minimize DOM reads/writes in handlers**: Batch DOM mutations, avoid forced synchronous layout (reading offsetHeight after a write)
- **Use CSS for visual feedback**: CSS transitions and animations run on the compositor thread, not the main thread. Use CSS for hover states, button press feedback, and simple animations
- **Virtualize long lists**: Do not render 10,000 DOM nodes. Use virtual scrolling (react-virtualized, vue-virtual-scroller) to render only visible items
- **Optimize React re-renders**: Use `React.memo`, `useMemo`, `useCallback` to prevent unnecessary re-renders. Use `useTransition` for non-urgent state updates
**3. Large DOM Size**
- Pages with > 1,500 DOM elements are at risk; > 3,000 is a red flag
- Large DOM increases memory usage and slows style recalculation, layout, and paint operations
**Fix strategies:**
- Reduce DOM nodes by simplifying layout (fewer nested containers)
- Use CSS Grid/Flexbox instead of deeply nested `<div>` structures
- Virtualize long lists and tables
- Lazy-render offscreen content
- Remove hidden elements from the DOM instead of using `display: none` on thousands of nodes
**4. Third-Party Script Impact**
- Tag managers, analytics, chat widgets, A/B testing tools, and ad scripts all compete for main thread time
- Third-party scripts are the most common source of long tasks on content-heavy sites
**Fix strategies:**
- Audit all third-party scripts with Chrome DevTools Performance panel (filter by domain)
- Load non-essential third-party scripts with `defer` or `async`
- Delay chat widgets and feedback tools until user interaction or scroll event
- Use `requestIdleCallback` for non-urgent analytics calls
- Consider Partytown or similar libraries to offload third-party scripts to web workers
- Regularly audit tag manager containers and remove unused tags
### INP Diagnosis Workflow
1. Open Chrome DevTools > Performance panel > Record a session with real interactions
2. Look for long tasks (red-flagged bars > 50ms in the main thread)
3. Identify which script/function is responsible (call stack in the task detail)
4. Check the "Interactions" track to see which user interactions had high latency
5. Use the Web Vitals Chrome extension to get real-time INP readings per interaction
6. Cross-reference with CrUX data to see field INP at the p75 level
---
## CLS — Cumulative Layout Shift
### What It Measures
CLS quantifies how much visible content shifts during the page lifecycle. Each layout shift is scored by multiplying the impact fraction (percentage of viewport affected) by the distance fraction (how far elements moved). CLS is the sum of all unexpected layout shift scores, grouped into session windows of maximum 5 seconds with 1-second gaps.
A CLS of 0.1 means the equivalent of 10% of the viewport shifting by 10% of the viewport height.
### Common Causes of Poor CLS
**1. Images and Iframes Without Explicit Dimensions**
- When the browser does not know an image's dimensions in advance, it allocates zero space, then shifts content when the image loads
**Fix strategies:**
- Always set `width` and `height` attributes on `<img>` and `<iframe>` elements
- Use CSS `aspect-ratio` property for responsive images: `aspect-ratio: 16 / 9`
- For responsive images with `srcset`, the `width` and `height` attributes still help the browser calculate aspect ratio before download
**2. Dynamically Injected Content**
- Banners, cookie consent bars, newsletter pop-ins, or promotional bars inserted above existing content push everything down
**Fix strategies:**
- Reserve space for dynamic content with CSS `min-height` on container elements
- Insert dynamic content below the fold when possible
- Use CSS `transform` animations instead of changing `top`, `margin`, or `height` (transforms do not cause layout shifts)
- For cookie banners and notification bars, use fixed/sticky positioning so they overlay rather than push content
- Avoid inserting content above existing content unless in response to a user interaction (user-initiated shifts are excluded from CLS)
**3. Web Fonts Causing Layout Shift (FOUT/FOIT)**
- When a web font loads and replaces a fallback font, text reflows if the fonts have different metrics (line height, letter spacing, word width)
**Fix strategies:**
- Use `font-display: optional` — the best option for CLS; if the font is not already cached, the fallback is used for the entire page visit, and the web font is cached for next visit
- Use `font-display: swap` with font metric overrides (`ascent-override`, `descent-override`, `line-gap-override`, `size-adjust` on the `@font-face` of the fallback font) to match fallback metrics to web font metrics
- Preload critical fonts: `<link rel="preload" as="font" type="font/woff2" href="font.woff2" crossorigin>`
- Subset fonts to include only needed characters (Latin, Latin Extended) using tools like glyphhanger or fonttools
- Self-host fonts instead of using Google Fonts CDN (eliminates the extra DNS lookup and connection)
**4. Ads and Embeds Without Reserved Space**
- Ad slots that resize after loading or load asynchronously without placeholder space
- Third-party embeds (YouTube, Twitter, maps) that load with unknown dimensions
**Fix strategies:**
- Set fixed `min-height` on ad containers based on the most common ad size for that slot
- Use CSS `aspect-ratio` or explicit dimensions for embed containers
- For responsive ad slots, set the minimum size and allow growth only downward (below the ad)
- Load ads below the fold where possible
- Consider static/reserved ad slots rather than dynamic ad insertion
**5. Late-Loading CSS or JavaScript Causing Reflow**
- CSS files loaded after initial render can change layout
- JavaScript that modifies element sizes, positions, or visibility after page load
**Fix strategies:**
- Inline critical CSS to ensure above-the-fold layout is stable from first paint
- Load non-critical CSS asynchronously but ensure it does not affect above-the-fold layout
- Avoid JavaScript that modifies layout properties on visible elements after load
- Use CSS `contain: layout` on components that should not affect siblings when they change
### CLS Debugging Workflow
1. Open Chrome DevTools > Performance panel > Record page load
2. Check the "Layout Shifts" track for blue markers
3. Click each shift to see which elements moved, shift score, and whether it was user-initiated
4. Use the Layout Shift Regions feature (Chrome DevTools > Rendering > Layout Shift Regions) for a visual overlay of shifting elements during real browsing
5. Check field CLS in CrUX data — lab data often underestimates CLS because automated tools do not scroll or interact with the page
---
## Measurement: Field Data vs Lab Data
### Field Data (Real User Monitoring)
- **Chrome UX Report (CrUX)**: The dataset Google uses for ranking signals. Aggregated from opted-in Chrome users. Available via PageSpeed Insights, CrUX API, BigQuery, and GSC Core Web Vitals report
- **Real User Monitoring (RUM)**: Custom instrumentation using the `web-vitals` JavaScript library. Provides per-page, per-segment, per-geography breakdowns
- **Google Search Console CWV report**: Aggregates CrUX data at the URL group level. Shows good/needs improvement/poor distribution
**Field data is what Google uses for ranking.** Lab data is for diagnosis only.
### Lab Data (Synthetic Testing)
- **Lighthouse** (Chrome DevTools, PageSpeed Insights, CI/CD): Simulated page load on a throttled connection. Good for diagnosis, not representative of real user experience
- **WebPageTest**: Advanced synthetic testing with real browsers, real network conditions, and filmstrip/waterfall views. Best lab tool for deep performance analysis
- **Chrome DevTools Performance panel**: Real-time recording on your local machine. Not throttled by default — enable CPU and network throttling for realistic results
### Common Discrepancies Between Field and Lab
| Scenario | Lab Shows | Field Shows | Explanation |
|---|---|---|---|
| CLS from scrolling | Low CLS | High CLS | Lab tools only measure load CLS; field captures all user-session shifts |
| INP from real interactions | Not measurable | Poor INP | Lab tools cannot simulate diverse real user interactions |
| LCP on slow networks | Good LCP (fast local network) | Poor LCP | Field includes users on 3G, 4G, and congested networks |
| Third-party script impact | Minimal | Significant | Lab may not load all third-party scripts (ad blockers, consent managers blocking tags) |
| Geographic latency | Low TTFB | High TTFB | Lab tests from one location; field includes users far from servers |
Always prioritize field data for decision-making. Use lab data to diagnose specific issues identified in field data.
---
## Optimization Priority Framework
When multiple CWV metrics need improvement, prioritize using this framework:
### Priority 1: Metric in "Poor" Range
Any metric in the poor range (LCP > 4s, INP > 500ms, CLS > 0.25) should be addressed first. Poor CWV can directly suppress rankings.
### Priority 2: LCP
LCP is the most impactful CWV for user experience and is often the easiest to improve with targeted fixes (image preload, CDN, format optimization). A 1-second LCP improvement can increase conversions by 2-5%.
### Priority 3: CLS
CLS fixes are typically low-effort (adding image dimensions, reserving ad space, font-display settings) with immediate impact. CLS is also the most noticeable metric to users — layout shift is visually jarring and erodes trust.
### Priority 4: INP
INP is often the hardest to fix because it requires JavaScript refactoring, framework-level changes, or third-party script auditing. Improvements may require development sprints rather than quick configuration changes.
### Cross-Metric Quick Wins
| Fix | LCP Impact | INP Impact | CLS Impact | Effort |
|---|---|---|---|---|
| Preload LCP image | High | None | None | 5 minutes |
| Add image dimensions | None | None | High | 30 minutes |
| Convert images to WebP/AVIF | Medium | None | None | 1-2 hours |
| Inline critical CSS | Medium | Low | Low | 2-4 hours |
| Defer third-party scripts | Low | High | Medium | 1-2 hours |
| Implement CDN | High | None | None | 2-4 hours |
| font-display: optional | None | None | High | 15 minutes |
| Code-split JavaScript | Low | High | None | 1-2 days |
| SSR for above-fold content | High | Medium | Low | 1-2 weeks |
---
## Tools Reference
| Tool | Best For | Cost |
|---|---|---|
| PageSpeed Insights | Quick CWV check with field + lab data | Free |
| Google Search Console CWV Report | Site-wide CWV status and URL grouping | Free |
| Chrome DevTools Performance Panel | Deep diagnosis of specific pages | Free |
| WebPageTest | Advanced waterfall analysis and filmstrip comparison | Free (public) / Paid (private) |
| CrUX API | Programmatic access to field data | Free |
| CrUX BigQuery Dataset | Large-scale CWV analysis across sites | Free (BigQuery free tier) |
| `web-vitals` JS Library | Real user monitoring instrumentation | Free (open source) |
| Lighthouse CI | Automated CWV regression testing in CI/CD | Free (open source) |
| Chrome Web Vitals Extension | Real-time CWV overlay during browsing | Free |
| DebugBear | Continuous CWV monitoring with alerts | Paid |
| SpeedCurve | Performance monitoring with CWV trends | Paid |
| Calibre | Automated performance budgets and alerts | Paid |

View File

@@ -0,0 +1,326 @@
# Crawlability — Robots.txt, Sitemaps, Crawl Budget & Log Analysis
A comprehensive reference for ensuring search engine crawlers can discover, access, render, and efficiently crawl all important pages on a website. Crawlability is the foundation of technical SEO — if search engines cannot crawl a page, it cannot rank.
---
## Robots.txt
### Purpose
The `robots.txt` file tells search engine crawlers which URLs they are allowed or disallowed from requesting. It is a crawl directive, not an indexation directive — pages blocked by robots.txt can still appear in search results if other pages link to them (they will show as "URL is blocked by robots.txt" in GSC).
### Syntax Reference
```
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /internal-search?
Allow: /api/public/
User-agent: Googlebot
Disallow: /staging/
Crawl-delay: 1
Sitemap: https://example.com/sitemap-index.xml
```
**Directives:**
- `User-agent`: Specifies which crawler the rules apply to. `*` means all crawlers
- `Disallow`: Blocks crawling of the specified path. Empty value (`Disallow:`) means allow everything
- `Allow`: Explicitly permits crawling of a path within a broader Disallow. Googlebot supports Allow; some crawlers do not
- `Crawl-delay`: Requests a delay (in seconds) between requests. Google ignores this — use GSC crawl rate settings instead. Bing respects it
- `Sitemap`: Points to the XML sitemap. Can list multiple Sitemap directives
**Pattern matching (Googlebot-specific):**
- `*` matches any sequence of characters: `Disallow: /*.pdf$` blocks all PDF files
- `$` anchors to end of URL: `Disallow: /page$` blocks `/page` but allows `/page/subpage`
- Path matching is case-sensitive
### Common Robots.txt Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Blocking CSS/JS files | Googlebot cannot render the page; mobile-first indexing fails | Allow all CSS and JS: `Allow: /*.css` and `Allow: /*.js` |
| Blocking entire site accidentally (`Disallow: /`) | No pages crawled; entire site deindexed over time | Audit robots.txt after every deployment |
| Blocking parameterized URLs that have unique content | Valuable pages never crawled | Use noindex instead of Disallow for pages that should not be indexed but can be crawled |
| No Sitemap directive | Crawlers must discover sitemap through other means | Always include `Sitemap:` directive |
| Using robots.txt to prevent indexation | Pages can still be indexed if linked externally | Use meta noindex or X-Robots-Tag for indexation control |
| Different robots.txt on staging vs production | Staging robots.txt (Disallow: /) deployed to production | Add robots.txt validation to deployment checklist |
| Blocking the robots.txt file itself via server config | Crawlers assume everything is allowed | Ensure robots.txt returns 200 status code |
### Testing Robots.txt
- **Google Search Console > Robots.txt Tester**: Validates syntax and tests specific URLs against rules
- **Bing Webmaster Tools**: Similar testing functionality for Bingbot rules
- Robots.txt must be served at the root of the domain: `https://example.com/robots.txt`
- Must return HTTP 200. If it returns 5xx, Google treats it as a temporary allow-all. If 4xx, Google treats it as no restrictions
- Maximum file size: 500KB (Google ignores rules beyond this limit)
---
## XML Sitemaps
### Purpose
XML sitemaps tell search engines which URLs exist and are worth crawling. They supplement natural crawl discovery through links. Sitemaps are especially important for large sites, new sites with few inbound links, sites with deep architecture, and pages with limited internal linking.
### Structure
**Basic sitemap:**
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page-1</loc>
<lastmod>2025-11-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
```
**Sitemap index (for large sites):**
```xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2025-11-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2025-11-10</lastmod>
</sitemap>
</sitemapindex>
```
### Limits and Requirements
| Constraint | Limit |
|---|---|
| URLs per sitemap file | 50,000 |
| Sitemap file size (uncompressed) | 50 MB |
| Sitemaps per sitemap index | 50,000 |
| Maximum total URLs (via index) | 2.5 billion (50K x 50K) |
| Encoding | UTF-8 |
| Compression | gzip supported and recommended for large sitemaps |
### Sitemap Best Practices
1. **Only include canonical, indexable URLs**: Do not include URLs with noindex, non-canonical URLs, redirected URLs, or 4xx/5xx pages
2. **Use accurate `lastmod` dates**: Google uses lastmod to prioritize crawling. Only update the date when content meaningfully changes. Inaccurate dates (auto-updating to today) cause Google to ignore lastmod entirely
3. **Segment sitemaps by content type**: Separate sitemaps for products, blog posts, category pages, and editorial content. This makes GSC reporting more useful and helps diagnose crawl issues by section
4. **Keep sitemaps current**: Dynamically generate sitemaps or update them on content publish/update. Stale sitemaps with dead URLs waste crawl budget
5. **Submit sitemaps in GSC**: Submit via Google Search Console AND reference in robots.txt
6. **Gzip large sitemaps**: Compress sitemaps to reduce server bandwidth and speed up crawler downloads
7. **Monitor sitemap status in GSC**: Check "Sitemaps" report for errors, warnings, and coverage
### Specialized Sitemaps
**Image Sitemap:**
```xml
<url>
<loc>https://example.com/product-page</loc>
<image:image>
<image:loc>https://example.com/images/product.jpg</image:loc>
<image:title>Product Name</image:title>
<image:caption>Description of the product image</image:caption>
</image:image>
</url>
```
**Video Sitemap:**
```xml
<url>
<loc>https://example.com/video-page</loc>
<video:video>
<video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
<video:title>Video Title</video:title>
<video:description>Video description</video:description>
<video:content_loc>https://example.com/video.mp4</video:content_loc>
<video:duration>600</video:duration>
</video:video>
</url>
```
**News Sitemap** (for Google News publishers):
- URLs must be less than 2 days old
- Requires `<news:publication>`, `<news:publication_date>`, and `<news:title>` elements
- Submit only articles, not index pages or category pages
---
## Crawl Budget
### What It Is
Crawl budget is the number of URLs Googlebot will crawl on a site within a given time period. It is determined by two factors:
1. **Crawl rate limit**: The maximum crawling speed Googlebot uses to avoid overloading the server. Determined by server responsiveness and health
2. **Crawl demand**: How much Google wants to crawl based on the site's popularity, freshness signals, and perceived size
### When Crawl Budget Matters
Crawl budget is primarily a concern for:
- Sites with 10,000+ pages
- Sites that generate new URLs rapidly (ecommerce, classified, UGC platforms)
- Sites with slow server response times (< 200ms TTFB is ideal for crawl efficiency)
- Sites where important pages are changing frequently and need quick recrawling
For small sites (under 10,000 pages), crawl budget is rarely a limiting factor.
### What Wastes Crawl Budget
| Waste Source | Description | Fix |
|---|---|---|
| Faceted navigation URLs | Filtering/sorting creates millions of parameter combinations | Block low-value facets in robots.txt; canonicalize to base category |
| Internal search result pages | `/search?q=xyz` indexed and crawled for every query | Block `/search` in robots.txt; add noindex to search results |
| Session ID URLs | Same page with different session parameters | Remove session IDs from URLs; use cookies instead |
| Infinite scroll/pagination traps | Calendar widgets, infinite pagination generating unlimited URLs | Cap pagination depth; use `rel="canonical"` on component pages |
| Soft 404 pages | Pages returning 200 but showing "no results" or empty content | Return proper 404 or 410 status codes |
| Duplicate content from parameters | Sort orders, tracking parameters, currency selectors | Canonicalize to the parameter-free version |
| Orphan pages | Pages with no internal links — only reachable through sitemap | Either add internal links or remove from sitemap if not valuable |
| Redirect chains | Each redirect consumes a crawl, and Google may stop following after 5 hops | Resolve chains to direct 301 redirects |
### Crawl Budget Optimization Strategies
1. **Improve server response time**: TTFB under 200ms allows Googlebot to crawl more URLs per session
2. **Block crawling of low-value URLs** via robots.txt (search results, filtered views, admin pages, API endpoints)
3. **Clean up redirect chains**: Resolve to direct single-hop 301s
4. **Return proper status codes**: 404 for not-found, 410 for permanently removed, 503 for temporary downtime
5. **Keep XML sitemaps clean**: Only canonical, indexable, 200-status URLs
6. **Use internal linking to signal priority**: Pages with more internal links get crawled more frequently
7. **Update `lastmod` accurately**: Helps Googlebot prioritize recently changed URLs
8. **Monitor crawl stats in GSC**: Crawl Stats report shows pages crawled per day, average response time, and crawl response breakdowns
---
## JavaScript Rendering and Crawling
### How Googlebot Handles JavaScript
Googlebot uses a two-phase process:
1. **Crawl phase**: Downloads HTML, discovers links and resources in the raw HTML
2. **Render phase**: Executes JavaScript using a headless Chromium instance, discovers additional content and links in the rendered DOM
The render phase is resource-intensive and happens in a separate queue. During peak load, rendering can be delayed by seconds to days. Content and links that exist only in JavaScript-rendered DOM may be discovered late.
### Rendering Strategies and SEO Impact
| Strategy | Initial HTML | SEO Risk | Best For |
|---|---|---|---|
| **Static HTML** | Complete content | None | Blogs, marketing sites, documentation |
| **Server-Side Rendering (SSR)** | Complete content | None | Dynamic content that changes per request |
| **Static Site Generation (SSG)** | Complete content | None | Content that changes infrequently |
| **Incremental Static Regeneration (ISR)** | Complete content (stale-while-revalidate) | Very low | High-traffic dynamic content |
| **Client-Side Rendering (CSR)** | Empty shell or skeleton | High | Authenticated dashboards (not for SEO pages) |
| **Hybrid (SSR + CSR)** | Critical content server-rendered; interactive parts client-rendered | Low | Modern web apps with SEO requirements |
### JavaScript SEO Checklist
- [ ] Critical content visible in the raw HTML source (View Source, not Inspect Element)
- [ ] Internal links are standard `<a href="...">` tags, not JavaScript-triggered navigation
- [ ] Meta tags (title, description, canonical, robots) are in the initial HTML, not injected by JS
- [ ] Structured data (JSON-LD) is in the initial HTML response
- [ ] URL Inspection tool in GSC shows rendered HTML matches what users see
- [ ] No critical rendering errors in GSC's URL Inspection "More Info" section
- [ ] Client-side routing uses History API (pushState), not hash-based routing (`#/page`)
- [ ] Server returns proper HTTP status codes (404, 301) rather than handling them client-side
---
## Log File Analysis
### What to Analyze
Server logs record every request made to the server, including search engine crawlers. Analyzing these logs reveals how crawlers actually behave on the site, which may differ significantly from what you expect.
### Key Metrics from Log Files
| Metric | What It Tells You | Healthy Range |
|---|---|---|
| **Crawl frequency by URL** | How often Googlebot visits each URL | Important pages: daily; low-value: weekly or less |
| **Crawl frequency by section** | Which site sections get the most crawler attention | Should align with business value of each section |
| **Response code distribution** | Percentage of 200, 301, 404, 5xx responses served to bots | > 90% should be 200; < 1% should be 5xx |
| **Average response time for bots** | Server performance under crawler load | < 200ms ideal; > 500ms is a problem |
| **Crawl of non-indexable URLs** | How much crawl budget is wasted on noindex, blocked, or redirected URLs | < 20% of total bot requests |
| **Crawl of orphan pages** | Pages crawled that have no internal links | Should be near 0 for important content |
| **Bot identification** | Which bots are crawling and their behavior | Verify Googlebot, Bingbot; watch for scraper bots |
### Log Analysis Workflow
1. **Extract bot requests** from access logs (filter by user-agent containing "Googlebot", "bingbot", "Yandex", etc.)
2. **Verify bot identity**: Googlebot IPs resolve to `*.googlebot.com` or `*.google.com` via reverse DNS. Fake Googlebots are common
3. **Segment by URL pattern**: Group crawled URLs by directory/template (product pages, blog posts, category pages, etc.)
4. **Calculate crawl distribution**: What percentage of crawls goes to each section? Does it match the site's priority?
5. **Identify crawl waste**: URLs returning 3xx, 4xx, 5xx to bots; non-indexable URLs being crawled repeatedly
6. **Check response times**: Are any URL patterns consistently slow for bots?
7. **Compare to sitemap**: Are all sitemap URLs being crawled? Are non-sitemap URLs being crawled more than sitemap URLs?
8. **Track over time**: Weekly log analysis to detect crawl behavior changes after site updates
### Tools for Log Analysis
- **Screaming Frog Log File Analyser**: Dedicated tool for SEO log analysis. Parses common log formats, segments by bot, visualizes crawl patterns
- **Custom scripts (Python/pandas)**: For large log files or custom analysis needs. Parse with regex, aggregate in pandas
- **ELK Stack (Elasticsearch, Logstash, Kibana)**: For continuous log monitoring and dashboarding at scale
- **BigQuery or Athena**: For querying very large log files stored in cloud storage
- **Botify, OnCrawl, JetOctopus**: Enterprise SEO platforms with built-in log file analysis
---
## Orphan Page Detection
### What Are Orphan Pages
Orphan pages are URLs that exist on the server and may be indexed but have zero internal links pointing to them. They are only discoverable through:
- XML sitemaps
- External backlinks
- Direct URL entry
- Previously cached crawl data
### Why Orphan Pages Matter
- **Crawl inefficiency**: If the page is valuable, it is being starved of crawl frequency and PageRank
- **Index bloat**: If the page is low-value, it is consuming index space without contributing
- **Missed SEO opportunity**: Pages with no internal links signal low importance to search engines
### Detection Method
1. Crawl the site with a tool like Screaming Frog, Sitebulb, or a custom crawler starting from the homepage
2. Export the list of discovered URLs (reachable through internal links)
3. Compare against: XML sitemap URLs, GSC indexed URLs, server log URLs (pages Googlebot actually crawled)
4. Any URL in the sitemap, GSC, or logs that was NOT found by the internal crawl is an orphan
### Resolution
- **Valuable orphan pages**: Add internal links from relevant parent pages. Include in navigation or related-content sections
- **Low-value orphan pages**: Remove from sitemap, add noindex, or return 410 Gone if truly obsolete
- **Orphan pages with backlinks**: High priority to rescue — add internal links to capture that external link equity
---
## URL Parameter Handling
### The Problem
URL parameters (query strings) create multiple URLs pointing to the same or similar content:
- `example.com/shoes` (base)
- `example.com/shoes?color=red` (filtered)
- `example.com/shoes?sort=price` (sorted)
- `example.com/shoes?color=red&sort=price&page=2` (combined)
For a site with 50 categories, 10 filters, 5 sort options, and 10 pages of pagination, the combinatorial explosion produces 250,000 URL variations from 50 base categories.
### Resolution Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| **Canonical to base URL** | Parameter does not create unique, valuable content (sort, session, tracking) | `<link rel="canonical" href="base-url">` on parameterized pages |
| **Robots.txt block** | High-volume parameter URLs that waste crawl budget | `Disallow: /*?sort=` in robots.txt |
| **Noindex, follow** | Parameter pages have some link value but should not rank | `<meta name="robots" content="noindex, follow">` |
| **Allow indexation** | Parameter creates genuinely unique, search-valuable content (e.g., `/shoes?color=red` targets "red shoes") | Ensure unique title, description, and content; self-referencing canonical |
| **AJAX-based filtering** | Prevent parameter URLs from being generated at all | Filtering updates content via JavaScript without changing the URL; use History API for shareable state |
Note: Google deprecated its URL Parameters tool in Google Search Console in 2022. Parameter handling must now be managed entirely through on-site signals (canonicals, robots, noindex).

View File

@@ -0,0 +1,280 @@
# Indexation — Canonicals, Meta Robots, Duplicate Content & Index Management
A comprehensive reference for controlling which pages search engines index, resolving duplicate content, managing index coverage, and accelerating indexation of new content. Indexation management ensures that search engine indexes contain only the pages you want to rank, with no duplication, no bloat, and no wasted authority.
---
## Canonical Tags
### Purpose
The `rel="canonical"` link element tells search engines which URL is the preferred (canonical) version of a page when multiple URLs serve the same or substantially similar content. It consolidates ranking signals (backlinks, PageRank) onto the canonical URL.
### Implementation
**HTML link element (most common):**
```html
<link rel="canonical" href="https://example.com/preferred-page">
```
**HTTP header (for non-HTML resources like PDFs):**
```
Link: <https://example.com/preferred-page>; rel="canonical"
```
### Canonical Tag Rules
1. **Self-referencing canonicals**: Every indexable page should have a canonical tag pointing to itself. This prevents issues from URL parameters, tracking codes, or session IDs creating duplicate URLs that Google discovers through external links
2. **Canonical must be an absolute URL**: `href="https://example.com/page"` not `href="/page"`
3. **Canonical must point to a 200-status page**: Do not canonical to a 301, 404, or 5xx page
4. **Canonical must match the protocol**: HTTPS pages should canonical to HTTPS URLs
5. **Canonical is a hint, not a directive**: Google may choose to ignore the canonical if other signals contradict it (e.g., internal links primarily point to a different URL)
6. **One canonical per page**: Multiple canonical tags on the same page cause Google to ignore all of them
### Common Canonical Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Canonical points to a noindex page | Conflicting signals — Google may ignore both | Remove noindex from the canonical target, or change the canonical to an indexable page |
| Canonical points to a 404/410 page | Canonical signal is ignored; page may be indexed independently | Update canonical to a live, relevant page |
| Canonical to a redirected URL | Google may follow the redirect and use the final destination, but this adds unnecessary ambiguity | Point canonical directly to the final destination URL |
| Canonical chain (A canonicals to B, B canonicals to C) | Google may resolve correctly but processing delays occur; long chains may be abandoned | Point A directly to C |
| Relative URLs in canonical | Parsed relative to current URL — may resolve incorrectly across templates | Always use absolute URLs |
| Canonical between very different pages | Google ignores the canonical because content does not match | Only canonical between pages with substantially similar content |
| Missing self-referencing canonical | Parameter variations and tracking URLs may be indexed as duplicates | Add self-referencing canonical to every indexable page template |
| Canonical in the `<body>` instead of `<head>` | Google may not process it | Ensure canonical tag is within the `<head>` element |
### Cross-Domain Canonicals
Used when the same content exists on multiple domains (syndication, multi-brand, regional sites):
```html
<!-- On syndication-partner.com -->
<link rel="canonical" href="https://original-publisher.com/article">
```
Cross-domain canonicals are a stronger hint than same-domain, and Google generally respects them when the content is truly identical. The canonicalized domain passes ranking signals to the canonical domain.
---
## Meta Robots Directives
### Available Directives
| Directive | Meaning |
|---|---|
| `index` | Allow this page to be indexed (default behavior; rarely needs to be explicit) |
| `noindex` | Do not show this page in search results. Strongest indexation control |
| `follow` | Follow links on this page (default behavior) |
| `nofollow` | Do not follow any links on this page for ranking purposes |
| `noarchive` | Do not show a cached copy of this page in search results |
| `nosnippet` | Do not show a text snippet or video preview in search results |
| `max-snippet:[n]` | Limit text snippet to n characters |
| `max-image-preview:[size]` | Limit image preview size: `none`, `standard`, `large` |
| `max-video-preview:[n]` | Limit video preview to n seconds |
| `notranslate` | Do not offer translation of this page in search results |
| `noimageindex` | Do not index images on this page |
| `unavailable_after:[date]` | Do not show this page after the specified date |
### Implementation
**HTML meta tag:**
```html
<meta name="robots" content="noindex, follow">
```
**Specific crawler:**
```html
<meta name="googlebot" content="noindex">
<meta name="bingbot" content="noindex">
```
**X-Robots-Tag HTTP header** (works for all file types, not just HTML):
```
X-Robots-Tag: noindex, follow
```
### When to Use noindex vs robots.txt vs canonical
| Goal | Use | Reason |
|---|---|---|
| Page should never appear in search results | `noindex` | Definitive removal from index once crawled |
| Page should not be crawled at all (save crawl budget) | `robots.txt Disallow` | Prevents crawling, but page can still be indexed if linked externally |
| Multiple URLs for same content — pick one winner | `canonical` | Consolidates signals to preferred URL |
| Temporarily remove a page from search | GSC URL Removal Tool + noindex | Removal tool is fast (hours) but temporary (6 months); noindex is permanent |
| Permanently removed content | `410 Gone` status code | Tells Google the page is permanently gone; faster than noindex for deindexation |
**Critical distinction**: robots.txt blocks crawling but not indexing. If a page blocked by robots.txt has external backlinks, Google may index it based on anchor text alone (appearing as "No information is available for this page" in search results). To prevent indexation, use noindex — but the page must be crawlable for Google to see the noindex tag.
---
## Index Coverage in Google Search Console
### Status Categories
| Status | Meaning | Action |
|---|---|---|
| **Valid** | Page is indexed and can appear in search results | Monitor for changes. Verify these are pages you want indexed |
| **Valid with warnings** | Page is indexed but has issues that may affect visibility | Review warnings (e.g., indexed but blocked by robots.txt) |
| **Excluded** | Page is not indexed — could be intentional or problematic | Review exclusion reasons below |
| **Error** | Page has issues preventing proper indexing | Fix server errors, redirect errors, or crawl anomalies |
### Common Exclusion Reasons and Fixes
| Exclusion Reason | What It Means | Action |
|---|---|---|
| **Excluded by noindex tag** | Page has meta noindex — intentional if you set it | Verify this is intentional. If not, remove the noindex tag |
| **Blocked by robots.txt** | Robots.txt prevents crawling | If intentional, fine. If the page should be indexed, update robots.txt |
| **Crawled — currently not indexed** | Google crawled but chose not to index (quality/relevance issue) | Improve content quality, add internal links, build backlinks. This is Google saying "I saw it but it is not good enough" |
| **Discovered — currently not indexed** | Google knows the URL exists but has not crawled it yet | Common for new/low-authority pages. Improve internal linking, submit in sitemap, request indexing via URL Inspection |
| **Alternate page with proper canonical** | Page canonicals to another URL — expected behavior | Verify the canonical target is correct and indexed |
| **Duplicate without user-selected canonical** | Google found duplicate content and chose its own canonical | Check if Google's choice matches your intent. If not, strengthen canonical signals (internal links, sitemap, explicit canonical tag) |
| **Duplicate, Google chose different canonical** | You set a canonical but Google disagreed | Review why — content may not be similar enough, or the canonical target may have issues. Strengthen signals on your preferred canonical |
| **Page with redirect** | URL redirects to another page | Expected for redirected URLs. Verify redirect targets are correct |
| **Soft 404** | Page returns 200 but Google thinks it is a 404 (empty or near-empty content) | Either return a proper 404/410 status code, or add substantial content to the page |
| **Not found (404)** | Page returns 404 status | If intentional, the 404 will eventually drop out. If the page should exist, fix the URL or implement a redirect |
---
## Duplicate Content Management
### Types of Duplicate Content
**Exact duplicates**: Identical content accessible at multiple URLs
- `http://` vs `https://`
- `www.` vs non-www
- Trailing slash vs no trailing slash
- URL parameters (tracking, session, sorting)
- `index.html` vs `/`
- Uppercase vs lowercase URLs
**Near duplicates**: Substantially similar content with minor variations
- Product pages differing only by color/size selection
- Location pages with boilerplate content and only city name changed
- Paginated content where intro text repeats across pages
- Print-friendly versions of pages
- Mobile-specific URLs (m.example.com)
**Syndicated duplicates**: Same content on different domains
- Content republished on partner sites
- Press releases on wire services
- Product descriptions provided by manufacturers
### Resolution Strategies
| Duplicate Type | Strategy | Implementation |
|---|---|---|
| Protocol/www/slash variations | 301 redirect to canonical version | Server config (nginx/Apache redirect rules) |
| Parameter variations | Self-referencing canonical on clean URL | Canonical tag on every page template |
| Print-friendly versions | Canonical to main page or noindex | Canonical tag on print pages |
| Near-duplicate location pages | Unique content per page (minimum 60-70% unique) | Invest in location-specific content |
| Syndicated content | Cross-domain canonical to original publisher | Canonical tag on syndication partner pages |
| Paginated content | Self-referencing canonical per page OR view-all canonical | Depends on page count (see site-architecture.md) |
| Translated content (same language) | Choose one version; canonical to it | Canonical tag; do not use hreflang for same-language duplicates |
---
## Index Bloat
### What It Is
Index bloat occurs when a search engine indexes significantly more pages than the site has valuable, unique content. Common symptoms:
- Indexed page count in GSC is 2x+ the number of pages in the sitemap
- Large numbers of thin or duplicate pages appearing in the index
- Important pages competing with low-value pages for rankings
### Common Sources of Index Bloat
| Source | Example | Scale Risk |
|---|---|---|
| Faceted navigation | Every filter combination generates an indexable URL | Extreme (hundreds of thousands to millions) |
| Internal search results | `/search?q=*` pages indexed for every query | High |
| Tag/archive pages | WordPress tag pages with 1-2 posts each | Medium |
| Pagination | Deep paginated pages (page 50+) with no unique value | Medium |
| Calendar/date archives | Empty or near-empty date archive pages | Medium |
| User profile pages | Thin public profile pages on UGC platforms | High |
| Parameter variations | Tracking, session, currency, language parameters | High |
| Staging/development environments | Staging.example.com indexed by Google | Medium |
| PDF and file duplicates | Same content as HTML pages but in PDF format | Low-Medium |
### Index Bloat Cleanup Process
1. **Audit the index**: Compare GSC indexed page count to your sitemap URL count. A ratio above 1.5:1 suggests bloat
2. **Identify bloat sources**: Use GSC index coverage report, site: search operator, and crawl data to categorize indexed URLs by template type
3. **Prioritize by volume**: Address the largest bloat sources first (faceted navigation before tag pages)
4. **Apply controls**:
- `noindex, follow` on pages that have link value but should not rank
- `robots.txt Disallow` on URL patterns that should never be crawled
- `canonical` to consolidate duplicate/near-duplicate pages
- `410 Gone` for pages that should be permanently removed
- `rel="canonical"` to view-all or primary page for paginated series
5. **Clean up sitemaps**: Remove all non-indexable URLs from XML sitemaps
6. **Monitor**: Track indexed page count weekly. Expect a gradual decrease over 4-8 weeks as Google recrawls and deindexes pages
---
## New Content Indexation
### How to Speed Up Indexation of New Pages
**Tier 1: High-impact (do immediately)**
- Add internal links from high-authority, frequently crawled pages (homepage, category pages, popular blog posts)
- Include the new URL in the XML sitemap with an accurate `lastmod` date
- Use Google Search Console URL Inspection tool > "Request Indexing" (limited to ~10-20 requests per day)
**Tier 2: Supplementary (do within 24 hours)**
- Share the URL on social media (Google discovers URLs through social platforms)
- Ping the sitemap: `https://www.google.com/ping?sitemap=https://example.com/sitemap.xml`
- If the site uses Google's Indexing API (eligible for job postings and live streaming content), submit through the API (much faster than standard crawling)
**Tier 3: Long-term (ongoing)**
- Maintain a healthy crawl rate by keeping the site fast and error-free
- Build external backlinks to new content
- Publish content consistently (sites with regular publishing schedules get crawled more frequently)
- Keep XML sitemaps accurate (no broken URLs, accurate lastmod dates)
### Indexation Timeline Expectations
| Site Authority | New Page Indexation | Factors |
|---|---|---|
| High (established domain, strong backlink profile) | Minutes to hours | Google crawls frequently; new content discovered quickly through internal links |
| Medium (growing domain, moderate authority) | Hours to days | Regular crawling schedule; sitemap and internal links help |
| Low (new domain, few backlinks) | Days to weeks | Infrequent crawling; URL Inspection and sitemap submission are critical |
| Very low (brand new domain, no backlinks) | Weeks to months | Google may need multiple crawl cycles before indexing; focus on building authority |
### Google Indexing API
The Indexing API provides near-instant indexation (minutes) but is officially supported only for:
- `JobPosting` structured data pages
- `BroadcastEvent` (live streaming) structured data pages
Some SEOs use it for broader content types with mixed results. Google has stated it is only intended for the supported types. For most sites, the URL Inspection tool's "Request Indexing" is the recommended manual indexation method.
---
## URL Removal Tools and Processes
### Temporary Removal (Google Search Console)
- **URL Removal Tool**: Temporarily hides a URL from Google Search results for approximately 6 months
- Use for: Emergency removal of sensitive content, outdated pages that need time to fix
- Does NOT permanently remove the page from the index — the page must also have noindex or return 404/410 for permanent removal
### Permanent Removal Methods
| Method | Speed | Permanence | Use Case |
|---|---|---|---|
| `noindex` meta tag | Days to weeks (next crawl) | Permanent while tag is present | Pages that should exist but not rank |
| `410 Gone` status code | Days to weeks | Permanent (Google drops from index) | Content permanently removed with no replacement |
| `404 Not Found` | Weeks to months | Eventually drops from index | Content no longer exists |
| `301 Redirect` | Days to weeks | Old URL replaced by new URL in index | Content moved to a new URL |
| URL Removal Tool + noindex | Hours (removal) + permanent (noindex) | Permanent | Urgent removal of sensitive/harmful content |
### Outdated Content Removal
Google provides a separate "Remove Outdated Content" tool for requesting removal of cached content that no longer reflects the live page. This is used when:
- A page's snippet in search results shows outdated information
- A page has been updated but Google's cache has not refreshed
- A removed page still appears in search results for other users to request removal
This tool is available to anyone, not just site owners: `https://search.google.com/search-console/remove-outdated-content`

View File

@@ -0,0 +1,360 @@
# International Technical SEO — Hreflang, URL Structures & Global Site Architecture
A comprehensive reference for building and maintaining websites that target multiple countries, languages, or regions. International SEO is one of the most technically complex areas of SEO — a single hreflang mistake can cause the wrong language version to rank in the wrong country.
---
## URL Structure Strategies
### Three Approaches
| Strategy | Example | Pros | Cons | Best For |
|---|---|---|---|---|
| **ccTLD** (country code top-level domain) | `example.de`, `example.co.uk`, `example.fr` | Strongest geo-targeting signal; users trust local domains; clear separation of properties | Most expensive (multiple domains to register and maintain); link equity does not transfer between domains; requires separate GSC properties; separate SEO authority per domain | Enterprise businesses with strong local presence in each market; brand already well-known in target countries |
| **Subdomain** | `de.example.com`, `uk.example.com`, `fr.example.com` | Easy to set up; can host on different servers/CDNs per region; separate GSC properties possible; geotargeting in GSC | Treated as semi-separate sites by Google; link equity from root domain has limited transfer; user trust slightly lower than ccTLD | Companies wanting regional separation with a single domain; sites needing different hosting per region |
| **Subdirectory** | `example.com/de/`, `example.com/uk/`, `example.com/fr/` | All link equity stays on one domain; easiest to maintain; single hosting setup; single GSC property with filtering; strongest domain authority consolidation | Cannot host on different servers per region without complex CDN configuration; less clear geo-targeting signal than ccTLD | Most businesses; the default recommendation unless specific requirements dictate otherwise |
### Decision Framework
**Choose ccTLD when:**
- The brand has separate business entities per country
- Strong local brand identity is essential (e.g., banking, government, legal services)
- Budget supports maintaining separate domains and separate SEO strategies
- Target countries have strong ccTLD preference (e.g., .de in Germany, .co.uk in UK)
**Choose subdomain when:**
- Regional content is managed by separate teams or hosted on different infrastructure
- The business needs separate GSC analytics per region but does not want multiple domains
- Content and user experience differ significantly by region (not just language)
**Choose subdirectory when (default recommendation):**
- SEO authority consolidation is a priority (most cases)
- A single team manages the website globally
- Budget and resources are limited
- The business is entering new markets and does not have established local authority
### Language vs Region in URL Structure
| URL Pattern | Targets | Example |
|---|---|---|
| `/es/` | Spanish language (all regions) | A blog post for all Spanish speakers |
| `/es-mx/` | Spanish language, Mexico specifically | A product page with Mexico-specific pricing, shipping, and legal requirements |
| `/es-es/` | Spanish language, Spain specifically | A product page with Spain-specific pricing and regulations |
Use language-only paths (`/es/`) when content is identical for all speakers of that language. Use language-region paths (`/es-mx/`) when content differs by country (pricing, legal, shipping, cultural references, local phone numbers, currency).
---
## Hreflang Implementation
### Purpose
Hreflang tags tell search engines which language and regional version of a page to show to users in different locations. Without hreflang, Google may show the French version of a page to English-speaking users, or the US version to UK users.
### Syntax
The hreflang attribute uses ISO 639-1 language codes and optional ISO 3166-1 Alpha-2 country codes:
| Format | Meaning | Example |
|---|---|---|
| `hreflang="en"` | English (any region) | General English content |
| `hreflang="en-us"` | English (United States) | US-specific pricing and content |
| `hreflang="en-gb"` | English (United Kingdom) | UK-specific pricing and content |
| `hreflang="es"` | Spanish (any region) | General Spanish content |
| `hreflang="es-mx"` | Spanish (Mexico) | Mexico-specific content |
| `hreflang="zh-hans"` | Chinese (Simplified) | Simplified Chinese content |
| `hreflang="zh-hant"` | Chinese (Traditional) | Traditional Chinese content |
| `hreflang="x-default"` | Default/fallback | Language selector page or default language version |
### Implementation Methods
**Method 1: HTML Link Elements (in `<head>`)**
```html
<link rel="alternate" hreflang="en-us" href="https://example.com/page">
<link rel="alternate" hreflang="en-gb" href="https://example.com/uk/page">
<link rel="alternate" hreflang="de" href="https://example.com/de/page">
<link rel="alternate" hreflang="fr" href="https://example.com/fr/page">
<link rel="alternate" hreflang="x-default" href="https://example.com/page">
```
**Best for**: Sites with fewer than 20 language/region versions per page. Above that, the HTML `<head>` becomes bloated.
**Method 2: HTTP Headers**
```
Link: <https://example.com/page>; rel="alternate"; hreflang="en-us",
<https://example.com/uk/page>; rel="alternate"; hreflang="en-gb",
<https://example.com/de/page>; rel="alternate"; hreflang="de",
<https://example.com/fr/page>; rel="alternate"; hreflang="fr",
<https://example.com/page>; rel="alternate"; hreflang="x-default"
```
**Best for**: Non-HTML resources (PDFs, documents) that need hreflang but cannot contain HTML tags.
**Method 3: XML Sitemap (Recommended for Large Sites)**
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://example.com/page</loc>
<xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/page"/>
<xhtml:link rel="alternate" hreflang="en-gb" href="https://example.com/uk/page"/>
<xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/page"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/page"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/page"/>
</url>
<url>
<loc>https://example.com/uk/page</loc>
<xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/page"/>
<xhtml:link rel="alternate" hreflang="en-gb" href="https://example.com/uk/page"/>
<xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/page"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/page"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/page"/>
</url>
</urlset>
```
**Best for**: Sites with 20+ language/region versions, or any site where maintaining hreflang in HTML `<head>` is impractical. The sitemap method keeps the HTML clean and is easier to generate programmatically.
### Common Hreflang Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| **Missing self-referencing hreflang** | Google may not process the hreflang set correctly | Every page must include a hreflang tag pointing to itself |
| **Missing x-default** | No fallback for users in unlisted regions/languages | Add x-default pointing to the language selector page or the primary language version |
| **Missing return links (bidirectional)** | If page A hreflang-links to page B, but page B does not link back to page A, Google ignores the annotation | Ensure every page in the hreflang set includes the complete set of all alternate versions |
| **Hreflang pointing to non-canonical URL** | If canonical and hreflang reference different URLs, Google may ignore hreflang | Hreflang href values must match the canonical URL of each page |
| **Hreflang pointing to noindex/blocked page** | Google cannot index a page it cannot access; hreflang signal is lost | All hreflang targets must be indexable, crawlable, and return 200 |
| **Wrong language/region codes** | `hreflang="uk"` is not valid (UK is a country code; the language code for Ukrainian is `uk`, but English for the UK is `en-gb`) | Use ISO 639-1 for language and ISO 3166-1 Alpha-2 for country. Validate codes |
| **Inconsistent URL formats** | Mixing `http` and `https`, or `www` and non-www, in hreflang URLs | Use the exact canonical URL format consistently across all hreflang annotations |
| **Using hreflang for duplicate content** | Two pages with the same content in the same language, just different URLs | Hreflang is for different language/region versions. Use canonical for same-language duplicates |
### Hreflang Validation
- **Google Search Console > International Targeting report**: Shows hreflang errors detected by Google
- **Screaming Frog**: Crawls all hreflang annotations and flags missing return links, invalid codes, and conflicts with canonicals
- **Aleyda Solis' Hreflang Tags Generator**: Generates correct hreflang markup from a URL matrix
- **Merkle Hreflang Tag Testing Tool**: Validates hreflang implementation on live pages
---
## Geotargeting
### Google Search Console International Targeting
For subdirectories and subdomains (not ccTLDs — ccTLDs are automatically geo-targeted):
1. Go to Google Search Console > Legacy tools > International Targeting
2. Select the property (e.g., `example.com/de/`)
3. Set the target country (e.g., Germany)
**Note**: This feature targets a **country**, not a language. A subdirectory can target Germany but not "German speakers worldwide." For language targeting without country restriction, use hreflang without GSC geotargeting.
### IP-Based Redirection: Do Not Do This
Redirecting users based on their IP address is a common mistake in international SEO:
- **Problem**: Googlebot primarily crawls from US IP addresses. If you redirect US IPs to the English version, Google may never crawl or index your non-English versions
- **Alternative**: Show a banner suggesting the appropriate language/region version (e.g., "It looks like you are in Germany. View our German site?") without redirecting. Let the user choose
- **Exception**: Using IP detection to set a default language preference on the first visit is acceptable if the user can easily switch and if all versions are accessible to crawlers without IP-based blocking
---
## Content Localization vs Translation
### Translation
Direct linguistic conversion of content from one language to another. Necessary but insufficient for effective international SEO.
### Localization
Adapting content for the cultural, legal, and market context of the target region:
| Dimension | Translation Only | Full Localization |
|---|---|---|
| **Currency** | Dollar amounts left as-is | Converted to local currency |
| **Units** | Imperial measurements | Metric (or local standard) |
| **Date formats** | MM/DD/YYYY | DD/MM/YYYY or YYYY-MM-DD per locale |
| **Phone numbers** | US format | Local format with country code |
| **Legal references** | US regulations | Local regulations and compliance |
| **Cultural references** | US holidays, sports, idioms | Locally relevant references |
| **Images** | Global stock photos | Locally relevant people, settings, products |
| **Payment methods** | Credit cards | iDEAL (Netherlands), Klarna (Nordics), PIX (Brazil), UPI (India) |
| **Social proof** | Global testimonials | Local customer testimonials and case studies |
| **Keyword targeting** | Translated keywords | Locally researched keywords (search behavior differs) |
### International Keyword Research
Keywords do not translate 1:1 between languages. Differences include:
- **Search volume**: A keyword with 10K monthly searches in English may have a direct translation with only 500 searches in German because Germans use a different phrase
- **Search intent**: The same translated phrase may have different intent in different markets
- **Colloquialisms**: "Sneakers" (US) vs "trainers" (UK) vs "Turnschuhe" (DE) — all mean athletic shoes
- **Brand vs generic**: Some markets search for brand names more than generic terms (or vice versa)
Always conduct keyword research natively in each target language using local search data, not by translating an English keyword list.
---
## International Site Architecture Patterns
### Pattern 1: Single Domain, Subdirectories (Most Common)
```
example.com/ (English, US — default)
example.com/uk/ (English, UK)
example.com/de/ (German)
example.com/fr/ (French)
example.com/es-mx/ (Spanish, Mexico)
```
- Single domain authority
- One hosting setup (use CDN for global performance)
- Hreflang in XML sitemap
- GSC: One property with directory-level filtering
### Pattern 2: Subdomains Per Region
```
www.example.com (English, US — default)
uk.example.com (English, UK)
de.example.com (German)
fr.example.com (French)
mx.example.com (Spanish, Mexico)
```
- Semi-separate authority (subdomains inherit some root domain authority)
- Can host on different servers/CDNs per region for performance
- Separate GSC properties per subdomain
- More complex to manage
### Pattern 3: ccTLDs Per Country
```
example.com (English, US)
example.co.uk (English, UK)
example.de (German)
example.fr (French)
example.com.mx (Spanish, Mexico)
```
- Completely separate domain authority
- Strongest geo-targeting signal
- Most expensive and complex to maintain
- Each domain needs its own link building and SEO strategy
### Pattern 4: Hybrid (ccTLD + Subdirectories for Languages)
```
example.de/ (German, Germany)
example.de/en/ (English version for Germany)
example.co.uk/ (English, UK)
example.com/ (English, US — default)
example.com/es/ (Spanish, general)
example.com/fr/ (French, general)
```
- Used when some markets justify ccTLDs (major markets) but others do not
- Combines strong local signals for key markets with consolidated authority for secondary markets
---
## CDN and Server Location
### Impact on International Performance
- **Server location affects TTFB**: A server in the US serving pages to users in Australia adds 200-300ms of latency per request
- **CDN is essential for international sites**: Cache static assets and HTML at edge locations near users in each target market
- **Key CDN providers**: Cloudflare (broadest free tier), Fastly (best real-time purging), CloudFront (best for AWS infrastructure), Akamai (enterprise)
### CDN Configuration for International Sites
1. **Cache HTML at the edge** — not just static assets. This eliminates TTFB latency for cached pages
2. **Set cache keys to include language/region** — ensure `/de/` pages are cached separately from `/en/` pages
3. **Use the Vary header** if serving different content from the same URL based on Accept-Language: `Vary: Accept-Language` (not recommended — subdirectory approach is cleaner)
4. **Monitor CDN cache hit rates by region** — low hit rates in a region indicate either insufficient edge presence or aggressive cache expiry
---
## Search Engine Market Share by Country
| Country | Primary Search Engine | Market Share | Notes |
|---|---|---|---|
| United States | Google | ~87% | Bing has ~7% (important for B2B due to workplace defaults) |
| United Kingdom | Google | ~92% | |
| Germany | Google | ~90% | |
| France | Google | ~91% | |
| Japan | Google | ~76% | Yahoo Japan (~15%) uses Google's index |
| South Korea | Naver | ~55% | Google ~35%. Naver requires separate optimization |
| China | Baidu | ~65% | Google is blocked. Baidu requires ICP license, Simplified Chinese, .cn domain |
| Russia | Yandex | ~60% | Google ~38%. Yandex has different ranking factors |
| Czech Republic | Seznam | ~25% | Google ~72%. Seznam still significant |
| Brazil | Google | ~96% | |
| India | Google | ~98% | |
### SEO Implications by Search Engine
**Baidu (China):**
- Requires an ICP license to host in China (mandatory)
- Simplified Chinese content is essential
- .cn ccTLD is strongly preferred
- JavaScript rendering is limited — SSR or static HTML required
- Meta keywords tag is still used as a ranking signal
- Baidu Webmaster Tools for submission and monitoring
**Yandex (Russia):**
- Strong behavioral signals (user engagement metrics affect rankings)
- Yandex Webmaster tools for submission and monitoring
- Regional ranking algorithm differs from Google (strong local geo signals)
- Supports its own structured data formats alongside schema.org
- Slower to crawl than Google — sitemaps are critical
**Naver (South Korea):**
- Blog and knowledge content (Naver Blog, Naver Knowledge iN) ranks prominently
- Naver Webmaster Tools for submission
- Korean-language content on Naver's own platforms gets priority
- Consider Naver Blog as a content channel alongside the website
---
## Legal and Compliance Considerations
### GDPR (European Economic Area)
- Cookie consent banner required before setting non-essential cookies
- Privacy policy must be available in local language
- Data processing agreements required with third-party tools
- Right to erasure affects user-generated content
- Analytics must comply (server-side analytics, anonymized IP, consent-based tracking)
### CCPA/CPRA (California, US)
- "Do Not Sell My Personal Information" link required for California users
- Privacy policy must disclose data collection practices
### ePrivacy Directive (EU)
- Applies to electronic communications, cookies, and tracking
- Stricter than GDPR for certain marketing activities (email marketing requires explicit opt-in)
### Country-Specific Requirements
| Country | Requirement | Impact on Website |
|---|---|---|
| Germany | Impressum (legal notice) page mandatory | Add `/impressum` page with company details |
| France | Legal mentions (mentions legales) required | Add `/mentions-legales` page |
| China | ICP license number displayed on homepage | Required for hosting in China |
| Australia | Privacy Act compliance, spam act for email | Privacy policy and email consent mechanisms |
| Brazil | LGPD (similar to GDPR) | Cookie consent and privacy compliance |
| Canada | CASL for email marketing, PIPEDA for privacy | Explicit consent for marketing emails |
| Japan | APPI (data protection law) | Privacy policy and consent mechanisms |
### Structured Data for Legal Compliance
- Implement Organization schema with `address` per country entity
- Use LocalBusiness schema for physical locations in each country
- Include `areaServed` in Service and Product schema to clarify geographic availability
- Ensure `priceRange` and `priceCurrency` in Product schema match the local currency and pricing

View File

@@ -0,0 +1,293 @@
# Site Architecture — URL Structure, Internal Linking & Information Architecture
A comprehensive reference for designing and optimizing site architecture for search engines and users. Site architecture determines how link equity flows through a site, how efficiently crawlers discover content, and how easily users find what they need. It is one of the highest-leverage technical SEO levers for large sites.
---
## URL Structure
### Best Practices
**Readability and keywords:**
- URLs should be human-readable and include the primary target keyword: `/blog/technical-seo-guide` not `/blog/post?id=4827`
- Use hyphens (`-`) to separate words, not underscores (`_`) or spaces (`%20`). Google treats hyphens as word separators but underscores as joiners
- Keep URLs concise — under 100 characters when possible (no hard limit, but shorter URLs are easier to share and display in SERPs)
- Use lowercase consistently. URLs are case-sensitive on most servers; mixed case creates duplicate content risk
**Structure patterns:**
| Pattern | Example | Best For |
|---|---|---|
| Flat | `/product-name` | Small sites, individual landing pages |
| Category/page | `/category/page-name` | Blogs, medium sites, content hubs |
| Hierarchical | `/category/subcategory/page-name` | Large sites, ecommerce with clear taxonomy |
| Date-based | `/2025/11/article-name` | News sites (shows freshness), but limits future reorganization |
**What to avoid in URLs:**
- Session IDs: `/page?sessionid=abc123` — use cookies instead
- Excessive parameters: `/page?color=red&size=m&sort=price&page=2&ref=homepage`
- Unnecessary depth: `/store/products/clothing/mens/shirts/casual/blue-shirt` (too deep)
- Stop words in excess: `/the-complete-guide-to-the-best-ways-to-do-seo` — trim to `/complete-seo-guide`
- Changing URLs after publication — every URL change requires a 301 redirect and risks ranking loss
### Trailing Slash Consistency
Choose one format and enforce it sitewide:
- `example.com/page/` (with trailing slash)
- `example.com/page` (without trailing slash)
Google treats these as different URLs. If both resolve with 200 status, it creates duplicate content. Enforce one format with a server-side redirect (301) from the non-canonical format.
Most CMSs have a setting for this. For custom implementations, handle it in the web server config (nginx, Apache) or application router.
---
## Information Architecture
### Principles
1. **Every important page should be reachable within 3 clicks from the homepage.** Pages deeper than 3 levels receive less crawl frequency and less PageRank. This does not mean a flat URL structure — it means internal links create short paths.
2. **Group related content together.** Search engines use content proximity (pages linking to each other, sharing URL path structure, and covering related topics) to understand topical authority.
3. **Build topical authority through content clusters.** A pillar page targeting a broad topic links to cluster pages targeting specific subtopics. All cluster pages link back to the pillar. This creates a self-reinforcing authority signal.
### Topic Cluster Model
```
[Pillar Page]
"Technical SEO Guide"
/ | | | \
/ | | | \
[Cluster] [Cluster] [Cluster] [Cluster] [Cluster]
"Core Web "Crawl "Site "Schema "Mobile-
Vitals" Budget" Migration" Markup" First"
```
**Pillar page**: Comprehensive overview (2,000-5,000 words) targeting the broad head term. Links to every cluster page.
**Cluster pages**: Deep-dive articles (1,000-3,000 words) targeting specific long-tail subtopics. Each links back to the pillar and cross-links to related cluster pages.
**Result**: Search engines understand the site is an authority on the pillar topic because of the depth and interconnection of coverage.
### Content Siloing
Content siloing organizes site content into distinct thematic sections with controlled linking between them. The goal is to concentrate topical relevance within each silo.
**Hard silo**: URL structure mirrors the silo: `/technical-seo/core-web-vitals`, `/technical-seo/crawlability`. Internal links stay within the silo. Cross-silo links go through the top-level silo pages.
**Soft silo**: URL structure may be flat, but internal linking creates virtual silos. Contextual links connect related content within the same topic area.
**When to silo:**
- Sites covering multiple distinct topics (a marketing agency with SEO, PPC, social, email sections)
- Ecommerce sites with distinct product categories
- Publishers covering multiple beats
**When siloing is unnecessary:**
- Small sites (under 50 pages) where all content is closely related
- Single-topic niche sites where everything is one silo
---
## Internal Linking Strategy
### Why Internal Links Matter
1. **Crawl discovery**: Googlebot follows internal links to discover pages. Pages with more internal links are crawled more frequently
2. **PageRank distribution**: Internal links pass PageRank (link equity) from one page to another. Strategic internal linking concentrates authority on priority pages
3. **Topical relevance signals**: The anchor text and surrounding context of internal links help search engines understand what the linked page is about
4. **User navigation**: Well-placed internal links reduce bounce rate and increase pages per session
### Types of Internal Links
| Type | Description | SEO Value | Example |
|---|---|---|---|
| **Navigation links** | Header, footer, sidebar menus | Medium (sitewide dilution) | Main menu linking to category pages |
| **Contextual links** | In-content links within body copy | High (relevant context + anchor text) | Blog post linking to related article |
| **Breadcrumb links** | Hierarchical path from homepage to current page | Medium-High (reinforces hierarchy) | Home > Category > Subcategory > Page |
| **Related content links** | Algorithmically or manually curated related pages | Medium | "Related articles" section below blog posts |
| **Footer links** | Links in the site footer | Low-Medium (sitewide, often ignored) | Useful for important pages not in main nav |
| **Sidebar links** | Links in sidebar widgets | Low-Medium | Category lists, popular posts, recent posts |
### Anchor Text Optimization
- **Use descriptive, keyword-relevant anchor text**: "technical SEO audit checklist" not "click here" or "read more"
- **Vary anchor text naturally**: Do not use the exact same anchor text for every link to a page. Use variations, partial matches, and natural phrases
- **Avoid over-optimization**: Do not stuff exact-match keywords into every internal link anchor. Google's algorithms detect this pattern
- **Context matters**: The surrounding paragraph provides additional relevance signals beyond just the anchor text
- **Avoid generic anchors** for important links: "Learn more," "Click here," and "Read this" waste an anchor text opportunity
### Internal Link Audit Methodology
1. **Crawl the site** to build a complete link graph (Screaming Frog, Sitebulb, or custom crawler)
2. **Identify pages with low internal link counts**: Important pages (target keyword pages, revenue pages) with fewer than 5 internal links pointing to them need more
3. **Identify pages with excessive internal links**: Pages linking to 200+ URLs dilute PageRank per link. Consolidate or prioritize
4. **Find orphan pages**: Pages with zero internal links (see crawlability.md for detection method)
5. **Analyze link depth**: Map click depth from homepage. Flag critical pages deeper than 3 clicks
6. **Check for broken internal links**: 404s from internal links waste crawl budget and PageRank
7. **Review anchor text distribution**: Ensure important pages receive keyword-relevant anchor text from multiple sources
8. **Visualize link flow**: Use a site architecture visualization to identify PageRank bottlenecks and silos
### Link Equity Distribution Principles
- **Homepage has the most PageRank** (it receives the most external backlinks). Links from the homepage are the most valuable internal links
- **PageRank flows through links and is divided among all links on a page**. A page with 10 outgoing links passes more equity per link than a page with 100 outgoing links
- **Deep pages need intentional linking**: A blog post 5 clicks from the homepage receives minimal PageRank unless linked from higher-authority pages
- **"Link to your money pages"**: Product pages, service pages, and high-converting landing pages should receive internal links from high-authority content (blog posts with backlinks, homepage, category pages)
---
## Pagination Handling
### Current Best Practices (Post rel=prev/next Deprecation)
Google deprecated support for `rel="prev"` and `rel="next"` in 2019. Current approaches:
**Option 1: View-All Page (Preferred for SEO)**
- Create a single page with all content (`/products/shoes?view=all`)
- Set the view-all page as the canonical for all paginated component pages
- Best for: Product listings under 200 items, article lists
- Caveat: Page must load reasonably fast. If 500 products cause a 10-second load time, this is not viable
**Option 2: Self-Canonicalizing Paginated Pages**
- Each paginated page (`/shoes?page=1`, `/shoes?page=2`) has a self-referencing canonical
- Google indexes each page independently
- Best for: Large catalogs where a view-all page is not feasible
- Ensure each page has unique, relevant content (not just the same intro text with different products)
**Option 3: Load More / Infinite Scroll (with SEO Considerations)**
- JavaScript-powered "Load more" button or infinite scroll
- Critical: Implement as progressive enhancement with crawlable paginated URLs underneath
- Google recommends: `<a href="/shoes?page=2">` links in the HTML that JavaScript enhances into "Load more" functionality
- Without crawlable fallback URLs, Googlebot cannot access content beyond the initial load
---
## Faceted Navigation (Ecommerce)
### The Challenge
Ecommerce filtering (color, size, price range, brand, rating) generates enormous URL combinations. A category with 8 filter types and 5 options each creates 5^8 = 390,625 possible URL combinations from a single category.
### Strategy Matrix
| Facet Type | Example | Indexable? | Handling |
|---|---|---|---|
| **High-demand facets** | Color, brand, material for fashion | Yes — if search volume exists for "red running shoes" | Unique title/description, self-referencing canonical, include in sitemap |
| **Sorting parameters** | Sort by price, popularity, newest | No — same products, different order | Canonical to base category; robots.txt block or noindex |
| **Pagination within facets** | Page 2 of red shoes | Depends on depth | Pages 1-3 may be indexable; deeper pages canonical to page 1 |
| **Multi-select facets** | Red + blue + size 10 | No — too specific, no search demand | Canonical to broadest applicable facet; robots.txt block |
| **Price range** | $50-$100 | Rarely | Usually canonical to base category unless "cheap [product]" has volume |
| **Rating filters** | 4 stars and up | No | Canonical to base category |
### Implementation Approaches
1. **AJAX-based filtering (Best)**: Filters update content via JavaScript without generating new URLs. Use History API to update the URL for shareability without creating crawlable parameter URLs. Googlebot sees only the base category URL.
2. **Canonical + robots.txt (Common)**: Allow parameter URLs to exist but canonical low-value combinations to the base URL. Block high-volume parameter patterns in robots.txt to conserve crawl budget.
3. **Noindex, follow (Fallback)**: Apply noindex to parameter pages that should not rank but contain links worth following. Use when canonical signals are insufficient.
---
## Breadcrumb Implementation
### SEO Benefits
- Reinforces site hierarchy for search engines
- Provides keyword-rich internal links to parent pages
- Enables breadcrumb rich results in Google SERPs (increases click-through rate)
- Helps users understand their location within the site
### HTML Implementation
```html
<nav aria-label="Breadcrumb">
<ol itemscope itemtype="https://schema.org/BreadcrumbList">
<li itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
<a itemprop="item" href="/"><span itemprop="name">Home</span></a>
<meta itemprop="position" content="1">
</li>
<li itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
<a itemprop="item" href="/technical-seo"><span itemprop="name">Technical SEO</span></a>
<meta itemprop="position" content="2">
</li>
<li itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
<span itemprop="name">Core Web Vitals</span>
<meta itemprop="position" content="3">
</li>
</ol>
</nav>
```
### JSON-LD Alternative (Preferred)
```json
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{"@type": "ListItem", "position": 1, "name": "Home", "item": "https://example.com/"},
{"@type": "ListItem", "position": 2, "name": "Technical SEO", "item": "https://example.com/technical-seo"},
{"@type": "ListItem", "position": 3, "name": "Core Web Vitals"}
]
}
```
### Best Practices
- Always start with "Home" as the first breadcrumb
- The last item (current page) should not be a link
- Use descriptive names (not URL slugs) — "Core Web Vitals Guide" not "core-web-vitals"
- For products in multiple categories, choose the primary category for the breadcrumb path (match canonical)
- Implement BreadcrumbList schema for rich result eligibility
---
## Site Migration Planning
### Pre-Migration Checklist
- [ ] Complete URL mapping: old URL to new URL for every page with organic traffic or backlinks
- [ ] Set up 301 redirects for every mapped URL (test before launch)
- [ ] Verify new site has no robots.txt blocking or noindex tags from development
- [ ] Update all internal links to point to new URLs (avoid relying solely on redirects for internal navigation)
- [ ] Update canonical tags to reference new URLs
- [ ] Update XML sitemaps to reference new URLs
- [ ] Update hreflang tags (if international site)
- [ ] Update structured data (URLs in schema markup)
- [ ] Benchmark current performance: organic traffic by page, indexed page count, crawl stats, rankings for target keywords, Core Web Vitals
- [ ] Notify Google via GSC Change of Address tool (for domain migrations)
- [ ] Set up monitoring: daily organic traffic checks, hourly crawl error monitoring for the first week
### Migration Day
- [ ] Deploy redirects
- [ ] Verify redirects work (test a sample of 50+ URLs across different templates)
- [ ] Submit new sitemap to GSC
- [ ] Request indexing for the most important pages via URL Inspection tool
- [ ] Monitor crawl stats in real-time for the first 24 hours
- [ ] Check for spike in crawl errors in GSC
### Post-Migration Monitoring
| Timeframe | Check | Expected |
|---|---|---|
| Day 1-3 | Crawl errors in GSC | Spike is normal; should decrease rapidly |
| Week 1 | Index coverage | Old URLs transitioning to new URLs |
| Week 1 | Organic traffic | 10-30% dip is normal for well-executed migrations |
| Week 2-4 | Rankings for target keywords | Should begin recovering to pre-migration levels |
| Month 1-2 | Organic traffic recovery | Should reach 90-100% of pre-migration levels |
| Month 3 | Full audit | Comparable or improved performance across all metrics |
| Month 6-12 | Redirect maintenance | Keep old domain and redirects active for at least 12 months |
### When Rankings Do Not Recover
If organic traffic has not recovered to 90% within 8 weeks:
1. Check for redirect errors (broken redirects, redirect chains, loops)
2. Verify no noindex or robots.txt blocks on the new site
3. Check canonical tags are not pointing to old URLs
4. Verify internal links are updated (not just relying on redirect chains)
5. Check for content parity issues (missing content on new pages)
6. Review GSC for manual actions or security issues
7. Audit Core Web Vitals on the new site (performance regression can suppress rankings)