16 KiB
Indexation — Canonicals, Meta Robots, Duplicate Content & Index Management
A comprehensive reference for controlling which pages search engines index, resolving duplicate content, managing index coverage, and accelerating indexation of new content. Indexation management ensures that search engine indexes contain only the pages you want to rank, with no duplication, no bloat, and no wasted authority.
Canonical Tags
Purpose
The rel="canonical" link element tells search engines which URL is the preferred (canonical) version of a page when multiple URLs serve the same or substantially similar content. It consolidates ranking signals (backlinks, PageRank) onto the canonical URL.
Implementation
HTML link element (most common):
<link rel="canonical" href="https://example.com/preferred-page">
HTTP header (for non-HTML resources like PDFs):
Link: <https://example.com/preferred-page>; rel="canonical"
Canonical Tag Rules
- Self-referencing canonicals: Every indexable page should have a canonical tag pointing to itself. This prevents issues from URL parameters, tracking codes, or session IDs creating duplicate URLs that Google discovers through external links
- Canonical must be an absolute URL:
href="https://example.com/page"nothref="/page" - Canonical must point to a 200-status page: Do not canonical to a 301, 404, or 5xx page
- Canonical must match the protocol: HTTPS pages should canonical to HTTPS URLs
- Canonical is a hint, not a directive: Google may choose to ignore the canonical if other signals contradict it (e.g., internal links primarily point to a different URL)
- One canonical per page: Multiple canonical tags on the same page cause Google to ignore all of them
Common Canonical Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Canonical points to a noindex page | Conflicting signals — Google may ignore both | Remove noindex from the canonical target, or change the canonical to an indexable page |
| Canonical points to a 404/410 page | Canonical signal is ignored; page may be indexed independently | Update canonical to a live, relevant page |
| Canonical to a redirected URL | Google may follow the redirect and use the final destination, but this adds unnecessary ambiguity | Point canonical directly to the final destination URL |
| Canonical chain (A canonicals to B, B canonicals to C) | Google may resolve correctly but processing delays occur; long chains may be abandoned | Point A directly to C |
| Relative URLs in canonical | Parsed relative to current URL — may resolve incorrectly across templates | Always use absolute URLs |
| Canonical between very different pages | Google ignores the canonical because content does not match | Only canonical between pages with substantially similar content |
| Missing self-referencing canonical | Parameter variations and tracking URLs may be indexed as duplicates | Add self-referencing canonical to every indexable page template |
Canonical in the <body> instead of <head> |
Google may not process it | Ensure canonical tag is within the <head> element |
Cross-Domain Canonicals
Used when the same content exists on multiple domains (syndication, multi-brand, regional sites):
<!-- On syndication-partner.com -->
<link rel="canonical" href="https://original-publisher.com/article">
Cross-domain canonicals are a stronger hint than same-domain, and Google generally respects them when the content is truly identical. The canonicalized domain passes ranking signals to the canonical domain.
Meta Robots Directives
Available Directives
| Directive | Meaning |
|---|---|
index |
Allow this page to be indexed (default behavior; rarely needs to be explicit) |
noindex |
Do not show this page in search results. Strongest indexation control |
follow |
Follow links on this page (default behavior) |
nofollow |
Do not follow any links on this page for ranking purposes |
noarchive |
Do not show a cached copy of this page in search results |
nosnippet |
Do not show a text snippet or video preview in search results |
max-snippet:[n] |
Limit text snippet to n characters |
max-image-preview:[size] |
Limit image preview size: none, standard, large |
max-video-preview:[n] |
Limit video preview to n seconds |
notranslate |
Do not offer translation of this page in search results |
noimageindex |
Do not index images on this page |
unavailable_after:[date] |
Do not show this page after the specified date |
Implementation
HTML meta tag:
<meta name="robots" content="noindex, follow">
Specific crawler:
<meta name="googlebot" content="noindex">
<meta name="bingbot" content="noindex">
X-Robots-Tag HTTP header (works for all file types, not just HTML):
X-Robots-Tag: noindex, follow
When to Use noindex vs robots.txt vs canonical
| Goal | Use | Reason |
|---|---|---|
| Page should never appear in search results | noindex |
Definitive removal from index once crawled |
| Page should not be crawled at all (save crawl budget) | robots.txt Disallow |
Prevents crawling, but page can still be indexed if linked externally |
| Multiple URLs for same content — pick one winner | canonical |
Consolidates signals to preferred URL |
| Temporarily remove a page from search | GSC URL Removal Tool + noindex | Removal tool is fast (hours) but temporary (6 months); noindex is permanent |
| Permanently removed content | 410 Gone status code |
Tells Google the page is permanently gone; faster than noindex for deindexation |
Critical distinction: robots.txt blocks crawling but not indexing. If a page blocked by robots.txt has external backlinks, Google may index it based on anchor text alone (appearing as "No information is available for this page" in search results). To prevent indexation, use noindex — but the page must be crawlable for Google to see the noindex tag.
Index Coverage in Google Search Console
Status Categories
| Status | Meaning | Action |
|---|---|---|
| Valid | Page is indexed and can appear in search results | Monitor for changes. Verify these are pages you want indexed |
| Valid with warnings | Page is indexed but has issues that may affect visibility | Review warnings (e.g., indexed but blocked by robots.txt) |
| Excluded | Page is not indexed — could be intentional or problematic | Review exclusion reasons below |
| Error | Page has issues preventing proper indexing | Fix server errors, redirect errors, or crawl anomalies |
Common Exclusion Reasons and Fixes
| Exclusion Reason | What It Means | Action |
|---|---|---|
| Excluded by noindex tag | Page has meta noindex — intentional if you set it | Verify this is intentional. If not, remove the noindex tag |
| Blocked by robots.txt | Robots.txt prevents crawling | If intentional, fine. If the page should be indexed, update robots.txt |
| Crawled — currently not indexed | Google crawled but chose not to index (quality/relevance issue) | Improve content quality, add internal links, build backlinks. This is Google saying "I saw it but it is not good enough" |
| Discovered — currently not indexed | Google knows the URL exists but has not crawled it yet | Common for new/low-authority pages. Improve internal linking, submit in sitemap, request indexing via URL Inspection |
| Alternate page with proper canonical | Page canonicals to another URL — expected behavior | Verify the canonical target is correct and indexed |
| Duplicate without user-selected canonical | Google found duplicate content and chose its own canonical | Check if Google's choice matches your intent. If not, strengthen canonical signals (internal links, sitemap, explicit canonical tag) |
| Duplicate, Google chose different canonical | You set a canonical but Google disagreed | Review why — content may not be similar enough, or the canonical target may have issues. Strengthen signals on your preferred canonical |
| Page with redirect | URL redirects to another page | Expected for redirected URLs. Verify redirect targets are correct |
| Soft 404 | Page returns 200 but Google thinks it is a 404 (empty or near-empty content) | Either return a proper 404/410 status code, or add substantial content to the page |
| Not found (404) | Page returns 404 status | If intentional, the 404 will eventually drop out. If the page should exist, fix the URL or implement a redirect |
Duplicate Content Management
Types of Duplicate Content
Exact duplicates: Identical content accessible at multiple URLs
http://vshttps://www.vs non-www- Trailing slash vs no trailing slash
- URL parameters (tracking, session, sorting)
index.htmlvs/- Uppercase vs lowercase URLs
Near duplicates: Substantially similar content with minor variations
- Product pages differing only by color/size selection
- Location pages with boilerplate content and only city name changed
- Paginated content where intro text repeats across pages
- Print-friendly versions of pages
- Mobile-specific URLs (m.example.com)
Syndicated duplicates: Same content on different domains
- Content republished on partner sites
- Press releases on wire services
- Product descriptions provided by manufacturers
Resolution Strategies
| Duplicate Type | Strategy | Implementation |
|---|---|---|
| Protocol/www/slash variations | 301 redirect to canonical version | Server config (nginx/Apache redirect rules) |
| Parameter variations | Self-referencing canonical on clean URL | Canonical tag on every page template |
| Print-friendly versions | Canonical to main page or noindex | Canonical tag on print pages |
| Near-duplicate location pages | Unique content per page (minimum 60-70% unique) | Invest in location-specific content |
| Syndicated content | Cross-domain canonical to original publisher | Canonical tag on syndication partner pages |
| Paginated content | Self-referencing canonical per page OR view-all canonical | Depends on page count (see site-architecture.md) |
| Translated content (same language) | Choose one version; canonical to it | Canonical tag; do not use hreflang for same-language duplicates |
Index Bloat
What It Is
Index bloat occurs when a search engine indexes significantly more pages than the site has valuable, unique content. Common symptoms:
- Indexed page count in GSC is 2x+ the number of pages in the sitemap
- Large numbers of thin or duplicate pages appearing in the index
- Important pages competing with low-value pages for rankings
Common Sources of Index Bloat
| Source | Example | Scale Risk |
|---|---|---|
| Faceted navigation | Every filter combination generates an indexable URL | Extreme (hundreds of thousands to millions) |
| Internal search results | /search?q=* pages indexed for every query |
High |
| Tag/archive pages | WordPress tag pages with 1-2 posts each | Medium |
| Pagination | Deep paginated pages (page 50+) with no unique value | Medium |
| Calendar/date archives | Empty or near-empty date archive pages | Medium |
| User profile pages | Thin public profile pages on UGC platforms | High |
| Parameter variations | Tracking, session, currency, language parameters | High |
| Staging/development environments | Staging.example.com indexed by Google | Medium |
| PDF and file duplicates | Same content as HTML pages but in PDF format | Low-Medium |
Index Bloat Cleanup Process
- Audit the index: Compare GSC indexed page count to your sitemap URL count. A ratio above 1.5:1 suggests bloat
- Identify bloat sources: Use GSC index coverage report, site: search operator, and crawl data to categorize indexed URLs by template type
- Prioritize by volume: Address the largest bloat sources first (faceted navigation before tag pages)
- Apply controls:
noindex, followon pages that have link value but should not rankrobots.txt Disallowon URL patterns that should never be crawledcanonicalto consolidate duplicate/near-duplicate pages410 Gonefor pages that should be permanently removedrel="canonical"to view-all or primary page for paginated series
- Clean up sitemaps: Remove all non-indexable URLs from XML sitemaps
- Monitor: Track indexed page count weekly. Expect a gradual decrease over 4-8 weeks as Google recrawls and deindexes pages
New Content Indexation
How to Speed Up Indexation of New Pages
Tier 1: High-impact (do immediately)
- Add internal links from high-authority, frequently crawled pages (homepage, category pages, popular blog posts)
- Include the new URL in the XML sitemap with an accurate
lastmoddate - Use Google Search Console URL Inspection tool > "Request Indexing" (limited to ~10-20 requests per day)
Tier 2: Supplementary (do within 24 hours)
- Share the URL on social media (Google discovers URLs through social platforms)
- Ping the sitemap:
https://www.google.com/ping?sitemap=https://example.com/sitemap.xml - If the site uses Google's Indexing API (eligible for job postings and live streaming content), submit through the API (much faster than standard crawling)
Tier 3: Long-term (ongoing)
- Maintain a healthy crawl rate by keeping the site fast and error-free
- Build external backlinks to new content
- Publish content consistently (sites with regular publishing schedules get crawled more frequently)
- Keep XML sitemaps accurate (no broken URLs, accurate lastmod dates)
Indexation Timeline Expectations
| Site Authority | New Page Indexation | Factors |
|---|---|---|
| High (established domain, strong backlink profile) | Minutes to hours | Google crawls frequently; new content discovered quickly through internal links |
| Medium (growing domain, moderate authority) | Hours to days | Regular crawling schedule; sitemap and internal links help |
| Low (new domain, few backlinks) | Days to weeks | Infrequent crawling; URL Inspection and sitemap submission are critical |
| Very low (brand new domain, no backlinks) | Weeks to months | Google may need multiple crawl cycles before indexing; focus on building authority |
Google Indexing API
The Indexing API provides near-instant indexation (minutes) but is officially supported only for:
JobPostingstructured data pagesBroadcastEvent(live streaming) structured data pages
Some SEOs use it for broader content types with mixed results. Google has stated it is only intended for the supported types. For most sites, the URL Inspection tool's "Request Indexing" is the recommended manual indexation method.
URL Removal Tools and Processes
Temporary Removal (Google Search Console)
- URL Removal Tool: Temporarily hides a URL from Google Search results for approximately 6 months
- Use for: Emergency removal of sensitive content, outdated pages that need time to fix
- Does NOT permanently remove the page from the index — the page must also have noindex or return 404/410 for permanent removal
Permanent Removal Methods
| Method | Speed | Permanence | Use Case |
|---|---|---|---|
noindex meta tag |
Days to weeks (next crawl) | Permanent while tag is present | Pages that should exist but not rank |
410 Gone status code |
Days to weeks | Permanent (Google drops from index) | Content permanently removed with no replacement |
404 Not Found |
Weeks to months | Eventually drops from index | Content no longer exists |
301 Redirect |
Days to weeks | Old URL replaced by new URL in index | Content moved to a new URL |
| URL Removal Tool + noindex | Hours (removal) + permanent (noindex) | Permanent | Urgent removal of sensitive/harmful content |
Outdated Content Removal
Google provides a separate "Remove Outdated Content" tool for requesting removal of cached content that no longer reflects the live page. This is used when:
- A page's snippet in search results shows outdated information
- A page has been updated but Google's cache has not refreshed
- A removed page still appears in search results for other users to request removal
This tool is available to anyone, not just site owners: https://search.google.com/search-console/remove-outdated-content