fix(scanner): reliable cookie discovery, auto-categorisation, and scan scheduling UI (#7)

Scanner fixes:
- Remove conflicting ``path`` from consent pre-seed cookie (Playwright
  rejects cookies with both ``url`` and ``path``).
- Switch to ``networkidle`` + 5s + 2s delayed second-pass for reliable
  cookie capture.
- Check sitemap Content-Type to skip SPA HTML fallbacks.
- Propagate ``auto_category`` from scan results to the cookies table
  during sync (was silently dropped).
- Add ``_gcl_ls`` to the Open Cookie Database CSV.
- Classify ``_consentos_*`` cookies as necessary directly in the
  classification engine.
- Add ``seed_known_cookies`` to the bootstrap init container command.

Admin UI:
- Add scan schedule control to the Scans tab — preset options
  (disabled/daily/weekly/fortnightly/monthly) plus custom cron input.
  Saves ``scan_schedule_cron`` on the site config.
This commit is contained in:
James Cottrill
2026-04-18 20:14:32 +01:00
committed by GitHub
parent 80dfc15319
commit e0f1dd43e8
11 changed files with 297 additions and 15 deletions

View File

@@ -75,6 +75,13 @@ async def _fetch_sitemap(
if resp.status_code != 200:
return []
# SPAs with catch-all nginx/Caddy rules return 200 + text/html
# for /sitemap.xml. Don't try to parse HTML as XML.
content_type = resp.headers.get("content-type", "")
if "html" in content_type and "xml" not in content_type:
logger.debug("Sitemap %s returned HTML, skipping", url)
return []
root = ElementTree.fromstring(resp.text)
# Check if it's a sitemap index