fix(scanner): reliable cookie discovery, auto-categorisation, and scan scheduling UI (#7)

Scanner fixes: - Remove conflicting ``path`` from consent pre-seed cookie (Playwright rejects cookies with both ``url`` and ``path``). - Switch to ``networkidle`` + 5s + 2s delayed second-pass for reliable cookie capture. - Check sitemap Content-Type to skip SPA HTML fallbacks. - Propagate ``auto_category`` from scan results to the cookies table during sync (was silently dropped). - Add ``_gcl_ls`` to the Open Cookie Database CSV. - Classify ``_consentos_*`` cookies as necessary directly in the classification engine. - Add ``seed_known_cookies`` to the bootstrap init container command. Admin UI: - Add scan schedule control to the Scans tab — preset options (disabled/daily/weekly/fortnightly/monthly) plus custom cron input. Saves ``scan_schedule_cron`` on the site config.
2026-04-18 20:14:32 +01:00
parent 80dfc15319
commit e0f1dd43e8
11 changed files with 297 additions and 15 deletions
--- a/apps/scanner/src/sitemap.py
+++ b/apps/scanner/src/sitemap.py
@@ -75,6 +75,13 @@ async def _fetch_sitemap(
        if resp.status_code != 200:
            return []

+        # SPAs with catch-all nginx/Caddy rules return 200 + text/html
+        # for /sitemap.xml. Don't try to parse HTML as XML.
+        content_type = resp.headers.get("content-type", "")
+        if "html" in content_type and "xml" not in content_type:
+            logger.debug("Sitemap %s returned HTML, skipping", url)
+            return []
+
        root = ElementTree.fromstring(resp.text)

        # Check if it's a sitemap index