fix(scanner): reliable cookie discovery, auto-categorisation, and scan scheduling UI (#7)
Scanner fixes: - Remove conflicting ``path`` from consent pre-seed cookie (Playwright rejects cookies with both ``url`` and ``path``). - Switch to ``networkidle`` + 5s + 2s delayed second-pass for reliable cookie capture. - Check sitemap Content-Type to skip SPA HTML fallbacks. - Propagate ``auto_category`` from scan results to the cookies table during sync (was silently dropped). - Add ``_gcl_ls`` to the Open Cookie Database CSV. - Classify ``_consentos_*`` cookies as necessary directly in the classification engine. - Add ``seed_known_cookies`` to the bootstrap init container command. Admin UI: - Add scan schedule control to the Scans tab — preset options (disabled/daily/weekly/fortnightly/monthly) plus custom cron input. Saves ``scan_schedule_cron`` on the site config.
This commit is contained in:
@@ -75,6 +75,13 @@ async def _fetch_sitemap(
|
||||
if resp.status_code != 200:
|
||||
return []
|
||||
|
||||
# SPAs with catch-all nginx/Caddy rules return 200 + text/html
|
||||
# for /sitemap.xml. Don't try to parse HTML as XML.
|
||||
content_type = resp.headers.get("content-type", "")
|
||||
if "html" in content_type and "xml" not in content_type:
|
||||
logger.debug("Sitemap %s returned HTML, skipping", url)
|
||||
return []
|
||||
|
||||
root = ElementTree.fromstring(resp.text)
|
||||
|
||||
# Check if it's a sitemap index
|
||||
|
||||
Reference in New Issue
Block a user