Files
ALwrity/docs/BACKEND_LOG_RCA_TRACKER.md
2026-03-10 17:01:36 +05:30

362 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Backend Log RCA Tracker
## Purpose
This document is the working catalog for backend issues observed in runtime logs.
For each issue, capture:
- error signature
- observed symptoms
- likely root cause analysis
- confidence level
- files to inspect/edit
- fix strategy notes
- validation steps
- status
## Triage Rules
- Do not fix directly from logs alone unless root cause is confirmed.
- Prefer grouping repeated log lines under one issue.
- Track the first failing subsystem, then downstream effects.
- Separate configuration problems from code defects.
- Keep this document updated before and after each fix.
## Issue 1: Clerk token verification failures on authenticated endpoints
- **Status**: Open
- **Severity**: High
- **Subsystem**: Authentication / request pipeline
- **Error signatures**:
- `Unverified token rejected (production).`
- `AUTHENTICATION ERROR: Token verification failed for endpoint: GET /api/...`
- **Observed endpoints in logs**:
- `/api/content-planning/monitoring/lightweight-stats`
- `/api/content-planning/monitoring/health`
- `/api/subscription/dashboard/...`
- `/api/subscription/alerts/...`
- `/api/subscription/status/...`
- **Observed behavior**:
- Requests reach authenticated endpoints.
- Clerk verification fails.
- Fallback unverified decode path is attempted.
- Production mode rejects the token.
- **Primary RCA hypothesis**:
- The backend is receiving bearer tokens that do not successfully validate against the resolved Clerk JWKS/issuer configuration.
- The middleware then falls back to unverified decode, but production mode explicitly rejects that path.
- **Secondary RCA hypotheses**:
- Frontend token/audience/issuer mismatch.
- Wrong Clerk environment variables loaded in backend.
- Issuer-derived JWKS URL resolution is inconsistent with actual Clerk instance.
- Requests may be sent before a valid session token is available.
- **Evidence in code**:
- `backend/middleware/auth_middleware.py`
- `ClerkAuthMiddleware.__init__`
- `ClerkAuthMiddleware.verify_token`
- `get_current_user`
- Relevant logic:
- derives JWKS URL from token issuer or cached publishable key instance
- falls back to `jwt.decode(..., verify_signature=False)`
- rejects unverified tokens when `ALLOW_UNVERIFIED_JWT_DEV` is false
- **Likely files to inspect/edit later**:
- `backend/middleware/auth_middleware.py`
- possibly frontend auth/session request layer if token attachment is inconsistent
- **Confidence**: Medium
- **Root-cause questions to answer**:
- Are `CLERK_SECRET_KEY` and publishable key values from the same Clerk instance?
- Is the token issuer exactly matching the intended Clerk environment?
- Are failing requests sent with stale, dev, or cross-environment tokens?
- Are these requests triggered before Clerk session hydration on the frontend?
- **Validation after fix**:
- Authenticated endpoints return 200 with verified user context.
- No `Unverified token rejected (production)` log spam for healthy requests.
## Issue 2: Hugging Face structured JSON generation failing with model not found
- **Status**: Open
- **Severity**: High
- **Subsystem**: LLM provider / workflow generation
- **Error signatures**:
- `HF structured model not found: %s. Trying fallback model.`
- `Hugging Face API call failed: Not Found`
- `HF structured model not found (no response_format path): %s`
- `Hugging Face structured JSON generation failed: NotFoundError: Not Found`
- `[llm_text_gen] Provider huggingface failed: RetryError[...]`
- **Observed behavior**:
- Structured JSON call tries primary model.
- Fallback model sequence also fails.
- Retry without `response_format` still fails with `NotFound`.
- Upstream caller falls through to another provider or fallback path.
- **Primary RCA hypothesis**:
- The configured Hugging Face model identifier is invalid, unavailable to the account/provider, or incompatible with the current OpenAI-compatible Hugging Face endpoint.
- **Secondary RCA hypotheses**:
- Base URL/API key/provider configuration is wrong.
- Fallback model list contains provider-specific model ids not available in the current account/region.
- Structured generation path assumes chat completions support for models that only exist on a different inference route.
- **Evidence in code**:
- `backend/services/llm_providers/huggingface_provider.py`
- `_fallback_model_sequence`
- `huggingface_structured_json_response`
- The code retries:
- with `response_format={"type": "json_object"}`
- then again without `response_format`
- Both paths still fail with `NotFoundError`, which points more strongly to model/base-url availability than schema formatting.
- **Likely files to inspect/edit later**:
- `backend/services/llm_providers/huggingface_provider.py`
- provider selection/orchestration file calling Hugging Face as primary for structured JSON
- environment/config file for HF model names and API base URL
- **Confidence**: High
- **Root-cause questions to answer**:
- Which exact model string is being passed as the primary model in the failing call?
- What base URL and API key are being used for the OpenAI client?
- Are the fallback model ids valid for the currently configured Hugging Face inference provider?
- **Validation after fix**:
- A structured JSON test request succeeds with the intended model or a verified fallback.
- No `NotFoundError` for the chosen model list.
## Issue 3: txtai indexing attempted before service initialization completes
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: Semantic indexing / background tasks
- **Error signatures**:
- `Cannot index content - service not initialized for user ...`
- **Observed behavior**:
- Background indexing is triggered.
- `TxtaiIntelligenceService.index_content` calls `_ensure_initialized()`.
- `_ensure_initialized()` starts background initialization and returns immediately.
- `index_content` then checks `_initialized`, sees false, and fails fast.
- **Primary RCA hypothesis**:
- There is a race condition between lazy background initialization and immediate indexing/search calls.
- `SIF_FAIL_FAST=true` (default) causes operations to raise RuntimeError instead of gracefully deferring.
- **Evidence in code**:
- `backend/services/intelligence/txtai_service.py`:
- Line 57: `self.fail_fast = str(os.getenv("SIF_FAIL_FAST", "true")).lower() in {"1", "true", "yes", "on"}`
- Lines 234-235: `index_content` raises RuntimeError if `fail_fast` and not initialized
- Lines 284-285: `search` raises RuntimeError if `fail_fast` and not initialized
- Lines 319-320: `get_similarity` raises RuntimeError if `fail_fast` and not initialized
- `_ensure_initialized` is intentionally non-blocking (starts background thread)
- `backend/api/today_workflow.py`:
- `_index_tasks_to_sif` triggers indexing in background after workflow actions
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- any other callers that assume initialization is synchronous
- **Confidence**: High
- **Potential downstream impact**:
- workflow/task indexing silently fails
- semantic search quality degrades
- noisy logs obscure higher-priority failures
- **Root-cause questions to answer**:
- Should `index_content` await `_ensure_initialized_async()` instead of using the non-blocking path?
- Should callers tolerate deferred indexing instead of fail-fast behavior?
- Is `SIF_FAIL_FAST=true` appropriate for background indexing operations?
- Should `SIF_FAIL_FAST` default to `false` for background operations?
- **Validation after fix**:
- First indexing call after startup succeeds or is gracefully deferred without error spam.
## Issue 4: Today workflow endpoint reload observed during active debugging
- **Status**: Observed
- **Severity**: Low
- **Subsystem**: Development reload / workflow API
- **Log signature**:
- `StatReload detected changes in 'api\today_workflow.py'. Reloading...`
- **Observed behavior**:
- Development server reloads due to file edits.
- **RCA**:
- Expected dev-server behavior, not itself a product bug.
- **Files involved**:
- `backend/api/today_workflow.py`
- **Confidence**: High
- **Action**:
- No fix needed; keep separate from actual runtime defects.
## Cross-Issue Notes
- The auth failures and the workflow/indexing issues may be independent.
- The Hugging Face failure may trigger fallback task generation, which can still create workflows while hiding the upstream provider problem.
- txtai indexing failures appear to be a post-generation side effect, not the root cause of generation failure.
- **LiteLLM was investigated and dropped as a false herring** no project-level SIF/txtai wiring to LiteLLM was found.
- The SIF agent local-model path is **separate** from txtai embeddings and may be the source of the "local model used to work" feedback.
## Candidate Investigation Order
1. Authentication verification mismatch
2. Hugging Face model/provider availability mismatch
3. txtai initialization race (with `SIF_FAIL_FAST` behavior)
4. SIF agent local-model defaults (Qwen 1.5B vs lighter alternatives)
5. Any downstream workflow symptoms after the above are stabilized
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race):
- **Option A**: Change `SIF_FAIL_FAST` default to `false` for background operations
- Allows graceful deferral instead of RuntimeError
- Minimal code change, no logic changes
- **Option B**: Use `_ensure_initialized_async()` in `index_content`/`search`/`get_similarity`
- Awaits initialization before proceeding
- More robust but requires async refactoring
- **Option C**: Add initialization state callbacks to callers
- More complex, may not be necessary
### For Issue 5 (SIF agent local-model drift):
- **Option A**: Change default `model_name` in `SIFBaseAgent.__init__` to lighter model
- Example: `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`
- Single-line change, immediate effect
- **Option B**: Add env/config override for default agent local model
- More flexible, requires config wiring
- Allows runtime tuning without code changes
- **Option C**: Keep current default and rely on existing fallback chain
- The fallback chain already tries lighter models if memory fails
- May be sufficient if memory detection works correctly
## Current Evidence Sources
- Runtime logs from terminal `python` process `22056`
- `backend/middleware/auth_middleware.py`
- `backend/services/llm_providers/huggingface_provider.py`
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- `backend/services/today_workflow_service.py`
## Issue 5: SIF agent local-model drift (distinct from txtai embeddings)
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: SIF agents / local LLM wrappers
- **Error signatures**:
- (No direct log signature yet; this is a hypothesis from user feedback that "local model used to work")
- **Observed behavior**:
- User reports that a local model used to work for SIF agents now seems heavier or less responsive.
- The SIF agent path is **separate** from txtai embeddings.
- **Primary RCA hypothesis**:
- The SIF agent local LLM wrapper path uses a 1.5B parameter model by default, which may be heavier than the previous local model.
- This is distinct from txtai embeddings, which still use `sentence-transformers/all-MiniLM-L6-v2`.
- **Evidence in code**:
- `backend/services/intelligence/sif_agents.py`:
- Lines 47-51: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Lines 53-139: `LocalLLMWrapper` tries models in order, with memory issue detection and automatic fallback to smaller models
- Line 141: `SIFBaseAgent.__init__` default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- `backend/services/intelligence/txtai_service.py`:
- Line 48: Still uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/sif_agents.py`
- `backend/services/intelligence/agents/specialized/base.py`
- any config/env that controls default agent local model
- **Confidence**: Medium
- **Root-cause questions to answer**:
- What was the previous local model default for SIF agents?
- Is `Qwen/Qwen2.5-1.5B-Instruct` actually too heavy for the users laptop?
- Should the default be changed to `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`?
- Are there any env/config overrides that could make this configurable?
- **Validation after fix**:
- SIF agents use a CPU-friendly local model (e.g., smaller Qwen variant or TinyLlama).
- Agent generation completes without excessive CPU/memory pressure.
## Issue 6: Model initialization blocking and module unification
- **Status**: Open
- **Severity**: High
- **Subsystem**: Startup / model loading / module architecture
- **Error signatures**:
- (No direct log signature; architectural issue)
- **Observed behavior**:
- `start_alwrity_backend.py` pre-downloads `Qwen/Qwen2.5-3B-Instruct` **synchronously** before server starts (line 122).
- `sif_agents.py` defaults to `Qwen/Qwen2.5-1.5B-Instruct` and uses lazy loading via `LocalLLMWrapper`.
- `txtai_service.py` uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
- Three separate modules handle model loading, creating confusion.
- User wants fail-fast semantics (catch bugs, avoid silent failures) AND proper fallback.
- User wants non-blocking model downloads for SIF/agents.
- **Primary RCA hypothesis**:
- Startup script blocks on model download, contradicting non-blocking requirement.
- Model size mismatch: startup downloads 3B, agents default to 1.5B.
- Fail-fast in `txtai_service.py` prevents fallback from working.
- Module separation (`txtai_service.py`, `sif_agents.py`, `start_alwrity_backend.py`) creates confusion.
- **Evidence in code**:
- `start_alwrity_backend.py`:
- Line 122: `target_model = "Qwen/Qwen2.5-3B-Instruct"`
- Lines 127-131: `snapshot_download()` is **blocking** call
- Lines 117-120: Skips on Render/Railway but **not** on local dev
- `sif_agents.py`:
- Line 48: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Line 141: Default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- Line 150: Uses `LocalLLMWrapper` for lazy loading
- Lines 94-130: Has fallback logic with memory issue detection
- `txtai_service.py`:
- Line 57: `SIF_FAIL_FAST=true` (default) causes RuntimeError
- Lines 234-235, 284-285, 319-320: Fail-fast prevents fallback
- **Likely files to inspect/edit later**:
- `start_alwrity_backend.py` (remove blocking download)
- `services/intelligence/sif_agents.py` (unify model defaults)
- `services/intelligence/txtai_service.py` (fix fail-fast with fallback)
- Create unified `services/intelligence/model_registry.py` or similar
- **Confidence**: High
- **Root-cause questions to answer**:
- Should model download be truly non-blocking (background thread)?
- Should fail-fast be conditional (e.g., only for critical paths, not background ops)?
- Should module unification create a single `ModelRegistry` or `ModelManager`?
- How to ensure JSON/response structure compatibility across fallback chain?
- **Validation after fix**:
- Server starts without blocking on model download.
- SIF agents use consistent model defaults.
- Fail-fast catches bugs but allows fallback for non-critical ops.
- Single module handles all model loading logic.
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race) - REVISED:
- **Option A**: Change `SIF_FAIL_FAST` to be **conditional** (not global)
- Keep fail-fast for critical paths (user-initiated ops)
- Allow graceful deferral for background ops (indexing, clustering)
- Requires distinguishing operation types
- **Option B**: Use `_ensure_initialized_async()` for **blocking ops only**
- Keep non-blocking for background ops
- Awaits init for user-facing ops
- More robust but requires async refactoring
- **Option C**: Add operation-type-aware fail-fast
- Pass `critical=True/False` to operations
- Fail-fast only when `critical=True`
- Most aligned with user requirements
### For Issue 5 (SIF agent local-model drift) - REVISED:
- **Option A**: Change default to lighter model AND improve fallback chain
- Default: `Qwen/Qwen2.5-0.5B-Instruct` (lighter)
- Fallback: `0.5B → TinyLlama 1.1B`
- Ensure JSON/response structure compatibility
- **Option B**: Add env/config override + keep fallback chain
- `SIF_AGENT_MODEL` env var
- Fallback chain remains as-is
- More flexible
- **Option C**: Keep current default and rely on existing fallback chain
- **RECOMMENDED**: Already has memory detection and fallback
- Just need to ensure JSON compatibility
### For Issue 6 (Model blocking + module unification):
- **Option A**: Remove blocking download from startup script
- Delete `bootstrap_local_llm_models()` call
- Let `LocalLLMWrapper` handle lazy loading
- Minimal change, immediate non-blocking
- **Option B**: Make download non-blocking (background thread)
- Keep pre-download but in background
- Server starts immediately
- More complex
- **Option C**: Create unified `ModelRegistry` module
- Single source of truth for model defaults
- Centralized download/cache logic
- Eliminates confusion between modules
- **RECOMMENDED for long-term**
## Session Update Log
### 2026-03-10
- Created initial RCA tracker document.
- Seeded first three concrete issues from supplied logs.
- No fixes applied from this document yet.
- Added Issue 5: SIF agent local-model drift (LiteLLM dropped as false herring).
- Refined Issue 3 with `SIF_FAIL_FAST` behavior details.
- Added minimal fix paths for Issues 3 and 5.
- Added Issue 6: Model initialization blocking and module unification.
- Updated minimal fix paths based on user requirements (fail-fast + fallback, non-blocking, unification).