Commit_all_local_changes_after_PR_406_merge

2026-03-10 17:01:36 +05:30
parent f78b5f1e04
commit 8c2d88efb9
17 changed files with 936 additions and 412 deletions
--- a/docs/BACKEND_LOG_RCA_TRACKER.md
+++ b/docs/BACKEND_LOG_RCA_TRACKER.md
@@ -0,0 +1,361 @@
+# Backend Log RCA Tracker
+
+## Purpose
+
+This document is the working catalog for backend issues observed in runtime logs.
+
+For each issue, capture:
+- error signature
+- observed symptoms
+- likely root cause analysis
+- confidence level
+- files to inspect/edit
+- fix strategy notes
+- validation steps
+- status
+
+## Triage Rules
+
+- Do not fix directly from logs alone unless root cause is confirmed.
+- Prefer grouping repeated log lines under one issue.
+- Track the first failing subsystem, then downstream effects.
+- Separate configuration problems from code defects.
+- Keep this document updated before and after each fix.
+
+## Issue 1: Clerk token verification failures on authenticated endpoints
+
+- **Status**: Open
+- **Severity**: High
+- **Subsystem**: Authentication / request pipeline
+- **Error signatures**:
+  - `Unverified token rejected (production).`
+  - `AUTHENTICATION ERROR: Token verification failed for endpoint: GET /api/...`
+- **Observed endpoints in logs**:
+  - `/api/content-planning/monitoring/lightweight-stats`
+  - `/api/content-planning/monitoring/health`
+  - `/api/subscription/dashboard/...`
+  - `/api/subscription/alerts/...`
+  - `/api/subscription/status/...`
+- **Observed behavior**:
+  - Requests reach authenticated endpoints.
+  - Clerk verification fails.
+  - Fallback unverified decode path is attempted.
+  - Production mode rejects the token.
+- **Primary RCA hypothesis**:
+  - The backend is receiving bearer tokens that do not successfully validate against the resolved Clerk JWKS/issuer configuration.
+  - The middleware then falls back to unverified decode, but production mode explicitly rejects that path.
+- **Secondary RCA hypotheses**:
+  - Frontend token/audience/issuer mismatch.
+  - Wrong Clerk environment variables loaded in backend.
+  - Issuer-derived JWKS URL resolution is inconsistent with actual Clerk instance.
+  - Requests may be sent before a valid session token is available.
+- **Evidence in code**:
+  - `backend/middleware/auth_middleware.py`
+    - `ClerkAuthMiddleware.__init__`
+    - `ClerkAuthMiddleware.verify_token`
+    - `get_current_user`
+  - Relevant logic:
+    - derives JWKS URL from token issuer or cached publishable key instance
+    - falls back to `jwt.decode(..., verify_signature=False)`
+    - rejects unverified tokens when `ALLOW_UNVERIFIED_JWT_DEV` is false
+- **Likely files to inspect/edit later**:
+  - `backend/middleware/auth_middleware.py`
+  - possibly frontend auth/session request layer if token attachment is inconsistent
+- **Confidence**: Medium
+- **Root-cause questions to answer**:
+  - Are `CLERK_SECRET_KEY` and publishable key values from the same Clerk instance?
+  - Is the token issuer exactly matching the intended Clerk environment?
+  - Are failing requests sent with stale, dev, or cross-environment tokens?
+  - Are these requests triggered before Clerk session hydration on the frontend?
+- **Validation after fix**:
+  - Authenticated endpoints return 200 with verified user context.
+  - No `Unverified token rejected (production)` log spam for healthy requests.
+
+## Issue 2: Hugging Face structured JSON generation failing with model not found
+
+- **Status**: Open
+- **Severity**: High
+- **Subsystem**: LLM provider / workflow generation
+- **Error signatures**:
+  - `HF structured model not found: %s. Trying fallback model.`
+  - `Hugging Face API call failed: Not Found`
+  - `HF structured model not found (no response_format path): %s`
+  - `Hugging Face structured JSON generation failed: NotFoundError: Not Found`
+  - `[llm_text_gen] Provider huggingface failed: RetryError[...]`
+- **Observed behavior**:
+  - Structured JSON call tries primary model.
+  - Fallback model sequence also fails.
+  - Retry without `response_format` still fails with `NotFound`.
+  - Upstream caller falls through to another provider or fallback path.
+- **Primary RCA hypothesis**:
+  - The configured Hugging Face model identifier is invalid, unavailable to the account/provider, or incompatible with the current OpenAI-compatible Hugging Face endpoint.
+- **Secondary RCA hypotheses**:
+  - Base URL/API key/provider configuration is wrong.
+  - Fallback model list contains provider-specific model ids not available in the current account/region.
+  - Structured generation path assumes chat completions support for models that only exist on a different inference route.
+- **Evidence in code**:
+  - `backend/services/llm_providers/huggingface_provider.py`
+    - `_fallback_model_sequence`
+    - `huggingface_structured_json_response`
+  - The code retries:
+    - with `response_format={"type": "json_object"}`
+    - then again without `response_format`
+  - Both paths still fail with `NotFoundError`, which points more strongly to model/base-url availability than schema formatting.
+- **Likely files to inspect/edit later**:
+  - `backend/services/llm_providers/huggingface_provider.py`
+  - provider selection/orchestration file calling Hugging Face as primary for structured JSON
+  - environment/config file for HF model names and API base URL
+- **Confidence**: High
+- **Root-cause questions to answer**:
+  - Which exact model string is being passed as the primary model in the failing call?
+  - What base URL and API key are being used for the OpenAI client?
+  - Are the fallback model ids valid for the currently configured Hugging Face inference provider?
+- **Validation after fix**:
+  - A structured JSON test request succeeds with the intended model or a verified fallback.
+  - No `NotFoundError` for the chosen model list.
+
+## Issue 3: txtai indexing attempted before service initialization completes
+
+- **Status**: Open
+- **Severity**: Medium
+- **Subsystem**: Semantic indexing / background tasks
+- **Error signatures**:
+  - `Cannot index content - service not initialized for user ...`
+- **Observed behavior**:
+  - Background indexing is triggered.
+  - `TxtaiIntelligenceService.index_content` calls `_ensure_initialized()`.
+  - `_ensure_initialized()` starts background initialization and returns immediately.
+  - `index_content` then checks `_initialized`, sees false, and fails fast.
+- **Primary RCA hypothesis**:
+  - There is a race condition between lazy background initialization and immediate indexing/search calls.
+  - `SIF_FAIL_FAST=true` (default) causes operations to raise RuntimeError instead of gracefully deferring.
+- **Evidence in code**:
+  - `backend/services/intelligence/txtai_service.py`:
+    - Line 57: `self.fail_fast = str(os.getenv("SIF_FAIL_FAST", "true")).lower() in {"1", "true", "yes", "on"}`
+    - Lines 234-235: `index_content` raises RuntimeError if `fail_fast` and not initialized
+    - Lines 284-285: `search` raises RuntimeError if `fail_fast` and not initialized
+    - Lines 319-320: `get_similarity` raises RuntimeError if `fail_fast` and not initialized
+    - `_ensure_initialized` is intentionally non-blocking (starts background thread)
+  - `backend/api/today_workflow.py`:
+    - `_index_tasks_to_sif` triggers indexing in background after workflow actions
+- **Likely files to inspect/edit later**:
+  - `backend/services/intelligence/txtai_service.py`
+  - `backend/api/today_workflow.py`
+  - any other callers that assume initialization is synchronous
+- **Confidence**: High
+- **Potential downstream impact**:
+  - workflow/task indexing silently fails
+  - semantic search quality degrades
+  - noisy logs obscure higher-priority failures
+- **Root-cause questions to answer**:
+  - Should `index_content` await `_ensure_initialized_async()` instead of using the non-blocking path?
+  - Should callers tolerate deferred indexing instead of fail-fast behavior?
+  - Is `SIF_FAIL_FAST=true` appropriate for background indexing operations?
+  - Should `SIF_FAIL_FAST` default to `false` for background operations?
+- **Validation after fix**:
+  - First indexing call after startup succeeds or is gracefully deferred without error spam.
+
+## Issue 4: Today workflow endpoint reload observed during active debugging
+
+- **Status**: Observed
+- **Severity**: Low
+- **Subsystem**: Development reload / workflow API
+- **Log signature**:
+  - `StatReload detected changes in 'api\today_workflow.py'. Reloading...`
+- **Observed behavior**:
+  - Development server reloads due to file edits.
+- **RCA**:
+  - Expected dev-server behavior, not itself a product bug.
+- **Files involved**:
+  - `backend/api/today_workflow.py`
+- **Confidence**: High
+- **Action**:
+  - No fix needed; keep separate from actual runtime defects.
+
+## Cross-Issue Notes
+
+- The auth failures and the workflow/indexing issues may be independent.
+- The Hugging Face failure may trigger fallback task generation, which can still create workflows while hiding the upstream provider problem.
+- txtai indexing failures appear to be a post-generation side effect, not the root cause of generation failure.
+- **LiteLLM was investigated and dropped as a false herring** – no project-level SIF/txtai wiring to LiteLLM was found.
+- The SIF agent local-model path is **separate** from txtai embeddings and may be the source of the "local model used to work" feedback.
+
+## Candidate Investigation Order
+
+1. Authentication verification mismatch
+2. Hugging Face model/provider availability mismatch
+3. txtai initialization race (with `SIF_FAIL_FAST` behavior)
+4. SIF agent local-model defaults (Qwen 1.5B vs lighter alternatives)
+5. Any downstream workflow symptoms after the above are stabilized
+
+## Minimal Fix Paths (Pre-Implementation)
+
+### For Issue 3 (txtai init race):
+- **Option A**: Change `SIF_FAIL_FAST` default to `false` for background operations
+  - Allows graceful deferral instead of RuntimeError
+  - Minimal code change, no logic changes
+- **Option B**: Use `_ensure_initialized_async()` in `index_content`/`search`/`get_similarity`
+  - Awaits initialization before proceeding
+  - More robust but requires async refactoring
+- **Option C**: Add initialization state callbacks to callers
+  - More complex, may not be necessary
+
+### For Issue 5 (SIF agent local-model drift):
+- **Option A**: Change default `model_name` in `SIFBaseAgent.__init__` to lighter model
+  - Example: `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`
+  - Single-line change, immediate effect
+- **Option B**: Add env/config override for default agent local model
+  - More flexible, requires config wiring
+  - Allows runtime tuning without code changes
+- **Option C**: Keep current default and rely on existing fallback chain
+  - The fallback chain already tries lighter models if memory fails
+  - May be sufficient if memory detection works correctly
+
+## Current Evidence Sources
+
+- Runtime logs from terminal `python` process `22056`
+- `backend/middleware/auth_middleware.py`
+- `backend/services/llm_providers/huggingface_provider.py`
+- `backend/services/intelligence/txtai_service.py`
+- `backend/api/today_workflow.py`
+- `backend/services/today_workflow_service.py`
+
+## Issue 5: SIF agent local-model drift (distinct from txtai embeddings)
+
+- **Status**: Open
+- **Severity**: Medium
+- **Subsystem**: SIF agents / local LLM wrappers
+- **Error signatures**:
+  - (No direct log signature yet; this is a hypothesis from user feedback that "local model used to work")
+- **Observed behavior**:
+  - User reports that a local model used to work for SIF agents now seems heavier or less responsive.
+  - The SIF agent path is **separate** from txtai embeddings.
+- **Primary RCA hypothesis**:
+  - The SIF agent local LLM wrapper path uses a 1.5B parameter model by default, which may be heavier than the previous local model.
+  - This is distinct from txtai embeddings, which still use `sentence-transformers/all-MiniLM-L6-v2`.
+- **Evidence in code**:
+  - `backend/services/intelligence/sif_agents.py`:
+    - Lines 47-51: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
+    - Lines 53-139: `LocalLLMWrapper` tries models in order, with memory issue detection and automatic fallback to smaller models
+    - Line 141: `SIFBaseAgent.__init__` default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
+  - `backend/services/intelligence/txtai_service.py`:
+    - Line 48: Still uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings
+- **Likely files to inspect/edit later**:
+  - `backend/services/intelligence/sif_agents.py`
+  - `backend/services/intelligence/agents/specialized/base.py`
+  - any config/env that controls default agent local model
+- **Confidence**: Medium
+- **Root-cause questions to answer**:
+  - What was the previous local model default for SIF agents?
+  - Is `Qwen/Qwen2.5-1.5B-Instruct` actually too heavy for the user’s laptop?
+  - Should the default be changed to `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`?
+  - Are there any env/config overrides that could make this configurable?
+- **Validation after fix**:
+  - SIF agents use a CPU-friendly local model (e.g., smaller Qwen variant or TinyLlama).
+  - Agent generation completes without excessive CPU/memory pressure.
+
+## Issue 6: Model initialization blocking and module unification
+
+- **Status**: Open
+- **Severity**: High
+- **Subsystem**: Startup / model loading / module architecture
+- **Error signatures**:
+  - (No direct log signature; architectural issue)
+- **Observed behavior**:
+  - `start_alwrity_backend.py` pre-downloads `Qwen/Qwen2.5-3B-Instruct` **synchronously** before server starts (line 122).
+  - `sif_agents.py` defaults to `Qwen/Qwen2.5-1.5B-Instruct` and uses lazy loading via `LocalLLMWrapper`.
+  - `txtai_service.py` uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
+  - Three separate modules handle model loading, creating confusion.
+  - User wants fail-fast semantics (catch bugs, avoid silent failures) AND proper fallback.
+  - User wants non-blocking model downloads for SIF/agents.
+- **Primary RCA hypothesis**:
+  - Startup script blocks on model download, contradicting non-blocking requirement.
+  - Model size mismatch: startup downloads 3B, agents default to 1.5B.
+  - Fail-fast in `txtai_service.py` prevents fallback from working.
+  - Module separation (`txtai_service.py`, `sif_agents.py`, `start_alwrity_backend.py`) creates confusion.
+- **Evidence in code**:
+  - `start_alwrity_backend.py`:
+    - Line 122: `target_model = "Qwen/Qwen2.5-3B-Instruct"`
+    - Lines 127-131: `snapshot_download()` is **blocking** call
+    - Lines 117-120: Skips on Render/Railway but **not** on local dev
+  - `sif_agents.py`:
+    - Line 48: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
+    - Line 141: Default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
+    - Line 150: Uses `LocalLLMWrapper` for lazy loading
+    - Lines 94-130: Has fallback logic with memory issue detection
+  - `txtai_service.py`:
+    - Line 57: `SIF_FAIL_FAST=true` (default) causes RuntimeError
+    - Lines 234-235, 284-285, 319-320: Fail-fast prevents fallback
+- **Likely files to inspect/edit later**:
+  - `start_alwrity_backend.py` (remove blocking download)
+  - `services/intelligence/sif_agents.py` (unify model defaults)
+  - `services/intelligence/txtai_service.py` (fix fail-fast with fallback)
+  - Create unified `services/intelligence/model_registry.py` or similar
+- **Confidence**: High
+- **Root-cause questions to answer**:
+  - Should model download be truly non-blocking (background thread)?
+  - Should fail-fast be conditional (e.g., only for critical paths, not background ops)?
+  - Should module unification create a single `ModelRegistry` or `ModelManager`?
+  - How to ensure JSON/response structure compatibility across fallback chain?
+- **Validation after fix**:
+  - Server starts without blocking on model download.
+  - SIF agents use consistent model defaults.
+  - Fail-fast catches bugs but allows fallback for non-critical ops.
+  - Single module handles all model loading logic.
+
+## Minimal Fix Paths (Pre-Implementation)
+
+### For Issue 3 (txtai init race) - REVISED:
+- **Option A**: Change `SIF_FAIL_FAST` to be **conditional** (not global)
+  - Keep fail-fast for critical paths (user-initiated ops)
+  - Allow graceful deferral for background ops (indexing, clustering)
+  - Requires distinguishing operation types
+- **Option B**: Use `_ensure_initialized_async()` for **blocking ops only**
+  - Keep non-blocking for background ops
+  - Awaits init for user-facing ops
+  - More robust but requires async refactoring
+- **Option C**: Add operation-type-aware fail-fast
+  - Pass `critical=True/False` to operations
+  - Fail-fast only when `critical=True`
+  - Most aligned with user requirements
+
+### For Issue 5 (SIF agent local-model drift) - REVISED:
+- **Option A**: Change default to lighter model AND improve fallback chain
+  - Default: `Qwen/Qwen2.5-0.5B-Instruct` (lighter)
+  - Fallback: `0.5B → TinyLlama 1.1B`
+  - Ensure JSON/response structure compatibility
+- **Option B**: Add env/config override + keep fallback chain
+  - `SIF_AGENT_MODEL` env var
+  - Fallback chain remains as-is
+  - More flexible
+- **Option C**: Keep current default and rely on existing fallback chain
+  - **RECOMMENDED**: Already has memory detection and fallback
+  - Just need to ensure JSON compatibility
+
+### For Issue 6 (Model blocking + module unification):
+- **Option A**: Remove blocking download from startup script
+  - Delete `bootstrap_local_llm_models()` call
+  - Let `LocalLLMWrapper` handle lazy loading
+  - Minimal change, immediate non-blocking
+- **Option B**: Make download non-blocking (background thread)
+  - Keep pre-download but in background
+  - Server starts immediately
+  - More complex
+- **Option C**: Create unified `ModelRegistry` module
+  - Single source of truth for model defaults
+  - Centralized download/cache logic
+  - Eliminates confusion between modules
+  - **RECOMMENDED for long-term**
+
+## Session Update Log
+
+### 2026-03-10
+
+- Created initial RCA tracker document.
+- Seeded first three concrete issues from supplied logs.
+- No fixes applied from this document yet.
+- Added Issue 5: SIF agent local-model drift (LiteLLM dropped as false herring).
+- Refined Issue 3 with `SIF_FAIL_FAST` behavior details.
+- Added minimal fix paths for Issues 3 and 5.
+- Added Issue 6: Model initialization blocking and module unification.
+- Updated minimal fix paths based on user requirements (fail-fast + fallback, non-blocking, unification).