Commit_all_local_changes_after_PR_406_merge

This commit is contained in:
ajaysi
2026-03-10 17:01:36 +05:30
parent f78b5f1e04
commit 8c2d88efb9
17 changed files with 936 additions and 412 deletions

View File

@@ -0,0 +1,361 @@
# Backend Log RCA Tracker
## Purpose
This document is the working catalog for backend issues observed in runtime logs.
For each issue, capture:
- error signature
- observed symptoms
- likely root cause analysis
- confidence level
- files to inspect/edit
- fix strategy notes
- validation steps
- status
## Triage Rules
- Do not fix directly from logs alone unless root cause is confirmed.
- Prefer grouping repeated log lines under one issue.
- Track the first failing subsystem, then downstream effects.
- Separate configuration problems from code defects.
- Keep this document updated before and after each fix.
## Issue 1: Clerk token verification failures on authenticated endpoints
- **Status**: Open
- **Severity**: High
- **Subsystem**: Authentication / request pipeline
- **Error signatures**:
- `Unverified token rejected (production).`
- `AUTHENTICATION ERROR: Token verification failed for endpoint: GET /api/...`
- **Observed endpoints in logs**:
- `/api/content-planning/monitoring/lightweight-stats`
- `/api/content-planning/monitoring/health`
- `/api/subscription/dashboard/...`
- `/api/subscription/alerts/...`
- `/api/subscription/status/...`
- **Observed behavior**:
- Requests reach authenticated endpoints.
- Clerk verification fails.
- Fallback unverified decode path is attempted.
- Production mode rejects the token.
- **Primary RCA hypothesis**:
- The backend is receiving bearer tokens that do not successfully validate against the resolved Clerk JWKS/issuer configuration.
- The middleware then falls back to unverified decode, but production mode explicitly rejects that path.
- **Secondary RCA hypotheses**:
- Frontend token/audience/issuer mismatch.
- Wrong Clerk environment variables loaded in backend.
- Issuer-derived JWKS URL resolution is inconsistent with actual Clerk instance.
- Requests may be sent before a valid session token is available.
- **Evidence in code**:
- `backend/middleware/auth_middleware.py`
- `ClerkAuthMiddleware.__init__`
- `ClerkAuthMiddleware.verify_token`
- `get_current_user`
- Relevant logic:
- derives JWKS URL from token issuer or cached publishable key instance
- falls back to `jwt.decode(..., verify_signature=False)`
- rejects unverified tokens when `ALLOW_UNVERIFIED_JWT_DEV` is false
- **Likely files to inspect/edit later**:
- `backend/middleware/auth_middleware.py`
- possibly frontend auth/session request layer if token attachment is inconsistent
- **Confidence**: Medium
- **Root-cause questions to answer**:
- Are `CLERK_SECRET_KEY` and publishable key values from the same Clerk instance?
- Is the token issuer exactly matching the intended Clerk environment?
- Are failing requests sent with stale, dev, or cross-environment tokens?
- Are these requests triggered before Clerk session hydration on the frontend?
- **Validation after fix**:
- Authenticated endpoints return 200 with verified user context.
- No `Unverified token rejected (production)` log spam for healthy requests.
## Issue 2: Hugging Face structured JSON generation failing with model not found
- **Status**: Open
- **Severity**: High
- **Subsystem**: LLM provider / workflow generation
- **Error signatures**:
- `HF structured model not found: %s. Trying fallback model.`
- `Hugging Face API call failed: Not Found`
- `HF structured model not found (no response_format path): %s`
- `Hugging Face structured JSON generation failed: NotFoundError: Not Found`
- `[llm_text_gen] Provider huggingface failed: RetryError[...]`
- **Observed behavior**:
- Structured JSON call tries primary model.
- Fallback model sequence also fails.
- Retry without `response_format` still fails with `NotFound`.
- Upstream caller falls through to another provider or fallback path.
- **Primary RCA hypothesis**:
- The configured Hugging Face model identifier is invalid, unavailable to the account/provider, or incompatible with the current OpenAI-compatible Hugging Face endpoint.
- **Secondary RCA hypotheses**:
- Base URL/API key/provider configuration is wrong.
- Fallback model list contains provider-specific model ids not available in the current account/region.
- Structured generation path assumes chat completions support for models that only exist on a different inference route.
- **Evidence in code**:
- `backend/services/llm_providers/huggingface_provider.py`
- `_fallback_model_sequence`
- `huggingface_structured_json_response`
- The code retries:
- with `response_format={"type": "json_object"}`
- then again without `response_format`
- Both paths still fail with `NotFoundError`, which points more strongly to model/base-url availability than schema formatting.
- **Likely files to inspect/edit later**:
- `backend/services/llm_providers/huggingface_provider.py`
- provider selection/orchestration file calling Hugging Face as primary for structured JSON
- environment/config file for HF model names and API base URL
- **Confidence**: High
- **Root-cause questions to answer**:
- Which exact model string is being passed as the primary model in the failing call?
- What base URL and API key are being used for the OpenAI client?
- Are the fallback model ids valid for the currently configured Hugging Face inference provider?
- **Validation after fix**:
- A structured JSON test request succeeds with the intended model or a verified fallback.
- No `NotFoundError` for the chosen model list.
## Issue 3: txtai indexing attempted before service initialization completes
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: Semantic indexing / background tasks
- **Error signatures**:
- `Cannot index content - service not initialized for user ...`
- **Observed behavior**:
- Background indexing is triggered.
- `TxtaiIntelligenceService.index_content` calls `_ensure_initialized()`.
- `_ensure_initialized()` starts background initialization and returns immediately.
- `index_content` then checks `_initialized`, sees false, and fails fast.
- **Primary RCA hypothesis**:
- There is a race condition between lazy background initialization and immediate indexing/search calls.
- `SIF_FAIL_FAST=true` (default) causes operations to raise RuntimeError instead of gracefully deferring.
- **Evidence in code**:
- `backend/services/intelligence/txtai_service.py`:
- Line 57: `self.fail_fast = str(os.getenv("SIF_FAIL_FAST", "true")).lower() in {"1", "true", "yes", "on"}`
- Lines 234-235: `index_content` raises RuntimeError if `fail_fast` and not initialized
- Lines 284-285: `search` raises RuntimeError if `fail_fast` and not initialized
- Lines 319-320: `get_similarity` raises RuntimeError if `fail_fast` and not initialized
- `_ensure_initialized` is intentionally non-blocking (starts background thread)
- `backend/api/today_workflow.py`:
- `_index_tasks_to_sif` triggers indexing in background after workflow actions
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- any other callers that assume initialization is synchronous
- **Confidence**: High
- **Potential downstream impact**:
- workflow/task indexing silently fails
- semantic search quality degrades
- noisy logs obscure higher-priority failures
- **Root-cause questions to answer**:
- Should `index_content` await `_ensure_initialized_async()` instead of using the non-blocking path?
- Should callers tolerate deferred indexing instead of fail-fast behavior?
- Is `SIF_FAIL_FAST=true` appropriate for background indexing operations?
- Should `SIF_FAIL_FAST` default to `false` for background operations?
- **Validation after fix**:
- First indexing call after startup succeeds or is gracefully deferred without error spam.
## Issue 4: Today workflow endpoint reload observed during active debugging
- **Status**: Observed
- **Severity**: Low
- **Subsystem**: Development reload / workflow API
- **Log signature**:
- `StatReload detected changes in 'api\today_workflow.py'. Reloading...`
- **Observed behavior**:
- Development server reloads due to file edits.
- **RCA**:
- Expected dev-server behavior, not itself a product bug.
- **Files involved**:
- `backend/api/today_workflow.py`
- **Confidence**: High
- **Action**:
- No fix needed; keep separate from actual runtime defects.
## Cross-Issue Notes
- The auth failures and the workflow/indexing issues may be independent.
- The Hugging Face failure may trigger fallback task generation, which can still create workflows while hiding the upstream provider problem.
- txtai indexing failures appear to be a post-generation side effect, not the root cause of generation failure.
- **LiteLLM was investigated and dropped as a false herring** no project-level SIF/txtai wiring to LiteLLM was found.
- The SIF agent local-model path is **separate** from txtai embeddings and may be the source of the "local model used to work" feedback.
## Candidate Investigation Order
1. Authentication verification mismatch
2. Hugging Face model/provider availability mismatch
3. txtai initialization race (with `SIF_FAIL_FAST` behavior)
4. SIF agent local-model defaults (Qwen 1.5B vs lighter alternatives)
5. Any downstream workflow symptoms after the above are stabilized
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race):
- **Option A**: Change `SIF_FAIL_FAST` default to `false` for background operations
- Allows graceful deferral instead of RuntimeError
- Minimal code change, no logic changes
- **Option B**: Use `_ensure_initialized_async()` in `index_content`/`search`/`get_similarity`
- Awaits initialization before proceeding
- More robust but requires async refactoring
- **Option C**: Add initialization state callbacks to callers
- More complex, may not be necessary
### For Issue 5 (SIF agent local-model drift):
- **Option A**: Change default `model_name` in `SIFBaseAgent.__init__` to lighter model
- Example: `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`
- Single-line change, immediate effect
- **Option B**: Add env/config override for default agent local model
- More flexible, requires config wiring
- Allows runtime tuning without code changes
- **Option C**: Keep current default and rely on existing fallback chain
- The fallback chain already tries lighter models if memory fails
- May be sufficient if memory detection works correctly
## Current Evidence Sources
- Runtime logs from terminal `python` process `22056`
- `backend/middleware/auth_middleware.py`
- `backend/services/llm_providers/huggingface_provider.py`
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- `backend/services/today_workflow_service.py`
## Issue 5: SIF agent local-model drift (distinct from txtai embeddings)
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: SIF agents / local LLM wrappers
- **Error signatures**:
- (No direct log signature yet; this is a hypothesis from user feedback that "local model used to work")
- **Observed behavior**:
- User reports that a local model used to work for SIF agents now seems heavier or less responsive.
- The SIF agent path is **separate** from txtai embeddings.
- **Primary RCA hypothesis**:
- The SIF agent local LLM wrapper path uses a 1.5B parameter model by default, which may be heavier than the previous local model.
- This is distinct from txtai embeddings, which still use `sentence-transformers/all-MiniLM-L6-v2`.
- **Evidence in code**:
- `backend/services/intelligence/sif_agents.py`:
- Lines 47-51: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Lines 53-139: `LocalLLMWrapper` tries models in order, with memory issue detection and automatic fallback to smaller models
- Line 141: `SIFBaseAgent.__init__` default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- `backend/services/intelligence/txtai_service.py`:
- Line 48: Still uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/sif_agents.py`
- `backend/services/intelligence/agents/specialized/base.py`
- any config/env that controls default agent local model
- **Confidence**: Medium
- **Root-cause questions to answer**:
- What was the previous local model default for SIF agents?
- Is `Qwen/Qwen2.5-1.5B-Instruct` actually too heavy for the users laptop?
- Should the default be changed to `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`?
- Are there any env/config overrides that could make this configurable?
- **Validation after fix**:
- SIF agents use a CPU-friendly local model (e.g., smaller Qwen variant or TinyLlama).
- Agent generation completes without excessive CPU/memory pressure.
## Issue 6: Model initialization blocking and module unification
- **Status**: Open
- **Severity**: High
- **Subsystem**: Startup / model loading / module architecture
- **Error signatures**:
- (No direct log signature; architectural issue)
- **Observed behavior**:
- `start_alwrity_backend.py` pre-downloads `Qwen/Qwen2.5-3B-Instruct` **synchronously** before server starts (line 122).
- `sif_agents.py` defaults to `Qwen/Qwen2.5-1.5B-Instruct` and uses lazy loading via `LocalLLMWrapper`.
- `txtai_service.py` uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
- Three separate modules handle model loading, creating confusion.
- User wants fail-fast semantics (catch bugs, avoid silent failures) AND proper fallback.
- User wants non-blocking model downloads for SIF/agents.
- **Primary RCA hypothesis**:
- Startup script blocks on model download, contradicting non-blocking requirement.
- Model size mismatch: startup downloads 3B, agents default to 1.5B.
- Fail-fast in `txtai_service.py` prevents fallback from working.
- Module separation (`txtai_service.py`, `sif_agents.py`, `start_alwrity_backend.py`) creates confusion.
- **Evidence in code**:
- `start_alwrity_backend.py`:
- Line 122: `target_model = "Qwen/Qwen2.5-3B-Instruct"`
- Lines 127-131: `snapshot_download()` is **blocking** call
- Lines 117-120: Skips on Render/Railway but **not** on local dev
- `sif_agents.py`:
- Line 48: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Line 141: Default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- Line 150: Uses `LocalLLMWrapper` for lazy loading
- Lines 94-130: Has fallback logic with memory issue detection
- `txtai_service.py`:
- Line 57: `SIF_FAIL_FAST=true` (default) causes RuntimeError
- Lines 234-235, 284-285, 319-320: Fail-fast prevents fallback
- **Likely files to inspect/edit later**:
- `start_alwrity_backend.py` (remove blocking download)
- `services/intelligence/sif_agents.py` (unify model defaults)
- `services/intelligence/txtai_service.py` (fix fail-fast with fallback)
- Create unified `services/intelligence/model_registry.py` or similar
- **Confidence**: High
- **Root-cause questions to answer**:
- Should model download be truly non-blocking (background thread)?
- Should fail-fast be conditional (e.g., only for critical paths, not background ops)?
- Should module unification create a single `ModelRegistry` or `ModelManager`?
- How to ensure JSON/response structure compatibility across fallback chain?
- **Validation after fix**:
- Server starts without blocking on model download.
- SIF agents use consistent model defaults.
- Fail-fast catches bugs but allows fallback for non-critical ops.
- Single module handles all model loading logic.
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race) - REVISED:
- **Option A**: Change `SIF_FAIL_FAST` to be **conditional** (not global)
- Keep fail-fast for critical paths (user-initiated ops)
- Allow graceful deferral for background ops (indexing, clustering)
- Requires distinguishing operation types
- **Option B**: Use `_ensure_initialized_async()` for **blocking ops only**
- Keep non-blocking for background ops
- Awaits init for user-facing ops
- More robust but requires async refactoring
- **Option C**: Add operation-type-aware fail-fast
- Pass `critical=True/False` to operations
- Fail-fast only when `critical=True`
- Most aligned with user requirements
### For Issue 5 (SIF agent local-model drift) - REVISED:
- **Option A**: Change default to lighter model AND improve fallback chain
- Default: `Qwen/Qwen2.5-0.5B-Instruct` (lighter)
- Fallback: `0.5B → TinyLlama 1.1B`
- Ensure JSON/response structure compatibility
- **Option B**: Add env/config override + keep fallback chain
- `SIF_AGENT_MODEL` env var
- Fallback chain remains as-is
- More flexible
- **Option C**: Keep current default and rely on existing fallback chain
- **RECOMMENDED**: Already has memory detection and fallback
- Just need to ensure JSON compatibility
### For Issue 6 (Model blocking + module unification):
- **Option A**: Remove blocking download from startup script
- Delete `bootstrap_local_llm_models()` call
- Let `LocalLLMWrapper` handle lazy loading
- Minimal change, immediate non-blocking
- **Option B**: Make download non-blocking (background thread)
- Keep pre-download but in background
- Server starts immediately
- More complex
- **Option C**: Create unified `ModelRegistry` module
- Single source of truth for model defaults
- Centralized download/cache logic
- Eliminates confusion between modules
- **RECOMMENDED for long-term**
## Session Update Log
### 2026-03-10
- Created initial RCA tracker document.
- Seeded first three concrete issues from supplied logs.
- No fixes applied from this document yet.
- Added Issue 5: SIF agent local-model drift (LiteLLM dropped as false herring).
- Refined Issue 3 with `SIF_FAIL_FAST` behavior details.
- Added minimal fix paths for Issues 3 and 5.
- Added Issue 6: Model initialization blocking and module unification.
- Updated minimal fix paths based on user requirements (fail-fast + fallback, non-blocking, unification).