18 KiB
18 KiB
Backend Log RCA Tracker
Purpose
This document is the working catalog for backend issues observed in runtime logs.
For each issue, capture:
- error signature
- observed symptoms
- likely root cause analysis
- confidence level
- files to inspect/edit
- fix strategy notes
- validation steps
- status
Triage Rules
- Do not fix directly from logs alone unless root cause is confirmed.
- Prefer grouping repeated log lines under one issue.
- Track the first failing subsystem, then downstream effects.
- Separate configuration problems from code defects.
- Keep this document updated before and after each fix.
Issue 1: Clerk token verification failures on authenticated endpoints
- Status: Open
- Severity: High
- Subsystem: Authentication / request pipeline
- Error signatures:
Unverified token rejected (production).AUTHENTICATION ERROR: Token verification failed for endpoint: GET /api/...
- Observed endpoints in logs:
/api/content-planning/monitoring/lightweight-stats/api/content-planning/monitoring/health/api/subscription/dashboard/.../api/subscription/alerts/.../api/subscription/status/...
- Observed behavior:
- Requests reach authenticated endpoints.
- Clerk verification fails.
- Fallback unverified decode path is attempted.
- Production mode rejects the token.
- Primary RCA hypothesis:
- The backend is receiving bearer tokens that do not successfully validate against the resolved Clerk JWKS/issuer configuration.
- The middleware then falls back to unverified decode, but production mode explicitly rejects that path.
- Secondary RCA hypotheses:
- Frontend token/audience/issuer mismatch.
- Wrong Clerk environment variables loaded in backend.
- Issuer-derived JWKS URL resolution is inconsistent with actual Clerk instance.
- Requests may be sent before a valid session token is available.
- Evidence in code:
backend/middleware/auth_middleware.pyClerkAuthMiddleware.__init__ClerkAuthMiddleware.verify_tokenget_current_user
- Relevant logic:
- derives JWKS URL from token issuer or cached publishable key instance
- falls back to
jwt.decode(..., verify_signature=False) - rejects unverified tokens when
ALLOW_UNVERIFIED_JWT_DEVis false
- Likely files to inspect/edit later:
backend/middleware/auth_middleware.py- possibly frontend auth/session request layer if token attachment is inconsistent
- Confidence: Medium
- Root-cause questions to answer:
- Are
CLERK_SECRET_KEYand publishable key values from the same Clerk instance? - Is the token issuer exactly matching the intended Clerk environment?
- Are failing requests sent with stale, dev, or cross-environment tokens?
- Are these requests triggered before Clerk session hydration on the frontend?
- Are
- Validation after fix:
- Authenticated endpoints return 200 with verified user context.
- No
Unverified token rejected (production)log spam for healthy requests.
Issue 2: Hugging Face structured JSON generation failing with model not found
- Status: Open
- Severity: High
- Subsystem: LLM provider / workflow generation
- Error signatures:
HF structured model not found: %s. Trying fallback model.Hugging Face API call failed: Not FoundHF structured model not found (no response_format path): %sHugging Face structured JSON generation failed: NotFoundError: Not Found[llm_text_gen] Provider huggingface failed: RetryError[...]
- Observed behavior:
- Structured JSON call tries primary model.
- Fallback model sequence also fails.
- Retry without
response_formatstill fails withNotFound. - Upstream caller falls through to another provider or fallback path.
- Primary RCA hypothesis:
- The configured Hugging Face model identifier is invalid, unavailable to the account/provider, or incompatible with the current OpenAI-compatible Hugging Face endpoint.
- Secondary RCA hypotheses:
- Base URL/API key/provider configuration is wrong.
- Fallback model list contains provider-specific model ids not available in the current account/region.
- Structured generation path assumes chat completions support for models that only exist on a different inference route.
- Evidence in code:
backend/services/llm_providers/huggingface_provider.py_fallback_model_sequencehuggingface_structured_json_response
- The code retries:
- with
response_format={"type": "json_object"} - then again without
response_format
- with
- Both paths still fail with
NotFoundError, which points more strongly to model/base-url availability than schema formatting.
- Likely files to inspect/edit later:
backend/services/llm_providers/huggingface_provider.py- provider selection/orchestration file calling Hugging Face as primary for structured JSON
- environment/config file for HF model names and API base URL
- Confidence: High
- Root-cause questions to answer:
- Which exact model string is being passed as the primary model in the failing call?
- What base URL and API key are being used for the OpenAI client?
- Are the fallback model ids valid for the currently configured Hugging Face inference provider?
- Validation after fix:
- A structured JSON test request succeeds with the intended model or a verified fallback.
- No
NotFoundErrorfor the chosen model list.
Issue 3: txtai indexing attempted before service initialization completes
- Status: Open
- Severity: Medium
- Subsystem: Semantic indexing / background tasks
- Error signatures:
Cannot index content - service not initialized for user ...
- Observed behavior:
- Background indexing is triggered.
TxtaiIntelligenceService.index_contentcalls_ensure_initialized()._ensure_initialized()starts background initialization and returns immediately.index_contentthen checks_initialized, sees false, and fails fast.
- Primary RCA hypothesis:
- There is a race condition between lazy background initialization and immediate indexing/search calls.
SIF_FAIL_FAST=true(default) causes operations to raise RuntimeError instead of gracefully deferring.
- Evidence in code:
backend/services/intelligence/txtai_service.py:- Line 57:
self.fail_fast = str(os.getenv("SIF_FAIL_FAST", "true")).lower() in {"1", "true", "yes", "on"} - Lines 234-235:
index_contentraises RuntimeError iffail_fastand not initialized - Lines 284-285:
searchraises RuntimeError iffail_fastand not initialized - Lines 319-320:
get_similarityraises RuntimeError iffail_fastand not initialized _ensure_initializedis intentionally non-blocking (starts background thread)
- Line 57:
backend/api/today_workflow.py:_index_tasks_to_siftriggers indexing in background after workflow actions
- Likely files to inspect/edit later:
backend/services/intelligence/txtai_service.pybackend/api/today_workflow.py- any other callers that assume initialization is synchronous
- Confidence: High
- Potential downstream impact:
- workflow/task indexing silently fails
- semantic search quality degrades
- noisy logs obscure higher-priority failures
- Root-cause questions to answer:
- Should
index_contentawait_ensure_initialized_async()instead of using the non-blocking path? - Should callers tolerate deferred indexing instead of fail-fast behavior?
- Is
SIF_FAIL_FAST=trueappropriate for background indexing operations? - Should
SIF_FAIL_FASTdefault tofalsefor background operations?
- Should
- Validation after fix:
- First indexing call after startup succeeds or is gracefully deferred without error spam.
Issue 4: Today workflow endpoint reload observed during active debugging
- Status: Observed
- Severity: Low
- Subsystem: Development reload / workflow API
- Log signature:
StatReload detected changes in 'api\today_workflow.py'. Reloading...
- Observed behavior:
- Development server reloads due to file edits.
- RCA:
- Expected dev-server behavior, not itself a product bug.
- Files involved:
backend/api/today_workflow.py
- Confidence: High
- Action:
- No fix needed; keep separate from actual runtime defects.
Cross-Issue Notes
- The auth failures and the workflow/indexing issues may be independent.
- The Hugging Face failure may trigger fallback task generation, which can still create workflows while hiding the upstream provider problem.
- txtai indexing failures appear to be a post-generation side effect, not the root cause of generation failure.
- LiteLLM was investigated and dropped as a false herring – no project-level SIF/txtai wiring to LiteLLM was found.
- The SIF agent local-model path is separate from txtai embeddings and may be the source of the "local model used to work" feedback.
Candidate Investigation Order
- Authentication verification mismatch
- Hugging Face model/provider availability mismatch
- txtai initialization race (with
SIF_FAIL_FASTbehavior) - SIF agent local-model defaults (Qwen 1.5B vs lighter alternatives)
- Any downstream workflow symptoms after the above are stabilized
Minimal Fix Paths (Pre-Implementation)
For Issue 3 (txtai init race):
- Option A: Change
SIF_FAIL_FASTdefault tofalsefor background operations- Allows graceful deferral instead of RuntimeError
- Minimal code change, no logic changes
- Option B: Use
_ensure_initialized_async()inindex_content/search/get_similarity- Awaits initialization before proceeding
- More robust but requires async refactoring
- Option C: Add initialization state callbacks to callers
- More complex, may not be necessary
For Issue 5 (SIF agent local-model drift):
- Option A: Change default
model_nameinSIFBaseAgent.__init__to lighter model- Example:
Qwen/Qwen2.5-0.5B-InstructorTinyLlama/TinyLlama-1.1B-Chat-v1.0 - Single-line change, immediate effect
- Example:
- Option B: Add env/config override for default agent local model
- More flexible, requires config wiring
- Allows runtime tuning without code changes
- Option C: Keep current default and rely on existing fallback chain
- The fallback chain already tries lighter models if memory fails
- May be sufficient if memory detection works correctly
Current Evidence Sources
- Runtime logs from terminal
pythonprocess22056 backend/middleware/auth_middleware.pybackend/services/llm_providers/huggingface_provider.pybackend/services/intelligence/txtai_service.pybackend/api/today_workflow.pybackend/services/today_workflow_service.py
Issue 5: SIF agent local-model drift (distinct from txtai embeddings)
- Status: Open
- Severity: Medium
- Subsystem: SIF agents / local LLM wrappers
- Error signatures:
- (No direct log signature yet; this is a hypothesis from user feedback that "local model used to work")
- Observed behavior:
- User reports that a local model used to work for SIF agents now seems heavier or less responsive.
- The SIF agent path is separate from txtai embeddings.
- Primary RCA hypothesis:
- The SIF agent local LLM wrapper path uses a 1.5B parameter model by default, which may be heavier than the previous local model.
- This is distinct from txtai embeddings, which still use
sentence-transformers/all-MiniLM-L6-v2.
- Evidence in code:
backend/services/intelligence/sif_agents.py:- Lines 47-51:
LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"] - Lines 53-139:
LocalLLMWrappertries models in order, with memory issue detection and automatic fallback to smaller models - Line 141:
SIFBaseAgent.__init__defaultmodel_name="Qwen/Qwen2.5-1.5B-Instruct"
- Lines 47-51:
backend/services/intelligence/txtai_service.py:- Line 48: Still uses
sentence-transformers/all-MiniLM-L6-v2for embeddings
- Line 48: Still uses
- Likely files to inspect/edit later:
backend/services/intelligence/sif_agents.pybackend/services/intelligence/agents/specialized/base.py- any config/env that controls default agent local model
- Confidence: Medium
- Root-cause questions to answer:
- What was the previous local model default for SIF agents?
- Is
Qwen/Qwen2.5-1.5B-Instructactually too heavy for the user’s laptop? - Should the default be changed to
Qwen/Qwen2.5-0.5B-InstructorTinyLlama/TinyLlama-1.1B-Chat-v1.0? - Are there any env/config overrides that could make this configurable?
- Validation after fix:
- SIF agents use a CPU-friendly local model (e.g., smaller Qwen variant or TinyLlama).
- Agent generation completes without excessive CPU/memory pressure.
Issue 6: Model initialization blocking and module unification
- Status: Open
- Severity: High
- Subsystem: Startup / model loading / module architecture
- Error signatures:
- (No direct log signature; architectural issue)
- Observed behavior:
start_alwrity_backend.pypre-downloadsQwen/Qwen2.5-3B-Instructsynchronously before server starts (line 122).sif_agents.pydefaults toQwen/Qwen2.5-1.5B-Instructand uses lazy loading viaLocalLLMWrapper.txtai_service.pyusessentence-transformers/all-MiniLM-L6-v2for embeddings.- Three separate modules handle model loading, creating confusion.
- User wants fail-fast semantics (catch bugs, avoid silent failures) AND proper fallback.
- User wants non-blocking model downloads for SIF/agents.
- Primary RCA hypothesis:
- Startup script blocks on model download, contradicting non-blocking requirement.
- Model size mismatch: startup downloads 3B, agents default to 1.5B.
- Fail-fast in
txtai_service.pyprevents fallback from working. - Module separation (
txtai_service.py,sif_agents.py,start_alwrity_backend.py) creates confusion.
- Evidence in code:
start_alwrity_backend.py:- Line 122:
target_model = "Qwen/Qwen2.5-3B-Instruct" - Lines 127-131:
snapshot_download()is blocking call - Lines 117-120: Skips on Render/Railway but not on local dev
- Line 122:
sif_agents.py:- Line 48:
LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"] - Line 141: Default
model_name="Qwen/Qwen2.5-1.5B-Instruct" - Line 150: Uses
LocalLLMWrapperfor lazy loading - Lines 94-130: Has fallback logic with memory issue detection
- Line 48:
txtai_service.py:- Line 57:
SIF_FAIL_FAST=true(default) causes RuntimeError - Lines 234-235, 284-285, 319-320: Fail-fast prevents fallback
- Line 57:
- Likely files to inspect/edit later:
start_alwrity_backend.py(remove blocking download)services/intelligence/sif_agents.py(unify model defaults)services/intelligence/txtai_service.py(fix fail-fast with fallback)- Create unified
services/intelligence/model_registry.pyor similar
- Confidence: High
- Root-cause questions to answer:
- Should model download be truly non-blocking (background thread)?
- Should fail-fast be conditional (e.g., only for critical paths, not background ops)?
- Should module unification create a single
ModelRegistryorModelManager? - How to ensure JSON/response structure compatibility across fallback chain?
- Validation after fix:
- Server starts without blocking on model download.
- SIF agents use consistent model defaults.
- Fail-fast catches bugs but allows fallback for non-critical ops.
- Single module handles all model loading logic.
Minimal Fix Paths (Pre-Implementation)
For Issue 3 (txtai init race) - REVISED:
- Option A: Change
SIF_FAIL_FASTto be conditional (not global)- Keep fail-fast for critical paths (user-initiated ops)
- Allow graceful deferral for background ops (indexing, clustering)
- Requires distinguishing operation types
- Option B: Use
_ensure_initialized_async()for blocking ops only- Keep non-blocking for background ops
- Awaits init for user-facing ops
- More robust but requires async refactoring
- Option C: Add operation-type-aware fail-fast
- Pass
critical=True/Falseto operations - Fail-fast only when
critical=True - Most aligned with user requirements
- Pass
For Issue 5 (SIF agent local-model drift) - REVISED:
- Option A: Change default to lighter model AND improve fallback chain
- Default:
Qwen/Qwen2.5-0.5B-Instruct(lighter) - Fallback:
0.5B → TinyLlama 1.1B - Ensure JSON/response structure compatibility
- Default:
- Option B: Add env/config override + keep fallback chain
SIF_AGENT_MODELenv var- Fallback chain remains as-is
- More flexible
- Option C: Keep current default and rely on existing fallback chain
- RECOMMENDED: Already has memory detection and fallback
- Just need to ensure JSON compatibility
For Issue 6 (Model blocking + module unification):
- Option A: Remove blocking download from startup script
- Delete
bootstrap_local_llm_models()call - Let
LocalLLMWrapperhandle lazy loading - Minimal change, immediate non-blocking
- Delete
- Option B: Make download non-blocking (background thread)
- Keep pre-download but in background
- Server starts immediately
- More complex
- Option C: Create unified
ModelRegistrymodule- Single source of truth for model defaults
- Centralized download/cache logic
- Eliminates confusion between modules
- RECOMMENDED for long-term
Session Update Log
2026-03-10
- Created initial RCA tracker document.
- Seeded first three concrete issues from supplied logs.
- No fixes applied from this document yet.
- Added Issue 5: SIF agent local-model drift (LiteLLM dropped as false herring).
- Refined Issue 3 with
SIF_FAIL_FASTbehavior details. - Added minimal fix paths for Issues 3 and 5.
- Added Issue 6: Model initialization blocking and module unification.
- Updated minimal fix paths based on user requirements (fail-fast + fallback, non-blocking, unification).