Commit_all_local_changes_after_PR_406_merge

This commit is contained in:
ajaysi
2026-03-10 17:01:36 +05:30
parent f78b5f1e04
commit 8c2d88efb9
17 changed files with 936 additions and 412 deletions

View File

@@ -1,316 +0,0 @@
# ALwrity Daily Workflow PR Merge Summary
**Date:** March 9, 2026
**Session Goal:** Review and integrate workflow enhancement PRs (#388-397)
**Status:** ✅ COMPLETED - 9 PRs successfully merged
---
## Successfully Merged PRs (9 Total)
### Core Workflow Enhancement Series
| # | Title | Commit | Key Improvements |
|---|-------|--------|-----------------|
| #388 | Daily Workflow Integration & Enhanced Reliability | 8f6ed3a | Agent committee orchestration, robust task proposal handling, metadata normalization |
| #389 | Committee Health Precheck & Simplified Architecture | 3558131 | Simplified schema, health precheck, removed complex dependency coercion |
| #390 | Degraded-mode Workflow Regeneration Criteria | 56854df | Rate-limited `/regenerate` endpoint (3 req/60s), quality score tracking |
| #391 | Workflow Provenance Quality Metrics | 2d4c83e | Provenance classification (agent vs fallback), quality ratio calculation |
| #392 | Contextuality Validation & Low-context Status | 74b788a | Evidence-link grounding, plan contextuality scoring (65% threshold) |
| #394 | Task Memory Feedback Scoring | 38444f4 | Proper self-learning: uses persisted task.status, handles all negative cases |
| #395 | Dependencies Normalization | 0aaaf07 | Robust `_normalize_dependencies()` helper for consistent data types |
| #396 | Date Validation & Error Handling | 9271566 | ISO date validation before yesterday indexing, narrower SQLAlchemyError handling |
| #397 | Typed Request Model for Task Status | 39bc3e3 | Pydantic `TaskStatusEnum` & `TaskStatusUpdateRequest`, FastAPI auto-validation |
---
## System Architecture Evolution
### From Simple to Sophisticated
```
PR #388 ─→ Agent Committee Orchestration
PR #389 ─→ Clean Architecture
PR #390 ─→ Regeneration Control
PR #391 ─→ Quality Awareness
PR #392 ─→ Evidence-Based Grounding
PR #394 ─→ Proper Memory Learning
PR #395 ─→ Data Consistency
PR #396 ─→ Production Observability
PR #397 ─→ API Type Safety
```
---
## Key Features Implemented
### 1. **Agent Committee (PR #388)**
- Multi-agent orchestration with 5 specialized agents:
- ContentStrategyAgent
- StrategyArchitectAgent
- SEOOptimizationAgent
- SocialAmplificationAgent
- CompetitorResponseAgent
- Parallel proposal gathering with exception safety
- Deduplication by priority and semantic ordering
### 2. **Contextuality Validation (PR #392)**
- Evidence-link framework:
- `onboarding:{field_name}` references
- `alert:{alert_id}` references
- Task contextuality scoring: minimum 1 evidence link
- Plan contextuality threshold: 65% of tasks must meet threshold
- Automatic strict regeneration for low-context plans
- Response fields: `quality_status`, `contextuality_validation`
### 3. **Self-Learning Memory (PR #394)**
- Uses canonical `task.status` from database (not request param)
- Proper feedback scoring:
- `completed` → +1 (positive learning)
- `skipped`, `dismissed`, `rejected` → -1 (negative learning)
- Other statuses → 0 (neutral)
- Prevents inconsistent memory behavior from status normalization mismatches
### 4. **Data Consistency (PR #395)**
- `_normalize_dependencies()` helper handles all type variations:
- `None``[]`
- List → returned as-is
- JSON string → parsed and validated
- Invalid types → `[]`
- Applied to today and yesterday task payloads
- Ensures indexing pipeline receives consistent types
### 5. **Production Observability (PR #396)**
- Date validation:
- ISO format check before computing yesterday
- Clear warning logs (plan_id, user_id, plan_date, reason)
- Graceful skip on parse failure
- Narrower exception handling:
- `SQLAlchemyError` instead of silent `except Exception: pass`
- Detailed error logs with context
- Non-fatal failures preserve today's indexing
### 6. **API Type Safety (PR #397)**
- `TaskStatusEnum` enumeration:
- Constrains valid status values at type level
- FastAPI auto-validation in OpenAPI
- `TaskStatusUpdateRequest` Pydantic model:
- `status: TaskStatusEnum` (auto-validated)
- `completion_notes: Optional[str]` (max 4000 chars enforced)
- Eliminates manual validation code
---
## Technical Highlights
### Backend Services
- **today_workflow_service.py**:
- `generate_agent_enhanced_plan()` with agent committee + LLM fallback
- `validate_plan_contextuality()` for evidence-link scoring
- `_ensure_pillar_coverage()` with LLM backfill + controlled fallback
- `update_task_status()` with memory integration
- **API (today_workflow.py)**:
- Type-safe endpoint handlers
- Pydantic request/response validation
- Comprehensive error handling
- Normalized dependencies throughout
- Detailed logging for observability
### Database & ORM
- Efficient schema after simplification (PR #389)
- `plan_json` BLOB stores complete workflow metadata
- Proper foreign key relationships
- Transaction safety with SQLAlchemy
### Frontend (TypeScript)
- Zustand store for workflow state
- Error boundary handling
- Fallback logic for degraded mode
- Type-safe API calls
---
## Quality Metrics
### Code Quality
- ✅ Type safety throughout (Pydantic, TypeScript)
- ✅ Comprehensive error handling (narrower scopes)
- ✅ Detailed observability logging
- ✅ Non-fatal failure modes
- ✅ Data consistency guarantees
### Testing Coverage
- ✅ Python static compile checks (all PRs)
- ✅ Backend unit tests (scheduler, onboarding, database)
- ✅ Frontend builds without errors (linting auto-fixed)
### Production Readiness
- ✅ Rate limiting for regeneration endpoint
- ✅ Evidence-link grounding prevents hallucinations
- ✅ Self-learning memory improves task proposals
- ✅ Graceful degradation with fallback tasks
- ✅ Detailed error logging for operations
---
## Skipped PRs & Rationale
### PR #393: Improve indexing observability logs
- **Status:** ❌ CLOSED (user decision)
- **Reason:** Contextuality validation too important to remove
- **Contains:** Good logging improvements, but removes core validation
### PR #398: Resolve canonical user IDs in scheduler
- **Status:** ⏸️ SKIPPED
- **Reason:**
- Codex flagged P1 concern: User ID filtering could drop legacy tasks
- Codex flagged P2 concern: DB initialization as side effect in discovery
- Causes regressions in API layer (removes Pydantic models, error handling)
- Built from older main version
- **Recommendation:** Await rebase on current main + Codex concerns addressed
### PR #399: Centralize onboarding SEO task health
- **Status:** ⏸️ SKIPPED
- **Reason:**
- Same regressions as PR #398 (removes API improvements)
- Built from older main version
- SEO dashboard improvements are solid but not worth losing workflow API enhancements
- **Recommendation:** Rebase on current main when #398 is fixed
---
## Current State Summary
### What We Have
**Agent Committee System**
- 5 specialized agents with parallel proposal gathering
- Semantic deduplication
- Self-learning memory integration
- Graceful fallback to LLM generation
**Evidence-Link Grounding**
- Tasks reference onboarding data and system alerts
- Contextuality scoring prevents hallucinations
- Automatic strict regeneration for low-context workflows
- Response metadata for monitoring
**Self-Learning Memory**
- Proper feedback scoring from database state
- Handles all task status outcomes
- Prevents inconsistent learning from normalized statuses
**Data Consistency**
- Normalized dependencies across all payloads
- Type-safe API endpoints
- Consistent data handling in indexing
**Production Observability**
- Date validation before yesterday indexing
- Narrower exception handling with detailed logs
- Non-fatal error modes
- Clear operational visibility
**API Type Safety**
- Pydantic validation
- OpenAPI documentation
- No manual validation code needed
- Better IDE support with TypeScript
### System Capabilities
- Daily workflow generation with 6 lifecycle pillars
- Rate-limited on-demand regeneration
- Evidence-based contextuality validation
- Self-improving task proposals through memory
- Graceful degradation with fallback tasks
- Comprehensive logging and error handling
- Type-safe endpoints with auto-validation
---
## Lessons Learned
### PR Review Patterns
1. **Check for regressions:** Several PRs removed recent improvements
2. **Verify git history:** PRs #398-399 were built from older main
3. **Surgical merges work:** Combining good parts while preserving improvements
4. **Documentation matters:** Clear merge commit messages help understand evolution
### Code Quality
1. **Type safety prevents bugs:** Pydantic models caught issues early
2. **Narrow exception scopes:** Better observability than broad catches
3. **Evidence-based design:** Grounding prevents hallucination
4. **Data consistency:** Normalization functions prevent downstream bugs
### Architecture Decisions
1. **Committee approach:** Multiple agents > single LLM
2. **Evidence links:** Better than quality ratios for grounding
3. **Memory learning:** Use DB state, not request params
4. **Graceful degradation:** Fallback tasks > error states
---
## Next Steps (Future Work)
### High Priority
1. **PR #398 Rebase**: Wait for:
- Rebase on current main
- Codex P1 concern: Address user ID filtering for legacy tasks
- Codex P2 concern: Avoid DB initialization in discovery
2. **PR #399 Rebase**: Depends on #398
- SEO dashboard improvements once #398 is fixed
### Medium Priority
1. **Performance Tuning**: Monitor agent committee query times
2. **Memory Optimization**: Cache agent proposals for repeated patterns
3. **Dashboard Enhancement**: Add contextuality metrics to UI
### Low Priority
1. **Documentation**: Update API docs with new models
2. **Logging**: Expand observability for edge cases
3. **Testing**: Add integration tests for committee scenarios
---
## Session Statistics
| Metric | Value |
|--------|-------|
| **PRs Reviewed** | 12 (#388-397, #398-399) |
| **PRs Merged** | 9 (#388-397, excluding #393) |
| **PRs Skipped** | 3 (#393 closed by user, #398-399 due to regressions) |
| **Merge Conflicts Resolved** | 11 |
| **Surgical Merges** | 4 (#394-397) |
| **Git Commits** | 9 merge commits |
| **Files Modified** | 30+ across backend/frontend |
| **Lines Added** | 1000+ |
| **Lines Removed** | 1500+ |
| **Time Span** | March 8-9, 2026 |
---
## Recommendation for Future Sessions
1. **Before merging PRs:**
- Check that PR is based on current main
- Review for regressions in dependent code
- Look for Codex review comments (P1/P2 flags)
2. **When PRs conflict with improvements:**
- Use surgical merge to extract good parts
- Preserve working system over incomplete features
3. **For architectural changes:**
- Validate against existing patterns
- Ensure data consistency maintained
- Test against real workflows
4. **Documentation:**
- Update this file when significant changes occur
- Keep git history clean with descriptive commits
- Tag versions for major milestones
---
**Session Completed:**
**System State:** Production-ready with advanced features
**Next Review:** When PR #398 is rebased on current main

View File

@@ -45,11 +45,17 @@ def _extract_error_message(exc: Exception) -> str:
"""
Extract user-friendly error message from exception.
Handles HTTPException with nested error details from WaveSpeed API.
Preserves subscription modal flags for frontend.
"""
if isinstance(exc, HTTPException):
detail = exc.detail
# If detail is a dict (from WaveSpeed client)
if isinstance(detail, dict):
# Check if this is a subscription/credit error
if detail.get("error_type") == "insufficient_credits" or detail.get("show_subscription_modal"):
# Return the error message with subscription modal flag
return detail.get("message", "Insufficient credits. Please top up your account.")
# Try to extract message from nested response JSON
response_str = detail.get("response", "")
if response_str:
@@ -86,6 +92,27 @@ def _extract_error_message(exc: Exception) -> str:
return error_str
def _extract_error_metadata(exc: Exception) -> Dict[str, Any]:
"""Extract structured error metadata for task polling clients."""
if isinstance(exc, HTTPException):
detail = exc.detail
if isinstance(detail, dict):
return {
"error_status": exc.status_code,
"error_data": detail,
}
if isinstance(detail, str):
return {
"error_status": exc.status_code,
"error_data": {
"error": detail,
"message": detail,
},
}
return {}
def _execute_podcast_video_task(
task_id: str,
request: PodcastVideoGenerationRequest,
@@ -229,9 +256,15 @@ def _execute_podcast_video_task(
# Extract user-friendly error message from exception
error_msg = _extract_error_message(exc)
error_meta = _extract_error_metadata(exc)
task_manager.update_task_status(
task_id, "failed", error=error_msg, message=f"Video generation failed: {error_msg}"
task_id,
"failed",
error=error_msg,
message=f"Video generation failed: {error_msg}",
error_status=error_meta.get("error_status"),
error_data=error_meta.get("error_data"),
)
@@ -257,7 +290,7 @@ async def generate_podcast_video(
try:
if hasattr(request, 'headers') and hasattr(request.headers, 'get'):
auth_header = request.headers.get("Authorization")
except:
except Exception:
pass
if auth_header and auth_header.startswith("Bearer "):

View File

@@ -76,6 +76,10 @@ class TaskManager:
if task["status"] == "failed" and task.get("error"):
response["error"] = task["error"]
if task.get("error_status") is not None:
response["error_status"] = task["error_status"]
if task.get("error_data") is not None:
response["error_data"] = task["error_data"]
return response
@@ -86,7 +90,9 @@ class TaskManager:
progress: Optional[float] = None,
message: Optional[str] = None,
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None
error: Optional[str] = None,
error_status: Optional[int] = None,
error_data: Optional[Dict[str, Any]] = None,
):
"""Update the status of a task."""
if task_id not in self.task_storage:
@@ -112,6 +118,10 @@ class TaskManager:
if error is not None:
task["error"] = error
logger.error(f"[StoryWriter] Task {task_id} error: {error}")
if error_status is not None:
task["error_status"] = error_status
if error_data is not None:
task["error_data"] = error_data
async def execute_story_generation_task(
self,

View File

@@ -11,7 +11,7 @@ from sqlalchemy.exc import SQLAlchemyError
from middleware.auth_middleware import get_current_user
from services.database import get_db
from services.today_workflow_service import get_or_create_daily_workflow_plan, update_task_status
from services.today_workflow_service import get_or_create_daily_workflow_plan, update_task_status, _today_date_str
from models.daily_workflow_models import DailyWorkflowPlan, DailyWorkflowTask
import asyncio
from services.intelligence.txtai_service import TxtaiIntelligenceService
@@ -81,26 +81,7 @@ async def _index_tasks_to_sif(user_id: str, date: str, tasks: list[dict], label:
logger.debug(f"Background indexing failed for user {user_id}: {e}")
@router.get("")
async def get_today_workflow(
date: Optional[str] = None,
current_user: dict = Depends(get_current_user),
db: Session = Depends(get_db),
) -> Dict[str, Any]:
from starlette.concurrency import run_in_threadpool
user_id = str(current_user.get("id"))
plan, created = await get_or_create_daily_workflow_plan(db, user_id, date=date)
def _fetch_tasks():
return (
db.query(DailyWorkflowTask)
.filter(DailyWorkflowTask.plan_id == plan.id, DailyWorkflowTask.user_id == user_id)
.order_by(DailyWorkflowTask.created_at.asc())
.all()
)
tasks = await run_in_threadpool(_fetch_tasks)
def _build_workflow_payload(user_id: str, plan: DailyWorkflowPlan, tasks: list[DailyWorkflowTask]) -> Dict[str, Any]:
response_tasks = []
for t in tasks:
response_tasks.append(
@@ -136,8 +117,156 @@ async def get_today_workflow(
workflow_status = "completed"
total_estimated = int(sum(int(t.get("estimatedTime") or 0) for t in response_tasks))
plan_json = plan.plan_json or {}
return {
"workflow": {
"id": f"daily-{user_id}-{plan.date}",
"date": plan.date,
"userId": user_id,
"tasks": response_tasks,
"currentTaskIndex": current_index,
"completedTasks": completed,
"totalTasks": total,
"workflowStatus": workflow_status,
"totalEstimatedTime": total_estimated,
"actualTimeSpent": 0,
},
"plan": {
"id": plan.id,
"date": plan.date,
"source": plan.source,
"generation_mode": plan.generation_mode,
"committee_agent_count": plan.committee_agent_count,
"fallback_used": bool(plan.fallback_used),
"quality_status": plan_json.get("quality_status", "contextual"),
"contextuality_validation": plan_json.get("contextuality_validation"),
"provenance_summary": {
"generationMode": plan.generation_mode,
"committeeAgentCount": plan.committee_agent_count,
"fallbackUsed": bool(plan.fallback_used),
"taskSourceBreakdown": {},
},
"created_at": plan.created_at.isoformat() if plan.created_at else None,
"updated_at": plan.updated_at.isoformat() if plan.updated_at else None,
},
"schedule_status": {
"date": plan.date,
"generated": True,
"scheduled_run_completed": plan.source == "scheduled",
"source": plan.source,
"created_at": plan.created_at.isoformat() if plan.created_at else None,
},
}
@router.get("")
async def get_today_workflow(
date: Optional[str] = None,
current_user: dict = Depends(get_current_user),
db: Session = Depends(get_db),
) -> Dict[str, Any]:
"""Get existing daily workflow for the specified date.
Returns 404 if no workflow exists for the date.
Workflow should only be created via explicit user action or scheduled job.
"""
from starlette.concurrency import run_in_threadpool
user_id = str(current_user.get("id"))
date_str = date or _today_date_str()
def _get_existing():
return (
db.query(DailyWorkflowPlan)
.filter(DailyWorkflowPlan.user_id == user_id, DailyWorkflowPlan.date == date_str)
.first()
)
plan = await run_in_threadpool(_get_existing)
if not plan:
raise HTTPException(
status_code=404,
detail=f"No workflow found for date {date_str}. Workflow should be generated via explicit user action or scheduled job."
)
def _fetch_tasks():
return (
db.query(DailyWorkflowTask)
.filter(DailyWorkflowTask.plan_id == plan.id, DailyWorkflowTask.user_id == user_id)
.order_by(DailyWorkflowTask.created_at.asc())
.all()
)
tasks = await run_in_threadpool(_fetch_tasks)
return {
"success": True,
"data": _build_workflow_payload(user_id, plan, tasks),
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
}
@router.get("/status")
async def get_today_workflow_status(
date: Optional[str] = None,
current_user: dict = Depends(get_current_user),
db: Session = Depends(get_db),
) -> Dict[str, Any]:
from starlette.concurrency import run_in_threadpool
user_id = str(current_user.get("id"))
date_str = date or _today_date_str()
def _get_existing():
return (
db.query(DailyWorkflowPlan)
.filter(DailyWorkflowPlan.user_id == user_id, DailyWorkflowPlan.date == date_str)
.first()
)
plan = await run_in_threadpool(_get_existing)
return {
"success": True,
"data": {
"date": date_str,
"generated": plan is not None,
"scheduled_run_completed": bool(plan and plan.source == "scheduled"),
"source": plan.source if plan else None,
"created_at": plan.created_at.isoformat() if plan and plan.created_at else None,
},
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
}
@router.post("/generate")
async def generate_workflow(
date: Optional[str] = None,
current_user: dict = Depends(get_current_user),
db: Session = Depends(get_db),
) -> Dict[str, Any]:
"""Explicitly generate a new daily workflow for the specified date.
This should only be called when the user explicitly requests workflow generation
or via a scheduled job at night.
"""
from starlette.concurrency import run_in_threadpool
user_id = str(current_user.get("id"))
plan, created = await get_or_create_daily_workflow_plan(db, user_id, date=date, creation_source="manual")
def _fetch_tasks():
return (
db.query(DailyWorkflowTask)
.filter(DailyWorkflowTask.plan_id == plan.id, DailyWorkflowTask.user_id == user_id)
.order_by(DailyWorkflowTask.created_at.asc())
.all()
)
tasks = await run_in_threadpool(_fetch_tasks)
if created:
response_tasks = _build_workflow_payload(user_id, plan, tasks)["workflow"]["tasks"]
asyncio.create_task(_index_tasks_to_sif(user_id, plan.date, response_tasks, label="today"))
from datetime import date as date_type, timedelta
@@ -200,29 +329,7 @@ async def get_today_workflow(
return {
"success": True,
"data": {
"workflow": {
"id": f"daily-{user_id}-{plan.date}",
"date": plan.date,
"userId": user_id,
"tasks": response_tasks,
"currentTaskIndex": current_index,
"completedTasks": completed,
"totalTasks": total,
"workflowStatus": workflow_status,
"totalEstimatedTime": total_estimated,
"actualTimeSpent": 0,
},
"plan": {
"id": plan.id,
"date": plan.date,
"source": plan.source,
"quality_status": (plan.plan_json or {}).get("quality_status", "contextual"),
"contextuality_validation": (plan.plan_json or {}).get("contextuality_validation"),
"created_at": plan.created_at.isoformat() if plan.created_at else None,
"updated_at": plan.updated_at.isoformat() if plan.updated_at else None,
},
},
"data": _build_workflow_payload(user_id, plan, tasks),
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
}

View File

@@ -18,6 +18,26 @@ router = APIRouter()
UPLOAD_DIR = Path("backend/data/video_studio/uploads")
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
def _extract_error_metadata(exc: Exception) -> Dict[str, Any]:
"""Extract structured HTTP error metadata for polling clients."""
if isinstance(exc, HTTPException):
detail = exc.detail
if isinstance(detail, dict):
return {
"error_status": exc.status_code,
"error_data": detail,
}
if isinstance(detail, str):
return {
"error_status": exc.status_code,
"error_data": {
"error": detail,
"message": detail,
},
}
return {}
def _process_avatar_generation(task_id: str, image_path: Path, audio_path: Path, user_id: str, resolution: str, model: str):
"""
Background task to process avatar generation using shared InfiniteTalk service.
@@ -94,7 +114,15 @@ def _process_avatar_generation(task_id: str, image_path: Path, audio_path: Path,
except Exception as e:
logger.error(f"[VideoStudio] Avatar generation failed for task {task_id}: {e}", exc_info=True)
task_manager.update_task(task_id, "failed", error=str(e), user_id=user_id)
error_meta = _extract_error_metadata(e)
task_manager.update_task(
task_id,
"failed",
error=str(e),
user_id=user_id,
error_status=error_meta.get("error_status"),
error_data=error_meta.get("error_data"),
)
finally:
# Cleanup temp upload files
try:

View File

@@ -39,7 +39,18 @@ class TaskManager:
logger.error(f"[VideoStudio] Failed to create task: {e}")
raise
def update_task(self, task_id: str, status: str, result: Optional[Dict] = None, error: Optional[str] = None, user_id: str = None, progress: float = None, message: str = None):
def update_task(
self,
task_id: str,
status: str,
result: Optional[Dict] = None,
error: Optional[str] = None,
user_id: str = None,
progress: float = None,
message: str = None,
error_status: Optional[int] = None,
error_data: Optional[Dict[str, Any]] = None,
):
"""Update an existing task."""
if not user_id:
logger.error(f"[VideoStudio] Cannot update task {task_id} without user_id")
@@ -74,6 +85,13 @@ class TaskManager:
task.result = result
if error:
task.error = error
if error_status is not None or error_data is not None:
result_payload = task.result if isinstance(task.result, dict) else {}
if error_status is not None:
result_payload["error_status"] = error_status
if error_data is not None:
result_payload["error_data"] = error_data
task.result = result_payload
if progress is not None:
task.progress = progress
if message:
@@ -107,7 +125,7 @@ class TaskManager:
if status_val == "processing":
status_val = "running"
return {
response = {
"task_id": task.task_id,
"status": status_val,
"result": task.result,
@@ -117,6 +135,12 @@ class TaskManager:
"created_at": task.created_at,
"updated_at": task.updated_at
}
if isinstance(task.result, dict):
if task.result.get("error_status") is not None:
response["error_status"] = task.result.get("error_status")
if task.result.get("error_data") is not None:
response["error_data"] = task.result.get("error_data")
return response
finally:
db.close()
except Exception as e:

View File

@@ -4,6 +4,10 @@ import sqlite3
import os
from pathlib import Path
ROOT_DIR = Path(__file__).resolve().parent.parent
WORKSPACE_DIR = ROOT_DIR / "workspace"
def migrate_database(db_path):
"""Add missing columns to daily_workflow_plans table."""
if not os.path.exists(db_path):
@@ -46,14 +50,14 @@ def migrate_database(db_path):
def find_and_migrate_databases():
"""Find all databases and apply migrations."""
workspace_dir = r'c:\Users\diksha rawat\Desktop\ALwrity\workspace'
workspace_dir = WORKSPACE_DIR
if not os.path.exists(workspace_dir):
if not workspace_dir.exists():
print(f"Workspace directory not found: {workspace_dir}")
return
# Find all .db files
db_files = list(Path(workspace_dir).glob('**/db/*.db'))
db_files = list(workspace_dir.glob('**/db/*.db'))
if not db_files:
print("No databases found to migrate")

View File

@@ -53,6 +53,39 @@ WORKSPACE_DIR = os.path.join(ROOT_DIR, 'workspace')
# Engine cache for multi-tenant support
_user_engines = {}
def _ensure_daily_workflow_schema(engine, user_id: str) -> None:
"""Backfill required daily_workflow_plans columns for legacy tenant DBs."""
required_columns = {
"generation_mode": "VARCHAR(30) NOT NULL DEFAULT 'llm_generation'",
"committee_agent_count": "INTEGER NOT NULL DEFAULT 0",
"fallback_used": "BOOLEAN NOT NULL DEFAULT 0",
"generation_run_id": "INTEGER",
}
try:
with engine.begin() as conn:
table_check = conn.exec_driver_sql(
"SELECT name FROM sqlite_master WHERE type='table' AND name='daily_workflow_plans'"
).fetchone()
if not table_check:
return
existing_cols = {
row[1] for row in conn.exec_driver_sql("PRAGMA table_info(daily_workflow_plans)").fetchall()
}
for col_name, col_def in required_columns.items():
if col_name not in existing_cols:
conn.exec_driver_sql(
f"ALTER TABLE daily_workflow_plans ADD COLUMN {col_name} {col_def}"
)
logger.warning(
f"Auto-migrated daily_workflow_plans column '{col_name}' for user {user_id}"
)
except Exception as e:
logger.error(f"Failed daily_workflow_plans schema compatibility check for user {user_id}: {e}")
def get_user_db_path(user_id: str) -> str:
"""Get the database path for a specific user."""
# Sanitize user_id to be safe for filesystem
@@ -192,6 +225,7 @@ def init_user_database(user_id: str):
UserBusinessInfoBase.metadata.create_all(bind=engine)
ContentAssetBase.metadata.create_all(bind=engine)
BingAnalyticsBase.metadata.create_all(bind=engine)
_ensure_daily_workflow_schema(engine, user_id)
# Initialize default data for new databases
try:

View File

@@ -3,7 +3,10 @@ Task Scheduler Package
Modular, pluggable scheduler for ALwrity tasks.
"""
import os
from sqlalchemy.orm import Session
from apscheduler.triggers.cron import CronTrigger
from .core.scheduler import TaskScheduler
from .core.executor_interface import TaskExecutor, TaskExecutionResult
@@ -32,6 +35,7 @@ from .utils.platform_insights_task_loader import load_due_platform_insights_task
from .utils.advertools_task_loader import load_due_advertools_tasks
from .utils.sif_indexing_task_loader import load_due_sif_indexing_tasks
from .utils.market_trends_task_loader import load_due_market_trends_tasks
from services.today_workflow_service import generate_scheduled_daily_workflows
# Global scheduler instance (initialized on first access)
_scheduler_instance: TaskScheduler = None
@@ -144,6 +148,18 @@ def get_scheduler() -> TaskScheduler:
load_due_market_trends_tasks
)
today_workflow_hour_utc = int(os.getenv('TODAY_WORKFLOW_SCHEDULE_HOUR_UTC', '2'))
today_workflow_minute_utc = int(os.getenv('TODAY_WORKFLOW_SCHEDULE_MINUTE_UTC', '0'))
_scheduler_instance.scheduler.add_job(
generate_scheduled_daily_workflows,
trigger=CronTrigger(hour=today_workflow_hour_utc, minute=today_workflow_minute_utc, timezone='UTC'),
id='generate_daily_workflows',
replace_existing=True,
max_instances=1,
coalesce=True,
misfire_grace_time=3600,
)
return _scheduler_instance

View File

@@ -8,6 +8,7 @@ from models.daily_workflow_models import DailyWorkflowPlan, DailyWorkflowTask
from models.agent_activity_models import AgentAlert
from services.agent_activity_service import AgentActivityService, build_agent_event_payload
from services.llm_providers.main_text_generation import llm_text_gen
from services.database import get_all_user_ids, get_session_for_user
from loguru import logger
PILLAR_IDS = ["plan", "generate", "publish", "analyze", "engage", "remarket"]
@@ -604,7 +605,12 @@ async def generate_agent_enhanced_plan(
return result
async def get_or_create_daily_workflow_plan(db: Session, user_id: str, date: Optional[str] = None) -> tuple[DailyWorkflowPlan, bool]:
async def get_or_create_daily_workflow_plan(
db: Session,
user_id: str,
date: Optional[str] = None,
creation_source: str = "manual",
) -> tuple[DailyWorkflowPlan, bool]:
from starlette.concurrency import run_in_threadpool
date_str = date or _today_date_str()
@@ -646,7 +652,10 @@ async def get_or_create_daily_workflow_plan(db: Session, user_id: str, date: Opt
plan = DailyWorkflowPlan(
user_id=user_id,
date=date_str,
source="agent",
source=creation_source,
generation_mode=_derive_generation_mode(plan_data),
committee_agent_count=_count_committee_agents(tasks),
fallback_used=_plan_uses_fallback(tasks),
plan_json=plan_data,
created_at=datetime.utcnow(),
updated_at=datetime.utcnow(),
@@ -685,6 +694,80 @@ async def get_or_create_daily_workflow_plan(db: Session, user_id: str, date: Opt
return plan, True
def _derive_generation_mode(plan_data: Dict[str, Any]) -> str:
tasks = plan_data.get("tasks", []) if isinstance(plan_data, dict) else []
source_modes = set()
for task in tasks:
metadata = task.get("metadata") if isinstance(task, dict) else {}
metadata = metadata if isinstance(metadata, dict) else {}
source_agent = str(metadata.get("source_agent") or "").strip()
source = str(metadata.get("source") or "").strip()
if source_agent:
source_modes.add("agent_committee")
elif source in {"controlled_fallback", "llm_pillar_backfill"}:
source_modes.add(source)
if "agent_committee" in source_modes:
return "agent_committee"
if "controlled_fallback" in source_modes:
return "controlled_fallback"
if "llm_pillar_backfill" in source_modes:
return "llm_pillar_backfill"
return "llm_generation"
def _count_committee_agents(tasks: List[Dict[str, Any]]) -> int:
agents = set()
for task in tasks:
metadata = task.get("metadata") if isinstance(task, dict) else {}
metadata = metadata if isinstance(metadata, dict) else {}
source_agent = str(metadata.get("source_agent") or "").strip()
if source_agent:
agents.add(source_agent)
return len(agents)
def _plan_uses_fallback(tasks: List[Dict[str, Any]]) -> bool:
for task in tasks:
metadata = task.get("metadata") if isinstance(task, dict) else {}
metadata = metadata if isinstance(metadata, dict) else {}
source = str(metadata.get("source") or "").strip()
if source in {"controlled_fallback", "llm_pillar_backfill"}:
return True
return False
async def generate_scheduled_daily_workflows() -> Dict[str, int]:
user_ids = get_all_user_ids()
stats = {"users_seen": 0, "created": 0, "existing": 0, "failed": 0}
for user_id in user_ids:
stats["users_seen"] += 1
db = None
try:
db = get_session_for_user(user_id)
plan, created = await get_or_create_daily_workflow_plan(
db,
user_id,
creation_source="scheduled",
)
if created:
stats["created"] += 1
logger.info("Scheduled daily workflow created for user {} date {}", user_id, plan.date)
else:
stats["existing"] += 1
logger.info("Scheduled daily workflow already exists for user {} date {}", user_id, plan.date)
except Exception as e:
stats["failed"] += 1
logger.error("Scheduled daily workflow generation failed for user {}: {}", user_id, e)
finally:
if db:
db.close()
logger.info("Scheduled daily workflow run complete: {}", stats)
return stats
def update_task_status(
db: Session,
user_id: str,

View File

@@ -3,6 +3,7 @@ Video generation operations (text-to-video and image-to-video).
"""
import requests
import json
from typing import Any, Dict, Optional
from fastapi import HTTPException
@@ -12,6 +13,19 @@ from .base import VideoBase
logger = get_service_logger("wavespeed.generators.video.generation")
def _extract_wavespeed_message(response_text: str) -> str:
"""Best-effort extraction of WaveSpeed error message from response payload."""
if not response_text:
return ""
try:
parsed = json.loads(response_text)
if isinstance(parsed, dict):
return str(parsed.get("message") or parsed.get("error") or "")
except (json.JSONDecodeError, TypeError, ValueError):
return ""
return ""
class VideoGeneration(VideoBase):
"""Video generation operations."""
@@ -31,6 +45,25 @@ class VideoGeneration(VideoBase):
response = requests.post(url, headers=self._get_headers(), json=payload, timeout=timeout)
if response.status_code != 200:
logger.error(f"[WaveSpeed] Submission failed: {response.status_code} {response.text}")
error_message = _extract_wavespeed_message(response.text)
if "insufficient credits" in error_message.lower() or "credit" in error_message.lower():
raise HTTPException(
status_code=429,
detail={
"error": "Insufficient WaveSpeed credits",
"message": "Insufficient credits. Please top up to continue video generation.",
"provider": "wavespeed",
"usage_info": {
"provider": "wavespeed",
"type": "credits",
"limit_type": "provider_credits",
"operation_type": "scene_animation",
"action_required": "top_up",
},
},
)
raise HTTPException(
status_code=502,
detail={
@@ -75,6 +108,25 @@ class VideoGeneration(VideoBase):
if response.status_code != 200:
logger.error(f"[WaveSpeed] Text-to-video submission failed: {response.status_code} {response.text}")
error_message = _extract_wavespeed_message(response.text)
if "insufficient credits" in error_message.lower() or "credit" in error_message.lower():
raise HTTPException(
status_code=429,
detail={
"error": "Insufficient WaveSpeed credits",
"message": "Insufficient credits. Please top up to continue video generation.",
"provider": "wavespeed",
"usage_info": {
"provider": "wavespeed",
"type": "credits",
"limit_type": "provider_credits",
"operation_type": "video_generation",
"action_required": "top_up",
},
},
)
raise HTTPException(
status_code=502,
detail={
@@ -174,6 +226,25 @@ class VideoGeneration(VideoBase):
if response.status_code != 200:
logger.error(f"[WaveSpeed] Text-to-video submission failed: {response.status_code} {response.text}")
error_message = _extract_wavespeed_message(response.text)
if "insufficient credits" in error_message.lower() or "credit" in error_message.lower():
raise HTTPException(
status_code=429,
detail={
"error": "Insufficient WaveSpeed credits",
"message": "Insufficient credits. Please top up to continue video generation.",
"provider": "wavespeed",
"usage_info": {
"provider": "wavespeed",
"type": "credits",
"limit_type": "provider_credits",
"operation_type": "video_generation",
"action_required": "top_up",
},
},
)
raise HTTPException(
status_code=502,
detail={

View File

@@ -0,0 +1,361 @@
# Backend Log RCA Tracker
## Purpose
This document is the working catalog for backend issues observed in runtime logs.
For each issue, capture:
- error signature
- observed symptoms
- likely root cause analysis
- confidence level
- files to inspect/edit
- fix strategy notes
- validation steps
- status
## Triage Rules
- Do not fix directly from logs alone unless root cause is confirmed.
- Prefer grouping repeated log lines under one issue.
- Track the first failing subsystem, then downstream effects.
- Separate configuration problems from code defects.
- Keep this document updated before and after each fix.
## Issue 1: Clerk token verification failures on authenticated endpoints
- **Status**: Open
- **Severity**: High
- **Subsystem**: Authentication / request pipeline
- **Error signatures**:
- `Unverified token rejected (production).`
- `AUTHENTICATION ERROR: Token verification failed for endpoint: GET /api/...`
- **Observed endpoints in logs**:
- `/api/content-planning/monitoring/lightweight-stats`
- `/api/content-planning/monitoring/health`
- `/api/subscription/dashboard/...`
- `/api/subscription/alerts/...`
- `/api/subscription/status/...`
- **Observed behavior**:
- Requests reach authenticated endpoints.
- Clerk verification fails.
- Fallback unverified decode path is attempted.
- Production mode rejects the token.
- **Primary RCA hypothesis**:
- The backend is receiving bearer tokens that do not successfully validate against the resolved Clerk JWKS/issuer configuration.
- The middleware then falls back to unverified decode, but production mode explicitly rejects that path.
- **Secondary RCA hypotheses**:
- Frontend token/audience/issuer mismatch.
- Wrong Clerk environment variables loaded in backend.
- Issuer-derived JWKS URL resolution is inconsistent with actual Clerk instance.
- Requests may be sent before a valid session token is available.
- **Evidence in code**:
- `backend/middleware/auth_middleware.py`
- `ClerkAuthMiddleware.__init__`
- `ClerkAuthMiddleware.verify_token`
- `get_current_user`
- Relevant logic:
- derives JWKS URL from token issuer or cached publishable key instance
- falls back to `jwt.decode(..., verify_signature=False)`
- rejects unverified tokens when `ALLOW_UNVERIFIED_JWT_DEV` is false
- **Likely files to inspect/edit later**:
- `backend/middleware/auth_middleware.py`
- possibly frontend auth/session request layer if token attachment is inconsistent
- **Confidence**: Medium
- **Root-cause questions to answer**:
- Are `CLERK_SECRET_KEY` and publishable key values from the same Clerk instance?
- Is the token issuer exactly matching the intended Clerk environment?
- Are failing requests sent with stale, dev, or cross-environment tokens?
- Are these requests triggered before Clerk session hydration on the frontend?
- **Validation after fix**:
- Authenticated endpoints return 200 with verified user context.
- No `Unverified token rejected (production)` log spam for healthy requests.
## Issue 2: Hugging Face structured JSON generation failing with model not found
- **Status**: Open
- **Severity**: High
- **Subsystem**: LLM provider / workflow generation
- **Error signatures**:
- `HF structured model not found: %s. Trying fallback model.`
- `Hugging Face API call failed: Not Found`
- `HF structured model not found (no response_format path): %s`
- `Hugging Face structured JSON generation failed: NotFoundError: Not Found`
- `[llm_text_gen] Provider huggingface failed: RetryError[...]`
- **Observed behavior**:
- Structured JSON call tries primary model.
- Fallback model sequence also fails.
- Retry without `response_format` still fails with `NotFound`.
- Upstream caller falls through to another provider or fallback path.
- **Primary RCA hypothesis**:
- The configured Hugging Face model identifier is invalid, unavailable to the account/provider, or incompatible with the current OpenAI-compatible Hugging Face endpoint.
- **Secondary RCA hypotheses**:
- Base URL/API key/provider configuration is wrong.
- Fallback model list contains provider-specific model ids not available in the current account/region.
- Structured generation path assumes chat completions support for models that only exist on a different inference route.
- **Evidence in code**:
- `backend/services/llm_providers/huggingface_provider.py`
- `_fallback_model_sequence`
- `huggingface_structured_json_response`
- The code retries:
- with `response_format={"type": "json_object"}`
- then again without `response_format`
- Both paths still fail with `NotFoundError`, which points more strongly to model/base-url availability than schema formatting.
- **Likely files to inspect/edit later**:
- `backend/services/llm_providers/huggingface_provider.py`
- provider selection/orchestration file calling Hugging Face as primary for structured JSON
- environment/config file for HF model names and API base URL
- **Confidence**: High
- **Root-cause questions to answer**:
- Which exact model string is being passed as the primary model in the failing call?
- What base URL and API key are being used for the OpenAI client?
- Are the fallback model ids valid for the currently configured Hugging Face inference provider?
- **Validation after fix**:
- A structured JSON test request succeeds with the intended model or a verified fallback.
- No `NotFoundError` for the chosen model list.
## Issue 3: txtai indexing attempted before service initialization completes
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: Semantic indexing / background tasks
- **Error signatures**:
- `Cannot index content - service not initialized for user ...`
- **Observed behavior**:
- Background indexing is triggered.
- `TxtaiIntelligenceService.index_content` calls `_ensure_initialized()`.
- `_ensure_initialized()` starts background initialization and returns immediately.
- `index_content` then checks `_initialized`, sees false, and fails fast.
- **Primary RCA hypothesis**:
- There is a race condition between lazy background initialization and immediate indexing/search calls.
- `SIF_FAIL_FAST=true` (default) causes operations to raise RuntimeError instead of gracefully deferring.
- **Evidence in code**:
- `backend/services/intelligence/txtai_service.py`:
- Line 57: `self.fail_fast = str(os.getenv("SIF_FAIL_FAST", "true")).lower() in {"1", "true", "yes", "on"}`
- Lines 234-235: `index_content` raises RuntimeError if `fail_fast` and not initialized
- Lines 284-285: `search` raises RuntimeError if `fail_fast` and not initialized
- Lines 319-320: `get_similarity` raises RuntimeError if `fail_fast` and not initialized
- `_ensure_initialized` is intentionally non-blocking (starts background thread)
- `backend/api/today_workflow.py`:
- `_index_tasks_to_sif` triggers indexing in background after workflow actions
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- any other callers that assume initialization is synchronous
- **Confidence**: High
- **Potential downstream impact**:
- workflow/task indexing silently fails
- semantic search quality degrades
- noisy logs obscure higher-priority failures
- **Root-cause questions to answer**:
- Should `index_content` await `_ensure_initialized_async()` instead of using the non-blocking path?
- Should callers tolerate deferred indexing instead of fail-fast behavior?
- Is `SIF_FAIL_FAST=true` appropriate for background indexing operations?
- Should `SIF_FAIL_FAST` default to `false` for background operations?
- **Validation after fix**:
- First indexing call after startup succeeds or is gracefully deferred without error spam.
## Issue 4: Today workflow endpoint reload observed during active debugging
- **Status**: Observed
- **Severity**: Low
- **Subsystem**: Development reload / workflow API
- **Log signature**:
- `StatReload detected changes in 'api\today_workflow.py'. Reloading...`
- **Observed behavior**:
- Development server reloads due to file edits.
- **RCA**:
- Expected dev-server behavior, not itself a product bug.
- **Files involved**:
- `backend/api/today_workflow.py`
- **Confidence**: High
- **Action**:
- No fix needed; keep separate from actual runtime defects.
## Cross-Issue Notes
- The auth failures and the workflow/indexing issues may be independent.
- The Hugging Face failure may trigger fallback task generation, which can still create workflows while hiding the upstream provider problem.
- txtai indexing failures appear to be a post-generation side effect, not the root cause of generation failure.
- **LiteLLM was investigated and dropped as a false herring** no project-level SIF/txtai wiring to LiteLLM was found.
- The SIF agent local-model path is **separate** from txtai embeddings and may be the source of the "local model used to work" feedback.
## Candidate Investigation Order
1. Authentication verification mismatch
2. Hugging Face model/provider availability mismatch
3. txtai initialization race (with `SIF_FAIL_FAST` behavior)
4. SIF agent local-model defaults (Qwen 1.5B vs lighter alternatives)
5. Any downstream workflow symptoms after the above are stabilized
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race):
- **Option A**: Change `SIF_FAIL_FAST` default to `false` for background operations
- Allows graceful deferral instead of RuntimeError
- Minimal code change, no logic changes
- **Option B**: Use `_ensure_initialized_async()` in `index_content`/`search`/`get_similarity`
- Awaits initialization before proceeding
- More robust but requires async refactoring
- **Option C**: Add initialization state callbacks to callers
- More complex, may not be necessary
### For Issue 5 (SIF agent local-model drift):
- **Option A**: Change default `model_name` in `SIFBaseAgent.__init__` to lighter model
- Example: `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`
- Single-line change, immediate effect
- **Option B**: Add env/config override for default agent local model
- More flexible, requires config wiring
- Allows runtime tuning without code changes
- **Option C**: Keep current default and rely on existing fallback chain
- The fallback chain already tries lighter models if memory fails
- May be sufficient if memory detection works correctly
## Current Evidence Sources
- Runtime logs from terminal `python` process `22056`
- `backend/middleware/auth_middleware.py`
- `backend/services/llm_providers/huggingface_provider.py`
- `backend/services/intelligence/txtai_service.py`
- `backend/api/today_workflow.py`
- `backend/services/today_workflow_service.py`
## Issue 5: SIF agent local-model drift (distinct from txtai embeddings)
- **Status**: Open
- **Severity**: Medium
- **Subsystem**: SIF agents / local LLM wrappers
- **Error signatures**:
- (No direct log signature yet; this is a hypothesis from user feedback that "local model used to work")
- **Observed behavior**:
- User reports that a local model used to work for SIF agents now seems heavier or less responsive.
- The SIF agent path is **separate** from txtai embeddings.
- **Primary RCA hypothesis**:
- The SIF agent local LLM wrapper path uses a 1.5B parameter model by default, which may be heavier than the previous local model.
- This is distinct from txtai embeddings, which still use `sentence-transformers/all-MiniLM-L6-v2`.
- **Evidence in code**:
- `backend/services/intelligence/sif_agents.py`:
- Lines 47-51: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Lines 53-139: `LocalLLMWrapper` tries models in order, with memory issue detection and automatic fallback to smaller models
- Line 141: `SIFBaseAgent.__init__` default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- `backend/services/intelligence/txtai_service.py`:
- Line 48: Still uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings
- **Likely files to inspect/edit later**:
- `backend/services/intelligence/sif_agents.py`
- `backend/services/intelligence/agents/specialized/base.py`
- any config/env that controls default agent local model
- **Confidence**: Medium
- **Root-cause questions to answer**:
- What was the previous local model default for SIF agents?
- Is `Qwen/Qwen2.5-1.5B-Instruct` actually too heavy for the users laptop?
- Should the default be changed to `Qwen/Qwen2.5-0.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`?
- Are there any env/config overrides that could make this configurable?
- **Validation after fix**:
- SIF agents use a CPU-friendly local model (e.g., smaller Qwen variant or TinyLlama).
- Agent generation completes without excessive CPU/memory pressure.
## Issue 6: Model initialization blocking and module unification
- **Status**: Open
- **Severity**: High
- **Subsystem**: Startup / model loading / module architecture
- **Error signatures**:
- (No direct log signature; architectural issue)
- **Observed behavior**:
- `start_alwrity_backend.py` pre-downloads `Qwen/Qwen2.5-3B-Instruct` **synchronously** before server starts (line 122).
- `sif_agents.py` defaults to `Qwen/Qwen2.5-1.5B-Instruct` and uses lazy loading via `LocalLLMWrapper`.
- `txtai_service.py` uses `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
- Three separate modules handle model loading, creating confusion.
- User wants fail-fast semantics (catch bugs, avoid silent failures) AND proper fallback.
- User wants non-blocking model downloads for SIF/agents.
- **Primary RCA hypothesis**:
- Startup script blocks on model download, contradicting non-blocking requirement.
- Model size mismatch: startup downloads 3B, agents default to 1.5B.
- Fail-fast in `txtai_service.py` prevents fallback from working.
- Module separation (`txtai_service.py`, `sif_agents.py`, `start_alwrity_backend.py`) creates confusion.
- **Evidence in code**:
- `start_alwrity_backend.py`:
- Line 122: `target_model = "Qwen/Qwen2.5-3B-Instruct"`
- Lines 127-131: `snapshot_download()` is **blocking** call
- Lines 117-120: Skips on Render/Railway but **not** on local dev
- `sif_agents.py`:
- Line 48: `LOCAL_LLM_FALLBACKS = ["Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-0.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"]`
- Line 141: Default `model_name="Qwen/Qwen2.5-1.5B-Instruct"`
- Line 150: Uses `LocalLLMWrapper` for lazy loading
- Lines 94-130: Has fallback logic with memory issue detection
- `txtai_service.py`:
- Line 57: `SIF_FAIL_FAST=true` (default) causes RuntimeError
- Lines 234-235, 284-285, 319-320: Fail-fast prevents fallback
- **Likely files to inspect/edit later**:
- `start_alwrity_backend.py` (remove blocking download)
- `services/intelligence/sif_agents.py` (unify model defaults)
- `services/intelligence/txtai_service.py` (fix fail-fast with fallback)
- Create unified `services/intelligence/model_registry.py` or similar
- **Confidence**: High
- **Root-cause questions to answer**:
- Should model download be truly non-blocking (background thread)?
- Should fail-fast be conditional (e.g., only for critical paths, not background ops)?
- Should module unification create a single `ModelRegistry` or `ModelManager`?
- How to ensure JSON/response structure compatibility across fallback chain?
- **Validation after fix**:
- Server starts without blocking on model download.
- SIF agents use consistent model defaults.
- Fail-fast catches bugs but allows fallback for non-critical ops.
- Single module handles all model loading logic.
## Minimal Fix Paths (Pre-Implementation)
### For Issue 3 (txtai init race) - REVISED:
- **Option A**: Change `SIF_FAIL_FAST` to be **conditional** (not global)
- Keep fail-fast for critical paths (user-initiated ops)
- Allow graceful deferral for background ops (indexing, clustering)
- Requires distinguishing operation types
- **Option B**: Use `_ensure_initialized_async()` for **blocking ops only**
- Keep non-blocking for background ops
- Awaits init for user-facing ops
- More robust but requires async refactoring
- **Option C**: Add operation-type-aware fail-fast
- Pass `critical=True/False` to operations
- Fail-fast only when `critical=True`
- Most aligned with user requirements
### For Issue 5 (SIF agent local-model drift) - REVISED:
- **Option A**: Change default to lighter model AND improve fallback chain
- Default: `Qwen/Qwen2.5-0.5B-Instruct` (lighter)
- Fallback: `0.5B → TinyLlama 1.1B`
- Ensure JSON/response structure compatibility
- **Option B**: Add env/config override + keep fallback chain
- `SIF_AGENT_MODEL` env var
- Fallback chain remains as-is
- More flexible
- **Option C**: Keep current default and rely on existing fallback chain
- **RECOMMENDED**: Already has memory detection and fallback
- Just need to ensure JSON compatibility
### For Issue 6 (Model blocking + module unification):
- **Option A**: Remove blocking download from startup script
- Delete `bootstrap_local_llm_models()` call
- Let `LocalLLMWrapper` handle lazy loading
- Minimal change, immediate non-blocking
- **Option B**: Make download non-blocking (background thread)
- Keep pre-download but in background
- Server starts immediately
- More complex
- **Option C**: Create unified `ModelRegistry` module
- Single source of truth for model defaults
- Centralized download/cache logic
- Eliminates confusion between modules
- **RECOMMENDED for long-term**
## Session Update Log
### 2026-03-10
- Created initial RCA tracker document.
- Seeded first three concrete issues from supplied logs.
- No fixes applied from this document yet.
- Added Issue 5: SIF agent local-model drift (LiteLLM dropped as false herring).
- Refined Issue 3 with `SIF_FAIL_FAST` behavior details.
- Added minimal fix paths for Issues 3 and 5.
- Added Issue 6: Model initialization blocking and module unification.
- Updated minimal fix paths based on user requirements (fail-fast + fallback, non-blocking, unification).

View File

@@ -36,9 +36,9 @@
"zustand": "^5.0.7"
},
"scripts": {
"start": "react-scripts start",
"build": "node --max_old_space_size=8192 node_modules/react-scripts/scripts/build.js",
"build:nomap": "node --max_old_space_size=8192 -e \"process.env.GENERATE_SOURCEMAP='false'; require('./node_modules/react-scripts/scripts/build');\"",
"start": "node --max_old_space_size=12288 node_modules/react-scripts/scripts/start.js",
"build": "node --max_old_space_size=12288 node_modules/react-scripts/scripts/build.js",
"build:nomap": "node --max_old_space_size=12288 -e \"process.env.GENERATE_SOURCEMAP='false'; require('./node_modules/react-scripts/scripts/build');\"",
"test": "react-scripts test",
"eject": "react-scripts eject",
"analyze": "npm run build && npx source-map-explorer 'build/static/js/*.js' --html bundle-report.html",

View File

@@ -6,6 +6,7 @@ import {
Snackbar,
useTheme
} from '@mui/material';
import { Lightbulb } from '@mui/icons-material';
import { motion, AnimatePresence } from 'framer-motion';
import { useNavigate } from 'react-router-dom';
import { useAuth } from '@clerk/clerk-react';
@@ -63,7 +64,9 @@ const MainDashboard: React.FC = () => {
const {
currentWorkflow,
workflowProgress,
scheduleStatus,
isLoading: workflowLoading,
loadTodayWorkflow,
generateDailyWorkflow,
startWorkflow,
pauseWorkflow,
@@ -71,19 +74,18 @@ const MainDashboard: React.FC = () => {
} = useWorkflowStore();
const { userId } = useAuth();
// Initialize workflow on component mount
React.useEffect(() => {
const initializeWorkflow = async () => {
try {
if (!userId) return;
await generateDailyWorkflow(userId);
await loadTodayWorkflow();
} catch (error) {
console.warn('Failed to initialize workflow:', error);
console.warn('Failed to load today workflow:', error);
}
};
initializeWorkflow();
}, [generateDailyWorkflow, userId]);
}, [loadTodayWorkflow, userId]);
// Debug logging for workflow state (only in development)
React.useEffect(() => {
@@ -238,6 +240,17 @@ const MainDashboard: React.FC = () => {
// Note: filteredCategories removed as it's not used in the current implementation
const statusChips = React.useMemo(() => {
const scheduled = !!scheduleStatus?.scheduled_run_completed;
return [
{
label: scheduled ? 'Scheduled workflow ready' : 'Scheduled workflow pending',
color: scheduled ? '#22c55e' : '#ef4444',
icon: <Lightbulb sx={{ color: scheduled ? '#22c55e' : '#ef4444' }} />,
},
];
}, [scheduleStatus]);
if (loading) {
return <LoadingSkeleton />;
}
@@ -297,7 +310,7 @@ const MainDashboard: React.FC = () => {
<DashboardHeader
title="Alwrity Content Hub"
subtitle=""
statusChips={[]}
statusChips={statusChips}
customIcon={AskAlwrityIcon}
workflowControls={{
onStartWorkflow: handleStartWorkflow,

View File

@@ -7,10 +7,11 @@ import {
UserWorkflowPreferences,
NavigationState,
WorkflowStatus,
WorkflowError
WorkflowError,
TodayWorkflowScheduleStatus
} from '../types/workflow';
import { taskWorkflowOrchestrator } from '../services/TaskWorkflowOrchestrator';
import { apiClient, ConnectionError, NetworkError } from '../api/client';
import { apiClient } from '../api/client';
const isServerWorkflowId = (workflowId: string) => workflowId.startsWith('daily-');
@@ -47,9 +48,6 @@ const normalizeServerWorkflow = (workflow: DailyWorkflow): DailyWorkflow => ({
: [],
});
const isServerUnavailableError = (error: unknown): boolean =>
error instanceof ConnectionError || error instanceof NetworkError;
const toWorkflowError = (error: unknown, fallbackMessage: string): WorkflowError => {
if (error instanceof WorkflowError) return error;
@@ -109,6 +107,7 @@ interface WorkflowState {
currentWorkflow: DailyWorkflow | null;
workflowProgress: WorkflowProgress | null;
navigationState: NavigationState | null;
scheduleStatus: TodayWorkflowScheduleStatus | null;
// User preferences
userPreferences: UserWorkflowPreferences | null;
@@ -121,6 +120,8 @@ interface WorkflowState {
degradedModeReason: string | null;
// Actions
loadTodayWorkflow: (date?: string) => Promise<void>;
refreshScheduleStatus: (date?: string) => Promise<void>;
generateDailyWorkflow: (userId: string, date?: string) => Promise<void>;
startWorkflow: (workflowId: string) => Promise<void>;
pauseWorkflow: (workflowId: string) => Promise<void>;
@@ -154,6 +155,7 @@ export const useWorkflowStore = create<WorkflowState>()(
currentWorkflow: null,
workflowProgress: null,
navigationState: null,
scheduleStatus: null,
userPreferences: null,
isWorkflowModalOpen: false,
isLoading: false,
@@ -161,14 +163,14 @@ export const useWorkflowStore = create<WorkflowState>()(
isDegradedMode: false,
degradedModeReason: null,
// Generate daily workflow
generateDailyWorkflow: async (userId: string, date?: string) => {
loadTodayWorkflow: async (date?: string) => {
set({ isLoading: true, error: null });
try {
const resp = await apiClient.get('/api/today-workflow', { params: date ? { date } : {} });
const serverWorkflow = resp?.data?.data?.workflow as DailyWorkflow | undefined;
const planSummary = resp?.data?.data?.plan?.provenance_summary;
const scheduleStatus = resp?.data?.data?.schedule_status as TodayWorkflowScheduleStatus | undefined;
if (serverWorkflow && Array.isArray(serverWorkflow.tasks)) {
if (planSummary && !serverWorkflow.provenanceSummary) {
@@ -180,6 +182,75 @@ export const useWorkflowStore = create<WorkflowState>()(
currentWorkflow: normalizedWorkflow,
workflowProgress: derived.progress,
navigationState: derived.navigation,
scheduleStatus: scheduleStatus || null,
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
});
return;
}
throw new WorkflowError({
code: 'WORKFLOW_SCHEMA_INVALID',
message: 'Server workflow response is missing a valid tasks array.',
timestamp: new Date(),
recoverable: false,
suggestedAction: 'Refresh and try again. If this persists, contact support.'
});
} catch (error: any) {
if (error?.response?.status === 404) {
set({
currentWorkflow: null,
workflowProgress: null,
navigationState: null,
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
});
await get().refreshScheduleStatus(date);
return;
}
set({
error: toWorkflowError(error, 'Failed to load workflow from server.'),
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
});
}
},
refreshScheduleStatus: async (date?: string) => {
try {
const resp = await apiClient.get('/api/today-workflow/status', { params: date ? { date } : {} });
const scheduleStatus = resp?.data?.data as TodayWorkflowScheduleStatus | undefined;
set({ scheduleStatus: scheduleStatus || null });
} catch {
set({ scheduleStatus: null });
}
},
// Generate daily workflow
generateDailyWorkflow: async (userId: string, date?: string) => {
set({ isLoading: true, error: null });
try {
const resp = await apiClient.post('/api/today-workflow/generate', null, { params: date ? { date } : {} });
const serverWorkflow = resp?.data?.data?.workflow as DailyWorkflow | undefined;
const planSummary = resp?.data?.data?.plan?.provenance_summary;
const scheduleStatus = resp?.data?.data?.schedule_status as TodayWorkflowScheduleStatus | undefined;
if (serverWorkflow && Array.isArray(serverWorkflow.tasks)) {
if (planSummary && !serverWorkflow.provenanceSummary) {
serverWorkflow.provenanceSummary = planSummary;
}
const normalizedWorkflow = normalizeServerWorkflow(serverWorkflow);
const derived = computeProgressAndNavigation(normalizedWorkflow);
set({
currentWorkflow: normalizedWorkflow,
workflowProgress: derived.progress,
navigationState: derived.navigation,
scheduleStatus: scheduleStatus || null,
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
@@ -195,34 +266,11 @@ export const useWorkflowStore = create<WorkflowState>()(
suggestedAction: 'Refresh and try again. If this persists, contact support.'
});
} catch (error) {
if (!isServerUnavailableError(error)) {
set({
error: toWorkflowError(error, 'Failed to load workflow from server.'),
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
});
return;
}
}
try {
const workflow = await taskWorkflowOrchestrator.generateDailyWorkflow(userId, date);
const progress = taskWorkflowOrchestrator.getWorkflowProgress(workflow.id);
const navigation = taskWorkflowOrchestrator.getNavigationState(workflow.id);
set({
currentWorkflow: workflow,
workflowProgress: progress,
navigationState: navigation,
isLoading: false,
isDegradedMode: true,
degradedModeReason: 'Server workflow unavailable. Using local fallback workflow.',
error: null,
});
} catch (error) {
set({
error: toWorkflowError(error, 'Failed to generate local fallback workflow.'),
error: toWorkflowError(error, 'Failed to generate workflow from server.'),
isLoading: false,
isDegradedMode: false,
degradedModeReason: null,
});
}
},

View File

@@ -14,6 +14,14 @@ export interface WorkflowProvenanceSummary {
taskSourceBreakdown: Partial<Record<WorkflowGenerationMode, number>>;
}
export interface TodayWorkflowScheduleStatus {
date: string;
generated: boolean;
scheduled_run_completed: boolean;
source: string | null;
created_at?: string | null;
}
export interface TodayTask {
id: string;
pillarId: string;

View File

@@ -24,7 +24,7 @@
"./node_modules/@types",
"./src/types"
],
"types": ["jest", "node"]
"types": ["node"]
},
"include": [
"src"