# 🚨 CRITICAL: Onboarding Data Must Use Database ## Issue Summary **Severity:** 🔴 CRITICAL **Impact:** User isolation, data persistence, security **Status:** ⚠️ NEEDS IMMEDIATE FIX AFTER DEPLOYMENT STABILIZES ## Problem Description The onboarding system currently saves all user data to a JSON file (`.onboarding_progress.json`) instead of using the database. This causes multiple critical issues: ### 1. **No User Isolation** 🔴 - All users share the same JSON file - User data can be overwritten by other users - Privacy violation - users can see each other's data - **Line:** `backend/services/api_key_manager.py:45` - **Code:** `self.progress_file = progress_file or ".onboarding_progress.json"` ### 2. **Data Loss on Deployment** 🔴 - Render uses ephemeral filesystem - File is deleted on every deployment/restart - Users lose all onboarding progress - Have to restart onboarding after each deployment ### 3. **No Scalability** 🔴 - Won't work with multiple backend instances - File locking issues - Race conditions - Performance bottleneck ### 4. **Security Risk** 🔴 - API keys stored in plain text JSON file - No encryption - File accessible with filesystem access - Should be in database with proper security ## Current Architecture ``` User completes step → OnboardingProgress.mark_step_completed() → save_progress() (line 214) → json.dump(progress_data, ".onboarding_progress.json") ``` **File Location:** `backend/.onboarding_progress.json` **Affected Code:** - `backend/services/api_key_manager.py` (OnboardingProgress class) - `backend/api/onboarding_utils/endpoints_core.py` - `backend/api/onboarding_utils/step_management_service.py` ## Database Models Available ✅ **Good News:** Proper database models already exist! **File:** `backend/models/onboarding.py` ```python - OnboardingSession (user_id, current_step, progress, started_at, updated_at) - APIKey (session_id, provider, key, created_at, updated_at) - WebsiteAnalysis (session_id, website_url, analysis_date, writing_style, etc.) - ResearchPreferences (session_id, research_depth, content_types, etc.) ``` **Database Schema:** - ✅ User isolation via `user_id` and `session_id` - ✅ Proper relationships and foreign keys - ✅ Timestamps for audit trail - ✅ JSON fields for complex data - ✅ Cascade deletes for cleanup ## Required Changes ### Phase 1: Database Layer (Priority 1) **File:** `backend/services/onboarding_database_service.py` (NEW) ```python class OnboardingDatabaseService: """Database-backed onboarding service replacing JSON file storage.""" def get_or_create_session(self, user_id: str) -> OnboardingSession: """Get existing session or create new one.""" def get_progress(self, user_id: str) -> OnboardingProgress: """Load progress from database.""" def save_step_data(self, user_id: str, step_number: int, data: Dict): """Save step data to database.""" def mark_step_completed(self, user_id: str, step_number: int): """Mark step as completed in database.""" def get_step_data(self, user_id: str, step_number: int) -> Dict: """Retrieve step data from database.""" ``` ### Phase 2: Refactor API Key Manager (Priority 1) **File:** `backend/services/api_key_manager.py` **Changes:** 1. Remove JSON file operations (lines 214-242) 2. Add database dependency injection 3. Replace `save_progress()` with database calls 4. Replace `load_progress()` with database queries 5. Add user_id parameter to all methods **Before:** ```python def mark_step_completed(self, step_number: int, data: Optional[Dict] = None): # ... update in-memory state ... self.save_progress() # Saves to JSON file ``` **After:** ```python def mark_step_completed(self, user_id: str, step_number: int, data: Optional[Dict] = None): # ... update database ... db_service.save_step_data(user_id, step_number, data) db_service.mark_step_completed(user_id, step_number) ``` ### Phase 3: Update Endpoints (Priority 2) **Files to Update:** - `backend/api/onboarding_utils/endpoints_core.py` - `backend/api/onboarding_utils/step_management_service.py` - `backend/api/onboarding_utils/step3_routes.py` - `backend/api/onboarding_utils/step4_persona_routes.py` **Changes:** 1. Pass `user_id` from `get_current_user` to all service calls 2. Remove file-based caching 3. Use database queries for progress retrieval ### Phase 4: Migration Script (Priority 3) **File:** `backend/scripts/migrate_onboarding_to_database.py` (NEW) ```python def migrate_json_to_database(): """ Migrate existing .onboarding_progress.json to database. Only needed if production has existing data in JSON files. """ # Read JSON file # Create database records # Backup JSON file # Delete JSON file ``` ## Implementation Plan ### Step 1: Create Database Service (1-2 hours) - [ ] Create `onboarding_database_service.py` - [ ] Implement CRUD operations - [ ] Add user isolation checks - [ ] Write unit tests ### Step 2: Refactor API Key Manager (2-3 hours) - [ ] Remove JSON file operations - [ ] Add database calls - [ ] Update method signatures with user_id - [ ] Test with database ### Step 3: Update Endpoints (1-2 hours) - [ ] Pass user_id to service calls - [ ] Remove file-based logic - [ ] Test each endpoint ### Step 4: Testing (2-3 hours) - [ ] Test user isolation - [ ] Test data persistence across deployments - [ ] Test concurrent users - [ ] Test error handling ### Step 5: Deployment (1 hour) - [ ] Deploy to staging - [ ] Run migration script if needed - [ ] Deploy to production - [ ] Monitor for issues **Total Estimated Time:** 8-12 hours ## Temporary Mitigation Until this is fixed, we must: 1. ✅ Add `.onboarding_progress.json` to `.gitignore` 2. ✅ Document that onboarding data will be lost on deployment 3. ⚠️ Warn users that onboarding must be completed in one session 4. ⚠️ Consider using Render's persistent disk (expensive workaround) ## Testing Checklist After migration: - [ ] User A completes onboarding - [ ] User B completes onboarding - [ ] Verify User A and User B data are separate - [ ] Redeploy backend - [ ] Verify both users' data persists - [ ] User C starts onboarding - [ ] Verify User C doesn't see User A or B data - [ ] Test concurrent onboarding (multiple users at once) - [ ] Verify API keys are stored securely - [ ] Test onboarding restart (partial completion) ## Security Considerations ### Current (Insecure): ```json { "steps": [ { "step_number": 1, "data": { "api_keys": { "gemini": "ACTUAL_API_KEY_HERE", "exa": "ACTUAL_API_KEY_HERE" } } } ] } ``` ### After Migration (Secure): - API keys in database with user isolation - Encrypted at rest (if database supports it) - Access controlled by user_id - Audit trail via timestamps ## References - Database Models: `backend/models/onboarding.py` - Current Implementation: `backend/services/api_key_manager.py` - Endpoints: `backend/api/onboarding_utils/` - Issue tracking: GitHub Issue #XXX (to be created) ## Priority **This must be fixed before:** - ❌ Going to production with real users - ❌ Accepting paying customers - ❌ Handling sensitive data - ❌ Scaling to multiple instances **Acceptable to delay if:** - ✅ Still in alpha/beta with limited users - ✅ Users aware of data loss on deployment - ✅ Not handling production workloads yet ## Conclusion This is a critical architectural flaw that violates basic principles: - User data isolation - Data persistence - Security best practices - Scalability **Must be fixed immediately after current deployment stabilizes.**