# Research Engine Codebase Review & Understanding **Date**: 2025-01-29 **Status**: Comprehensive Codebase Review Summary --- ## 📋 Executive Summary The ALwrity Research Engine is a **fully functional, production-ready intent-driven research system** that has evolved from a traditional keyword-based search to an AI-powered research assistant. The system uses a unified analyzer approach to reduce LLM calls by 50% while providing hyper-personalized research experiences based on user onboarding data. --- ## 🏗️ Architecture Overview ### Current Architecture (Intent-Driven) ``` User Input → UnifiedResearchAnalyzer (Single AI Call) ├── Intent Inference ├── Query Generation (4-8 queries) └── Parameter Optimization (Exa/Tavily) ↓ Research Execution (Exa → Tavily → Google) ↓ IntentAwareAnalyzer (Result Analysis) ↓ Structured Deliverables (Statistics, Quotes, Case Studies, etc.) ``` ### Key Architectural Principles 1. **Unified Analysis**: Single LLM call for intent + queries + params (50% reduction) 2. **Intent-Driven**: Understand user goals before searching 3. **Hyper-Personalization**: Leverage research persona from onboarding data 4. **Provider Priority**: Exa → Tavily → Google (semantic → real-time → fallback) 5. **Subscription-Aware**: All AI calls go through `llm_text_gen` with `user_id` --- ## 📁 Code Structure ### Backend Structure ``` backend/services/research/ ├── core/ │ ├── research_engine.py # Main orchestrator (standalone) │ ├── research_context.py # Unified input schema │ └── parameter_optimizer.py # DEPRECATED (use unified analyzer) │ ├── intent/ │ ├── unified_research_analyzer.py # ⭐ Unified AI analyzer (intent + queries + params) │ ├── intent_aware_analyzer.py # Result analysis based on intent │ ├── unified_prompt_builder.py # LLM prompt builders │ ├── unified_schema_builder.py # JSON schema builders │ ├── unified_result_parser.py # Result parsing utilities │ ├── query_deduplicator.py # Query deduplication logic │ ├── research_intent_inference.py # Legacy (use unified) │ └── intent_query_generator.py # Legacy (use unified) │ ├── trends/ │ ├── google_trends_service.py # Google Trends integration │ └── rate_limiter.py # Rate limiting for Trends API │ ├── research_persona_service.py # Research persona generation/retrieval ├── research_persona_prompt_builder.py # Persona generation prompts ├── exa_service.py # Exa API integration ├── tavily_service.py # Tavily API integration └── google_search_service.py # Google/Gemini grounding backend/api/research/ ├── router.py # Main router └── handlers/ ├── providers.py # Provider status endpoints ├── research.py # Traditional research endpoints ├── intent.py # Intent-driven endpoints └── projects.py # My Projects endpoints ``` ### Frontend Structure ``` frontend/src/components/Research/ ├── ResearchWizard.tsx # Main wizard orchestrator (3 steps) ├── steps/ │ ├── ResearchInput.tsx # Step 1: Input + Intent & Options │ ├── StepProgress.tsx # Step 2: Progress/polling │ ├── StepResults.tsx # Step 3: Results display │ ├── components/ │ │ ├── ResearchInputHeader.tsx # Header with Advanced toggle │ │ ├── ResearchInputContainer.tsx # Main input with Intent & Options button │ │ ├── IntentConfirmationPanel.tsx # Intent display/edit panel │ │ ├── IntentResultsDisplay.tsx # Tabbed results (Summary, Deliverables, Sources, Analysis) │ │ ├── AdvancedOptionsSection.tsx # Exa/Tavily options │ │ ├── ProviderChips.tsx # Provider availability display │ │ ├── PersonalizationIndicator.tsx # UI indicator for personalization │ │ ├── PersonalizationBadge.tsx # Badge-style indicator │ │ └── ... (other components) │ ├── hooks/ │ │ ├── useResearchConfig.ts # Config + persona loading │ │ ├── useKeywordExpansion.ts # Keyword expansion with persona │ │ └── useResearchAngles.ts # Research angles generation │ └── utils/ │ ├── placeholders.ts # Personalized placeholders │ └── industryDefaults.ts # Industry-specific defaults └── hooks/ ├── useResearchWizard.ts # Wizard state management ├── useResearchExecution.ts # Research execution orchestration └── useIntentResearch.ts # Intent research flow ``` --- ## 🔑 Key Components ### 1. UnifiedResearchAnalyzer ⭐ **Location**: `backend/services/research/intent/unified_research_analyzer.py` **Purpose**: Single AI call that performs: - Intent inference (what user wants) - Query generation (4-8 targeted queries) - Parameter optimization (Exa/Tavily settings with justifications) **Key Features**: - Reduces LLM calls from 2-3 to 1 (50% reduction) - Provides justifications for all parameter decisions - Uses research persona for context - Returns structured `ResearchIntent`, `ResearchQuery[]`, and `OptimizedConfig` **Usage Pattern**: ```python from services.research.intent.unified_research_analyzer import UnifiedResearchAnalyzer analyzer = UnifiedResearchAnalyzer() result = await analyzer.analyze( user_input=user_input, keywords=keywords, research_persona=research_persona, competitor_data=competitor_data, industry=industry, target_audience=target_audience, user_id=user_id, # Required for subscription checks ) ``` ### 2. IntentAwareAnalyzer **Location**: `backend/services/research/intent/intent_aware_analyzer.py` **Purpose**: Analyzes raw research results based on user intent to extract specific deliverables **Key Features**: - Extracts statistics, quotes, case studies, trends, comparisons - Structures results by deliverable type - Provides credibility scores for sources - Identifies gaps and follow-up queries **Usage Pattern**: ```python from services.research.intent.intent_aware_analyzer import IntentAwareAnalyzer analyzer = IntentAwareAnalyzer() result = await analyzer.analyze( raw_results=exa_tavily_results, intent=research_intent, research_persona=research_persona, user_id=user_id, # Required for subscription checks ) ``` ### 3. ResearchEngine **Location**: `backend/services/research/core/research_engine.py` **Purpose**: Orchestrates provider calls with priority order **Provider Priority**: 1. **Exa** (Primary): Semantic understanding, academic papers, competitor research 2. **Tavily** (Secondary): Real-time news, trending topics, quick facts 3. **Google** (Fallback): Basic factual queries via Gemini grounding ### 4. ResearchPersonaService **Location**: `backend/services/research/research_persona_service.py` **Purpose**: Generates and retrieves research persona from onboarding data **Persona Sources**: - Core persona (onboarding step 1) - Website analysis (onboarding step 2): `writing_style`, `content_characteristics`, `content_type`, `style_patterns`, `crawl_result` - Competitor analysis (onboarding step 3) **Features**: - Caches persona (7-day TTL) - Provides persona defaults for UI pre-filling - Generates personalized presets, keywords, and research angles --- ## 🔌 API Endpoints ### Intent-Driven Endpoints (Current - Recommended) 1. **POST `/api/research/intent/analyze`** - Analyzes user input to understand intent - Generates queries and optimizes parameters - Returns intent, queries, and optimized config - **Performance**: 2-5 seconds (single LLM call) 2. **POST `/api/research/intent/research`** - Executes research based on confirmed intent - Returns structured deliverables - **Performance**: 10-30 seconds (depends on provider and query count) ### Traditional Endpoints (Fallback) 3. **POST `/api/research/execute`** - Synchronous research execution 4. **POST `/api/research/start`** - Asynchronous research execution 5. **GET `/api/research/status/{task_id}`** - Poll async research status ### Configuration Endpoints 6. **GET `/api/research/config`** - Provider availability + persona defaults 7. **GET `/api/research/providers/status`** - Provider availability only 8. **GET `/api/research/persona-defaults`** - Persona defaults only --- ## 🔄 Research Flow ### Intent-Driven Research Flow (Current) ``` 1. User Input User enters: "AI marketing tools" ↓ 2. Intent Analysis (UnifiedResearchAnalyzer) POST /api/research/intent/analyze ├── Fetches Research Persona (if enabled) ├── Fetches Competitor Data (if enabled) └── Single LLM Call: ├── Intent Inference ├── Query Generation (4-8 queries) └── Parameter Optimization (Exa/Tavily) ↓ 3. Intent Confirmation (Frontend) IntentConfirmationPanel displays: ├── Inferred intent (editable) ├── Suggested queries (selectable) └── AI-optimized settings with justifications ↓ 4. Research Execution POST /api/research/intent/research ├── ResearchEngine executes queries (Exa → Tavily → Google) └── Returns raw results ↓ 5. Intent-Aware Analysis IntentAwareAnalyzer analyzes results: ├── Extracts statistics, quotes, case studies ├── Structures by deliverable type └── Returns IntentDrivenResearchResult ↓ 6. Results Display IntentResultsDisplay shows: ├── Summary Tab ├── Deliverables Tab ├── Sources Tab └── Analysis Tab ``` --- ## 🎯 Key Features Implemented ### ✅ Completed Features 1. **Intent-Driven Research Architecture** - UnifiedResearchAnalyzer (single AI call) - IntentAwareAnalyzer (result analysis) - 3-Step Wizard (ResearchInput → StepProgress → StepResults) - IntentConfirmationPanel (review/edit intent) 2. **Google Trends Integration** - Phase 1: Core Google Trends service - Phase 2: Hybrid approach (automatic + on-demand) - Phase 3: Enhanced UI with charts, export functionality - Integrated into intent-driven research flow 3. **Research Persona System** - Persona generation from onboarding data - Persona defaults for UI pre-filling - Caching (7-day TTL) - UI indicators showing personalization 4. **My Projects Feature** - Auto-save research projects upon completion - Asset Library integration - Restore functionality with full state persistence 5. **UI/UX Enhancements** - QueryEditor redesign - Google Trends keywords with chip-based UI - Industry-specific placeholders - Time-sensitive query handling - Personalization indicators --- ## 📊 Data Models ### ResearchIntent ```python class ResearchIntent: primary_question: str secondary_questions: List[str] purpose: ResearchPurpose # learn, create_content, make_decision, etc. content_output: ContentOutput # blog, podcast, video, etc. expected_deliverables: List[ExpectedDeliverable] depth: ResearchDepthLevel # overview, detailed, expert focus_areas: List[str] perspective: Optional[str] time_sensitivity: str confidence: float confidence_reason: Optional[str] great_example: Optional[str] needs_clarification: bool clarifying_questions: List[str] ``` ### ResearchQuery ```python class ResearchQuery: query: str purpose: ExpectedDeliverable provider: str # "exa" | "tavily" priority: int # 1-5 expected_results: str justification: Optional[str] ``` ### IntentDrivenResearchResult ```python class IntentDrivenResearchResult: primary_answer: str secondary_answers: Dict[str, str] statistics: List[StatisticWithCitation] expert_quotes: List[ExpertQuote] case_studies: List[CaseStudySummary] trends: List[TrendAnalysis] comparisons: List[ComparisonTable] best_practices: List[str] step_by_step: List[str] pros_cons: Optional[ProsCons] definitions: Dict[str, str] examples: List[str] predictions: List[str] executive_summary: str key_takeaways: List[str] suggested_outline: List[str] sources: List[SourceWithRelevance] confidence: float gaps_identified: List[str] follow_up_queries: List[str] ``` --- ## 🎨 UI Components ### ResearchWizard **Purpose**: Main wizard orchestrator **Steps**: 1. **ResearchInput**: Input + Intent & Options button 2. **StepProgress**: Progress/polling for async research 3. **StepResults**: Tabbed results display ### IntentConfirmationPanel **Purpose**: Shows inferred intent and allows editing **Features**: - Displays inferred intent (editable) - Shows suggested queries (selectable) - Displays AI-optimized settings with justifications - Advanced options for manual override ### IntentResultsDisplay **Purpose**: Tabbed results display **Tabs**: - **Summary**: AI-generated overview - **Deliverables**: Extracted statistics, quotes, case studies, etc. - **Sources**: Citations with credibility scores - **Analysis**: Deep insights based on intent --- ## 🔐 Security & Subscription ### Authentication All endpoints require JWT authentication via `get_current_user` dependency. ### Subscription Checks All LLM calls must pass `user_id` for subscription and pre-flight validation: ```python result = llm_text_gen( prompt=prompt, json_struct=schema, user_id=user_id # Required ) ``` ### Rate Limiting - Subject to subscription tier limits - Provider APIs (Exa/Tavily/Google) have their own rate limits --- ## 📈 Performance ### Intent Analysis - **Typical Time**: 2-5 seconds - **LLM Calls**: 1 (unified analyzer) - **Caching**: Research persona cached (7-day TTL) ### Research Execution - **Typical Time**: 10-30 seconds - **Depends On**: Provider, query count, result count - **Async Support**: Yes (via `/api/research/start`) ### Result Analysis - **Typical Time**: 5-10 seconds - **LLM Calls**: 1 (intent-aware analyzer) --- ## 🔗 Integration Points ### Blog Writer Integration Research Engine can be imported by Blog Writer: ```python from services.research.core.research_engine import ResearchEngine from services.research.core.research_context import ResearchContext context = ResearchContext( query=blog_topic, keywords=blog_keywords, goal=ResearchGoal.FACTUAL, depth=ResearchDepth.COMPREHENSIVE, ) engine = ResearchEngine() result = await engine.research(context, user_id=user_id) ``` ### Frontend Integration Research Wizard can be reused in other tools: ```tsx import { ResearchWizard } from '@/components/Research/ResearchWizard'; { // Use results in blog/video generation }} initialKeywords={blogTopic} initialIndustry={userIndustry} /> ``` --- ## ✅ Best Practices 1. **Always use UnifiedResearchAnalyzer** for new intent-driven research 2. **Always pass user_id** to all LLM calls 3. **Always use IntentAwareAnalyzer** for result analysis 4. **Check provider availability** before using providers 5. **Provide justifications** for all AI-driven settings 6. **Allow user overrides** in Advanced Options 7. **Never fallback to "General"** - always use persona defaults --- ## 🚫 Common Pitfalls to Avoid 1. ❌ **Rule-Based Parameter Optimization**: Always use AI-driven optimization via `UnifiedResearchAnalyzer` 2. ❌ **Missing `user_id`**: Always pass `user_id` to `llm_text_gen` for subscription checks 3. ❌ **Breaking Changes**: Never modify Research Engine in a way that breaks existing tools (Blog Writer, etc.) 4. ❌ **Hardcoded Defaults**: Always use persona defaults, never hardcode "General" values 5. ❌ **Multiple LLM Calls**: Use unified analyzer instead of separate intent + query + params calls 6. ❌ **Ignoring Provider Availability**: Always check provider availability before using 7. ❌ **Missing Justifications**: Every AI-driven setting must have a justification for UI display --- ## 📋 Pending Items & TODOs ### From Code Review 1. **File Upload Logic** (ResearchInput.tsx:396) - TODO: Implement file upload logic for research input - Status: Not started (low priority) ### Documentation Gaps 1. **Intent-Driven Research Documentation** - ✅ Comprehensive guide created (`INTENT_DRIVEN_RESEARCH_GUIDE.md`) - ✅ API reference created (`INTENT_RESEARCH_API_REFERENCE.md`) - ✅ Architecture overview created (`CURRENT_ARCHITECTURE_OVERVIEW.md`) 2. **Outdated Documentation** - ⚠️ Some docs still reference old 4-step wizard - ⚠️ Need to update implementation guides - See `DOCUMENTATION_REVIEW_AND_UPDATE_PLAN.md` for details --- ## 🎯 Suggested Next Steps ### Priority 1: Documentation Updates (High Value, Low Effort) 1. Update outdated implementation documentation 2. Create integration examples 3. Update component documentation ### Priority 2: Dashboard Alert System Integration (Medium Value, Medium Effort) 1. Research cost alerts 2. Research efficiency alerts 3. Integration with billing dashboard alerts ### Priority 3: Feature Enhancements (Variable Value, Variable Effort) 1. File upload for research input 2. Research templates 3. Research comparison 4. Advanced export options ### Priority 4: Performance & Optimization (Low Value, High Effort) 1. Research result caching 2. Batch research operations --- ## 📚 Related Documentation ### Current & Accurate - ✅ **CURRENT_ARCHITECTURE_OVERVIEW.md** - Single source of truth - ✅ **INTENT_DRIVEN_RESEARCH_GUIDE.md** - Comprehensive guide - ✅ **INTENT_RESEARCH_API_REFERENCE.md** - Complete API docs - ✅ **.cursor/rules/researcher-architecture.mdc** - Authoritative rules - ✅ **PHASE2_IMPLEMENTATION_SUMMARY.md** - Persona enhancements - ✅ **PHASE3_AND_UI_INDICATORS_IMPLEMENTATION.md** - Phase 3 features - ✅ **RESEARCH_PERSONA_DATA_SOURCES.md** - Persona data sources ### Outdated (Historical Reference Only) - ⚠️ **RESEARCH_WIZARD_IMPLEMENTATION.md** - Describes old 4-step wizard - ⚠️ **RESEARCH_COMPONENT_INTEGRATION.md** - Mentions old architecture - ⚠️ **PHASE1_IMPLEMENTATION_REVIEW.md** - Missing intent-driven research - ⚠️ **RESEARCH_IMPROVEMENTS_SUMMARY.md** - Missing intent-driven research - ⚠️ **COMPLETE_IMPLEMENTATION_SUMMARY.md** - Missing intent-driven research --- ## ✅ Conclusion The Research Engine is **fully functional and production-ready**. The system has evolved from a traditional keyword-based search to an AI-powered intent-driven research assistant with: - **50% reduction in LLM calls** (unified analyzer) - **Hyper-personalization** based on onboarding data - **Structured deliverables** (statistics, quotes, case studies, etc.) - **Provider optimization** (Exa → Tavily → Google) - **UI indicators** showing personalization - **My Projects** integration with Asset Library **Main Gaps**: 1. Documentation updates (some outdated docs) 2. Alert system integration (cost/efficiency alerts) 3. Feature enhancements (file upload, templates, etc.) **Recommended Focus**: Start with documentation updates (high value, low effort) followed by alert system integration (improves user experience and cost transparency). --- **Status**: Codebase Review Complete - System is Production-Ready 🚀