19 KiB
Research Engine Codebase Review & Understanding
Date: 2025-01-29
Status: Comprehensive Codebase Review Summary
📋 Executive Summary
The ALwrity Research Engine is a fully functional, production-ready intent-driven research system that has evolved from a traditional keyword-based search to an AI-powered research assistant. The system uses a unified analyzer approach to reduce LLM calls by 50% while providing hyper-personalized research experiences based on user onboarding data.
🏗️ Architecture Overview
Current Architecture (Intent-Driven)
User Input → UnifiedResearchAnalyzer (Single AI Call)
├── Intent Inference
├── Query Generation (4-8 queries)
└── Parameter Optimization (Exa/Tavily)
↓
Research Execution (Exa → Tavily → Google)
↓
IntentAwareAnalyzer (Result Analysis)
↓
Structured Deliverables (Statistics, Quotes, Case Studies, etc.)
Key Architectural Principles
- Unified Analysis: Single LLM call for intent + queries + params (50% reduction)
- Intent-Driven: Understand user goals before searching
- Hyper-Personalization: Leverage research persona from onboarding data
- Provider Priority: Exa → Tavily → Google (semantic → real-time → fallback)
- Subscription-Aware: All AI calls go through
llm_text_genwithuser_id
📁 Code Structure
Backend Structure
backend/services/research/
├── core/
│ ├── research_engine.py # Main orchestrator (standalone)
│ ├── research_context.py # Unified input schema
│ └── parameter_optimizer.py # DEPRECATED (use unified analyzer)
│
├── intent/
│ ├── unified_research_analyzer.py # ⭐ Unified AI analyzer (intent + queries + params)
│ ├── intent_aware_analyzer.py # Result analysis based on intent
│ ├── unified_prompt_builder.py # LLM prompt builders
│ ├── unified_schema_builder.py # JSON schema builders
│ ├── unified_result_parser.py # Result parsing utilities
│ ├── query_deduplicator.py # Query deduplication logic
│ ├── research_intent_inference.py # Legacy (use unified)
│ └── intent_query_generator.py # Legacy (use unified)
│
├── trends/
│ ├── google_trends_service.py # Google Trends integration
│ └── rate_limiter.py # Rate limiting for Trends API
│
├── research_persona_service.py # Research persona generation/retrieval
├── research_persona_prompt_builder.py # Persona generation prompts
├── exa_service.py # Exa API integration
├── tavily_service.py # Tavily API integration
└── google_search_service.py # Google/Gemini grounding
backend/api/research/
├── router.py # Main router
└── handlers/
├── providers.py # Provider status endpoints
├── research.py # Traditional research endpoints
├── intent.py # Intent-driven endpoints
└── projects.py # My Projects endpoints
Frontend Structure
frontend/src/components/Research/
├── ResearchWizard.tsx # Main wizard orchestrator (3 steps)
├── steps/
│ ├── ResearchInput.tsx # Step 1: Input + Intent & Options
│ ├── StepProgress.tsx # Step 2: Progress/polling
│ ├── StepResults.tsx # Step 3: Results display
│ ├── components/
│ │ ├── ResearchInputHeader.tsx # Header with Advanced toggle
│ │ ├── ResearchInputContainer.tsx # Main input with Intent & Options button
│ │ ├── IntentConfirmationPanel.tsx # Intent display/edit panel
│ │ ├── IntentResultsDisplay.tsx # Tabbed results (Summary, Deliverables, Sources, Analysis)
│ │ ├── AdvancedOptionsSection.tsx # Exa/Tavily options
│ │ ├── ProviderChips.tsx # Provider availability display
│ │ ├── PersonalizationIndicator.tsx # UI indicator for personalization
│ │ ├── PersonalizationBadge.tsx # Badge-style indicator
│ │ └── ... (other components)
│ ├── hooks/
│ │ ├── useResearchConfig.ts # Config + persona loading
│ │ ├── useKeywordExpansion.ts # Keyword expansion with persona
│ │ └── useResearchAngles.ts # Research angles generation
│ └── utils/
│ ├── placeholders.ts # Personalized placeholders
│ └── industryDefaults.ts # Industry-specific defaults
└── hooks/
├── useResearchWizard.ts # Wizard state management
├── useResearchExecution.ts # Research execution orchestration
└── useIntentResearch.ts # Intent research flow
🔑 Key Components
1. UnifiedResearchAnalyzer ⭐
Location: backend/services/research/intent/unified_research_analyzer.py
Purpose: Single AI call that performs:
- Intent inference (what user wants)
- Query generation (4-8 targeted queries)
- Parameter optimization (Exa/Tavily settings with justifications)
Key Features:
- Reduces LLM calls from 2-3 to 1 (50% reduction)
- Provides justifications for all parameter decisions
- Uses research persona for context
- Returns structured
ResearchIntent,ResearchQuery[], andOptimizedConfig
Usage Pattern:
from services.research.intent.unified_research_analyzer import UnifiedResearchAnalyzer
analyzer = UnifiedResearchAnalyzer()
result = await analyzer.analyze(
user_input=user_input,
keywords=keywords,
research_persona=research_persona,
competitor_data=competitor_data,
industry=industry,
target_audience=target_audience,
user_id=user_id, # Required for subscription checks
)
2. IntentAwareAnalyzer
Location: backend/services/research/intent/intent_aware_analyzer.py
Purpose: Analyzes raw research results based on user intent to extract specific deliverables
Key Features:
- Extracts statistics, quotes, case studies, trends, comparisons
- Structures results by deliverable type
- Provides credibility scores for sources
- Identifies gaps and follow-up queries
Usage Pattern:
from services.research.intent.intent_aware_analyzer import IntentAwareAnalyzer
analyzer = IntentAwareAnalyzer()
result = await analyzer.analyze(
raw_results=exa_tavily_results,
intent=research_intent,
research_persona=research_persona,
user_id=user_id, # Required for subscription checks
)
3. ResearchEngine
Location: backend/services/research/core/research_engine.py
Purpose: Orchestrates provider calls with priority order
Provider Priority:
- Exa (Primary): Semantic understanding, academic papers, competitor research
- Tavily (Secondary): Real-time news, trending topics, quick facts
- Google (Fallback): Basic factual queries via Gemini grounding
4. ResearchPersonaService
Location: backend/services/research/research_persona_service.py
Purpose: Generates and retrieves research persona from onboarding data
Persona Sources:
- Core persona (onboarding step 1)
- Website analysis (onboarding step 2):
writing_style,content_characteristics,content_type,style_patterns,crawl_result - Competitor analysis (onboarding step 3)
Features:
- Caches persona (7-day TTL)
- Provides persona defaults for UI pre-filling
- Generates personalized presets, keywords, and research angles
🔌 API Endpoints
Intent-Driven Endpoints (Current - Recommended)
-
POST
/api/research/intent/analyze- Analyzes user input to understand intent
- Generates queries and optimizes parameters
- Returns intent, queries, and optimized config
- Performance: 2-5 seconds (single LLM call)
-
POST
/api/research/intent/research- Executes research based on confirmed intent
- Returns structured deliverables
- Performance: 10-30 seconds (depends on provider and query count)
Traditional Endpoints (Fallback)
- POST
/api/research/execute- Synchronous research execution - POST
/api/research/start- Asynchronous research execution - GET
/api/research/status/{task_id}- Poll async research status
Configuration Endpoints
- GET
/api/research/config- Provider availability + persona defaults - GET
/api/research/providers/status- Provider availability only - GET
/api/research/persona-defaults- Persona defaults only
🔄 Research Flow
Intent-Driven Research Flow (Current)
1. User Input
User enters: "AI marketing tools"
↓
2. Intent Analysis (UnifiedResearchAnalyzer)
POST /api/research/intent/analyze
├── Fetches Research Persona (if enabled)
├── Fetches Competitor Data (if enabled)
└── Single LLM Call:
├── Intent Inference
├── Query Generation (4-8 queries)
└── Parameter Optimization (Exa/Tavily)
↓
3. Intent Confirmation (Frontend)
IntentConfirmationPanel displays:
├── Inferred intent (editable)
├── Suggested queries (selectable)
└── AI-optimized settings with justifications
↓
4. Research Execution
POST /api/research/intent/research
├── ResearchEngine executes queries (Exa → Tavily → Google)
└── Returns raw results
↓
5. Intent-Aware Analysis
IntentAwareAnalyzer analyzes results:
├── Extracts statistics, quotes, case studies
├── Structures by deliverable type
└── Returns IntentDrivenResearchResult
↓
6. Results Display
IntentResultsDisplay shows:
├── Summary Tab
├── Deliverables Tab
├── Sources Tab
└── Analysis Tab
🎯 Key Features Implemented
✅ Completed Features
-
Intent-Driven Research Architecture
- UnifiedResearchAnalyzer (single AI call)
- IntentAwareAnalyzer (result analysis)
- 3-Step Wizard (ResearchInput → StepProgress → StepResults)
- IntentConfirmationPanel (review/edit intent)
-
Google Trends Integration
- Phase 1: Core Google Trends service
- Phase 2: Hybrid approach (automatic + on-demand)
- Phase 3: Enhanced UI with charts, export functionality
- Integrated into intent-driven research flow
-
Research Persona System
- Persona generation from onboarding data
- Persona defaults for UI pre-filling
- Caching (7-day TTL)
- UI indicators showing personalization
-
My Projects Feature
- Auto-save research projects upon completion
- Asset Library integration
- Restore functionality with full state persistence
-
UI/UX Enhancements
- QueryEditor redesign
- Google Trends keywords with chip-based UI
- Industry-specific placeholders
- Time-sensitive query handling
- Personalization indicators
📊 Data Models
ResearchIntent
class ResearchIntent:
primary_question: str
secondary_questions: List[str]
purpose: ResearchPurpose # learn, create_content, make_decision, etc.
content_output: ContentOutput # blog, podcast, video, etc.
expected_deliverables: List[ExpectedDeliverable]
depth: ResearchDepthLevel # overview, detailed, expert
focus_areas: List[str]
perspective: Optional[str]
time_sensitivity: str
confidence: float
confidence_reason: Optional[str]
great_example: Optional[str]
needs_clarification: bool
clarifying_questions: List[str]
ResearchQuery
class ResearchQuery:
query: str
purpose: ExpectedDeliverable
provider: str # "exa" | "tavily"
priority: int # 1-5
expected_results: str
justification: Optional[str]
IntentDrivenResearchResult
class IntentDrivenResearchResult:
primary_answer: str
secondary_answers: Dict[str, str]
statistics: List[StatisticWithCitation]
expert_quotes: List[ExpertQuote]
case_studies: List[CaseStudySummary]
trends: List[TrendAnalysis]
comparisons: List[ComparisonTable]
best_practices: List[str]
step_by_step: List[str]
pros_cons: Optional[ProsCons]
definitions: Dict[str, str]
examples: List[str]
predictions: List[str]
executive_summary: str
key_takeaways: List[str]
suggested_outline: List[str]
sources: List[SourceWithRelevance]
confidence: float
gaps_identified: List[str]
follow_up_queries: List[str]
🎨 UI Components
ResearchWizard
Purpose: Main wizard orchestrator
Steps:
- ResearchInput: Input + Intent & Options button
- StepProgress: Progress/polling for async research
- StepResults: Tabbed results display
IntentConfirmationPanel
Purpose: Shows inferred intent and allows editing
Features:
- Displays inferred intent (editable)
- Shows suggested queries (selectable)
- Displays AI-optimized settings with justifications
- Advanced options for manual override
IntentResultsDisplay
Purpose: Tabbed results display
Tabs:
- Summary: AI-generated overview
- Deliverables: Extracted statistics, quotes, case studies, etc.
- Sources: Citations with credibility scores
- Analysis: Deep insights based on intent
🔐 Security & Subscription
Authentication
All endpoints require JWT authentication via get_current_user dependency.
Subscription Checks
All LLM calls must pass user_id for subscription and pre-flight validation:
result = llm_text_gen(
prompt=prompt,
json_struct=schema,
user_id=user_id # Required
)
Rate Limiting
- Subject to subscription tier limits
- Provider APIs (Exa/Tavily/Google) have their own rate limits
📈 Performance
Intent Analysis
- Typical Time: 2-5 seconds
- LLM Calls: 1 (unified analyzer)
- Caching: Research persona cached (7-day TTL)
Research Execution
- Typical Time: 10-30 seconds
- Depends On: Provider, query count, result count
- Async Support: Yes (via
/api/research/start)
Result Analysis
- Typical Time: 5-10 seconds
- LLM Calls: 1 (intent-aware analyzer)
🔗 Integration Points
Blog Writer Integration
Research Engine can be imported by Blog Writer:
from services.research.core.research_engine import ResearchEngine
from services.research.core.research_context import ResearchContext
context = ResearchContext(
query=blog_topic,
keywords=blog_keywords,
goal=ResearchGoal.FACTUAL,
depth=ResearchDepth.COMPREHENSIVE,
)
engine = ResearchEngine()
result = await engine.research(context, user_id=user_id)
Frontend Integration
Research Wizard can be reused in other tools:
import { ResearchWizard } from '@/components/Research/ResearchWizard';
<ResearchWizard
onComplete={(results) => {
// Use results in blog/video generation
}}
initialKeywords={blogTopic}
initialIndustry={userIndustry}
/>
✅ Best Practices
- Always use UnifiedResearchAnalyzer for new intent-driven research
- Always pass user_id to all LLM calls
- Always use IntentAwareAnalyzer for result analysis
- Check provider availability before using providers
- Provide justifications for all AI-driven settings
- Allow user overrides in Advanced Options
- Never fallback to "General" - always use persona defaults
🚫 Common Pitfalls to Avoid
- ❌ Rule-Based Parameter Optimization: Always use AI-driven optimization via
UnifiedResearchAnalyzer - ❌ Missing
user_id: Always passuser_idtollm_text_genfor subscription checks - ❌ Breaking Changes: Never modify Research Engine in a way that breaks existing tools (Blog Writer, etc.)
- ❌ Hardcoded Defaults: Always use persona defaults, never hardcode "General" values
- ❌ Multiple LLM Calls: Use unified analyzer instead of separate intent + query + params calls
- ❌ Ignoring Provider Availability: Always check provider availability before using
- ❌ Missing Justifications: Every AI-driven setting must have a justification for UI display
📋 Pending Items & TODOs
From Code Review
- File Upload Logic (ResearchInput.tsx:396)
- TODO: Implement file upload logic for research input
- Status: Not started (low priority)
Documentation Gaps
-
Intent-Driven Research Documentation
- ✅ Comprehensive guide created (
INTENT_DRIVEN_RESEARCH_GUIDE.md) - ✅ API reference created (
INTENT_RESEARCH_API_REFERENCE.md) - ✅ Architecture overview created (
CURRENT_ARCHITECTURE_OVERVIEW.md)
- ✅ Comprehensive guide created (
-
Outdated Documentation
- ⚠️ Some docs still reference old 4-step wizard
- ⚠️ Need to update implementation guides
- See
DOCUMENTATION_REVIEW_AND_UPDATE_PLAN.mdfor details
🎯 Suggested Next Steps
Priority 1: Documentation Updates (High Value, Low Effort)
- Update outdated implementation documentation
- Create integration examples
- Update component documentation
Priority 2: Dashboard Alert System Integration (Medium Value, Medium Effort)
- Research cost alerts
- Research efficiency alerts
- Integration with billing dashboard alerts
Priority 3: Feature Enhancements (Variable Value, Variable Effort)
- File upload for research input
- Research templates
- Research comparison
- Advanced export options
Priority 4: Performance & Optimization (Low Value, High Effort)
- Research result caching
- Batch research operations
📚 Related Documentation
Current & Accurate
- ✅ CURRENT_ARCHITECTURE_OVERVIEW.md - Single source of truth
- ✅ INTENT_DRIVEN_RESEARCH_GUIDE.md - Comprehensive guide
- ✅ INTENT_RESEARCH_API_REFERENCE.md - Complete API docs
- ✅ .cursor/rules/researcher-architecture.mdc - Authoritative rules
- ✅ PHASE2_IMPLEMENTATION_SUMMARY.md - Persona enhancements
- ✅ PHASE3_AND_UI_INDICATORS_IMPLEMENTATION.md - Phase 3 features
- ✅ RESEARCH_PERSONA_DATA_SOURCES.md - Persona data sources
Outdated (Historical Reference Only)
- ⚠️ RESEARCH_WIZARD_IMPLEMENTATION.md - Describes old 4-step wizard
- ⚠️ RESEARCH_COMPONENT_INTEGRATION.md - Mentions old architecture
- ⚠️ PHASE1_IMPLEMENTATION_REVIEW.md - Missing intent-driven research
- ⚠️ RESEARCH_IMPROVEMENTS_SUMMARY.md - Missing intent-driven research
- ⚠️ COMPLETE_IMPLEMENTATION_SUMMARY.md - Missing intent-driven research
✅ Conclusion
The Research Engine is fully functional and production-ready. The system has evolved from a traditional keyword-based search to an AI-powered intent-driven research assistant with:
- 50% reduction in LLM calls (unified analyzer)
- Hyper-personalization based on onboarding data
- Structured deliverables (statistics, quotes, case studies, etc.)
- Provider optimization (Exa → Tavily → Google)
- UI indicators showing personalization
- My Projects integration with Asset Library
Main Gaps:
- Documentation updates (some outdated docs)
- Alert system integration (cost/efficiency alerts)
- Feature enhancements (file upload, templates, etc.)
Recommended Focus: Start with documentation updates (high value, low effort) followed by alert system integration (improves user experience and cost transparency).
Status: Codebase Review Complete - System is Production-Ready 🚀