610 lines
19 KiB
Markdown
610 lines
19 KiB
Markdown
# Research Engine Codebase Review & Understanding
|
|
|
|
**Date**: 2025-01-29
|
|
**Status**: Comprehensive Codebase Review Summary
|
|
|
|
---
|
|
|
|
## 📋 Executive Summary
|
|
|
|
The ALwrity Research Engine is a **fully functional, production-ready intent-driven research system** that has evolved from a traditional keyword-based search to an AI-powered research assistant. The system uses a unified analyzer approach to reduce LLM calls by 50% while providing hyper-personalized research experiences based on user onboarding data.
|
|
|
|
---
|
|
|
|
## 🏗️ Architecture Overview
|
|
|
|
### Current Architecture (Intent-Driven)
|
|
|
|
```
|
|
User Input → UnifiedResearchAnalyzer (Single AI Call)
|
|
├── Intent Inference
|
|
├── Query Generation (4-8 queries)
|
|
└── Parameter Optimization (Exa/Tavily)
|
|
↓
|
|
Research Execution (Exa → Tavily → Google)
|
|
↓
|
|
IntentAwareAnalyzer (Result Analysis)
|
|
↓
|
|
Structured Deliverables (Statistics, Quotes, Case Studies, etc.)
|
|
```
|
|
|
|
### Key Architectural Principles
|
|
|
|
1. **Unified Analysis**: Single LLM call for intent + queries + params (50% reduction)
|
|
2. **Intent-Driven**: Understand user goals before searching
|
|
3. **Hyper-Personalization**: Leverage research persona from onboarding data
|
|
4. **Provider Priority**: Exa → Tavily → Google (semantic → real-time → fallback)
|
|
5. **Subscription-Aware**: All AI calls go through `llm_text_gen` with `user_id`
|
|
|
|
---
|
|
|
|
## 📁 Code Structure
|
|
|
|
### Backend Structure
|
|
|
|
```
|
|
backend/services/research/
|
|
├── core/
|
|
│ ├── research_engine.py # Main orchestrator (standalone)
|
|
│ ├── research_context.py # Unified input schema
|
|
│ └── parameter_optimizer.py # DEPRECATED (use unified analyzer)
|
|
│
|
|
├── intent/
|
|
│ ├── unified_research_analyzer.py # ⭐ Unified AI analyzer (intent + queries + params)
|
|
│ ├── intent_aware_analyzer.py # Result analysis based on intent
|
|
│ ├── unified_prompt_builder.py # LLM prompt builders
|
|
│ ├── unified_schema_builder.py # JSON schema builders
|
|
│ ├── unified_result_parser.py # Result parsing utilities
|
|
│ ├── query_deduplicator.py # Query deduplication logic
|
|
│ ├── research_intent_inference.py # Legacy (use unified)
|
|
│ └── intent_query_generator.py # Legacy (use unified)
|
|
│
|
|
├── trends/
|
|
│ ├── google_trends_service.py # Google Trends integration
|
|
│ └── rate_limiter.py # Rate limiting for Trends API
|
|
│
|
|
├── research_persona_service.py # Research persona generation/retrieval
|
|
├── research_persona_prompt_builder.py # Persona generation prompts
|
|
├── exa_service.py # Exa API integration
|
|
├── tavily_service.py # Tavily API integration
|
|
└── google_search_service.py # Google/Gemini grounding
|
|
|
|
backend/api/research/
|
|
├── router.py # Main router
|
|
└── handlers/
|
|
├── providers.py # Provider status endpoints
|
|
├── research.py # Traditional research endpoints
|
|
├── intent.py # Intent-driven endpoints
|
|
└── projects.py # My Projects endpoints
|
|
```
|
|
|
|
### Frontend Structure
|
|
|
|
```
|
|
frontend/src/components/Research/
|
|
├── ResearchWizard.tsx # Main wizard orchestrator (3 steps)
|
|
├── steps/
|
|
│ ├── ResearchInput.tsx # Step 1: Input + Intent & Options
|
|
│ ├── StepProgress.tsx # Step 2: Progress/polling
|
|
│ ├── StepResults.tsx # Step 3: Results display
|
|
│ ├── components/
|
|
│ │ ├── ResearchInputHeader.tsx # Header with Advanced toggle
|
|
│ │ ├── ResearchInputContainer.tsx # Main input with Intent & Options button
|
|
│ │ ├── IntentConfirmationPanel.tsx # Intent display/edit panel
|
|
│ │ ├── IntentResultsDisplay.tsx # Tabbed results (Summary, Deliverables, Sources, Analysis)
|
|
│ │ ├── AdvancedOptionsSection.tsx # Exa/Tavily options
|
|
│ │ ├── ProviderChips.tsx # Provider availability display
|
|
│ │ ├── PersonalizationIndicator.tsx # UI indicator for personalization
|
|
│ │ ├── PersonalizationBadge.tsx # Badge-style indicator
|
|
│ │ └── ... (other components)
|
|
│ ├── hooks/
|
|
│ │ ├── useResearchConfig.ts # Config + persona loading
|
|
│ │ ├── useKeywordExpansion.ts # Keyword expansion with persona
|
|
│ │ └── useResearchAngles.ts # Research angles generation
|
|
│ └── utils/
|
|
│ ├── placeholders.ts # Personalized placeholders
|
|
│ └── industryDefaults.ts # Industry-specific defaults
|
|
└── hooks/
|
|
├── useResearchWizard.ts # Wizard state management
|
|
├── useResearchExecution.ts # Research execution orchestration
|
|
└── useIntentResearch.ts # Intent research flow
|
|
```
|
|
|
|
---
|
|
|
|
## 🔑 Key Components
|
|
|
|
### 1. UnifiedResearchAnalyzer ⭐
|
|
|
|
**Location**: `backend/services/research/intent/unified_research_analyzer.py`
|
|
|
|
**Purpose**: Single AI call that performs:
|
|
- Intent inference (what user wants)
|
|
- Query generation (4-8 targeted queries)
|
|
- Parameter optimization (Exa/Tavily settings with justifications)
|
|
|
|
**Key Features**:
|
|
- Reduces LLM calls from 2-3 to 1 (50% reduction)
|
|
- Provides justifications for all parameter decisions
|
|
- Uses research persona for context
|
|
- Returns structured `ResearchIntent`, `ResearchQuery[]`, and `OptimizedConfig`
|
|
|
|
**Usage Pattern**:
|
|
```python
|
|
from services.research.intent.unified_research_analyzer import UnifiedResearchAnalyzer
|
|
|
|
analyzer = UnifiedResearchAnalyzer()
|
|
result = await analyzer.analyze(
|
|
user_input=user_input,
|
|
keywords=keywords,
|
|
research_persona=research_persona,
|
|
competitor_data=competitor_data,
|
|
industry=industry,
|
|
target_audience=target_audience,
|
|
user_id=user_id, # Required for subscription checks
|
|
)
|
|
```
|
|
|
|
### 2. IntentAwareAnalyzer
|
|
|
|
**Location**: `backend/services/research/intent/intent_aware_analyzer.py`
|
|
|
|
**Purpose**: Analyzes raw research results based on user intent to extract specific deliverables
|
|
|
|
**Key Features**:
|
|
- Extracts statistics, quotes, case studies, trends, comparisons
|
|
- Structures results by deliverable type
|
|
- Provides credibility scores for sources
|
|
- Identifies gaps and follow-up queries
|
|
|
|
**Usage Pattern**:
|
|
```python
|
|
from services.research.intent.intent_aware_analyzer import IntentAwareAnalyzer
|
|
|
|
analyzer = IntentAwareAnalyzer()
|
|
result = await analyzer.analyze(
|
|
raw_results=exa_tavily_results,
|
|
intent=research_intent,
|
|
research_persona=research_persona,
|
|
user_id=user_id, # Required for subscription checks
|
|
)
|
|
```
|
|
|
|
### 3. ResearchEngine
|
|
|
|
**Location**: `backend/services/research/core/research_engine.py`
|
|
|
|
**Purpose**: Orchestrates provider calls with priority order
|
|
|
|
**Provider Priority**:
|
|
1. **Exa** (Primary): Semantic understanding, academic papers, competitor research
|
|
2. **Tavily** (Secondary): Real-time news, trending topics, quick facts
|
|
3. **Google** (Fallback): Basic factual queries via Gemini grounding
|
|
|
|
### 4. ResearchPersonaService
|
|
|
|
**Location**: `backend/services/research/research_persona_service.py`
|
|
|
|
**Purpose**: Generates and retrieves research persona from onboarding data
|
|
|
|
**Persona Sources**:
|
|
- Core persona (onboarding step 1)
|
|
- Website analysis (onboarding step 2): `writing_style`, `content_characteristics`, `content_type`, `style_patterns`, `crawl_result`
|
|
- Competitor analysis (onboarding step 3)
|
|
|
|
**Features**:
|
|
- Caches persona (7-day TTL)
|
|
- Provides persona defaults for UI pre-filling
|
|
- Generates personalized presets, keywords, and research angles
|
|
|
|
---
|
|
|
|
## 🔌 API Endpoints
|
|
|
|
### Intent-Driven Endpoints (Current - Recommended)
|
|
|
|
1. **POST `/api/research/intent/analyze`**
|
|
- Analyzes user input to understand intent
|
|
- Generates queries and optimizes parameters
|
|
- Returns intent, queries, and optimized config
|
|
- **Performance**: 2-5 seconds (single LLM call)
|
|
|
|
2. **POST `/api/research/intent/research`**
|
|
- Executes research based on confirmed intent
|
|
- Returns structured deliverables
|
|
- **Performance**: 10-30 seconds (depends on provider and query count)
|
|
|
|
### Traditional Endpoints (Fallback)
|
|
|
|
3. **POST `/api/research/execute`** - Synchronous research execution
|
|
4. **POST `/api/research/start`** - Asynchronous research execution
|
|
5. **GET `/api/research/status/{task_id}`** - Poll async research status
|
|
|
|
### Configuration Endpoints
|
|
|
|
6. **GET `/api/research/config`** - Provider availability + persona defaults
|
|
7. **GET `/api/research/providers/status`** - Provider availability only
|
|
8. **GET `/api/research/persona-defaults`** - Persona defaults only
|
|
|
|
---
|
|
|
|
## 🔄 Research Flow
|
|
|
|
### Intent-Driven Research Flow (Current)
|
|
|
|
```
|
|
1. User Input
|
|
User enters: "AI marketing tools"
|
|
↓
|
|
|
|
2. Intent Analysis (UnifiedResearchAnalyzer)
|
|
POST /api/research/intent/analyze
|
|
├── Fetches Research Persona (if enabled)
|
|
├── Fetches Competitor Data (if enabled)
|
|
└── Single LLM Call:
|
|
├── Intent Inference
|
|
├── Query Generation (4-8 queries)
|
|
└── Parameter Optimization (Exa/Tavily)
|
|
↓
|
|
|
|
3. Intent Confirmation (Frontend)
|
|
IntentConfirmationPanel displays:
|
|
├── Inferred intent (editable)
|
|
├── Suggested queries (selectable)
|
|
└── AI-optimized settings with justifications
|
|
↓
|
|
|
|
4. Research Execution
|
|
POST /api/research/intent/research
|
|
├── ResearchEngine executes queries (Exa → Tavily → Google)
|
|
└── Returns raw results
|
|
↓
|
|
|
|
5. Intent-Aware Analysis
|
|
IntentAwareAnalyzer analyzes results:
|
|
├── Extracts statistics, quotes, case studies
|
|
├── Structures by deliverable type
|
|
└── Returns IntentDrivenResearchResult
|
|
↓
|
|
|
|
6. Results Display
|
|
IntentResultsDisplay shows:
|
|
├── Summary Tab
|
|
├── Deliverables Tab
|
|
├── Sources Tab
|
|
└── Analysis Tab
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 Key Features Implemented
|
|
|
|
### ✅ Completed Features
|
|
|
|
1. **Intent-Driven Research Architecture**
|
|
- UnifiedResearchAnalyzer (single AI call)
|
|
- IntentAwareAnalyzer (result analysis)
|
|
- 3-Step Wizard (ResearchInput → StepProgress → StepResults)
|
|
- IntentConfirmationPanel (review/edit intent)
|
|
|
|
2. **Google Trends Integration**
|
|
- Phase 1: Core Google Trends service
|
|
- Phase 2: Hybrid approach (automatic + on-demand)
|
|
- Phase 3: Enhanced UI with charts, export functionality
|
|
- Integrated into intent-driven research flow
|
|
|
|
3. **Research Persona System**
|
|
- Persona generation from onboarding data
|
|
- Persona defaults for UI pre-filling
|
|
- Caching (7-day TTL)
|
|
- UI indicators showing personalization
|
|
|
|
4. **My Projects Feature**
|
|
- Auto-save research projects upon completion
|
|
- Asset Library integration
|
|
- Restore functionality with full state persistence
|
|
|
|
5. **UI/UX Enhancements**
|
|
- QueryEditor redesign
|
|
- Google Trends keywords with chip-based UI
|
|
- Industry-specific placeholders
|
|
- Time-sensitive query handling
|
|
- Personalization indicators
|
|
|
|
---
|
|
|
|
## 📊 Data Models
|
|
|
|
### ResearchIntent
|
|
|
|
```python
|
|
class ResearchIntent:
|
|
primary_question: str
|
|
secondary_questions: List[str]
|
|
purpose: ResearchPurpose # learn, create_content, make_decision, etc.
|
|
content_output: ContentOutput # blog, podcast, video, etc.
|
|
expected_deliverables: List[ExpectedDeliverable]
|
|
depth: ResearchDepthLevel # overview, detailed, expert
|
|
focus_areas: List[str]
|
|
perspective: Optional[str]
|
|
time_sensitivity: str
|
|
confidence: float
|
|
confidence_reason: Optional[str]
|
|
great_example: Optional[str]
|
|
needs_clarification: bool
|
|
clarifying_questions: List[str]
|
|
```
|
|
|
|
### ResearchQuery
|
|
|
|
```python
|
|
class ResearchQuery:
|
|
query: str
|
|
purpose: ExpectedDeliverable
|
|
provider: str # "exa" | "tavily"
|
|
priority: int # 1-5
|
|
expected_results: str
|
|
justification: Optional[str]
|
|
```
|
|
|
|
### IntentDrivenResearchResult
|
|
|
|
```python
|
|
class IntentDrivenResearchResult:
|
|
primary_answer: str
|
|
secondary_answers: Dict[str, str]
|
|
statistics: List[StatisticWithCitation]
|
|
expert_quotes: List[ExpertQuote]
|
|
case_studies: List[CaseStudySummary]
|
|
trends: List[TrendAnalysis]
|
|
comparisons: List[ComparisonTable]
|
|
best_practices: List[str]
|
|
step_by_step: List[str]
|
|
pros_cons: Optional[ProsCons]
|
|
definitions: Dict[str, str]
|
|
examples: List[str]
|
|
predictions: List[str]
|
|
executive_summary: str
|
|
key_takeaways: List[str]
|
|
suggested_outline: List[str]
|
|
sources: List[SourceWithRelevance]
|
|
confidence: float
|
|
gaps_identified: List[str]
|
|
follow_up_queries: List[str]
|
|
```
|
|
|
|
---
|
|
|
|
## 🎨 UI Components
|
|
|
|
### ResearchWizard
|
|
|
|
**Purpose**: Main wizard orchestrator
|
|
|
|
**Steps**:
|
|
1. **ResearchInput**: Input + Intent & Options button
|
|
2. **StepProgress**: Progress/polling for async research
|
|
3. **StepResults**: Tabbed results display
|
|
|
|
### IntentConfirmationPanel
|
|
|
|
**Purpose**: Shows inferred intent and allows editing
|
|
|
|
**Features**:
|
|
- Displays inferred intent (editable)
|
|
- Shows suggested queries (selectable)
|
|
- Displays AI-optimized settings with justifications
|
|
- Advanced options for manual override
|
|
|
|
### IntentResultsDisplay
|
|
|
|
**Purpose**: Tabbed results display
|
|
|
|
**Tabs**:
|
|
- **Summary**: AI-generated overview
|
|
- **Deliverables**: Extracted statistics, quotes, case studies, etc.
|
|
- **Sources**: Citations with credibility scores
|
|
- **Analysis**: Deep insights based on intent
|
|
|
|
---
|
|
|
|
## 🔐 Security & Subscription
|
|
|
|
### Authentication
|
|
|
|
All endpoints require JWT authentication via `get_current_user` dependency.
|
|
|
|
### Subscription Checks
|
|
|
|
All LLM calls must pass `user_id` for subscription and pre-flight validation:
|
|
|
|
```python
|
|
result = llm_text_gen(
|
|
prompt=prompt,
|
|
json_struct=schema,
|
|
user_id=user_id # Required
|
|
)
|
|
```
|
|
|
|
### Rate Limiting
|
|
|
|
- Subject to subscription tier limits
|
|
- Provider APIs (Exa/Tavily/Google) have their own rate limits
|
|
|
|
---
|
|
|
|
## 📈 Performance
|
|
|
|
### Intent Analysis
|
|
- **Typical Time**: 2-5 seconds
|
|
- **LLM Calls**: 1 (unified analyzer)
|
|
- **Caching**: Research persona cached (7-day TTL)
|
|
|
|
### Research Execution
|
|
- **Typical Time**: 10-30 seconds
|
|
- **Depends On**: Provider, query count, result count
|
|
- **Async Support**: Yes (via `/api/research/start`)
|
|
|
|
### Result Analysis
|
|
- **Typical Time**: 5-10 seconds
|
|
- **LLM Calls**: 1 (intent-aware analyzer)
|
|
|
|
---
|
|
|
|
## 🔗 Integration Points
|
|
|
|
### Blog Writer Integration
|
|
|
|
Research Engine can be imported by Blog Writer:
|
|
|
|
```python
|
|
from services.research.core.research_engine import ResearchEngine
|
|
from services.research.core.research_context import ResearchContext
|
|
|
|
context = ResearchContext(
|
|
query=blog_topic,
|
|
keywords=blog_keywords,
|
|
goal=ResearchGoal.FACTUAL,
|
|
depth=ResearchDepth.COMPREHENSIVE,
|
|
)
|
|
|
|
engine = ResearchEngine()
|
|
result = await engine.research(context, user_id=user_id)
|
|
```
|
|
|
|
### Frontend Integration
|
|
|
|
Research Wizard can be reused in other tools:
|
|
|
|
```tsx
|
|
import { ResearchWizard } from '@/components/Research/ResearchWizard';
|
|
|
|
<ResearchWizard
|
|
onComplete={(results) => {
|
|
// Use results in blog/video generation
|
|
}}
|
|
initialKeywords={blogTopic}
|
|
initialIndustry={userIndustry}
|
|
/>
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ Best Practices
|
|
|
|
1. **Always use UnifiedResearchAnalyzer** for new intent-driven research
|
|
2. **Always pass user_id** to all LLM calls
|
|
3. **Always use IntentAwareAnalyzer** for result analysis
|
|
4. **Check provider availability** before using providers
|
|
5. **Provide justifications** for all AI-driven settings
|
|
6. **Allow user overrides** in Advanced Options
|
|
7. **Never fallback to "General"** - always use persona defaults
|
|
|
|
---
|
|
|
|
## 🚫 Common Pitfalls to Avoid
|
|
|
|
1. ❌ **Rule-Based Parameter Optimization**: Always use AI-driven optimization via `UnifiedResearchAnalyzer`
|
|
2. ❌ **Missing `user_id`**: Always pass `user_id` to `llm_text_gen` for subscription checks
|
|
3. ❌ **Breaking Changes**: Never modify Research Engine in a way that breaks existing tools (Blog Writer, etc.)
|
|
4. ❌ **Hardcoded Defaults**: Always use persona defaults, never hardcode "General" values
|
|
5. ❌ **Multiple LLM Calls**: Use unified analyzer instead of separate intent + query + params calls
|
|
6. ❌ **Ignoring Provider Availability**: Always check provider availability before using
|
|
7. ❌ **Missing Justifications**: Every AI-driven setting must have a justification for UI display
|
|
|
|
---
|
|
|
|
## 📋 Pending Items & TODOs
|
|
|
|
### From Code Review
|
|
|
|
1. **File Upload Logic** (ResearchInput.tsx:396)
|
|
- TODO: Implement file upload logic for research input
|
|
- Status: Not started (low priority)
|
|
|
|
### Documentation Gaps
|
|
|
|
1. **Intent-Driven Research Documentation**
|
|
- ✅ Comprehensive guide created (`INTENT_DRIVEN_RESEARCH_GUIDE.md`)
|
|
- ✅ API reference created (`INTENT_RESEARCH_API_REFERENCE.md`)
|
|
- ✅ Architecture overview created (`CURRENT_ARCHITECTURE_OVERVIEW.md`)
|
|
|
|
2. **Outdated Documentation**
|
|
- ⚠️ Some docs still reference old 4-step wizard
|
|
- ⚠️ Need to update implementation guides
|
|
- See `DOCUMENTATION_REVIEW_AND_UPDATE_PLAN.md` for details
|
|
|
|
---
|
|
|
|
## 🎯 Suggested Next Steps
|
|
|
|
### Priority 1: Documentation Updates (High Value, Low Effort)
|
|
|
|
1. Update outdated implementation documentation
|
|
2. Create integration examples
|
|
3. Update component documentation
|
|
|
|
### Priority 2: Dashboard Alert System Integration (Medium Value, Medium Effort)
|
|
|
|
1. Research cost alerts
|
|
2. Research efficiency alerts
|
|
3. Integration with billing dashboard alerts
|
|
|
|
### Priority 3: Feature Enhancements (Variable Value, Variable Effort)
|
|
|
|
1. File upload for research input
|
|
2. Research templates
|
|
3. Research comparison
|
|
4. Advanced export options
|
|
|
|
### Priority 4: Performance & Optimization (Low Value, High Effort)
|
|
|
|
1. Research result caching
|
|
2. Batch research operations
|
|
|
|
---
|
|
|
|
## 📚 Related Documentation
|
|
|
|
### Current & Accurate
|
|
|
|
- ✅ **CURRENT_ARCHITECTURE_OVERVIEW.md** - Single source of truth
|
|
- ✅ **INTENT_DRIVEN_RESEARCH_GUIDE.md** - Comprehensive guide
|
|
- ✅ **INTENT_RESEARCH_API_REFERENCE.md** - Complete API docs
|
|
- ✅ **.cursor/rules/researcher-architecture.mdc** - Authoritative rules
|
|
- ✅ **PHASE2_IMPLEMENTATION_SUMMARY.md** - Persona enhancements
|
|
- ✅ **PHASE3_AND_UI_INDICATORS_IMPLEMENTATION.md** - Phase 3 features
|
|
- ✅ **RESEARCH_PERSONA_DATA_SOURCES.md** - Persona data sources
|
|
|
|
### Outdated (Historical Reference Only)
|
|
|
|
- ⚠️ **RESEARCH_WIZARD_IMPLEMENTATION.md** - Describes old 4-step wizard
|
|
- ⚠️ **RESEARCH_COMPONENT_INTEGRATION.md** - Mentions old architecture
|
|
- ⚠️ **PHASE1_IMPLEMENTATION_REVIEW.md** - Missing intent-driven research
|
|
- ⚠️ **RESEARCH_IMPROVEMENTS_SUMMARY.md** - Missing intent-driven research
|
|
- ⚠️ **COMPLETE_IMPLEMENTATION_SUMMARY.md** - Missing intent-driven research
|
|
|
|
---
|
|
|
|
## ✅ Conclusion
|
|
|
|
The Research Engine is **fully functional and production-ready**. The system has evolved from a traditional keyword-based search to an AI-powered intent-driven research assistant with:
|
|
|
|
- **50% reduction in LLM calls** (unified analyzer)
|
|
- **Hyper-personalization** based on onboarding data
|
|
- **Structured deliverables** (statistics, quotes, case studies, etc.)
|
|
- **Provider optimization** (Exa → Tavily → Google)
|
|
- **UI indicators** showing personalization
|
|
- **My Projects** integration with Asset Library
|
|
|
|
**Main Gaps**:
|
|
1. Documentation updates (some outdated docs)
|
|
2. Alert system integration (cost/efficiency alerts)
|
|
3. Feature enhancements (file upload, templates, etc.)
|
|
|
|
**Recommended Focus**: Start with documentation updates (high value, low effort) followed by alert system integration (improves user experience and cost transparency).
|
|
|
|
---
|
|
|
|
**Status**: Codebase Review Complete - System is Production-Ready 🚀
|