Files
ALwrity/docs/ALwrity Researcher/RESEARCHER_CODEBASE_REVIEW.md

19 KiB

Research Engine Codebase Review & Understanding

Date: 2025-01-29
Status: Comprehensive Codebase Review Summary


📋 Executive Summary

The ALwrity Research Engine is a fully functional, production-ready intent-driven research system that has evolved from a traditional keyword-based search to an AI-powered research assistant. The system uses a unified analyzer approach to reduce LLM calls by 50% while providing hyper-personalized research experiences based on user onboarding data.


🏗️ Architecture Overview

Current Architecture (Intent-Driven)

User Input → UnifiedResearchAnalyzer (Single AI Call)
           ├── Intent Inference
           ├── Query Generation (4-8 queries)
           └── Parameter Optimization (Exa/Tavily)
           ↓
Research Execution (Exa → Tavily → Google)
           ↓
IntentAwareAnalyzer (Result Analysis)
           ↓
Structured Deliverables (Statistics, Quotes, Case Studies, etc.)

Key Architectural Principles

  1. Unified Analysis: Single LLM call for intent + queries + params (50% reduction)
  2. Intent-Driven: Understand user goals before searching
  3. Hyper-Personalization: Leverage research persona from onboarding data
  4. Provider Priority: Exa → Tavily → Google (semantic → real-time → fallback)
  5. Subscription-Aware: All AI calls go through llm_text_gen with user_id

📁 Code Structure

Backend Structure

backend/services/research/
├── core/
│   ├── research_engine.py           # Main orchestrator (standalone)
│   ├── research_context.py          # Unified input schema
│   └── parameter_optimizer.py     # DEPRECATED (use unified analyzer)
│
├── intent/
│   ├── unified_research_analyzer.py # ⭐ Unified AI analyzer (intent + queries + params)
│   ├── intent_aware_analyzer.py     # Result analysis based on intent
│   ├── unified_prompt_builder.py   # LLM prompt builders
│   ├── unified_schema_builder.py   # JSON schema builders
│   ├── unified_result_parser.py    # Result parsing utilities
│   ├── query_deduplicator.py       # Query deduplication logic
│   ├── research_intent_inference.py # Legacy (use unified)
│   └── intent_query_generator.py   # Legacy (use unified)
│
├── trends/
│   ├── google_trends_service.py    # Google Trends integration
│   └── rate_limiter.py              # Rate limiting for Trends API
│
├── research_persona_service.py      # Research persona generation/retrieval
├── research_persona_prompt_builder.py # Persona generation prompts
├── exa_service.py                  # Exa API integration
├── tavily_service.py                # Tavily API integration
└── google_search_service.py         # Google/Gemini grounding

backend/api/research/
├── router.py                        # Main router
└── handlers/
    ├── providers.py                 # Provider status endpoints
    ├── research.py                  # Traditional research endpoints
    ├── intent.py                    # Intent-driven endpoints
    └── projects.py                  # My Projects endpoints

Frontend Structure

frontend/src/components/Research/
├── ResearchWizard.tsx               # Main wizard orchestrator (3 steps)
├── steps/
│   ├── ResearchInput.tsx            # Step 1: Input + Intent & Options
│   ├── StepProgress.tsx             # Step 2: Progress/polling
│   ├── StepResults.tsx              # Step 3: Results display
│   ├── components/
│   │   ├── ResearchInputHeader.tsx  # Header with Advanced toggle
│   │   ├── ResearchInputContainer.tsx # Main input with Intent & Options button
│   │   ├── IntentConfirmationPanel.tsx # Intent display/edit panel
│   │   ├── IntentResultsDisplay.tsx # Tabbed results (Summary, Deliverables, Sources, Analysis)
│   │   ├── AdvancedOptionsSection.tsx # Exa/Tavily options
│   │   ├── ProviderChips.tsx        # Provider availability display
│   │   ├── PersonalizationIndicator.tsx # UI indicator for personalization
│   │   ├── PersonalizationBadge.tsx # Badge-style indicator
│   │   └── ... (other components)
│   ├── hooks/
│   │   ├── useResearchConfig.ts     # Config + persona loading
│   │   ├── useKeywordExpansion.ts   # Keyword expansion with persona
│   │   └── useResearchAngles.ts     # Research angles generation
│   └── utils/
│       ├── placeholders.ts          # Personalized placeholders
│       └── industryDefaults.ts     # Industry-specific defaults
└── hooks/
    ├── useResearchWizard.ts        # Wizard state management
    ├── useResearchExecution.ts      # Research execution orchestration
    └── useIntentResearch.ts         # Intent research flow

🔑 Key Components

1. UnifiedResearchAnalyzer

Location: backend/services/research/intent/unified_research_analyzer.py

Purpose: Single AI call that performs:

  • Intent inference (what user wants)
  • Query generation (4-8 targeted queries)
  • Parameter optimization (Exa/Tavily settings with justifications)

Key Features:

  • Reduces LLM calls from 2-3 to 1 (50% reduction)
  • Provides justifications for all parameter decisions
  • Uses research persona for context
  • Returns structured ResearchIntent, ResearchQuery[], and OptimizedConfig

Usage Pattern:

from services.research.intent.unified_research_analyzer import UnifiedResearchAnalyzer

analyzer = UnifiedResearchAnalyzer()
result = await analyzer.analyze(
    user_input=user_input,
    keywords=keywords,
    research_persona=research_persona,
    competitor_data=competitor_data,
    industry=industry,
    target_audience=target_audience,
    user_id=user_id,  # Required for subscription checks
)

2. IntentAwareAnalyzer

Location: backend/services/research/intent/intent_aware_analyzer.py

Purpose: Analyzes raw research results based on user intent to extract specific deliverables

Key Features:

  • Extracts statistics, quotes, case studies, trends, comparisons
  • Structures results by deliverable type
  • Provides credibility scores for sources
  • Identifies gaps and follow-up queries

Usage Pattern:

from services.research.intent.intent_aware_analyzer import IntentAwareAnalyzer

analyzer = IntentAwareAnalyzer()
result = await analyzer.analyze(
    raw_results=exa_tavily_results,
    intent=research_intent,
    research_persona=research_persona,
    user_id=user_id,  # Required for subscription checks
)

3. ResearchEngine

Location: backend/services/research/core/research_engine.py

Purpose: Orchestrates provider calls with priority order

Provider Priority:

  1. Exa (Primary): Semantic understanding, academic papers, competitor research
  2. Tavily (Secondary): Real-time news, trending topics, quick facts
  3. Google (Fallback): Basic factual queries via Gemini grounding

4. ResearchPersonaService

Location: backend/services/research/research_persona_service.py

Purpose: Generates and retrieves research persona from onboarding data

Persona Sources:

  • Core persona (onboarding step 1)
  • Website analysis (onboarding step 2): writing_style, content_characteristics, content_type, style_patterns, crawl_result
  • Competitor analysis (onboarding step 3)

Features:

  • Caches persona (7-day TTL)
  • Provides persona defaults for UI pre-filling
  • Generates personalized presets, keywords, and research angles

🔌 API Endpoints

  1. POST /api/research/intent/analyze

    • Analyzes user input to understand intent
    • Generates queries and optimizes parameters
    • Returns intent, queries, and optimized config
    • Performance: 2-5 seconds (single LLM call)
  2. POST /api/research/intent/research

    • Executes research based on confirmed intent
    • Returns structured deliverables
    • Performance: 10-30 seconds (depends on provider and query count)

Traditional Endpoints (Fallback)

  1. POST /api/research/execute - Synchronous research execution
  2. POST /api/research/start - Asynchronous research execution
  3. GET /api/research/status/{task_id} - Poll async research status

Configuration Endpoints

  1. GET /api/research/config - Provider availability + persona defaults
  2. GET /api/research/providers/status - Provider availability only
  3. GET /api/research/persona-defaults - Persona defaults only

🔄 Research Flow

Intent-Driven Research Flow (Current)

1. User Input
   User enters: "AI marketing tools"
   ↓

2. Intent Analysis (UnifiedResearchAnalyzer)
   POST /api/research/intent/analyze
   ├── Fetches Research Persona (if enabled)
   ├── Fetches Competitor Data (if enabled)
   └── Single LLM Call:
       ├── Intent Inference
       ├── Query Generation (4-8 queries)
       └── Parameter Optimization (Exa/Tavily)
   ↓

3. Intent Confirmation (Frontend)
   IntentConfirmationPanel displays:
   ├── Inferred intent (editable)
   ├── Suggested queries (selectable)
   └── AI-optimized settings with justifications
   ↓

4. Research Execution
   POST /api/research/intent/research
   ├── ResearchEngine executes queries (Exa → Tavily → Google)
   └── Returns raw results
   ↓

5. Intent-Aware Analysis
   IntentAwareAnalyzer analyzes results:
   ├── Extracts statistics, quotes, case studies
   ├── Structures by deliverable type
   └── Returns IntentDrivenResearchResult
   ↓

6. Results Display
   IntentResultsDisplay shows:
   ├── Summary Tab
   ├── Deliverables Tab
   ├── Sources Tab
   └── Analysis Tab

🎯 Key Features Implemented

Completed Features

  1. Intent-Driven Research Architecture

    • UnifiedResearchAnalyzer (single AI call)
    • IntentAwareAnalyzer (result analysis)
    • 3-Step Wizard (ResearchInput → StepProgress → StepResults)
    • IntentConfirmationPanel (review/edit intent)
  2. Google Trends Integration

    • Phase 1: Core Google Trends service
    • Phase 2: Hybrid approach (automatic + on-demand)
    • Phase 3: Enhanced UI with charts, export functionality
    • Integrated into intent-driven research flow
  3. Research Persona System

    • Persona generation from onboarding data
    • Persona defaults for UI pre-filling
    • Caching (7-day TTL)
    • UI indicators showing personalization
  4. My Projects Feature

    • Auto-save research projects upon completion
    • Asset Library integration
    • Restore functionality with full state persistence
  5. UI/UX Enhancements

    • QueryEditor redesign
    • Google Trends keywords with chip-based UI
    • Industry-specific placeholders
    • Time-sensitive query handling
    • Personalization indicators

📊 Data Models

ResearchIntent

class ResearchIntent:
    primary_question: str
    secondary_questions: List[str]
    purpose: ResearchPurpose  # learn, create_content, make_decision, etc.
    content_output: ContentOutput  # blog, podcast, video, etc.
    expected_deliverables: List[ExpectedDeliverable]
    depth: ResearchDepthLevel  # overview, detailed, expert
    focus_areas: List[str]
    perspective: Optional[str]
    time_sensitivity: str
    confidence: float
    confidence_reason: Optional[str]
    great_example: Optional[str]
    needs_clarification: bool
    clarifying_questions: List[str]

ResearchQuery

class ResearchQuery:
    query: str
    purpose: ExpectedDeliverable
    provider: str  # "exa" | "tavily"
    priority: int  # 1-5
    expected_results: str
    justification: Optional[str]

IntentDrivenResearchResult

class IntentDrivenResearchResult:
    primary_answer: str
    secondary_answers: Dict[str, str]
    statistics: List[StatisticWithCitation]
    expert_quotes: List[ExpertQuote]
    case_studies: List[CaseStudySummary]
    trends: List[TrendAnalysis]
    comparisons: List[ComparisonTable]
    best_practices: List[str]
    step_by_step: List[str]
    pros_cons: Optional[ProsCons]
    definitions: Dict[str, str]
    examples: List[str]
    predictions: List[str]
    executive_summary: str
    key_takeaways: List[str]
    suggested_outline: List[str]
    sources: List[SourceWithRelevance]
    confidence: float
    gaps_identified: List[str]
    follow_up_queries: List[str]

🎨 UI Components

ResearchWizard

Purpose: Main wizard orchestrator

Steps:

  1. ResearchInput: Input + Intent & Options button
  2. StepProgress: Progress/polling for async research
  3. StepResults: Tabbed results display

IntentConfirmationPanel

Purpose: Shows inferred intent and allows editing

Features:

  • Displays inferred intent (editable)
  • Shows suggested queries (selectable)
  • Displays AI-optimized settings with justifications
  • Advanced options for manual override

IntentResultsDisplay

Purpose: Tabbed results display

Tabs:

  • Summary: AI-generated overview
  • Deliverables: Extracted statistics, quotes, case studies, etc.
  • Sources: Citations with credibility scores
  • Analysis: Deep insights based on intent

🔐 Security & Subscription

Authentication

All endpoints require JWT authentication via get_current_user dependency.

Subscription Checks

All LLM calls must pass user_id for subscription and pre-flight validation:

result = llm_text_gen(
    prompt=prompt,
    json_struct=schema,
    user_id=user_id  # Required
)

Rate Limiting

  • Subject to subscription tier limits
  • Provider APIs (Exa/Tavily/Google) have their own rate limits

📈 Performance

Intent Analysis

  • Typical Time: 2-5 seconds
  • LLM Calls: 1 (unified analyzer)
  • Caching: Research persona cached (7-day TTL)

Research Execution

  • Typical Time: 10-30 seconds
  • Depends On: Provider, query count, result count
  • Async Support: Yes (via /api/research/start)

Result Analysis

  • Typical Time: 5-10 seconds
  • LLM Calls: 1 (intent-aware analyzer)

🔗 Integration Points

Blog Writer Integration

Research Engine can be imported by Blog Writer:

from services.research.core.research_engine import ResearchEngine
from services.research.core.research_context import ResearchContext

context = ResearchContext(
    query=blog_topic,
    keywords=blog_keywords,
    goal=ResearchGoal.FACTUAL,
    depth=ResearchDepth.COMPREHENSIVE,
)

engine = ResearchEngine()
result = await engine.research(context, user_id=user_id)

Frontend Integration

Research Wizard can be reused in other tools:

import { ResearchWizard } from '@/components/Research/ResearchWizard';

<ResearchWizard
  onComplete={(results) => {
    // Use results in blog/video generation
  }}
  initialKeywords={blogTopic}
  initialIndustry={userIndustry}
/>

Best Practices

  1. Always use UnifiedResearchAnalyzer for new intent-driven research
  2. Always pass user_id to all LLM calls
  3. Always use IntentAwareAnalyzer for result analysis
  4. Check provider availability before using providers
  5. Provide justifications for all AI-driven settings
  6. Allow user overrides in Advanced Options
  7. Never fallback to "General" - always use persona defaults

🚫 Common Pitfalls to Avoid

  1. Rule-Based Parameter Optimization: Always use AI-driven optimization via UnifiedResearchAnalyzer
  2. Missing user_id: Always pass user_id to llm_text_gen for subscription checks
  3. Breaking Changes: Never modify Research Engine in a way that breaks existing tools (Blog Writer, etc.)
  4. Hardcoded Defaults: Always use persona defaults, never hardcode "General" values
  5. Multiple LLM Calls: Use unified analyzer instead of separate intent + query + params calls
  6. Ignoring Provider Availability: Always check provider availability before using
  7. Missing Justifications: Every AI-driven setting must have a justification for UI display

📋 Pending Items & TODOs

From Code Review

  1. File Upload Logic (ResearchInput.tsx:396)
    • TODO: Implement file upload logic for research input
    • Status: Not started (low priority)

Documentation Gaps

  1. Intent-Driven Research Documentation

    • Comprehensive guide created (INTENT_DRIVEN_RESEARCH_GUIDE.md)
    • API reference created (INTENT_RESEARCH_API_REFERENCE.md)
    • Architecture overview created (CURRENT_ARCHITECTURE_OVERVIEW.md)
  2. Outdated Documentation

    • ⚠️ Some docs still reference old 4-step wizard
    • ⚠️ Need to update implementation guides
    • See DOCUMENTATION_REVIEW_AND_UPDATE_PLAN.md for details

🎯 Suggested Next Steps

Priority 1: Documentation Updates (High Value, Low Effort)

  1. Update outdated implementation documentation
  2. Create integration examples
  3. Update component documentation

Priority 2: Dashboard Alert System Integration (Medium Value, Medium Effort)

  1. Research cost alerts
  2. Research efficiency alerts
  3. Integration with billing dashboard alerts

Priority 3: Feature Enhancements (Variable Value, Variable Effort)

  1. File upload for research input
  2. Research templates
  3. Research comparison
  4. Advanced export options

Priority 4: Performance & Optimization (Low Value, High Effort)

  1. Research result caching
  2. Batch research operations

Current & Accurate

  • CURRENT_ARCHITECTURE_OVERVIEW.md - Single source of truth
  • INTENT_DRIVEN_RESEARCH_GUIDE.md - Comprehensive guide
  • INTENT_RESEARCH_API_REFERENCE.md - Complete API docs
  • .cursor/rules/researcher-architecture.mdc - Authoritative rules
  • PHASE2_IMPLEMENTATION_SUMMARY.md - Persona enhancements
  • PHASE3_AND_UI_INDICATORS_IMPLEMENTATION.md - Phase 3 features
  • RESEARCH_PERSONA_DATA_SOURCES.md - Persona data sources

Outdated (Historical Reference Only)

  • ⚠️ RESEARCH_WIZARD_IMPLEMENTATION.md - Describes old 4-step wizard
  • ⚠️ RESEARCH_COMPONENT_INTEGRATION.md - Mentions old architecture
  • ⚠️ PHASE1_IMPLEMENTATION_REVIEW.md - Missing intent-driven research
  • ⚠️ RESEARCH_IMPROVEMENTS_SUMMARY.md - Missing intent-driven research
  • ⚠️ COMPLETE_IMPLEMENTATION_SUMMARY.md - Missing intent-driven research

Conclusion

The Research Engine is fully functional and production-ready. The system has evolved from a traditional keyword-based search to an AI-powered intent-driven research assistant with:

  • 50% reduction in LLM calls (unified analyzer)
  • Hyper-personalization based on onboarding data
  • Structured deliverables (statistics, quotes, case studies, etc.)
  • Provider optimization (Exa → Tavily → Google)
  • UI indicators showing personalization
  • My Projects integration with Asset Library

Main Gaps:

  1. Documentation updates (some outdated docs)
  2. Alert system integration (cost/efficiency alerts)
  3. Feature enhancements (file upload, templates, etc.)

Recommended Focus: Start with documentation updates (high value, low effort) followed by alert system integration (improves user experience and cost transparency).


Status: Codebase Review Complete - System is Production-Ready 🚀