# Research Persona Data Sources & Generated Fields
## Overview
The Research Persona is an AI-generated profile that provides hyper-personalized research defaults, suggestions, and configurations based on a user's onboarding data. This document details what data is used to generate the persona and what fields are produced.
---
## Data Sources Used for Generation
### 1. **Website Analysis** (`website_analysis`)
**Source**: Onboarding Step 2 - Website Analysis
**Location**: `WebsiteAnalysis` table in database
**Key Fields Used**:
- `website_url`: User's website URL
- `writing_style`: Tone, voice, complexity, engagement level
- `content_characteristics`: Sentence structure, vocabulary, paragraph organization
- `target_audience`: Demographics, expertise level, industry focus
- `content_type`: Primary type, secondary types, purpose
- `recommended_settings`: Writing tone, target audience, content type
- `style_patterns`: Writing patterns analysis
- `style_guidelines`: Generated guidelines
**Usage**: Extracts industry focus, target audience, content preferences, and writing style patterns to inform research defaults.
### 2. **Core Persona** (`core_persona`)
**Source**: Onboarding Step 4 - Persona Generation
**Location**: `PersonaData.core_persona` JSON field
**Key Fields Used**:
- `industry`: User's primary industry
- `target_audience`: Detailed audience description
- `interests`: User's content interests and focus areas
- `pain_points`: Challenges and needs
- `content_goals`: What the user wants to achieve with content
**Usage**: Primary source for industry, audience, and content strategy insights.
### 3. **Research Preferences** (`research_preferences`)
**Source**: Onboarding Step 3 - Research Preferences
**Location**: `ResearchPreferences` table
**Key Fields Used**:
- `research_depth`: "standard", "comprehensive", "basic"
- `content_types`: Array of content types (e.g., ["blog", "social", "video"])
- `auto_research`: Whether to auto-enable research
- `factual_content`: Preference for factual vs. opinion-based content
- `writing_style`: Inherited from website analysis
- `content_characteristics`: Inherited from website analysis
- `target_audience`: Inherited from website analysis
**Usage**: Determines default research mode, provider preferences, and content type focus.
### 4. **Business Information** (`business_info`)
**Source**: Constructed from persona data and website analysis
**Key Fields Used**:
- `industry`: Extracted from `core_persona.industry` or `website_analysis.target_audience.industry_focus`
- `target_audience`: Extracted from `core_persona.target_audience` or `website_analysis.target_audience.demographics`
**Usage**: Fallback and inference source when core persona data is minimal.
### 5. **Competitor Analysis** (Future Enhancement)
**Source**: Onboarding Step 3 - Competitor Discovery
**Location**: `CompetitorAnalysis` table
**Status**: Currently not used in persona generation, but available for future enhancements
**Potential Usage**: Could inform industry context, competitive landscape insights, and domain suggestions.
---
## Generated Research Persona Fields
### **1. Smart Defaults**
| Field | Type | Description | Source Priority |
|-------|------|-------------|-----------------|
| `default_industry` | string | User's primary industry | 1. core_persona.industry
2. business_info.industry
3. website_analysis.target_audience.industry_focus
4. Inferred from content_types |
| `default_target_audience` | string | Detailed audience description | 1. core_persona.target_audience
2. website_analysis.target_audience
3. business_info.target_audience
4. Default: "Professionals and content consumers" |
| `default_research_mode` | string | "basic" \| "comprehensive" \| "targeted" | Based on research_preferences.research_depth and content_type preferences |
| `default_provider` | string | "exa" \| "tavily" \| "google" | Based on user's typical research needs:
- Academic/research: "exa"
- News/current events: "tavily"
- General business: "exa"
- Default: "exa" |
### **2. Keyword Intelligence**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `suggested_keywords` | string[] | 8-12 relevant keywords | Generated from:
- User's industry
- Core persona interests
- Content goals
- Research preferences |
| `keyword_expansion_patterns` | Dict | Mapping of keywords to expanded terms | 10-15 patterns like:
`{"AI": ["healthcare AI", "medical AI"], "tools": ["medical devices"]}`
Focuses on industry-specific terminology |
### **3. Exa Provider Optimization**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `suggested_exa_domains` | string[] | 4-6 authoritative domains | Industry-specific authoritative sources:
- Healthcare: ["pubmed.gov", "nejm.org"]
- Finance: ["sec.gov", "bloomberg.com"]
- Tech: ["github.com", "stackoverflow.com"] |
| `suggested_exa_category` | string? | Exa content category | Based on industry:
- Healthcare/Science: "research paper"
- Finance: "financial report"
- Tech/Business: "company" or "news"
- Social/Marketing: "tweet" or "linkedin profile"
- Default: null (all categories) |
| `suggested_exa_search_type` | string? | Exa search algorithm | Based on content needs:
- Academic/research: "neural"
- Current news/trends: "fast"
- General research: "auto"
- Code/technical: "neural" |
### **4. Tavily Provider Optimization**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `suggested_tavily_topic` | string? | "general" \| "news" \| "finance" | Based on content type:
- Financial content: "finance"
- News/current events: "news"
- General research: "general" |
| `suggested_tavily_search_depth` | string? | "basic" \| "advanced" \| "fast" \| "ultra-fast" | Based on research needs:
- Quick overview: "basic"
- In-depth analysis: "advanced"
- Breaking news: "fast" |
| `suggested_tavily_include_answer` | string? | "false" \| "basic" \| "advanced" | Based on query type:
- Factual queries: "advanced"
- Research summaries: "basic"
- Custom content: "false" |
| `suggested_tavily_time_range` | string? | "day" \| "week" \| "month" \| "year" \| null | Based on recency needs:
- Breaking news: "day"
- Recent developments: "week"
- Industry analysis: "month"
- Historical: null |
| `suggested_tavily_raw_content_format` | string? | "false" \| "markdown" \| "text" | Based on use case:
- Blog content: "markdown"
- Text extraction: "text"
- No raw content: "false" |
### **5. Provider Selection Logic**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `provider_recommendations` | Dict | Use case → provider mapping | Example:
`{"trends": "tavily", "deep_research": "exa", "factual": "google", "news": "tavily", "academic": "exa"}` |
### **6. Research Intelligence**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `research_angles` | string[] | 5-8 alternative research angles | Generated from:
- User's pain points
- Industry trends
- Content goals
- Audience interests
Examples: "Compare {topic} tools", "{topic} ROI analysis" |
| `query_enhancement_rules` | Dict | Templates for improving vague queries | 5-8 enhancement patterns:
`{"vague_ai": "Research: AI applications in {industry} for {audience}", "vague_tools": "Compare top {industry} tools"}` |
### **7. Research Presets**
| Field | Type | Description | Generation Logic |
|-------|------|-------------|------------------|
| `recommended_presets` | ResearchPreset[] | 3-5 personalized preset templates | Each preset includes:
- `name`: Descriptive name
- `keywords`: Research query
- `industry`: User's industry
- `target_audience`: User's audience
- `research_mode`: "basic" \| "comprehensive" \| "targeted"
- `config`: Complete ResearchConfig object
- `description`: Brief explanation |
### **8. Research Preferences (Structured)**
| Field | Type | Description | Source |
|-------|------|-------------|--------|
| `research_preferences` | Dict | Structured research preferences | Extracted from onboarding:
- `research_depth`: From research_preferences.research_depth
- `content_types`: From research_preferences.content_types
- `auto_research`: From research_preferences.auto_research
- `factual_content`: From research_preferences.factual_content |
### **9. Metadata**
| Field | Type | Description |
|-------|------|-------------|
| `generated_at` | string? | ISO timestamp of generation |
| `confidence_score` | float? | Confidence score 0-1 (higher = richer data) |
| `version` | string? | Schema version (e.g., "1.0") |
---
## Data Collection Process
### Step 1: Collect Onboarding Data
```python
onboarding_data = {
"website_analysis": get_website_analysis(user_id),
"persona_data": get_persona_data(user_id),
"research_preferences": get_research_preferences(user_id),
"business_info": construct_business_info(persona_data, website_analysis)
}
```
### Step 2: Build AI Prompt
The prompt includes:
- All onboarding data (JSON formatted)
- Detailed instructions for each field
- Examples and use cases
- Rules for handling minimal data scenarios
### Step 3: LLM Generation
- Uses structured JSON response format
- Validates against `ResearchPersona` Pydantic model
- Adds metadata (generated_at, confidence_score)
### Step 4: Save to Database
- Stored in `PersonaData.research_persona` JSON field
- Cached with 7-day TTL
- Timestamp stored in `PersonaData.research_persona_generated_at`
---
## Handling Minimal Data Scenarios
When onboarding data is incomplete, the AI uses intelligent inference:
1. **Industry Inference**:
- From `content_types`: "blog" → "Content Marketing", "video" → "Video Content Creation"
- From `website_analysis.content_characteristics`: Patterns suggest industry
- Default: "Technology" or "Business Consulting"
2. **Target Audience Inference**:
- From `writing_style`: Complexity level suggests audience
- From `content_goals`: Purpose suggests audience
- Default: "Professionals and content consumers"
3. **Provider Defaults**:
- Always defaults to "exa" for content creators
- Uses "tavily" only for news/current events focus
4. **Never Uses "General"**:
- The prompt explicitly instructs to never use "General"
- Always infers specific categories based on available context
---
## Frontend Display
### Currently Displayed Fields:
✅ Default Settings (industry, audience, mode, provider)
✅ Suggested Keywords
✅ Research Angles
✅ Recommended Presets
✅ Metadata (generated_at, confidence_score, version)
### Recently Added Fields (Enhanced Display):
✅ Keyword Expansion Patterns
✅ Exa Provider Settings (domains, category, search_type)
✅ Tavily Provider Settings (topic, depth, answer, time_range, format)
✅ Provider Recommendations
✅ Query Enhancement Rules
✅ Research Preferences (structured)
---
## Future Enhancements
1. **Competitor Analysis Integration**: Use competitor data to inform industry context and domain suggestions
2. **Research History**: Learn from past research queries to improve suggestions
3. **A/B Testing**: Test different persona generation strategies
4. **User Feedback Loop**: Allow users to rate and improve persona suggestions
5. **Multi-Industry Support**: Handle users with multiple industries/niches
---
## API Endpoints
- `GET /api/research/persona-defaults`: Get persona defaults (cached only)
- `GET /api/research/research-persona`: Get or generate research persona
- `POST /api/research/research-persona?force_refresh=true`: Force regenerate persona
---
## Related Files
- **Backend**: `backend/services/research/research_persona_service.py`
- **Prompt Builder**: `backend/services/research/research_persona_prompt_builder.py`
- **Models**: `backend/models/research_persona_models.py`
- **API**: `backend/api/research_config.py`
- **Frontend**: `frontend/src/pages/ResearchTest.tsx` (Persona Details Modal)