12 KiB
Research Persona Data Sources & Generated Fields
Overview
The Research Persona is an AI-generated profile that provides hyper-personalized research defaults, suggestions, and configurations based on a user's onboarding data. This document details what data is used to generate the persona and what fields are produced.
Data Sources Used for Generation
1. Website Analysis (website_analysis)
Source: Onboarding Step 2 - Website Analysis
Location: WebsiteAnalysis table in database
Key Fields Used:
website_url: User's website URLwriting_style: Tone, voice, complexity, engagement levelcontent_characteristics: Sentence structure, vocabulary, paragraph organizationtarget_audience: Demographics, expertise level, industry focuscontent_type: Primary type, secondary types, purposerecommended_settings: Writing tone, target audience, content typestyle_patterns: Writing patterns analysisstyle_guidelines: Generated guidelines
Usage: Extracts industry focus, target audience, content preferences, and writing style patterns to inform research defaults.
2. Core Persona (core_persona)
Source: Onboarding Step 4 - Persona Generation
Location: PersonaData.core_persona JSON field
Key Fields Used:
industry: User's primary industrytarget_audience: Detailed audience descriptioninterests: User's content interests and focus areaspain_points: Challenges and needscontent_goals: What the user wants to achieve with content
Usage: Primary source for industry, audience, and content strategy insights.
3. Research Preferences (research_preferences)
Source: Onboarding Step 3 - Research Preferences
Location: ResearchPreferences table
Key Fields Used:
research_depth: "standard", "comprehensive", "basic"content_types: Array of content types (e.g., ["blog", "social", "video"])auto_research: Whether to auto-enable researchfactual_content: Preference for factual vs. opinion-based contentwriting_style: Inherited from website analysiscontent_characteristics: Inherited from website analysistarget_audience: Inherited from website analysis
Usage: Determines default research mode, provider preferences, and content type focus.
4. Business Information (business_info)
Source: Constructed from persona data and website analysis
Key Fields Used:
industry: Extracted fromcore_persona.industryorwebsite_analysis.target_audience.industry_focustarget_audience: Extracted fromcore_persona.target_audienceorwebsite_analysis.target_audience.demographics
Usage: Fallback and inference source when core persona data is minimal.
5. Competitor Analysis (Future Enhancement)
Source: Onboarding Step 3 - Competitor Discovery
Location: CompetitorAnalysis table
Status: Currently not used in persona generation, but available for future enhancements
Potential Usage: Could inform industry context, competitive landscape insights, and domain suggestions.
Generated Research Persona Fields
1. Smart Defaults
| Field | Type | Description | Source Priority |
|---|---|---|---|
default_industry |
string | User's primary industry | 1. core_persona.industry 2. business_info.industry 3. website_analysis.target_audience.industry_focus 4. Inferred from content_types |
default_target_audience |
string | Detailed audience description | 1. core_persona.target_audience 2. website_analysis.target_audience 3. business_info.target_audience 4. Default: "Professionals and content consumers" |
default_research_mode |
string | "basic" | "comprehensive" | "targeted" | Based on research_preferences.research_depth and content_type preferences |
default_provider |
string | "exa" | "tavily" | "google" | Based on user's typical research needs: - Academic/research: "exa" - News/current events: "tavily" - General business: "exa" - Default: "exa" |
2. Keyword Intelligence
| Field | Type | Description | Generation Logic |
|---|---|---|---|
suggested_keywords |
string[] | 8-12 relevant keywords | Generated from: - User's industry - Core persona interests - Content goals - Research preferences |
keyword_expansion_patterns |
Dict<string, string[]> | Mapping of keywords to expanded terms | 10-15 patterns like:{"AI": ["healthcare AI", "medical AI"], "tools": ["medical devices"]}Focuses on industry-specific terminology |
3. Exa Provider Optimization
| Field | Type | Description | Generation Logic |
|---|---|---|---|
suggested_exa_domains |
string[] | 4-6 authoritative domains | Industry-specific authoritative sources: - Healthcare: ["pubmed.gov", "nejm.org"] - Finance: ["sec.gov", "bloomberg.com"] - Tech: ["github.com", "stackoverflow.com"] |
suggested_exa_category |
string? | Exa content category | Based on industry: - Healthcare/Science: "research paper" - Finance: "financial report" - Tech/Business: "company" or "news" - Social/Marketing: "tweet" or "linkedin profile" - Default: null (all categories) |
suggested_exa_search_type |
string? | Exa search algorithm | Based on content needs: - Academic/research: "neural" - Current news/trends: "fast" - General research: "auto" - Code/technical: "neural" |
4. Tavily Provider Optimization
| Field | Type | Description | Generation Logic |
|---|---|---|---|
suggested_tavily_topic |
string? | "general" | "news" | "finance" | Based on content type: - Financial content: "finance" - News/current events: "news" - General research: "general" |
suggested_tavily_search_depth |
string? | "basic" | "advanced" | "fast" | "ultra-fast" | Based on research needs: - Quick overview: "basic" - In-depth analysis: "advanced" - Breaking news: "fast" |
suggested_tavily_include_answer |
string? | "false" | "basic" | "advanced" | Based on query type: - Factual queries: "advanced" - Research summaries: "basic" - Custom content: "false" |
suggested_tavily_time_range |
string? | "day" | "week" | "month" | "year" | null | Based on recency needs: - Breaking news: "day" - Recent developments: "week" - Industry analysis: "month" - Historical: null |
suggested_tavily_raw_content_format |
string? | "false" | "markdown" | "text" | Based on use case: - Blog content: "markdown" - Text extraction: "text" - No raw content: "false" |
5. Provider Selection Logic
| Field | Type | Description | Generation Logic |
|---|---|---|---|
provider_recommendations |
Dict<string, string> | Use case → provider mapping | Example:{"trends": "tavily", "deep_research": "exa", "factual": "google", "news": "tavily", "academic": "exa"} |
6. Research Intelligence
| Field | Type | Description | Generation Logic |
|---|---|---|---|
research_angles |
string[] | 5-8 alternative research angles | Generated from: - User's pain points - Industry trends - Content goals - Audience interests Examples: "Compare {topic} tools", "{topic} ROI analysis" |
query_enhancement_rules |
Dict<string, string> | Templates for improving vague queries | 5-8 enhancement patterns:{"vague_ai": "Research: AI applications in {industry} for {audience}", "vague_tools": "Compare top {industry} tools"} |
7. Research Presets
| Field | Type | Description | Generation Logic |
|---|---|---|---|
recommended_presets |
ResearchPreset[] | 3-5 personalized preset templates | Each preset includes: - name: Descriptive name- keywords: Research query- industry: User's industry- target_audience: User's audience- research_mode: "basic" | "comprehensive" | "targeted"- config: Complete ResearchConfig object- description: Brief explanation |
8. Research Preferences (Structured)
| Field | Type | Description | Source |
|---|---|---|---|
research_preferences |
Dict<string, any> | Structured research preferences | Extracted from onboarding: - research_depth: From research_preferences.research_depth- content_types: From research_preferences.content_types- auto_research: From research_preferences.auto_research- factual_content: From research_preferences.factual_content |
9. Metadata
| Field | Type | Description |
|---|---|---|
generated_at |
string? | ISO timestamp of generation |
confidence_score |
float? | Confidence score 0-1 (higher = richer data) |
version |
string? | Schema version (e.g., "1.0") |
Data Collection Process
Step 1: Collect Onboarding Data
onboarding_data = {
"website_analysis": get_website_analysis(user_id),
"persona_data": get_persona_data(user_id),
"research_preferences": get_research_preferences(user_id),
"business_info": construct_business_info(persona_data, website_analysis)
}
Step 2: Build AI Prompt
The prompt includes:
- All onboarding data (JSON formatted)
- Detailed instructions for each field
- Examples and use cases
- Rules for handling minimal data scenarios
Step 3: LLM Generation
- Uses structured JSON response format
- Validates against
ResearchPersonaPydantic model - Adds metadata (generated_at, confidence_score)
Step 4: Save to Database
- Stored in
PersonaData.research_personaJSON field - Cached with 7-day TTL
- Timestamp stored in
PersonaData.research_persona_generated_at
Handling Minimal Data Scenarios
When onboarding data is incomplete, the AI uses intelligent inference:
-
Industry Inference:
- From
content_types: "blog" → "Content Marketing", "video" → "Video Content Creation" - From
website_analysis.content_characteristics: Patterns suggest industry - Default: "Technology" or "Business Consulting"
- From
-
Target Audience Inference:
- From
writing_style: Complexity level suggests audience - From
content_goals: Purpose suggests audience - Default: "Professionals and content consumers"
- From
-
Provider Defaults:
- Always defaults to "exa" for content creators
- Uses "tavily" only for news/current events focus
-
Never Uses "General":
- The prompt explicitly instructs to never use "General"
- Always infers specific categories based on available context
Frontend Display
Currently Displayed Fields:
✅ Default Settings (industry, audience, mode, provider)
✅ Suggested Keywords
✅ Research Angles
✅ Recommended Presets
✅ Metadata (generated_at, confidence_score, version)
Recently Added Fields (Enhanced Display):
✅ Keyword Expansion Patterns
✅ Exa Provider Settings (domains, category, search_type)
✅ Tavily Provider Settings (topic, depth, answer, time_range, format)
✅ Provider Recommendations
✅ Query Enhancement Rules
✅ Research Preferences (structured)
Future Enhancements
- Competitor Analysis Integration: Use competitor data to inform industry context and domain suggestions
- Research History: Learn from past research queries to improve suggestions
- A/B Testing: Test different persona generation strategies
- User Feedback Loop: Allow users to rate and improve persona suggestions
- Multi-Industry Support: Handle users with multiple industries/niches
API Endpoints
GET /api/research/persona-defaults: Get persona defaults (cached only)GET /api/research/research-persona: Get or generate research personaPOST /api/research/research-persona?force_refresh=true: Force regenerate persona
Related Files
- Backend:
backend/services/research/research_persona_service.py - Prompt Builder:
backend/services/research/research_persona_prompt_builder.py - Models:
backend/models/research_persona_models.py - API:
backend/api/research_config.py - Frontend:
frontend/src/pages/ResearchTest.tsx(Persona Details Modal)