Files

2026-01-01 17:56:25 +05:30

12 KiB

Raw Blame History

Research Persona Data Sources & Generated Fields

Overview

The Research Persona is an AI-generated profile that provides hyper-personalized research defaults, suggestions, and configurations based on a user's onboarding data. This document details what data is used to generate the persona and what fields are produced.

Data Sources Used for Generation

1. Website Analysis (`website_analysis`)

Source: Onboarding Step 2 - Website Analysis
Location: WebsiteAnalysis table in database
Key Fields Used:

website_url: User's website URL
writing_style: Tone, voice, complexity, engagement level
content_characteristics: Sentence structure, vocabulary, paragraph organization
target_audience: Demographics, expertise level, industry focus
content_type: Primary type, secondary types, purpose
recommended_settings: Writing tone, target audience, content type
style_patterns: Writing patterns analysis
style_guidelines: Generated guidelines

Usage: Extracts industry focus, target audience, content preferences, and writing style patterns to inform research defaults.

2. Core Persona (`core_persona`)

Source: Onboarding Step 4 - Persona Generation
Location: PersonaData.core_persona JSON field
Key Fields Used:

industry: User's primary industry
target_audience: Detailed audience description
interests: User's content interests and focus areas
pain_points: Challenges and needs
content_goals: What the user wants to achieve with content

Usage: Primary source for industry, audience, and content strategy insights.

3. Research Preferences (`research_preferences`)

Source: Onboarding Step 3 - Research Preferences
Location: ResearchPreferences table
Key Fields Used:

research_depth: "standard", "comprehensive", "basic"
content_types: Array of content types (e.g., ["blog", "social", "video"])
auto_research: Whether to auto-enable research
factual_content: Preference for factual vs. opinion-based content
writing_style: Inherited from website analysis
content_characteristics: Inherited from website analysis
target_audience: Inherited from website analysis

Usage: Determines default research mode, provider preferences, and content type focus.

4. Business Information (`business_info`)

Source: Constructed from persona data and website analysis
Key Fields Used:

industry: Extracted from core_persona.industry or website_analysis.target_audience.industry_focus
target_audience: Extracted from core_persona.target_audience or website_analysis.target_audience.demographics

Usage: Fallback and inference source when core persona data is minimal.

5. Competitor Analysis (Future Enhancement)

Source: Onboarding Step 3 - Competitor Discovery
Location: CompetitorAnalysis table
Status: Currently not used in persona generation, but available for future enhancements

Potential Usage: Could inform industry context, competitive landscape insights, and domain suggestions.

Generated Research Persona Fields

1. Smart Defaults

Field	Type	Description	Source Priority
`default_industry`	string	User's primary industry	1. core_persona.industry 2. business_info.industry 3. website_analysis.target_audience.industry_focus 4. Inferred from content_types
`default_target_audience`	string	Detailed audience description	1. core_persona.target_audience 2. website_analysis.target_audience 3. business_info.target_audience 4. Default: "Professionals and content consumers"
`default_research_mode`	string	"basic" \| "comprehensive" \| "targeted"	Based on research_preferences.research_depth and content_type preferences
`default_provider`	string	"exa" \| "tavily" \| "google"	Based on user's typical research needs: - Academic/research: "exa" - News/current events: "tavily" - General business: "exa" - Default: "exa"

2. Keyword Intelligence

Field	Type	Description	Generation Logic
`suggested_keywords`	string[]	8-12 relevant keywords	Generated from: - User's industry - Core persona interests - Content goals - Research preferences
`keyword_expansion_patterns`	Dict<string, string[]>	Mapping of keywords to expanded terms	10-15 patterns like: `{"AI": ["healthcare AI", "medical AI"], "tools": ["medical devices"]}` Focuses on industry-specific terminology

3. Exa Provider Optimization

Field	Type	Description	Generation Logic
`suggested_exa_domains`	string[]	4-6 authoritative domains	Industry-specific authoritative sources: - Healthcare: ["pubmed.gov", "nejm.org"] - Finance: ["sec.gov", "bloomberg.com"] - Tech: ["github.com", "stackoverflow.com"]
`suggested_exa_category`	string?	Exa content category	Based on industry: - Healthcare/Science: "research paper" - Finance: "financial report" - Tech/Business: "company" or "news" - Social/Marketing: "tweet" or "linkedin profile" - Default: null (all categories)
`suggested_exa_search_type`	string?	Exa search algorithm	Based on content needs: - Academic/research: "neural" - Current news/trends: "fast" - General research: "auto" - Code/technical: "neural"

4. Tavily Provider Optimization

Field	Type	Description	Generation Logic
`suggested_tavily_topic`	string?	"general" \| "news" \| "finance"	Based on content type: - Financial content: "finance" - News/current events: "news" - General research: "general"
`suggested_tavily_search_depth`	string?	"basic" \| "advanced" \| "fast" \| "ultra-fast"	Based on research needs: - Quick overview: "basic" - In-depth analysis: "advanced" - Breaking news: "fast"
`suggested_tavily_include_answer`	string?	"false" \| "basic" \| "advanced"	Based on query type: - Factual queries: "advanced" - Research summaries: "basic" - Custom content: "false"
`suggested_tavily_time_range`	string?	"day" \| "week" \| "month" \| "year" \| null	Based on recency needs: - Breaking news: "day" - Recent developments: "week" - Industry analysis: "month" - Historical: null
`suggested_tavily_raw_content_format`	string?	"false" \| "markdown" \| "text"	Based on use case: - Blog content: "markdown" - Text extraction: "text" - No raw content: "false"

5. Provider Selection Logic

Field	Type	Description	Generation Logic
`provider_recommendations`	Dict<string, string>	Use case → provider mapping	Example: `{"trends": "tavily", "deep_research": "exa", "factual": "google", "news": "tavily", "academic": "exa"}`

6. Research Intelligence

Field	Type	Description	Generation Logic
`research_angles`	string[]	5-8 alternative research angles	Generated from: - User's pain points - Industry trends - Content goals - Audience interests Examples: "Compare {topic} tools", "{topic} ROI analysis"
`query_enhancement_rules`	Dict<string, string>	Templates for improving vague queries	5-8 enhancement patterns: `{"vague_ai": "Research: AI applications in {industry} for {audience}", "vague_tools": "Compare top {industry} tools"}`

7. Research Presets

Field	Type	Description	Generation Logic
`recommended_presets`	ResearchPreset[]	3-5 personalized preset templates	Each preset includes: - `name`: Descriptive name - `keywords`: Research query - `industry`: User's industry - `target_audience`: User's audience - `research_mode`: "basic" \| "comprehensive" \| "targeted" - `config`: Complete ResearchConfig object - `description`: Brief explanation

8. Research Preferences (Structured)

Field	Type	Description	Source
`research_preferences`	Dict<string, any>	Structured research preferences	Extracted from onboarding: - `research_depth`: From research_preferences.research_depth - `content_types`: From research_preferences.content_types - `auto_research`: From research_preferences.auto_research - `factual_content`: From research_preferences.factual_content

9. Metadata

Field	Type	Description
`generated_at`	string?	ISO timestamp of generation
`confidence_score`	float?	Confidence score 0-1 (higher = richer data)
`version`	string?	Schema version (e.g., "1.0")

Data Collection Process

Step 1: Collect Onboarding Data

onboarding_data = {
    "website_analysis": get_website_analysis(user_id),
    "persona_data": get_persona_data(user_id),
    "research_preferences": get_research_preferences(user_id),
    "business_info": construct_business_info(persona_data, website_analysis)
}

Step 2: Build AI Prompt

The prompt includes:

All onboarding data (JSON formatted)
Detailed instructions for each field
Examples and use cases
Rules for handling minimal data scenarios

Step 3: LLM Generation

Uses structured JSON response format
Validates against ResearchPersona Pydantic model
Adds metadata (generated_at, confidence_score)

Step 4: Save to Database

Stored in PersonaData.research_persona JSON field
Cached with 7-day TTL
Timestamp stored in PersonaData.research_persona_generated_at

Handling Minimal Data Scenarios

When onboarding data is incomplete, the AI uses intelligent inference:

Industry Inference:
- From content_types: "blog" → "Content Marketing", "video" → "Video Content Creation"
- From website_analysis.content_characteristics: Patterns suggest industry
- Default: "Technology" or "Business Consulting"
Target Audience Inference:
- From writing_style: Complexity level suggests audience
- From content_goals: Purpose suggests audience
- Default: "Professionals and content consumers"
Provider Defaults:
- Always defaults to "exa" for content creators
- Uses "tavily" only for news/current events focus
Never Uses "General":
- The prompt explicitly instructs to never use "General"
- Always infers specific categories based on available context

Frontend Display

Currently Displayed Fields:

✅ Default Settings (industry, audience, mode, provider)
✅ Suggested Keywords
✅ Research Angles
✅ Recommended Presets
✅ Metadata (generated_at, confidence_score, version)

Recently Added Fields (Enhanced Display):

✅ Keyword Expansion Patterns
✅ Exa Provider Settings (domains, category, search_type)
✅ Tavily Provider Settings (topic, depth, answer, time_range, format)
✅ Provider Recommendations
✅ Query Enhancement Rules
✅ Research Preferences (structured)

Future Enhancements

Competitor Analysis Integration: Use competitor data to inform industry context and domain suggestions
Research History: Learn from past research queries to improve suggestions
A/B Testing: Test different persona generation strategies
User Feedback Loop: Allow users to rate and improve persona suggestions
Multi-Industry Support: Handle users with multiple industries/niches

API Endpoints

GET /api/research/persona-defaults: Get persona defaults (cached only)
GET /api/research/research-persona: Get or generate research persona
POST /api/research/research-persona?force_refresh=true: Force regenerate persona

Backend: backend/services/research/research_persona_service.py
Prompt Builder: backend/services/research/research_persona_prompt_builder.py
Models: backend/models/research_persona_models.py
API: backend/api/research_config.py
Frontend: frontend/src/pages/ResearchTest.tsx (Persona Details Modal)

12 KiB Raw Blame History