436 lines
13 KiB
Markdown
436 lines
13 KiB
Markdown
# Step 2 (Website Analysis) - Complete Data Flow Analysis
|
|
|
|
## Overview
|
|
|
|
Step 2 performs comprehensive website analysis including crawling, style detection, pattern analysis, and guideline generation. This document maps the complete data flow from frontend to database.
|
|
|
|
## API Endpoints Called
|
|
|
|
### 1. `/api/onboarding/style-detection/complete` (PRIMARY)
|
|
|
|
**Purpose**: Main analysis endpoint that performs the complete workflow
|
|
|
|
**Request** (`POST`):
|
|
```typescript
|
|
{
|
|
url: string,
|
|
include_patterns: true,
|
|
include_guidelines: true
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```typescript
|
|
{
|
|
success: boolean,
|
|
crawl_result: {
|
|
content: string,
|
|
success: boolean,
|
|
timestamp: string
|
|
},
|
|
style_analysis: {
|
|
writing_style: {...},
|
|
content_characteristics: {...},
|
|
target_audience: {...},
|
|
content_type: {...},
|
|
recommended_settings: {...},
|
|
brand_analysis: {...}, // ← Rich brand insights
|
|
content_strategy_insights: {...} // ← SWOT analysis
|
|
},
|
|
style_patterns: {
|
|
style_consistency: {...},
|
|
unique_elements: {...}
|
|
},
|
|
style_guidelines: {
|
|
guidelines: [...],
|
|
best_practices: [...],
|
|
avoid_elements: [...],
|
|
content_strategy: [...],
|
|
ai_generation_tips: [...],
|
|
competitive_advantages: [...],
|
|
content_calendar_suggestions: [...]
|
|
},
|
|
analysis_id: number,
|
|
warning?: string
|
|
}
|
|
```
|
|
|
|
### 2. `/api/onboarding/style-detection/check-existing/{url}` (OPTIONAL)
|
|
|
|
**Purpose**: Check if analysis already exists for this URL
|
|
|
|
**Response**:
|
|
```typescript
|
|
{
|
|
exists: boolean,
|
|
analysis_id?: number,
|
|
analysis?: {...} // Full analysis data if exists
|
|
}
|
|
```
|
|
|
|
### 3. `/api/onboarding/style-detection/analysis/{id}` (OPTIONAL)
|
|
|
|
**Purpose**: Load existing analysis by ID
|
|
|
|
### 4. `/api/onboarding/style-detection/session-analyses` (OPTIONAL)
|
|
|
|
**Purpose**: Get last analysis from session for pre-filling
|
|
|
|
## Complete Data Structure Collected
|
|
|
|
### 1. **Writing Style** (`writing_style`)
|
|
```json
|
|
{
|
|
"tone": "Professional, Informative",
|
|
"voice": "Active, Direct",
|
|
"complexity": "Moderate",
|
|
"engagement_level": "High",
|
|
"brand_personality": "Trustworthy, Expert",
|
|
"formality_level": "Semi-formal",
|
|
"emotional_appeal": "Rational with emotional hooks"
|
|
}
|
|
```
|
|
|
|
### 2. **Content Characteristics** (`content_characteristics`)
|
|
```json
|
|
{
|
|
"sentence_structure": "Mix of short and medium sentences",
|
|
"vocabulary_level": "Professional/Business",
|
|
"paragraph_organization": "Clear topic sentences",
|
|
"content_flow": "Logical progression",
|
|
"readability_score": "8th-10th grade",
|
|
"content_density": "Information-rich",
|
|
"visual_elements_usage": "Moderate"
|
|
}
|
|
```
|
|
|
|
### 3. **Target Audience** (`target_audience`)
|
|
```json
|
|
{
|
|
"demographics": ["B2B", "Enterprise clients", "IT professionals"],
|
|
"expertise_level": "Intermediate to Advanced",
|
|
"industry_focus": "Technology/SaaS",
|
|
"geographic_focus": "Global, US-focused",
|
|
"psychographic_profile": "Innovation-driven, ROI-focused",
|
|
"pain_points": ["Efficiency", "Scalability"],
|
|
"motivations": ["Business growth", "Competitive advantage"]
|
|
}
|
|
```
|
|
|
|
### 4. **Content Type** (`content_type`)
|
|
```json
|
|
{
|
|
"primary_type": "Educational/Thought Leadership",
|
|
"secondary_types": ["Case Studies", "Product Descriptions"],
|
|
"purpose": "Inform and convert",
|
|
"call_to_action": "Demo request, Free trial",
|
|
"conversion_focus": "Lead generation",
|
|
"educational_value": "High"
|
|
}
|
|
```
|
|
|
|
### 5. **Brand Analysis** (`brand_analysis`) ⭐ **IMPORTANT**
|
|
```json
|
|
{
|
|
"brand_voice": "Authoritative yet approachable",
|
|
"brand_values": ["Innovation", "Reliability", "Customer success"],
|
|
"brand_positioning": "Premium solution provider",
|
|
"competitive_differentiation": "AI-powered automation",
|
|
"trust_signals": ["Case studies", "Testimonials", "Security badges"],
|
|
"authority_indicators": ["Industry certifications", "Expert team"]
|
|
}
|
|
```
|
|
|
|
### 6. **Content Strategy Insights** (`content_strategy_insights`) ⭐ **IMPORTANT**
|
|
```json
|
|
{
|
|
"strengths": [
|
|
"Clear value proposition",
|
|
"Strong technical authority",
|
|
"Engaging storytelling"
|
|
],
|
|
"weaknesses": [
|
|
"Limited social proof",
|
|
"Technical jargon overuse"
|
|
],
|
|
"opportunities": [
|
|
"Video content",
|
|
"Interactive demos",
|
|
"Industry thought leadership"
|
|
],
|
|
"threats": [
|
|
"Competitor content marketing",
|
|
"Market saturation"
|
|
],
|
|
"recommended_improvements": [
|
|
"Add more case studies",
|
|
"Simplify technical explanations",
|
|
"Increase content frequency"
|
|
],
|
|
"content_gaps": [
|
|
"Beginner-level tutorials",
|
|
"Comparison guides",
|
|
"Industry trend analysis"
|
|
]
|
|
}
|
|
```
|
|
|
|
### 7. **Recommended Settings** (`recommended_settings`)
|
|
```json
|
|
{
|
|
"writing_tone": "Professional yet conversational",
|
|
"target_audience": "B2B decision makers",
|
|
"content_type": "Educational with conversion focus",
|
|
"creativity_level": "Balanced",
|
|
"geographic_location": "US/Global",
|
|
"industry_context": "B2B SaaS"
|
|
}
|
|
```
|
|
|
|
### 8. **Crawl Result** (`crawl_result`)
|
|
```json
|
|
{
|
|
"content": "Full crawled text content...",
|
|
"success": true,
|
|
"timestamp": "2025-10-11T12:00:00Z"
|
|
}
|
|
```
|
|
|
|
### 9. **Style Patterns** (`style_patterns`)
|
|
```json
|
|
{
|
|
"style_consistency": {
|
|
"consistency_score": 0.85,
|
|
"common_patterns": ["Data-driven claims", "Action-oriented CTAs"],
|
|
"variations": ["Blog vs landing page tone"]
|
|
},
|
|
"unique_elements": [
|
|
"Custom terminology",
|
|
"Brand-specific phrases",
|
|
"Signature formatting"
|
|
]
|
|
}
|
|
```
|
|
|
|
### 10. **Style Guidelines** (`style_guidelines`)
|
|
```json
|
|
{
|
|
"guidelines": [
|
|
"Use active voice",
|
|
"Start with benefit statements",
|
|
"Support claims with data"
|
|
],
|
|
"best_practices": [
|
|
"Lead with customer pain points",
|
|
"Include social proof",
|
|
"Clear CTAs"
|
|
],
|
|
"avoid_elements": [
|
|
"Passive voice",
|
|
"Overly technical jargon",
|
|
"Generic claims"
|
|
],
|
|
"content_strategy": [
|
|
"Focus on thought leadership",
|
|
"Build trust through expertise",
|
|
"Address buyer journey stages"
|
|
],
|
|
"ai_generation_tips": [
|
|
"Emphasize ROI and metrics",
|
|
"Use industry-specific examples",
|
|
"Balance technical depth with clarity"
|
|
],
|
|
"competitive_advantages": [
|
|
"Unique positioning statement",
|
|
"Differentiating features",
|
|
"Customer success stories"
|
|
],
|
|
"content_calendar_suggestions": [
|
|
"Weekly blog posts",
|
|
"Monthly case studies",
|
|
"Quarterly industry reports"
|
|
]
|
|
}
|
|
```
|
|
|
|
## Current Database Storage (OnboardingDatabaseService)
|
|
|
|
### What's Saved to `onboarding_sessions.website_analyses` Table:
|
|
|
|
**File**: `backend/services/onboarding_database_service.py` (Line 173)
|
|
|
|
```python
|
|
WebsiteAnalysis(
|
|
session_id=session.id,
|
|
website_url=analysis_data.get('website_url'),
|
|
writing_style=analysis_data.get('writing_style'), # ✅
|
|
content_characteristics=analysis_data.get('content_characteristics'), # ✅
|
|
target_audience=analysis_data.get('target_audience'), # ✅
|
|
content_type=analysis_data.get('content_type'), # ✅
|
|
recommended_settings=analysis_data.get('recommended_settings'),# ✅
|
|
crawl_result=analysis_data.get('crawl_result'), # ✅
|
|
style_patterns=analysis_data.get('style_patterns'), # ✅
|
|
style_guidelines=analysis_data.get('style_guidelines'), # ✅
|
|
status='completed'
|
|
)
|
|
```
|
|
|
|
### ❌ What's MISSING from Database Storage:
|
|
|
|
1. **brand_analysis** - NOT saved to `onboarding_database_service`
|
|
2. **content_strategy_insights** - NOT saved to `onboarding_database_service`
|
|
|
|
### ✅ What's Saved to `website_analyses` Table (via WebsiteAnalysisService):
|
|
|
|
**File**: `backend/services/website_analysis_service.py` (Lines 44-87)
|
|
|
|
This service saves to a DIFFERENT table (`website_analyses` not `onboarding_sessions.website_analyses`).
|
|
|
|
```python
|
|
# Saves to: website_analyses table
|
|
WebsiteAnalysis(
|
|
session_id=session_id, # Integer session ID
|
|
website_url=website_url,
|
|
writing_style=style_analysis.get('writing_style'),
|
|
content_characteristics=style_analysis.get('content_characteristics'),
|
|
target_audience=style_analysis.get('target_audience'),
|
|
content_type=style_analysis.get('content_type'),
|
|
recommended_settings=style_analysis.get('recommended_settings'),
|
|
brand_analysis=style_analysis.get('brand_analysis'), # ✅ SAVED HERE!
|
|
content_strategy_insights=style_analysis.get('content_strategy_insights'), # ✅ SAVED HERE!
|
|
crawl_result=analysis_data.get('crawl_result'),
|
|
style_patterns=analysis_data.get('style_patterns'),
|
|
style_guidelines=analysis_data.get('style_guidelines'),
|
|
status='completed'
|
|
)
|
|
```
|
|
|
|
## The Problem: Dual Database Persistence
|
|
|
|
We have **TWO separate database save operations** happening:
|
|
|
|
### 1. `/style-detection/complete` endpoint (component_logic.py)
|
|
- Saves to `website_analyses` table via `WebsiteAnalysisService`
|
|
- Uses **Integer session_id** (converted from Clerk ID via SHA256)
|
|
- Saves **ALL fields** including `brand_analysis` and `content_strategy_insights`
|
|
|
|
### 2. `OnboardingProgress.save_progress()` (api_key_manager.py)
|
|
- Saves to `onboarding_sessions.website_analyses` table via `OnboardingDatabaseService`
|
|
- Uses **String user_id** (Clerk ID)
|
|
- **MISSING** `brand_analysis` and `content_strategy_insights`
|
|
|
|
## Current Frontend Data Structure
|
|
|
|
**File**: `frontend/src/components/OnboardingWizard/WebsiteStep.tsx` (Line 386)
|
|
|
|
```typescript
|
|
const stepData = {
|
|
website: fixedUrl, // ← Should be "website_url"
|
|
domainName: domainName,
|
|
analysis: { // ← Nested structure
|
|
writing_style: {...},
|
|
content_characteristics: {...},
|
|
target_audience: {...},
|
|
content_type: {...},
|
|
brand_analysis: {...}, // ✅ Present
|
|
content_strategy_insights: {...}, // ✅ Present
|
|
recommended_settings: {...},
|
|
// ... ALL the fields from API response
|
|
guidelines: [...],
|
|
best_practices: [...],
|
|
avoid_elements: [...],
|
|
style_patterns: {...},
|
|
// etc.
|
|
},
|
|
useAnalysisForGenAI: true
|
|
};
|
|
```
|
|
|
|
## Solution Required
|
|
|
|
### 1. Fix Data Transformation (COMPLETED ✅)
|
|
|
|
**File**: `backend/services/api_key_manager.py` (Line 278)
|
|
|
|
Already fixed to flatten the structure:
|
|
|
|
```python
|
|
elif step.step_number == 2: # Website Analysis
|
|
# Transform frontend data structure to match database schema
|
|
analysis_for_db = {
|
|
'website_url': step.data.get('website', ''),
|
|
'status': 'completed'
|
|
}
|
|
# Merge analysis fields if they exist
|
|
if 'analysis' in step.data and step.data['analysis']:
|
|
analysis_for_db.update(step.data['analysis'])
|
|
|
|
self.db_service.save_website_analysis(self.user_id, analysis_for_db, db)
|
|
```
|
|
|
|
### 2. Update OnboardingDatabaseService to Save ALL Fields
|
|
|
|
**File**: `backend/services/onboarding_database_service.py`
|
|
|
|
**NEEDED**: Add `brand_analysis` and `content_strategy_insights` to the save operation.
|
|
|
|
Check if `WebsiteAnalysis` model has these columns:
|
|
|
|
```python
|
|
# Line 206-213 (existing code)
|
|
website_url=analysis_data.get('website_url', ''),
|
|
writing_style=analysis_data.get('writing_style'),
|
|
content_characteristics=analysis_data.get('content_characteristics'),
|
|
target_audience=analysis_data.get('target_audience'),
|
|
content_type=analysis_data.get('content_type'),
|
|
recommended_settings=analysis_data.get('recommended_settings'),
|
|
brand_analysis=analysis_data.get('brand_analysis'), # ← ADD THIS
|
|
content_strategy_insights=analysis_data.get('content_strategy_insights'), # ← ADD THIS
|
|
crawl_result=analysis_data.get('crawl_result'),
|
|
style_patterns=analysis_data.get('style_patterns'),
|
|
style_guidelines=analysis_data.get('style_guidelines'),
|
|
```
|
|
|
|
### 3. Verify Database Model Supports These Fields
|
|
|
|
**File**: `backend/models/onboarding.py`
|
|
|
|
Check `WebsiteAnalysis` model for:
|
|
- `brand_analysis` column (JSON)
|
|
- `content_strategy_insights` column (JSON)
|
|
|
|
If missing, add migration.
|
|
|
|
## Recommendation
|
|
|
|
1. ✅ **Data transformation fix is complete** (api_key_manager.py updated)
|
|
2. ⏳ **Check WebsiteAnalysis model** for brand_analysis and content_strategy_insights columns
|
|
3. ⏳ **Update OnboardingDatabaseService.save_website_analysis()** to include these fields
|
|
4. ⏳ **Restart backend** to apply changes
|
|
5. ⏳ **Re-run Step 2** to save complete data
|
|
6. ⏳ **Verify Step 6** displays all fields
|
|
|
|
## Benefits of Complete Data Storage
|
|
|
|
With `brand_analysis` and `content_strategy_insights` saved:
|
|
|
|
1. **Better Content Generation**: AI can align with brand values
|
|
2. **Strategic Insights**: SWOT analysis informs content strategy
|
|
3. **Competitive Intelligence**: Differentiation factors for positioning
|
|
4. **Content Planning**: Recommendations and calendar suggestions
|
|
5. **Quality Assurance**: Consistency checking against brand guidelines
|
|
|
|
## Status
|
|
|
|
- ✅ API endpoint returns complete data
|
|
- ✅ Frontend receives and displays complete data
|
|
- ✅ Data transformation fix applied (flattening structure)
|
|
- ⏳ Database model verification needed
|
|
- ⏳ OnboardingDatabaseService update needed
|
|
- ⏳ Testing required
|
|
|
|
---
|
|
|
|
**Next Action**: Check `WebsiteAnalysis` model and update `OnboardingDatabaseService` to save ALL fields.
|
|
|