Files
ALwrity/docs/STEP_2_COMPLETE_DATA_FLOW_ANALYSIS.md

436 lines
13 KiB
Markdown

# Step 2 (Website Analysis) - Complete Data Flow Analysis
## Overview
Step 2 performs comprehensive website analysis including crawling, style detection, pattern analysis, and guideline generation. This document maps the complete data flow from frontend to database.
## API Endpoints Called
### 1. `/api/onboarding/style-detection/complete` (PRIMARY)
**Purpose**: Main analysis endpoint that performs the complete workflow
**Request** (`POST`):
```typescript
{
url: string,
include_patterns: true,
include_guidelines: true
}
```
**Response**:
```typescript
{
success: boolean,
crawl_result: {
content: string,
success: boolean,
timestamp: string
},
style_analysis: {
writing_style: {...},
content_characteristics: {...},
target_audience: {...},
content_type: {...},
recommended_settings: {...},
brand_analysis: {...}, // ← Rich brand insights
content_strategy_insights: {...} // ← SWOT analysis
},
style_patterns: {
style_consistency: {...},
unique_elements: {...}
},
style_guidelines: {
guidelines: [...],
best_practices: [...],
avoid_elements: [...],
content_strategy: [...],
ai_generation_tips: [...],
competitive_advantages: [...],
content_calendar_suggestions: [...]
},
analysis_id: number,
warning?: string
}
```
### 2. `/api/onboarding/style-detection/check-existing/{url}` (OPTIONAL)
**Purpose**: Check if analysis already exists for this URL
**Response**:
```typescript
{
exists: boolean,
analysis_id?: number,
analysis?: {...} // Full analysis data if exists
}
```
### 3. `/api/onboarding/style-detection/analysis/{id}` (OPTIONAL)
**Purpose**: Load existing analysis by ID
### 4. `/api/onboarding/style-detection/session-analyses` (OPTIONAL)
**Purpose**: Get last analysis from session for pre-filling
## Complete Data Structure Collected
### 1. **Writing Style** (`writing_style`)
```json
{
"tone": "Professional, Informative",
"voice": "Active, Direct",
"complexity": "Moderate",
"engagement_level": "High",
"brand_personality": "Trustworthy, Expert",
"formality_level": "Semi-formal",
"emotional_appeal": "Rational with emotional hooks"
}
```
### 2. **Content Characteristics** (`content_characteristics`)
```json
{
"sentence_structure": "Mix of short and medium sentences",
"vocabulary_level": "Professional/Business",
"paragraph_organization": "Clear topic sentences",
"content_flow": "Logical progression",
"readability_score": "8th-10th grade",
"content_density": "Information-rich",
"visual_elements_usage": "Moderate"
}
```
### 3. **Target Audience** (`target_audience`)
```json
{
"demographics": ["B2B", "Enterprise clients", "IT professionals"],
"expertise_level": "Intermediate to Advanced",
"industry_focus": "Technology/SaaS",
"geographic_focus": "Global, US-focused",
"psychographic_profile": "Innovation-driven, ROI-focused",
"pain_points": ["Efficiency", "Scalability"],
"motivations": ["Business growth", "Competitive advantage"]
}
```
### 4. **Content Type** (`content_type`)
```json
{
"primary_type": "Educational/Thought Leadership",
"secondary_types": ["Case Studies", "Product Descriptions"],
"purpose": "Inform and convert",
"call_to_action": "Demo request, Free trial",
"conversion_focus": "Lead generation",
"educational_value": "High"
}
```
### 5. **Brand Analysis** (`brand_analysis`) ⭐ **IMPORTANT**
```json
{
"brand_voice": "Authoritative yet approachable",
"brand_values": ["Innovation", "Reliability", "Customer success"],
"brand_positioning": "Premium solution provider",
"competitive_differentiation": "AI-powered automation",
"trust_signals": ["Case studies", "Testimonials", "Security badges"],
"authority_indicators": ["Industry certifications", "Expert team"]
}
```
### 6. **Content Strategy Insights** (`content_strategy_insights`) ⭐ **IMPORTANT**
```json
{
"strengths": [
"Clear value proposition",
"Strong technical authority",
"Engaging storytelling"
],
"weaknesses": [
"Limited social proof",
"Technical jargon overuse"
],
"opportunities": [
"Video content",
"Interactive demos",
"Industry thought leadership"
],
"threats": [
"Competitor content marketing",
"Market saturation"
],
"recommended_improvements": [
"Add more case studies",
"Simplify technical explanations",
"Increase content frequency"
],
"content_gaps": [
"Beginner-level tutorials",
"Comparison guides",
"Industry trend analysis"
]
}
```
### 7. **Recommended Settings** (`recommended_settings`)
```json
{
"writing_tone": "Professional yet conversational",
"target_audience": "B2B decision makers",
"content_type": "Educational with conversion focus",
"creativity_level": "Balanced",
"geographic_location": "US/Global",
"industry_context": "B2B SaaS"
}
```
### 8. **Crawl Result** (`crawl_result`)
```json
{
"content": "Full crawled text content...",
"success": true,
"timestamp": "2025-10-11T12:00:00Z"
}
```
### 9. **Style Patterns** (`style_patterns`)
```json
{
"style_consistency": {
"consistency_score": 0.85,
"common_patterns": ["Data-driven claims", "Action-oriented CTAs"],
"variations": ["Blog vs landing page tone"]
},
"unique_elements": [
"Custom terminology",
"Brand-specific phrases",
"Signature formatting"
]
}
```
### 10. **Style Guidelines** (`style_guidelines`)
```json
{
"guidelines": [
"Use active voice",
"Start with benefit statements",
"Support claims with data"
],
"best_practices": [
"Lead with customer pain points",
"Include social proof",
"Clear CTAs"
],
"avoid_elements": [
"Passive voice",
"Overly technical jargon",
"Generic claims"
],
"content_strategy": [
"Focus on thought leadership",
"Build trust through expertise",
"Address buyer journey stages"
],
"ai_generation_tips": [
"Emphasize ROI and metrics",
"Use industry-specific examples",
"Balance technical depth with clarity"
],
"competitive_advantages": [
"Unique positioning statement",
"Differentiating features",
"Customer success stories"
],
"content_calendar_suggestions": [
"Weekly blog posts",
"Monthly case studies",
"Quarterly industry reports"
]
}
```
## Current Database Storage (OnboardingDatabaseService)
### What's Saved to `onboarding_sessions.website_analyses` Table:
**File**: `backend/services/onboarding_database_service.py` (Line 173)
```python
WebsiteAnalysis(
session_id=session.id,
website_url=analysis_data.get('website_url'),
writing_style=analysis_data.get('writing_style'), # ✅
content_characteristics=analysis_data.get('content_characteristics'), # ✅
target_audience=analysis_data.get('target_audience'), # ✅
content_type=analysis_data.get('content_type'), # ✅
recommended_settings=analysis_data.get('recommended_settings'),# ✅
crawl_result=analysis_data.get('crawl_result'), # ✅
style_patterns=analysis_data.get('style_patterns'), # ✅
style_guidelines=analysis_data.get('style_guidelines'), # ✅
status='completed'
)
```
### ❌ What's MISSING from Database Storage:
1. **brand_analysis** - NOT saved to `onboarding_database_service`
2. **content_strategy_insights** - NOT saved to `onboarding_database_service`
### ✅ What's Saved to `website_analyses` Table (via WebsiteAnalysisService):
**File**: `backend/services/website_analysis_service.py` (Lines 44-87)
This service saves to a DIFFERENT table (`website_analyses` not `onboarding_sessions.website_analyses`).
```python
# Saves to: website_analyses table
WebsiteAnalysis(
session_id=session_id, # Integer session ID
website_url=website_url,
writing_style=style_analysis.get('writing_style'),
content_characteristics=style_analysis.get('content_characteristics'),
target_audience=style_analysis.get('target_audience'),
content_type=style_analysis.get('content_type'),
recommended_settings=style_analysis.get('recommended_settings'),
brand_analysis=style_analysis.get('brand_analysis'), # ✅ SAVED HERE!
content_strategy_insights=style_analysis.get('content_strategy_insights'), # ✅ SAVED HERE!
crawl_result=analysis_data.get('crawl_result'),
style_patterns=analysis_data.get('style_patterns'),
style_guidelines=analysis_data.get('style_guidelines'),
status='completed'
)
```
## The Problem: Dual Database Persistence
We have **TWO separate database save operations** happening:
### 1. `/style-detection/complete` endpoint (component_logic.py)
- Saves to `website_analyses` table via `WebsiteAnalysisService`
- Uses **Integer session_id** (converted from Clerk ID via SHA256)
- Saves **ALL fields** including `brand_analysis` and `content_strategy_insights`
### 2. `OnboardingProgress.save_progress()` (api_key_manager.py)
- Saves to `onboarding_sessions.website_analyses` table via `OnboardingDatabaseService`
- Uses **String user_id** (Clerk ID)
- **MISSING** `brand_analysis` and `content_strategy_insights`
## Current Frontend Data Structure
**File**: `frontend/src/components/OnboardingWizard/WebsiteStep.tsx` (Line 386)
```typescript
const stepData = {
website: fixedUrl, // ← Should be "website_url"
domainName: domainName,
analysis: { // ← Nested structure
writing_style: {...},
content_characteristics: {...},
target_audience: {...},
content_type: {...},
brand_analysis: {...}, // ✅ Present
content_strategy_insights: {...}, // ✅ Present
recommended_settings: {...},
// ... ALL the fields from API response
guidelines: [...],
best_practices: [...],
avoid_elements: [...],
style_patterns: {...},
// etc.
},
useAnalysisForGenAI: true
};
```
## Solution Required
### 1. Fix Data Transformation (COMPLETED ✅)
**File**: `backend/services/api_key_manager.py` (Line 278)
Already fixed to flatten the structure:
```python
elif step.step_number == 2: # Website Analysis
# Transform frontend data structure to match database schema
analysis_for_db = {
'website_url': step.data.get('website', ''),
'status': 'completed'
}
# Merge analysis fields if they exist
if 'analysis' in step.data and step.data['analysis']:
analysis_for_db.update(step.data['analysis'])
self.db_service.save_website_analysis(self.user_id, analysis_for_db, db)
```
### 2. Update OnboardingDatabaseService to Save ALL Fields
**File**: `backend/services/onboarding_database_service.py`
**NEEDED**: Add `brand_analysis` and `content_strategy_insights` to the save operation.
Check if `WebsiteAnalysis` model has these columns:
```python
# Line 206-213 (existing code)
website_url=analysis_data.get('website_url', ''),
writing_style=analysis_data.get('writing_style'),
content_characteristics=analysis_data.get('content_characteristics'),
target_audience=analysis_data.get('target_audience'),
content_type=analysis_data.get('content_type'),
recommended_settings=analysis_data.get('recommended_settings'),
brand_analysis=analysis_data.get('brand_analysis'), # ← ADD THIS
content_strategy_insights=analysis_data.get('content_strategy_insights'), # ← ADD THIS
crawl_result=analysis_data.get('crawl_result'),
style_patterns=analysis_data.get('style_patterns'),
style_guidelines=analysis_data.get('style_guidelines'),
```
### 3. Verify Database Model Supports These Fields
**File**: `backend/models/onboarding.py`
Check `WebsiteAnalysis` model for:
- `brand_analysis` column (JSON)
- `content_strategy_insights` column (JSON)
If missing, add migration.
## Recommendation
1.**Data transformation fix is complete** (api_key_manager.py updated)
2.**Check WebsiteAnalysis model** for brand_analysis and content_strategy_insights columns
3.**Update OnboardingDatabaseService.save_website_analysis()** to include these fields
4.**Restart backend** to apply changes
5.**Re-run Step 2** to save complete data
6.**Verify Step 6** displays all fields
## Benefits of Complete Data Storage
With `brand_analysis` and `content_strategy_insights` saved:
1. **Better Content Generation**: AI can align with brand values
2. **Strategic Insights**: SWOT analysis informs content strategy
3. **Competitive Intelligence**: Differentiation factors for positioning
4. **Content Planning**: Recommendations and calendar suggestions
5. **Quality Assurance**: Consistency checking against brand guidelines
## Status
- ✅ API endpoint returns complete data
- ✅ Frontend receives and displays complete data
- ✅ Data transformation fix applied (flattening structure)
- ⏳ Database model verification needed
- ⏳ OnboardingDatabaseService update needed
- ⏳ Testing required
---
**Next Action**: Check `WebsiteAnalysis` model and update `OnboardingDatabaseService` to save ALL fields.