Base code

This commit is contained in:
Kunthawat Greethong
2026-01-08 22:39:53 +07:00
parent 697115c61a
commit c35fa52117
2169 changed files with 626670 additions and 0 deletions

View File

@@ -0,0 +1,151 @@
# AI Blog Writer Service Architecture
This directory contains the refactored AI Blog Writer service with a clean, modular architecture.
## 📁 Directory Structure
```
blog_writer/
├── README.md # This file
├── blog_service.py # Main entry point (imports from core)
├── core/ # Core service orchestrator
│ ├── __init__.py
│ └── blog_writer_service.py # Main service coordinator
├── research/ # Research functionality
│ ├── __init__.py
│ ├── research_service.py # Main research orchestrator
│ ├── keyword_analyzer.py # AI-powered keyword analysis
│ ├── competitor_analyzer.py # Competitor intelligence
│ └── content_angle_generator.py # Content angle discovery
├── outline/ # Outline generation
│ ├── __init__.py
│ ├── outline_service.py # Main outline orchestrator
│ ├── outline_generator.py # AI-powered outline generation
│ ├── outline_optimizer.py # Outline optimization
│ └── section_enhancer.py # Section enhancement
├── content/ # Content generation (TODO)
└── optimization/ # SEO & optimization (TODO)
```
## 🏗️ Architecture Overview
### Core Module (`core/`)
- **`BlogWriterService`**: Main orchestrator that coordinates all blog writing functionality
- Provides a unified interface for research, outline generation, and content creation
- Delegates to specialized modules for specific functionality
### Research Module (`research/`)
- **`ResearchService`**: Orchestrates comprehensive research using Google Search grounding
- **`KeywordAnalyzer`**: AI-powered keyword analysis and extraction
- **`CompetitorAnalyzer`**: Competitor intelligence and market analysis
- **`ContentAngleGenerator`**: Strategic content angle discovery
### Outline Module (`outline/`)
- **`OutlineService`**: Manages outline generation, refinement, and optimization
- **`OutlineGenerator`**: AI-powered outline generation from research data
- **`OutlineOptimizer`**: Optimizes outlines for flow, SEO, and engagement
- **`SectionEnhancer`**: Enhances individual sections using AI
## 🔄 Service Flow
1. **Research Phase**: `ResearchService``KeywordAnalyzer` + `CompetitorAnalyzer` + `ContentAngleGenerator`
2. **Outline Phase**: `OutlineService``OutlineGenerator``OutlineOptimizer`
3. **Content Phase**: (TODO) Content generation and optimization
4. **Publishing Phase**: (TODO) Platform integration and publishing
## 🚀 Usage
```python
from services.blog_writer.blog_service import BlogWriterService
# Initialize the service
service = BlogWriterService()
# Research a topic
research_result = await service.research(research_request)
# Generate outline from research
outline_result = await service.generate_outline(outline_request)
# Enhance sections
enhanced_section = await service.enhance_section_with_ai(section, "SEO optimization")
```
## 🎯 Key Benefits
### 1. **Modularity**
- Each module has a single responsibility
- Easy to test, maintain, and extend
- Clear separation of concerns
### 2. **Reusability**
- Components can be used independently
- Easy to swap implementations
- Shared utilities and helpers
### 3. **Scalability**
- New features can be added as separate modules
- Existing modules can be enhanced without affecting others
- Clear interfaces between modules
### 4. **Maintainability**
- Smaller, focused files are easier to understand
- Changes are isolated to specific modules
- Clear dependency relationships
## 🔧 Development Guidelines
### Adding New Features
1. Identify the appropriate module (research, outline, content, optimization)
2. Create new classes following the existing patterns
3. Update the module's `__init__.py` to export new classes
4. Add methods to the appropriate service orchestrator
5. Update the main `BlogWriterService` if needed
### Testing
- Each module should have its own test suite
- Mock external dependencies (AI providers, APIs)
- Test both success and failure scenarios
- Maintain high test coverage
### Error Handling
- Use graceful degradation with fallbacks
- Log errors appropriately
- Return meaningful error messages to users
- Don't let one module's failure break the entire flow
## 📈 Future Enhancements
### Content Module (`content/`)
- Section content generation
- Content optimization and refinement
- Multi-format output (HTML, Markdown, etc.)
### Optimization Module (`optimization/`)
- SEO analysis and recommendations
- Readability optimization
- Performance metrics and analytics
### Integration Module (`integration/`)
- Platform-specific adapters (WordPress, Wix, etc.)
- Publishing workflows
- Content management system integration
## 🔍 Code Quality
- **Type Hints**: All methods use proper type annotations
- **Documentation**: Comprehensive docstrings for all public methods
- **Error Handling**: Graceful failure with meaningful error messages
- **Logging**: Structured logging with appropriate levels
- **Testing**: Unit tests for all major functionality
- **Performance**: Efficient caching and API usage
## 📝 Migration Notes
The original `blog_service.py` has been refactored into this modular structure:
- **Research functionality** → `research/` module
- **Outline generation** → `outline/` module
- **Service orchestration** → `core/` module
- **Main entry point** → `blog_service.py` (now just imports from core)
All existing API endpoints continue to work without changes due to the maintained interface in `BlogWriterService`.

View File

@@ -0,0 +1,11 @@
"""
AI Blog Writer Service - Main entry point for blog writing functionality.
This module provides a clean interface to the modular blog writer services.
The actual implementation has been refactored into specialized modules:
- research/ - Research and keyword analysis
- outline/ - Outline generation and optimization
- core/ - Main service orchestrator
"""
from .core import BlogWriterService

View File

@@ -0,0 +1,209 @@
"""
Circuit Breaker Pattern for Blog Writer API Calls
Implements circuit breaker pattern to prevent cascading failures when external APIs
are experiencing issues. Tracks failure rates and automatically disables calls when
threshold is exceeded, with auto-recovery after cooldown period.
"""
import time
import asyncio
from typing import Callable, Any, Optional, Dict
from enum import Enum
from dataclasses import dataclass
from loguru import logger
from .exceptions import CircuitBreakerOpenException
class CircuitState(Enum):
"""Circuit breaker states."""
CLOSED = "closed" # Normal operation
OPEN = "open" # Circuit is open, calls are blocked
HALF_OPEN = "half_open" # Testing if service is back
@dataclass
class CircuitBreakerConfig:
"""Configuration for circuit breaker."""
failure_threshold: int = 5 # Number of failures before opening
recovery_timeout: int = 60 # Seconds to wait before trying again
success_threshold: int = 3 # Successes needed to close from half-open
timeout: int = 30 # Timeout for individual calls
max_failures_per_minute: int = 10 # Max failures per minute before opening
class CircuitBreaker:
"""Circuit breaker implementation for API calls."""
def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = 0
self.last_success_time = 0
self.failure_times = [] # Track failure times for rate limiting
self._lock = asyncio.Lock()
async def call(self, func: Callable, *args, **kwargs) -> Any:
"""
Execute function with circuit breaker protection.
Args:
func: Function to execute
*args: Function arguments
**kwargs: Function keyword arguments
Returns:
Function result
Raises:
CircuitBreakerOpenException: If circuit is open
"""
async with self._lock:
# Check if circuit should be opened due to rate limiting
await self._check_rate_limit()
# Check circuit state
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.success_count = 0
logger.info(f"Circuit breaker {self.name} transitioning to HALF_OPEN")
else:
retry_after = int(self.config.recovery_timeout - (time.time() - self.last_failure_time))
raise CircuitBreakerOpenException(
f"Circuit breaker {self.name} is OPEN",
retry_after=max(0, retry_after),
context={"circuit_name": self.name, "state": self.state.value}
)
try:
# Execute the function with timeout
result = await asyncio.wait_for(
func(*args, **kwargs),
timeout=self.config.timeout
)
# Record success
await self._record_success()
return result
except asyncio.TimeoutError:
await self._record_failure("timeout")
raise
except Exception as e:
await self._record_failure(str(e))
raise
async def _check_rate_limit(self):
"""Check if failure rate exceeds threshold."""
current_time = time.time()
# Remove failures older than 1 minute
self.failure_times = [
failure_time for failure_time in self.failure_times
if current_time - failure_time < 60
]
# Check if we've exceeded the rate limit
if len(self.failure_times) >= self.config.max_failures_per_minute:
self.state = CircuitState.OPEN
self.last_failure_time = current_time
logger.warning(f"Circuit breaker {self.name} opened due to rate limit: {len(self.failure_times)} failures in last minute")
def _should_attempt_reset(self) -> bool:
"""Check if enough time has passed to attempt reset."""
return time.time() - self.last_failure_time >= self.config.recovery_timeout
async def _record_success(self):
"""Record a successful call."""
async with self._lock:
self.last_success_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.config.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
logger.info(f"Circuit breaker {self.name} closed after {self.success_count} successes")
elif self.state == CircuitState.CLOSED:
# Reset failure count on success
self.failure_count = 0
async def _record_failure(self, error: str):
"""Record a failed call."""
async with self._lock:
current_time = time.time()
self.failure_count += 1
self.last_failure_time = current_time
self.failure_times.append(current_time)
logger.warning(f"Circuit breaker {self.name} recorded failure #{self.failure_count}: {error}")
# Open circuit if threshold exceeded
if self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
logger.error(f"Circuit breaker {self.name} opened after {self.failure_count} failures")
def get_state(self) -> Dict[str, Any]:
"""Get current circuit breaker state."""
return {
"name": self.name,
"state": self.state.value,
"failure_count": self.failure_count,
"success_count": self.success_count,
"last_failure_time": self.last_failure_time,
"last_success_time": self.last_success_time,
"failures_in_last_minute": len([
t for t in self.failure_times
if time.time() - t < 60
])
}
class CircuitBreakerManager:
"""Manages multiple circuit breakers."""
def __init__(self):
self._breakers: Dict[str, CircuitBreaker] = {}
def get_breaker(self, name: str, config: Optional[CircuitBreakerConfig] = None) -> CircuitBreaker:
"""Get or create a circuit breaker."""
if name not in self._breakers:
self._breakers[name] = CircuitBreaker(name, config)
return self._breakers[name]
def get_all_states(self) -> Dict[str, Dict[str, Any]]:
"""Get states of all circuit breakers."""
return {name: breaker.get_state() for name, breaker in self._breakers.items()}
def reset_breaker(self, name: str):
"""Reset a circuit breaker to closed state."""
if name in self._breakers:
self._breakers[name].state = CircuitState.CLOSED
self._breakers[name].failure_count = 0
self._breakers[name].success_count = 0
logger.info(f"Circuit breaker {name} manually reset")
# Global circuit breaker manager
circuit_breaker_manager = CircuitBreakerManager()
def circuit_breaker(name: str, config: Optional[CircuitBreakerConfig] = None):
"""
Decorator to add circuit breaker protection to async functions.
Args:
name: Circuit breaker name
config: Circuit breaker configuration
"""
def decorator(func: Callable) -> Callable:
async def wrapper(*args, **kwargs):
breaker = circuit_breaker_manager.get_breaker(name, config)
return await breaker.call(func, *args, **kwargs)
return wrapper
return decorator

View File

@@ -0,0 +1,209 @@
"""
Blog Rewriter Service
Handles blog rewriting based on user feedback using structured AI calls.
"""
import time
import uuid
from typing import Dict, Any
from loguru import logger
from services.llm_providers.gemini_provider import gemini_structured_json_response
class BlogRewriter:
"""Service for rewriting blog content based on user feedback."""
def __init__(self, task_manager):
self.task_manager = task_manager
def start_blog_rewrite(self, request: Dict[str, Any]) -> str:
"""Start blog rewrite task with user feedback."""
try:
# Extract request data
title = request.get("title", "Untitled Blog")
sections = request.get("sections", [])
research = request.get("research", {})
outline = request.get("outline", [])
feedback = request.get("feedback", "")
tone = request.get("tone")
audience = request.get("audience")
focus = request.get("focus")
if not sections:
raise ValueError("No sections provided for rewrite")
if not feedback or len(feedback.strip()) < 10:
raise ValueError("Feedback is required and must be at least 10 characters")
# Create task for rewrite
task_id = f"rewrite_{int(time.time())}_{uuid.uuid4().hex[:8]}"
# Start the rewrite task
self.task_manager.start_task(
task_id,
self._execute_blog_rewrite,
title=title,
sections=sections,
research=research,
outline=outline,
feedback=feedback,
tone=tone,
audience=audience,
focus=focus
)
logger.info(f"Blog rewrite task started: {task_id}")
return task_id
except Exception as e:
logger.error(f"Failed to start blog rewrite: {e}")
raise
async def _execute_blog_rewrite(self, task_id: str, **kwargs):
"""Execute the blog rewrite task."""
try:
title = kwargs.get("title", "Untitled Blog")
sections = kwargs.get("sections", [])
research = kwargs.get("research", {})
outline = kwargs.get("outline", [])
feedback = kwargs.get("feedback", "")
tone = kwargs.get("tone")
audience = kwargs.get("audience")
focus = kwargs.get("focus")
# Update task status
self.task_manager.update_task_status(task_id, "processing", "Analyzing current content and feedback...")
# Build rewrite prompt with user feedback
system_prompt = f"""You are an expert blog writer tasked with rewriting content based on user feedback.
Current Blog Title: {title}
User Feedback: {feedback}
{f"Desired Tone: {tone}" if tone else ""}
{f"Target Audience: {audience}" if audience else ""}
{f"Focus Area: {focus}" if focus else ""}
Your task is to rewrite the blog content to address the user's feedback while maintaining the core structure and research insights."""
# Prepare content for rewrite
full_content = f"Title: {title}\n\n"
for section in sections:
full_content += f"Section: {section.get('heading', 'Untitled')}\n"
full_content += f"Content: {section.get('content', '')}\n\n"
# Create rewrite prompt
rewrite_prompt = f"""
Based on the user feedback and current blog content, rewrite the blog to address their concerns and preferences.
Current Content:
{full_content}
User Feedback: {feedback}
{f"Desired Tone: {tone}" if tone else ""}
{f"Target Audience: {audience}" if audience else ""}
{f"Focus Area: {focus}" if focus else ""}
Please rewrite the blog content in the following JSON format:
{{
"title": "New or improved blog title",
"sections": [
{{
"id": "section_id",
"heading": "Section heading",
"content": "Rewritten section content"
}}
]
}}
Guidelines:
1. Address the user's feedback directly
2. Maintain the research insights and factual accuracy
3. Improve flow, clarity, and engagement
4. Keep the same section structure unless feedback suggests otherwise
5. Ensure content is well-formatted with proper paragraphs
"""
# Update task status
self.task_manager.update_task_status(task_id, "processing", "Generating rewritten content...")
# Use structured JSON generation
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"heading": {"type": "string"},
"content": {"type": "string"}
}
}
}
}
}
result = gemini_structured_json_response(
prompt=rewrite_prompt,
schema=schema,
temperature=0.7,
max_tokens=4096,
system_prompt=system_prompt
)
logger.info(f"Gemini response for rewrite task {task_id}: {result}")
# Check if we have a valid result - handle both multi-section and single-section formats
is_valid_multi_section = result and not result.get("error") and result.get("title") and result.get("sections")
is_valid_single_section = result and not result.get("error") and (result.get("heading") or result.get("title")) and result.get("content")
if is_valid_multi_section or is_valid_single_section:
# If single section format, convert to multi-section format for consistency
if is_valid_single_section and not is_valid_multi_section:
# Convert single section to multi-section format
converted_result = {
"title": result.get("heading") or result.get("title") or "Rewritten Blog",
"sections": [
{
"id": result.get("id") or "section_1",
"heading": result.get("heading") or "Main Content",
"content": result.get("content", "")
}
]
}
result = converted_result
logger.info(f"Converted single section response to multi-section format for task {task_id}")
# Update task status with success
self.task_manager.update_task_status(
task_id,
"completed",
"Blog rewrite completed successfully!",
result=result
)
logger.info(f"Blog rewrite completed successfully: {task_id}")
else:
# More detailed error handling
if not result:
error_msg = "No response from AI"
elif result.get("error"):
error_msg = f"AI error: {result.get('error')}"
elif not (result.get("title") or result.get("heading")):
error_msg = "AI response missing title/heading"
elif not (result.get("sections") or result.get("content")):
error_msg = "AI response missing sections/content"
else:
error_msg = "AI response has invalid structure"
self.task_manager.update_task_status(task_id, "failed", f"Rewrite failed: {error_msg}")
logger.error(f"Blog rewrite failed: {error_msg}")
except Exception as e:
error_msg = f"Blog rewrite error: {str(e)}"
self.task_manager.update_task_status(task_id, "failed", error_msg)
logger.error(f"Blog rewrite task failed: {e}")
raise

View File

@@ -0,0 +1,152 @@
"""
ContextMemory - maintains intelligent continuity context across sections using LLM-enhanced summarization.
Stores smart per-section summaries and thread keywords for use in prompts with cost optimization.
"""
from __future__ import annotations
from typing import Dict, List, Optional, Tuple
from collections import deque
from loguru import logger
import hashlib
# Import the common gemini provider
from services.llm_providers.gemini_provider import gemini_text_response
class ContextMemory:
"""In-memory continuity store for recent sections with LLM-enhanced summarization.
Notes:
- Keeps an ordered deque of recent (section_id, summary) pairs
- Uses LLM for intelligent summarization when content is substantial
- Provides utilities to build a compact previous-sections summary
- Implements caching to minimize LLM calls
"""
def __init__(self, max_entries: int = 10):
self.max_entries = max_entries
self._recent: deque[Tuple[str, str]] = deque(maxlen=max_entries)
# Cache for LLM-generated summaries
self._summary_cache: Dict[str, str] = {}
logger.info("✅ ContextMemory initialized with LLM-enhanced summarization")
def update_with_section(self, section_id: str, full_text: str, use_llm: bool = True) -> None:
"""Create a compact summary and store it for continuity usage."""
summary = self._summarize_text_intelligently(full_text, use_llm=use_llm)
self._recent.append((section_id, summary))
def get_recent_summaries(self, limit: int = 2) -> List[str]:
"""Return the last N stored summaries (most recent first)."""
return [s for (_sid, s) in list(self._recent)[-limit:]]
def build_previous_sections_summary(self, limit: int = 2) -> str:
"""Join recent summaries for prompt injection."""
recents = self.get_recent_summaries(limit=limit)
if not recents:
return ""
return "\n\n".join(recents)
def _summarize_text_intelligently(self, text: str, target_words: int = 80, use_llm: bool = True) -> str:
"""Create intelligent summary using LLM when appropriate, fallback to truncation."""
# Create cache key
cache_key = self._get_cache_key(text)
# Check cache first
if cache_key in self._summary_cache:
logger.debug("Summary cache hit")
return self._summary_cache[cache_key]
# Determine if we should use LLM
should_use_llm = use_llm and self._should_use_llm_summarization(text)
if should_use_llm:
try:
summary = self._llm_summarize_text(text, target_words)
self._summary_cache[cache_key] = summary
logger.info("LLM-based summarization completed")
return summary
except Exception as e:
logger.warning(f"LLM summarization failed, using fallback: {e}")
# Fall through to local summarization
# Local fallback
summary = self._summarize_text_locally(text, target_words)
self._summary_cache[cache_key] = summary
return summary
def _should_use_llm_summarization(self, text: str) -> bool:
"""Determine if content is substantial enough to warrant LLM summarization."""
word_count = len(text.split())
# Use LLM for substantial content (>150 words) or complex structure
has_complex_structure = any(marker in text for marker in ['##', '###', '**', '*', '-', '1.', '2.'])
return word_count > 150 or has_complex_structure
def _llm_summarize_text(self, text: str, target_words: int = 80) -> str:
"""Use Gemini API for intelligent text summarization."""
# Truncate text to minimize tokens while keeping key content
truncated_text = text[:800] # First 800 chars usually contain the main points
prompt = f"""
Summarize the following content in approximately {target_words} words, focusing on key concepts and main points.
Content: {truncated_text}
Requirements:
- Capture the main ideas and key concepts
- Maintain the original tone and style
- Keep it concise but informative
- Focus on what's most important for continuity
Generate only the summary, no explanations or formatting.
"""
try:
result = gemini_text_response(
prompt=prompt,
temperature=0.3, # Low temperature for consistent summarization
max_tokens=500, # Increased tokens for better summaries
system_prompt="You are an expert at creating concise, informative summaries."
)
if result and result.strip():
summary = result.strip()
# Ensure it's not too long
words = summary.split()
if len(words) > target_words + 20: # Allow some flexibility
summary = " ".join(words[:target_words]) + "..."
return summary
else:
logger.warning("LLM summary response empty, using fallback")
return self._summarize_text_locally(text, target_words)
except Exception as e:
logger.error(f"LLM summarization error: {e}")
return self._summarize_text_locally(text, target_words)
def _summarize_text_locally(self, text: str, target_words: int = 80) -> str:
"""Very lightweight, deterministic truncation-based summary.
This deliberately avoids extra LLM calls. It collects the first
sentences up to approximately target_words.
"""
words = text.split()
if len(words) <= target_words:
return text.strip()
return " ".join(words[:target_words]).strip() + ""
def _get_cache_key(self, text: str) -> str:
"""Generate cache key from text hash."""
# Use first 200 chars for cache key to balance uniqueness vs memory
return hashlib.md5(text[:200].encode()).hexdigest()[:12]
def clear_cache(self):
"""Clear summary cache (useful for testing or memory management)."""
self._summary_cache.clear()
logger.info("ContextMemory cache cleared")

View File

@@ -0,0 +1,92 @@
"""
EnhancedContentGenerator - thin orchestrator for section generation.
Provider parity:
- Uses main_text_generation.llm_text_gen to respect GPT_PROVIDER (Gemini/HF)
- No direct provider coupling here; Google grounding remains in research only
"""
from typing import Any, Dict
from services.llm_providers.main_text_generation import llm_text_gen
from .source_url_manager import SourceURLManager
from .context_memory import ContextMemory
from .transition_generator import TransitionGenerator
from .flow_analyzer import FlowAnalyzer
class EnhancedContentGenerator:
def __init__(self):
self.url_manager = SourceURLManager()
self.memory = ContextMemory(max_entries=12)
self.transitioner = TransitionGenerator()
self.flow = FlowAnalyzer()
async def generate_section(self, section: Any, research: Any, mode: str = "polished") -> Dict[str, Any]:
prev_summary = self.memory.build_previous_sections_summary(limit=2)
urls = self.url_manager.pick_relevant_urls(section, research)
prompt = self._build_prompt(section, research, prev_summary, urls)
# Provider-agnostic text generation (respect GPT_PROVIDER & circuit-breaker)
content_text: str = ""
try:
ai_resp = llm_text_gen(
prompt=prompt,
json_struct=None,
system_prompt=None,
)
if isinstance(ai_resp, dict) and ai_resp.get("text"):
content_text = ai_resp.get("text", "")
elif isinstance(ai_resp, str):
content_text = ai_resp
else:
# Fallback best-effort extraction
content_text = str(ai_resp or "")
except Exception as e:
content_text = ""
result = {
"content": content_text,
"sources": [{"title": u.get("title", ""), "url": u.get("url", "")} for u in urls] if urls else [],
}
# Generate transition and compute intelligent flow metrics
previous_text = prev_summary
current_text = result.get("content", "")
transition = self.transitioner.generate_transition(previous_text, getattr(section, 'heading', 'This section'), use_llm=True)
metrics = self.flow.assess_flow(previous_text, current_text, use_llm=True)
# Update memory for subsequent sections and store continuity snapshot
if current_text:
self.memory.update_with_section(getattr(section, 'id', 'unknown'), current_text, use_llm=True)
# Return enriched result
result["transition"] = transition
result["continuity_metrics"] = metrics
# Persist a lightweight continuity snapshot for API access
try:
sid = getattr(section, 'id', 'unknown')
if not hasattr(self, "_last_continuity"):
self._last_continuity = {}
self._last_continuity[sid] = metrics
except Exception:
pass
return result
def _build_prompt(self, section: Any, research: Any, prev_summary: str, urls: list) -> str:
heading = getattr(section, 'heading', 'Section')
key_points = getattr(section, 'key_points', [])
keywords = getattr(section, 'keywords', [])
target_words = getattr(section, 'target_words', 300)
url_block = "\n".join([f"- {u.get('title','')} ({u.get('url','')})" for u in urls]) if urls else "(no specific URLs provided)"
return (
f"You are writing the blog section '{heading}'.\n\n"
f"Context summary (previous sections): {prev_summary}\n\n"
f"Authoring requirements:\n"
f"- Target word count: ~{target_words}\n"
f"- Use the following key points: {', '.join(key_points)}\n"
f"- Include these keywords naturally: {', '.join(keywords)}\n"
f"- Cite insights from these sources when relevant (do not output raw URLs):\n{url_block}\n\n"
"Write engaging, well-structured markdown with clear paragraphs (2-4 sentences each) separated by double line breaks."
)

View File

@@ -0,0 +1,162 @@
"""
FlowAnalyzer - evaluates narrative flow using LLM-based analysis with cost optimization.
Uses Gemini API for intelligent analysis while minimizing API calls through caching and smart triggers.
"""
from typing import Dict, Optional
from loguru import logger
import hashlib
import json
# Import the common gemini provider
from services.llm_providers.gemini_provider import gemini_structured_json_response
class FlowAnalyzer:
def __init__(self):
# Simple in-memory cache to avoid redundant LLM calls
self._cache: Dict[str, Dict[str, float]] = {}
# Cache for rule-based fallback when LLM analysis isn't needed
self._rule_cache: Dict[str, Dict[str, float]] = {}
logger.info("✅ FlowAnalyzer initialized with LLM-based analysis")
def assess_flow(self, previous_text: str, current_text: str, use_llm: bool = True) -> Dict[str, float]:
"""
Return flow metrics in range 0..1.
Args:
previous_text: Previous section content
current_text: Current section content
use_llm: Whether to use LLM analysis (default: True for significant content)
"""
if not current_text:
return {"flow": 0.0, "consistency": 0.0, "progression": 0.0}
# Create cache key from content hashes
cache_key = self._get_cache_key(previous_text, current_text)
# Check cache first
if cache_key in self._cache:
logger.debug("Flow analysis cache hit")
return self._cache[cache_key]
# Determine if we should use LLM analysis
should_use_llm = use_llm and self._should_use_llm_analysis(previous_text, current_text)
if should_use_llm:
try:
metrics = self._llm_flow_analysis(previous_text, current_text)
self._cache[cache_key] = metrics
logger.info("LLM-based flow analysis completed")
return metrics
except Exception as e:
logger.warning(f"LLM flow analysis failed, falling back to rules: {e}")
# Fall through to rule-based analysis
# Rule-based fallback (cached separately)
if cache_key in self._rule_cache:
return self._rule_cache[cache_key]
metrics = self._rule_based_analysis(previous_text, current_text)
self._rule_cache[cache_key] = metrics
return metrics
def _should_use_llm_analysis(self, previous_text: str, current_text: str) -> bool:
"""Determine if content is significant enough to warrant LLM analysis."""
# Use LLM for substantial content or when previous context exists
word_count = len(current_text.split())
has_previous = bool(previous_text and len(previous_text.strip()) > 50)
# Use LLM if: substantial content (>100 words) OR has meaningful previous context
return word_count > 100 or has_previous
def _llm_flow_analysis(self, previous_text: str, current_text: str) -> Dict[str, float]:
"""Use Gemini API for intelligent flow analysis."""
# Truncate content to minimize tokens while keeping context
prev_truncated = (previous_text[-300:] if previous_text else "") if previous_text else ""
curr_truncated = current_text[:500] # First 500 chars usually contain the key content
prompt = f"""
Analyze the narrative flow between these two content sections. Rate each aspect from 0.0 to 1.0.
PREVIOUS SECTION (end): {prev_truncated}
CURRENT SECTION (start): {curr_truncated}
Evaluate:
1. Flow Quality (0.0-1.0): How smoothly does the content transition? Are there logical connections?
2. Consistency (0.0-1.0): Do key themes, terminology, and tone remain consistent?
3. Progression (0.0-1.0): Does the content logically build upon previous ideas?
Return ONLY a JSON object with these exact keys: flow, consistency, progression
"""
schema = {
"type": "object",
"properties": {
"flow": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"consistency": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"progression": {"type": "number", "minimum": 0.0, "maximum": 1.0}
},
"required": ["flow", "consistency", "progression"]
}
try:
result = gemini_structured_json_response(
prompt=prompt,
schema=schema,
temperature=0.2, # Low temperature for consistent scoring
max_tokens=1000 # Increased tokens for better analysis
)
if result.parsed:
return {
"flow": float(result.parsed.get("flow", 0.6)),
"consistency": float(result.parsed.get("consistency", 0.6)),
"progression": float(result.parsed.get("progression", 0.6))
}
else:
logger.warning("LLM response parsing failed, using fallback")
return self._rule_based_analysis(previous_text, current_text)
except Exception as e:
logger.error(f"LLM flow analysis error: {e}")
return self._rule_based_analysis(previous_text, current_text)
def _rule_based_analysis(self, previous_text: str, current_text: str) -> Dict[str, float]:
"""Fallback rule-based analysis for cost efficiency."""
flow = 0.6
consistency = 0.6
progression = 0.6
# Enhanced heuristics
if previous_text and previous_text[-1] in ".!?":
flow += 0.1
if any(k in current_text.lower() for k in ["therefore", "next", "building on", "as a result", "furthermore", "additionally"]):
progression += 0.2
if len(current_text.split()) > 120:
consistency += 0.1
if any(k in current_text.lower() for k in ["however", "but", "although", "despite"]):
flow += 0.1 # Good use of contrast words
return {
"flow": min(flow, 1.0),
"consistency": min(consistency, 1.0),
"progression": min(progression, 1.0),
}
def _get_cache_key(self, previous_text: str, current_text: str) -> str:
"""Generate cache key from content hashes."""
# Use first 100 chars of each for cache key to balance uniqueness vs memory
prev_hash = hashlib.md5((previous_text[:100] if previous_text else "").encode()).hexdigest()[:8]
curr_hash = hashlib.md5(current_text[:100].encode()).hexdigest()[:8]
return f"{prev_hash}_{curr_hash}"
def clear_cache(self):
"""Clear analysis cache (useful for testing or memory management)."""
self._cache.clear()
self._rule_cache.clear()
logger.info("FlowAnalyzer cache cleared")

View File

@@ -0,0 +1,186 @@
"""
Introduction Generator - Generates varied blog introductions based on content and research.
Generates 3 different introduction options for the user to choose from.
"""
from typing import Dict, Any, List
from loguru import logger
from models.blog_models import BlogResearchResponse, BlogOutlineSection
class IntroductionGenerator:
"""Generates blog introductions using research and content data."""
def __init__(self):
"""Initialize the introduction generator."""
pass
def build_introduction_prompt(
self,
blog_title: str,
research: BlogResearchResponse,
outline: List[BlogOutlineSection],
sections_content: Dict[str, str],
primary_keywords: List[str],
search_intent: str
) -> str:
"""Build a prompt for generating blog introductions."""
# Extract key research insights
keyword_analysis = research.keyword_analysis or {}
content_angles = research.suggested_angles or []
# Get a summary of the first few sections for context
section_summaries = []
for i, section in enumerate(outline[:3], 1):
section_id = section.id
content = sections_content.get(section_id, '')
if content:
# Take first 200 chars as summary
summary = content[:200] + '...' if len(content) > 200 else content
section_summaries.append(f"{i}. {section.heading}: {summary}")
sections_text = '\n'.join(section_summaries) if section_summaries else "Content sections are being generated."
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else "the topic"
content_angle_text = ', '.join(content_angles[:3]) if content_angles else "General insights"
return f"""Generate exactly 3 varied blog introductions for the following blog post.
BLOG TITLE: {blog_title}
PRIMARY KEYWORDS: {primary_kw_text}
SEARCH INTENT: {search_intent}
CONTENT ANGLES: {content_angle_text}
BLOG CONTENT SUMMARY:
{sections_text}
REQUIREMENTS FOR EACH INTRODUCTION:
- 80-120 words in length
- Hook the reader immediately with a compelling opening
- Clearly state the value proposition and what readers will learn
- Include the primary keyword naturally within the first 2 sentences
- Each introduction should have a different angle/approach:
1. First: Problem-focused (highlight the challenge readers face)
2. Second: Benefit-focused (emphasize the value and outcomes)
3. Third: Story/statistic-focused (use a compelling fact or narrative hook)
- Maintain a professional yet engaging tone
- Avoid generic phrases - be specific and benefit-driven
Return ONLY a JSON array of exactly 3 introductions:
[
"First introduction (80-120 words, problem-focused)",
"Second introduction (80-120 words, benefit-focused)",
"Third introduction (80-120 words, story/statistic-focused)"
]"""
def get_introduction_schema(self) -> Dict[str, Any]:
"""Get the JSON schema for introduction generation."""
return {
"type": "array",
"items": {
"type": "string",
"minLength": 80,
"maxLength": 150
},
"minItems": 3,
"maxItems": 3
}
async def generate_introductions(
self,
blog_title: str,
research: BlogResearchResponse,
outline: List[BlogOutlineSection],
sections_content: Dict[str, str],
primary_keywords: List[str],
search_intent: str,
user_id: str
) -> List[str]:
"""Generate 3 varied blog introductions.
Args:
blog_title: The blog post title
research: Research data with keywords and insights
outline: Blog outline sections
sections_content: Dictionary mapping section IDs to their content
primary_keywords: Primary keywords for the blog
search_intent: Search intent (informational, commercial, etc.)
user_id: User ID for API calls
Returns:
List of 3 introduction options
"""
from services.llm_providers.main_text_generation import llm_text_gen
if not user_id:
raise ValueError("user_id is required for introduction generation")
# Build prompt
prompt = self.build_introduction_prompt(
blog_title=blog_title,
research=research,
outline=outline,
sections_content=sections_content,
primary_keywords=primary_keywords,
search_intent=search_intent
)
# Get schema
schema = self.get_introduction_schema()
logger.info(f"Generating blog introductions for user {user_id}")
try:
# Generate introductions using structured JSON response
result = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt="You are an expert content writer specializing in creating compelling blog introductions that hook readers and clearly communicate value.",
user_id=user_id
)
# Handle response - could be array directly or wrapped in dict
if isinstance(result, list):
introductions = result
elif isinstance(result, dict):
# Try common keys
introductions = result.get('introductions', result.get('options', result.get('intros', [])))
if not introductions and isinstance(result.get('response'), list):
introductions = result['response']
else:
logger.warning(f"Unexpected introduction generation result type: {type(result)}")
introductions = []
# Validate and clean introductions
cleaned_introductions = []
for intro in introductions:
if isinstance(intro, str) and len(intro.strip()) >= 50: # Minimum reasonable length
cleaned = intro.strip()
# Ensure it's within reasonable bounds
if len(cleaned) <= 200: # Allow slight overflow for quality
cleaned_introductions.append(cleaned)
# Ensure we have exactly 3 introductions
if len(cleaned_introductions) < 3:
logger.warning(f"Generated only {len(cleaned_introductions)} introductions, expected 3")
# Pad with placeholder if needed
while len(cleaned_introductions) < 3:
cleaned_introductions.append(f"{blog_title} - A comprehensive guide covering essential insights and practical strategies.")
# Return exactly 3 introductions
return cleaned_introductions[:3]
except Exception as e:
logger.error(f"Failed to generate introductions: {e}")
# Fallback: generate simple introductions
fallback_introductions = [
f"In this comprehensive guide, we'll explore {primary_keywords[0] if primary_keywords else 'essential insights'} and provide actionable strategies.",
f"Discover everything you need to know about {primary_keywords[0] if primary_keywords else 'this topic'} and how it can transform your approach.",
f"Whether you're new to {primary_keywords[0] if primary_keywords else 'this topic'} or looking to deepen your understanding, this guide has you covered."
]
return fallback_introductions

View File

@@ -0,0 +1,257 @@
"""
Medium Blog Generator Service
Handles generation of medium-length blogs (≤1000 words) using structured AI calls.
"""
import time
import json
from typing import Dict, Any, List
from loguru import logger
from fastapi import HTTPException
from models.blog_models import (
MediumBlogGenerateRequest,
MediumBlogGenerateResult,
MediumGeneratedSection,
ResearchSource,
)
from services.llm_providers.main_text_generation import llm_text_gen
from services.cache.persistent_content_cache import persistent_content_cache
class MediumBlogGenerator:
"""Service for generating medium-length blog content using structured AI calls."""
def __init__(self):
self.cache = persistent_content_cache
async def generate_medium_blog_with_progress(self, req: MediumBlogGenerateRequest, task_id: str, user_id: str) -> MediumBlogGenerateResult:
"""Use Gemini structured JSON to generate a medium-length blog in one call.
Args:
req: Medium blog generation request
task_id: Task ID for progress updates
user_id: User ID (required for subscription checks and usage tracking)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for medium blog generation (subscription checks and usage tracking)")
import time
start = time.time()
# Prepare sections data for cache key generation
sections_for_cache = []
for s in req.sections:
sections_for_cache.append({
"id": s.id,
"heading": s.heading,
"keyPoints": getattr(s, "key_points", []) or getattr(s, "keyPoints", []),
"subheadings": getattr(s, "subheadings", []),
"keywords": getattr(s, "keywords", []),
"targetWords": getattr(s, "target_words", None) or getattr(s, "targetWords", None),
})
# Check cache first
cached_result = self.cache.get_cached_content(
keywords=req.researchKeywords or [],
sections=sections_for_cache,
global_target_words=req.globalTargetWords or 1000,
persona_data=req.persona.dict() if req.persona else None,
tone=req.tone,
audience=req.audience
)
if cached_result:
logger.info(f"Using cached content for keywords: {req.researchKeywords} (saved expensive generation)")
# Add cache hit marker to distinguish from fresh generation
cached_result['generation_time_ms'] = 0 # Mark as cache hit
cached_result['cache_hit'] = True
return MediumBlogGenerateResult(**cached_result)
# Cache miss - proceed with AI generation
logger.info(f"Cache miss - generating new content for keywords: {req.researchKeywords}")
# Build schema expected from the model
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"heading": {"type": "string"},
"content": {"type": "string"},
"wordCount": {"type": "number"},
"sources": {
"type": "array",
"items": {
"type": "object",
"properties": {"title": {"type": "string"}, "url": {"type": "string"}},
},
},
},
},
},
},
}
# Compose prompt
def section_block(s):
return {
"id": s.id,
"heading": s.heading,
"outline": {
"keyPoints": getattr(s, "key_points", []) or getattr(s, "keyPoints", []),
"subheadings": getattr(s, "subheadings", []),
"keywords": getattr(s, "keywords", []),
"targetWords": getattr(s, "target_words", None) or getattr(s, "targetWords", None),
"references": [
{"title": r.title, "url": r.url} for r in getattr(s, "references", [])
],
},
}
payload = {
"title": req.title,
"globalTargetWords": req.globalTargetWords or 1000,
"persona": req.persona.dict() if req.persona else None,
"tone": req.tone,
"audience": req.audience,
"sections": [section_block(s) for s in req.sections],
}
# Build persona-aware system prompt
persona_context = ""
if req.persona:
persona_context = f"""
PERSONA GUIDELINES:
- Industry: {req.persona.industry or 'General'}
- Tone: {req.persona.tone or 'Professional'}
- Audience: {req.persona.audience or 'General readers'}
- Persona ID: {req.persona.persona_id or 'Default'}
Write content that reflects this persona's expertise and communication style.
Use industry-specific terminology and examples where appropriate.
Maintain consistent voice and authority throughout all sections.
"""
system = (
"You are a professional blog writer with deep expertise in your field. "
"Generate high-quality, persona-driven content for each section based on the provided outline. "
"Write engaging, informative content that follows the section's key points and target word count. "
"Ensure the content flows naturally and maintains consistent voice and authority. "
"Format content with proper paragraph breaks using double line breaks (\\n\\n) between paragraphs. "
"Structure content with clear paragraphs - aim for 2-4 sentences per paragraph. "
f"{persona_context}"
"Return ONLY valid JSON with no markdown formatting or explanations."
)
# Build persona-specific content instructions
persona_instructions = ""
if req.persona:
industry = req.persona.industry or 'General'
tone = req.persona.tone or 'Professional'
audience = req.persona.audience or 'General readers'
persona_instructions = f"""
PERSONA-DRIVEN CONTENT REQUIREMENTS:
- Write as an expert in {industry} industry
- Use {tone} tone appropriate for {audience}
- Include industry-specific examples and terminology
- Demonstrate authority and expertise in the field
- Use language that resonates with {audience}
- Maintain consistent voice that reflects this persona's expertise
"""
prompt = (
f"Write blog content for the following sections. Each section should be {req.globalTargetWords or 1000} words total, distributed across all sections.\n\n"
f"Blog Title: {req.title}\n\n"
"For each section, write engaging content that:\n"
"- Follows the key points provided\n"
"- Uses the suggested keywords naturally\n"
"- Meets the target word count\n"
"- Maintains professional tone\n"
"- References the provided sources when relevant\n"
"- Breaks content into clear paragraphs (2-4 sentences each)\n"
"- Uses double line breaks (\\n\\n) between paragraphs for proper formatting\n"
"- Starts with an engaging opening paragraph\n"
"- Ends with a strong concluding paragraph\n"
f"{persona_instructions}\n"
"IMPORTANT: Format the 'content' field with proper paragraph breaks using \\n\\n between paragraphs.\n\n"
"Return a JSON object with 'title' and 'sections' array. Each section should have 'id', 'heading', 'content', and 'wordCount'.\n\n"
f"Sections to write:\n{json.dumps(payload, ensure_ascii=False, indent=2)}"
)
try:
ai_resp = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt=system,
user_id=user_id
)
except HTTPException:
# Re-raise HTTPExceptions (e.g., 429 subscription limit) to preserve error details
raise
except Exception as llm_error:
# Wrap other errors
logger.error(f"AI generation failed: {llm_error}")
raise Exception(f"AI generation failed: {str(llm_error)}")
# Check for errors in AI response
if not ai_resp or ai_resp.get("error"):
error_msg = ai_resp.get("error", "Empty generation result from model") if ai_resp else "No response from model"
logger.error(f"AI generation failed: {error_msg}")
raise Exception(f"AI generation failed: {error_msg}")
# Normalize output
title = ai_resp.get("title") or req.title
out_sections = []
for s in ai_resp.get("sections", []) or []:
out_sections.append(
MediumGeneratedSection(
id=str(s.get("id")),
heading=s.get("heading") or "",
content=s.get("content") or "",
wordCount=int(s.get("wordCount") or 0),
sources=[
# map to ResearchSource shape if possible; keep minimal
ResearchSource(title=src.get("title", ""), url=src.get("url", ""))
for src in (s.get("sources") or [])
] or None,
)
)
duration_ms = int((time.time() - start) * 1000)
result = MediumBlogGenerateResult(
success=True,
title=title,
sections=out_sections,
model="gemini-2.5-flash",
generation_time_ms=duration_ms,
safety_flags=None,
)
# Cache the result for future use
try:
self.cache.cache_content(
keywords=req.researchKeywords or [],
sections=sections_for_cache,
global_target_words=req.globalTargetWords or 1000,
persona_data=req.persona.dict() if req.persona else None,
tone=req.tone or "professional",
audience=req.audience or "general",
result=result.dict()
)
logger.info(f"Cached content result for keywords: {req.researchKeywords}")
except Exception as cache_error:
logger.warning(f"Failed to cache content result: {cache_error}")
# Don't fail the entire operation if caching fails
return result

View File

@@ -0,0 +1,42 @@
"""
SourceURLManager - selects the most relevant source URLs for a section.
Low-effort heuristic using keywords and titles; safe defaults if no research.
"""
from typing import List, Dict, Any
class SourceURLManager:
def pick_relevant_urls(self, section: Any, research: Any, limit: int = 5) -> List[str]:
if not research or not getattr(research, 'sources', None):
return []
section_keywords = set([k.lower() for k in getattr(section, 'keywords', [])])
scored: List[tuple[float, str]] = []
for s in research.sources:
url = getattr(s, 'url', None) or getattr(s, 'uri', None) or s.get('url') if isinstance(s, dict) else None
title = getattr(s, 'title', None) or s.get('title') if isinstance(s, dict) else ''
if not url or not isinstance(url, str):
continue
title_l = (title or '').lower()
# simple overlap score
score = 0.0
for kw in section_keywords:
if kw and kw in title_l:
score += 1.0
# prefer https and reputable domains lightly
if url.startswith('https://'):
score += 0.2
scored.append((score, url))
scored.sort(key=lambda x: x[0], reverse=True)
dedup: List[str] = []
for _, u in scored:
if u not in dedup:
dedup.append(u)
if len(dedup) >= limit:
break
return dedup

View File

@@ -0,0 +1,143 @@
"""
TransitionGenerator - produces intelligent transitions between sections using LLM analysis.
Uses Gemini API for natural transitions while maintaining cost efficiency through smart caching.
"""
from typing import Optional, Dict
from loguru import logger
import hashlib
# Import the common gemini provider
from services.llm_providers.gemini_provider import gemini_text_response
class TransitionGenerator:
def __init__(self):
# Simple cache to avoid redundant LLM calls for similar transitions
self._cache: Dict[str, str] = {}
logger.info("✅ TransitionGenerator initialized with LLM-based generation")
def generate_transition(self, previous_text: str, current_heading: str, use_llm: bool = True) -> str:
"""
Return a 12 sentence bridge from previous_text into current_heading.
Args:
previous_text: Previous section content
current_heading: Current section heading
use_llm: Whether to use LLM generation (default: True for substantial content)
"""
prev = (previous_text or "").strip()
if not prev:
return f"Let's explore {current_heading.lower()} next."
# Create cache key
cache_key = self._get_cache_key(prev, current_heading)
# Check cache first
if cache_key in self._cache:
logger.debug("Transition generation cache hit")
return self._cache[cache_key]
# Determine if we should use LLM
should_use_llm = use_llm and self._should_use_llm_generation(prev, current_heading)
if should_use_llm:
try:
transition = self._llm_generate_transition(prev, current_heading)
self._cache[cache_key] = transition
logger.info("LLM-based transition generated")
return transition
except Exception as e:
logger.warning(f"LLM transition generation failed, using fallback: {e}")
# Fall through to heuristic generation
# Heuristic fallback
transition = self._heuristic_transition(prev, current_heading)
self._cache[cache_key] = transition
return transition
def _should_use_llm_generation(self, previous_text: str, current_heading: str) -> bool:
"""Determine if content is substantial enough to warrant LLM generation."""
# Use LLM for substantial previous content (>100 words) or complex headings
word_count = len(previous_text.split())
complex_heading = len(current_heading.split()) > 2 or any(char in current_heading for char in [':', '-', '&'])
return word_count > 100 or complex_heading
def _llm_generate_transition(self, previous_text: str, current_heading: str) -> str:
"""Use Gemini API for intelligent transition generation."""
# Truncate previous text to minimize tokens while keeping context
prev_truncated = previous_text[-200:] # Last 200 chars usually contain the conclusion
prompt = f"""
Create a smooth, natural 1-2 sentence transition from the previous content to the new section.
PREVIOUS CONTENT (ending): {prev_truncated}
NEW SECTION HEADING: {current_heading}
Requirements:
- Write exactly 1-2 sentences
- Create a logical bridge between the topics
- Use natural, engaging language
- Avoid repetition of the previous content
- Lead smoothly into the new section topic
Generate only the transition text, no explanations or formatting.
"""
try:
result = gemini_text_response(
prompt=prompt,
temperature=0.6, # Balanced creativity and consistency
max_tokens=300, # Increased tokens for better transitions
system_prompt="You are an expert content writer creating smooth transitions between sections."
)
if result and result.strip():
# Clean up the response
transition = result.strip()
# Ensure it's 1-2 sentences
sentences = transition.split('. ')
if len(sentences) > 2:
transition = '. '.join(sentences[:2]) + '.'
return transition
else:
logger.warning("LLM transition response empty, using fallback")
return self._heuristic_transition(previous_text, current_heading)
except Exception as e:
logger.error(f"LLM transition generation error: {e}")
return self._heuristic_transition(previous_text, current_heading)
def _heuristic_transition(self, previous_text: str, current_heading: str) -> str:
"""Fallback heuristic-based transition generation."""
tail = previous_text[-240:]
# Enhanced heuristics based on content patterns
if any(word in tail.lower() for word in ["problem", "issue", "challenge"]):
return f"Now that we've identified the challenges, let's explore {current_heading.lower()} to find solutions."
elif any(word in tail.lower() for word in ["solution", "approach", "method"]):
return f"Building on this approach, {current_heading.lower()} provides the next step in our analysis."
elif any(word in tail.lower() for word in ["important", "crucial", "essential"]):
return f"Given this importance, {current_heading.lower()} becomes our next focus area."
else:
return (
f"Building on the discussion above, this leads us into {current_heading.lower()}, "
f"where we focus on practical implications and what to do next."
)
def _get_cache_key(self, previous_text: str, current_heading: str) -> str:
"""Generate cache key from content hashes."""
# Use last 100 chars of previous text and heading for cache key
prev_hash = hashlib.md5(previous_text[-100:].encode()).hexdigest()[:8]
heading_hash = hashlib.md5(current_heading.encode()).hexdigest()[:8]
return f"{prev_hash}_{heading_hash}"
def clear_cache(self):
"""Clear transition cache (useful for testing or memory management)."""
self._cache.clear()
logger.info("TransitionGenerator cache cleared")

View File

@@ -0,0 +1,11 @@
"""
Core module for AI Blog Writer.
This module contains the main service orchestrator and shared utilities.
"""
from .blog_writer_service import BlogWriterService
__all__ = [
'BlogWriterService'
]

View File

@@ -0,0 +1,521 @@
"""
Blog Writer Service - Main orchestrator for AI Blog Writer.
Coordinates research, outline generation, content creation, and optimization.
"""
from typing import Dict, Any, List
import time
import uuid
from loguru import logger
from models.blog_models import (
BlogResearchRequest,
BlogResearchResponse,
BlogOutlineRequest,
BlogOutlineResponse,
BlogOutlineRefineRequest,
BlogSectionRequest,
BlogSectionResponse,
BlogOptimizeRequest,
BlogOptimizeResponse,
BlogSEOAnalyzeRequest,
BlogSEOAnalyzeResponse,
BlogSEOMetadataRequest,
BlogSEOMetadataResponse,
BlogPublishRequest,
BlogPublishResponse,
BlogOutlineSection,
ResearchSource,
)
from ..research import ResearchService
from ..outline import OutlineService
from ..content.enhanced_content_generator import EnhancedContentGenerator
from ..content.medium_blog_generator import MediumBlogGenerator
from ..content.blog_rewriter import BlogRewriter
from services.llm_providers.gemini_provider import gemini_structured_json_response
from services.cache.persistent_content_cache import persistent_content_cache
from models.blog_models import (
MediumBlogGenerateRequest,
MediumBlogGenerateResult,
MediumGeneratedSection,
)
# Import task manager - we'll create a simple one for this service
class SimpleTaskManager:
"""Simple task manager for BlogWriterService."""
def __init__(self):
self.tasks = {}
def start_task(self, task_id: str, func, **kwargs):
"""Start a task with the given function and arguments."""
import asyncio
self.tasks[task_id] = {
"status": "running",
"progress": "Starting...",
"result": None,
"error": None
}
# Start the task in the background
asyncio.create_task(self._run_task(task_id, func, **kwargs))
async def _run_task(self, task_id: str, func, **kwargs):
"""Run the task function."""
try:
await func(task_id, **kwargs)
except Exception as e:
self.tasks[task_id]["status"] = "failed"
self.tasks[task_id]["error"] = str(e)
logger.error(f"Task {task_id} failed: {e}")
def update_task_status(self, task_id: str, status: str, progress: str = None, result=None):
"""Update task status."""
if task_id in self.tasks:
self.tasks[task_id]["status"] = status
if progress:
self.tasks[task_id]["progress"] = progress
if result:
self.tasks[task_id]["result"] = result
def get_task_status(self, task_id: str):
"""Get task status."""
return self.tasks.get(task_id, {"status": "not_found"})
class BlogWriterService:
"""Main service orchestrator for AI Blog Writer functionality."""
def __init__(self):
self.research_service = ResearchService()
self.outline_service = OutlineService()
self.content_generator = EnhancedContentGenerator()
self.task_manager = SimpleTaskManager()
self.medium_blog_generator = MediumBlogGenerator()
self.blog_rewriter = BlogRewriter(self.task_manager)
# Research Methods
async def research(self, request: BlogResearchRequest, user_id: str) -> BlogResearchResponse:
"""Conduct comprehensive research using Google Search grounding."""
return await self.research_service.research(request, user_id)
async def research_with_progress(self, request: BlogResearchRequest, task_id: str, user_id: str) -> BlogResearchResponse:
"""Conduct research with real-time progress updates."""
return await self.research_service.research_with_progress(request, task_id, user_id)
# Outline Methods
async def generate_outline(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
"""Generate AI-powered outline from research data.
Args:
request: Outline generation request with research data
user_id: User ID (required for subscription checks and usage tracking)
"""
if not user_id:
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
return await self.outline_service.generate_outline(request, user_id)
async def generate_outline_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
"""Generate outline with real-time progress updates."""
return await self.outline_service.generate_outline_with_progress(request, task_id, user_id)
async def refine_outline(self, request: BlogOutlineRefineRequest) -> BlogOutlineResponse:
"""Refine outline with HITL operations."""
return await self.outline_service.refine_outline(request)
async def enhance_section_with_ai(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
"""Enhance a section using AI."""
return await self.outline_service.enhance_section_with_ai(section, focus)
async def optimize_outline_with_ai(self, outline: List[BlogOutlineSection], focus: str = "general optimization") -> List[BlogOutlineSection]:
"""Optimize entire outline for better flow and SEO."""
return await self.outline_service.optimize_outline_with_ai(outline, focus)
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
"""Rebalance word count distribution across sections."""
return self.outline_service.rebalance_word_counts(outline, target_words)
# Content Generation Methods
async def generate_section(self, request: BlogSectionRequest) -> BlogSectionResponse:
"""Generate section content from outline."""
# Compose research-lite object with minimal continuity summary if available
research_ctx: Any = getattr(request, 'research', None)
try:
ai_result = await self.content_generator.generate_section(
section=request.section,
research=research_ctx,
mode=(request.mode or "polished"),
)
markdown = ai_result.get('content') or ai_result.get('markdown') or ''
citations = []
# Map basic citations from sources if present
for s in ai_result.get('sources', [])[:5]:
citations.append({
"title": s.get('title') if isinstance(s, dict) else getattr(s, 'title', ''),
"url": s.get('url') if isinstance(s, dict) else getattr(s, 'url', ''),
})
if not markdown:
markdown = f"## {request.section.heading}\n\n(Generated content was empty.)"
return BlogSectionResponse(
success=True,
markdown=markdown,
citations=citations,
continuity_metrics=ai_result.get('continuity_metrics')
)
except Exception as e:
logger.error(f"Section generation failed: {e}")
fallback = f"## {request.section.heading}\n\nThis section will cover: {', '.join(request.section.key_points)}."
return BlogSectionResponse(success=False, markdown=fallback, citations=[])
async def optimize_section(self, request: BlogOptimizeRequest) -> BlogOptimizeResponse:
"""Optimize section content for readability and SEO."""
# TODO: Move to optimization module
return BlogOptimizeResponse(success=True, optimized=request.content, diff_preview=None)
# SEO and Analysis Methods (TODO: Extract to optimization module)
async def hallucination_check(self, payload: Dict[str, Any]) -> Dict[str, Any]:
"""Run hallucination detection on provided text."""
text = str(payload.get("text", "") or "").strip()
if not text:
return {"success": False, "error": "No text provided"}
# Prefer direct service use over HTTP proxy
try:
from services.hallucination_detector import HallucinationDetector
detector = HallucinationDetector()
result = await detector.detect_hallucinations(text)
# Serialize dataclass-like result to dict
claims = []
for c in result.claims:
claims.append({
"text": c.text,
"confidence": c.confidence,
"assessment": c.assessment,
"supporting_sources": c.supporting_sources,
"refuting_sources": c.refuting_sources,
"reasoning": c.reasoning,
})
return {
"success": True,
"overall_confidence": result.overall_confidence,
"total_claims": result.total_claims,
"supported_claims": result.supported_claims,
"refuted_claims": result.refuted_claims,
"insufficient_claims": result.insufficient_claims,
"timestamp": result.timestamp,
"claims": claims,
}
except Exception as e:
return {"success": False, "error": str(e)}
async def seo_analyze(self, request: BlogSEOAnalyzeRequest, user_id: str = None) -> BlogSEOAnalyzeResponse:
"""Analyze content for SEO optimization using comprehensive blog-specific analyzer."""
try:
from services.blog_writer.seo.blog_content_seo_analyzer import BlogContentSEOAnalyzer
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
content = request.content or ""
target_keywords = request.keywords or []
# Use research data from request if available, otherwise create fallback
if request.research_data:
research_data = request.research_data
logger.info(f"Using research data from request: {research_data.get('keyword_analysis', {})}")
else:
# Fallback for backward compatibility
research_data = {
"keyword_analysis": {
"primary": target_keywords,
"long_tail": [],
"semantic": [],
"all_keywords": target_keywords,
"search_intent": "informational"
}
}
logger.warning("No research data provided, using fallback keywords")
# Use our comprehensive SEO analyzer
analyzer = BlogContentSEOAnalyzer()
analysis_results = await analyzer.analyze_blog_content(content, research_data, user_id=user_id)
# Convert results to response format
recommendations = analysis_results.get('actionable_recommendations', [])
# Convert recommendation objects to strings
recommendation_strings = []
for rec in recommendations:
if isinstance(rec, dict):
recommendation_strings.append(f"[{rec.get('category', 'General')}] {rec.get('recommendation', '')}")
else:
recommendation_strings.append(str(rec))
return BlogSEOAnalyzeResponse(
success=True,
seo_score=float(analysis_results.get('overall_score', 0)),
density=analysis_results.get('visualization_data', {}).get('keyword_analysis', {}).get('densities', {}),
structure=analysis_results.get('detailed_analysis', {}).get('content_structure', {}),
readability=analysis_results.get('detailed_analysis', {}).get('readability_analysis', {}),
link_suggestions=[],
image_alt_status={"total_images": 0, "missing_alt": 0},
recommendations=recommendation_strings
)
except Exception as e:
logger.error(f"SEO analysis failed: {e}")
return BlogSEOAnalyzeResponse(
success=False,
seo_score=0.0,
density={},
structure={},
readability={},
link_suggestions=[],
image_alt_status={"total_images": 0, "missing_alt": 0},
recommendations=[f"SEO analysis failed: {str(e)}"]
)
async def seo_metadata(self, request: BlogSEOMetadataRequest, user_id: str = None) -> BlogSEOMetadataResponse:
"""Generate comprehensive SEO metadata for content."""
try:
from services.blog_writer.seo.blog_seo_metadata_generator import BlogSEOMetadataGenerator
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
# Initialize metadata generator
metadata_generator = BlogSEOMetadataGenerator()
# Extract outline and seo_analysis from request
outline = request.outline if hasattr(request, 'outline') else None
seo_analysis = request.seo_analysis if hasattr(request, 'seo_analysis') else None
# Generate comprehensive metadata with full context
metadata_results = await metadata_generator.generate_comprehensive_metadata(
blog_content=request.content,
blog_title=request.title or "Untitled Blog Post",
research_data=request.research_data or {},
outline=outline,
seo_analysis=seo_analysis,
user_id=user_id
)
# Convert to BlogSEOMetadataResponse format
return BlogSEOMetadataResponse(
success=metadata_results.get('success', True),
title_options=metadata_results.get('title_options', []),
meta_descriptions=metadata_results.get('meta_descriptions', []),
seo_title=metadata_results.get('seo_title'),
meta_description=metadata_results.get('meta_description'),
url_slug=metadata_results.get('url_slug', ''),
blog_tags=metadata_results.get('blog_tags', []),
blog_categories=metadata_results.get('blog_categories', []),
social_hashtags=metadata_results.get('social_hashtags', []),
open_graph=metadata_results.get('open_graph', {}),
twitter_card=metadata_results.get('twitter_card', {}),
json_ld_schema=metadata_results.get('json_ld_schema', {}),
canonical_url=metadata_results.get('canonical_url', ''),
reading_time=metadata_results.get('reading_time', 0.0),
focus_keyword=metadata_results.get('focus_keyword', ''),
generated_at=metadata_results.get('generated_at', ''),
optimization_score=metadata_results.get('metadata_summary', {}).get('optimization_score', 0)
)
except Exception as e:
logger.error(f"SEO metadata generation failed: {e}")
# Return fallback response
return BlogSEOMetadataResponse(
success=False,
title_options=[request.title or "Generated SEO Title"],
meta_descriptions=["Compelling meta description..."],
open_graph={"title": request.title or "OG Title", "image": ""},
twitter_card={"card": "summary_large_image"},
json_ld_schema={"@type": "Article"},
error=str(e)
)
async def publish(self, request: BlogPublishRequest) -> BlogPublishResponse:
"""Publish content to specified platform."""
# TODO: Move to content module
return BlogPublishResponse(success=True, platform=request.platform, url="https://example.com/post")
async def generate_medium_blog_with_progress(self, req: MediumBlogGenerateRequest, task_id: str, user_id: str) -> MediumBlogGenerateResult:
"""Use Gemini structured JSON to generate a medium-length blog in one call.
Args:
req: Medium blog generation request
task_id: Task ID for progress updates
user_id: User ID (required for subscription checks and usage tracking)
"""
if not user_id:
raise ValueError("user_id is required for medium blog generation (subscription checks and usage tracking)")
return await self.medium_blog_generator.generate_medium_blog_with_progress(req, task_id, user_id)
async def analyze_flow_basic(self, request: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze flow metrics for entire blog using single AI call (cost-effective)."""
try:
# Extract blog content from request
sections = request.get("sections", [])
title = request.get("title", "Untitled Blog")
if not sections:
return {"error": "No sections provided for analysis"}
# Combine all content for analysis
full_content = f"Title: {title}\n\n"
for section in sections:
full_content += f"Section: {section.get('heading', 'Untitled')}\n"
full_content += f"Content: {section.get('content', '')}\n\n"
# Build analysis prompt
system_prompt = """You are an expert content analyst specializing in narrative flow, consistency, and progression analysis.
Analyze the provided blog content and provide detailed, actionable feedback for improvement.
Focus on how well the content flows from section to section, maintains consistency in tone and style,
and progresses logically through the topic."""
analysis_prompt = f"""
Analyze the following blog content for narrative flow, consistency, and progression:
{full_content}
Evaluate each section and provide overall analysis with specific scores and actionable suggestions.
Consider:
- How well each section flows into the next
- Consistency in tone, style, and voice throughout
- Logical progression of ideas and arguments
- Transition quality between sections
- Overall coherence and readability
IMPORTANT: For each section in the response, use the exact section ID provided in the input.
The section IDs in your response must match the section IDs from the input exactly.
Provide detailed analysis with specific, actionable suggestions for improvement.
"""
# Use Gemini for structured analysis
from services.llm_providers.gemini_provider import gemini_structured_json_response
schema = {
"type": "object",
"properties": {
"overall_flow_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"overall_consistency_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"overall_progression_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"overall_coherence_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"section_id": {"type": "string"},
"heading": {"type": "string"},
"flow_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"consistency_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"progression_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"coherence_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"transition_quality": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"suggestions": {"type": "array", "items": {"type": "string"}},
"strengths": {"type": "array", "items": {"type": "string"}},
"improvement_areas": {"type": "array", "items": {"type": "string"}}
},
"required": ["section_id", "heading", "flow_score", "consistency_score", "progression_score", "coherence_score", "transition_quality", "suggestions"]
}
},
"overall_suggestions": {"type": "array", "items": {"type": "string"}},
"overall_strengths": {"type": "array", "items": {"type": "string"}},
"overall_improvement_areas": {"type": "array", "items": {"type": "string"}},
"transition_analysis": {
"type": "object",
"properties": {
"overall_transition_quality": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"transition_suggestions": {"type": "array", "items": {"type": "string"}}
}
}
},
"required": ["overall_flow_score", "overall_consistency_score", "overall_progression_score", "overall_coherence_score", "sections", "overall_suggestions"]
}
result = gemini_structured_json_response(
prompt=analysis_prompt,
schema=schema,
temperature=0.3,
max_tokens=4096,
system_prompt=system_prompt
)
if result and not result.get("error"):
logger.info("Basic flow analysis completed successfully")
return {"success": True, "analysis": result, "mode": "basic"}
else:
error_msg = result.get("error", "Analysis failed") if result else "No response from AI"
logger.error(f"Basic flow analysis failed: {error_msg}")
return {"error": error_msg}
except Exception as e:
logger.error(f"Basic flow analysis error: {e}")
return {"error": str(e)}
async def analyze_flow_advanced(self, request: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze flow metrics for each section individually (detailed but expensive)."""
try:
# Use the existing enhanced content generator for detailed analysis
sections = request.get("sections", [])
title = request.get("title", "Untitled Blog")
if not sections:
return {"error": "No sections provided for analysis"}
results = []
for section in sections:
# Use the existing flow analyzer for each section
section_content = section.get("content", "")
section_heading = section.get("heading", "Untitled")
# Get previous section context for better analysis
prev_section_content = ""
if len(results) > 0:
prev_section_content = results[-1].get("content", "")
# Use the existing flow analyzer
flow_metrics = self.content_generator.flow.assess_flow(
prev_section_content,
section_content,
use_llm=True
)
results.append({
"section_id": section.get("id", "unknown"),
"heading": section_heading,
"flow_score": flow_metrics.get("flow", 0.0),
"consistency_score": flow_metrics.get("consistency", 0.0),
"progression_score": flow_metrics.get("progression", 0.0),
"detailed_analysis": flow_metrics.get("analysis", ""),
"suggestions": flow_metrics.get("suggestions", [])
})
# Calculate overall scores
overall_flow = sum(r["flow_score"] for r in results) / len(results) if results else 0.0
overall_consistency = sum(r["consistency_score"] for r in results) / len(results) if results else 0.0
overall_progression = sum(r["progression_score"] for r in results) / len(results) if results else 0.0
logger.info("Advanced flow analysis completed successfully")
return {
"success": True,
"analysis": {
"overall_flow_score": overall_flow,
"overall_consistency_score": overall_consistency,
"overall_progression_score": overall_progression,
"sections": results
},
"mode": "advanced"
}
except Exception as e:
logger.error(f"Advanced flow analysis error: {e}")
return {"error": str(e)}
def start_blog_rewrite(self, request: Dict[str, Any]) -> str:
"""Start blog rewrite task with user feedback."""
return self.blog_rewriter.start_blog_rewrite(request)

View File

@@ -0,0 +1,536 @@
"""
Database-Backed Task Manager for Blog Writer
Replaces in-memory task storage with persistent database storage for
reliability, recovery, and analytics.
"""
import asyncio
import uuid
import json
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional
from loguru import logger
from services.blog_writer.logger_config import blog_writer_logger, log_function_call
from models.blog_models import (
BlogResearchRequest,
BlogOutlineRequest,
MediumBlogGenerateRequest,
MediumBlogGenerateResult,
)
from services.blog_writer.blog_service import BlogWriterService
class DatabaseTaskManager:
"""Database-backed task manager for blog writer operations."""
def __init__(self, db_connection):
self.db = db_connection
self.service = BlogWriterService()
self._cleanup_task = None
self._start_cleanup_task()
def _start_cleanup_task(self):
"""Start background task to clean up old completed tasks."""
async def cleanup_loop():
while True:
try:
await self.cleanup_old_tasks()
await asyncio.sleep(3600) # Run every hour
except Exception as e:
logger.error(f"Error in cleanup task: {e}")
await asyncio.sleep(300) # Wait 5 minutes on error
self._cleanup_task = asyncio.create_task(cleanup_loop())
@log_function_call("create_task")
async def create_task(
self,
user_id: str,
task_type: str,
request_data: Dict[str, Any],
correlation_id: Optional[str] = None,
operation: Optional[str] = None,
priority: int = 0,
max_retries: int = 3,
metadata: Optional[Dict[str, Any]] = None
) -> str:
"""Create a new task in the database."""
task_id = str(uuid.uuid4())
correlation_id = correlation_id or str(uuid.uuid4())
query = """
INSERT INTO blog_writer_tasks
(id, user_id, task_type, status, request_data, correlation_id, operation, priority, max_retries, metadata)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
"""
await self.db.execute(
query,
task_id,
user_id,
task_type,
'pending',
json.dumps(request_data),
correlation_id,
operation,
priority,
max_retries,
json.dumps(metadata or {})
)
blog_writer_logger.log_operation_start(
"task_created",
task_id=task_id,
task_type=task_type,
user_id=user_id,
correlation_id=correlation_id
)
return task_id
@log_function_call("get_task_status")
async def get_task_status(self, task_id: str) -> Optional[Dict[str, Any]]:
"""Get the status of a task."""
query = """
SELECT
id, user_id, task_type, status, request_data, result_data, error_data,
created_at, updated_at, completed_at, correlation_id, operation,
retry_count, max_retries, priority, metadata
FROM blog_writer_tasks
WHERE id = $1
"""
row = await self.db.fetchrow(query, task_id)
if not row:
return None
# Get progress messages
progress_query = """
SELECT timestamp, message, percentage, progress_type, metadata
FROM blog_writer_task_progress
WHERE task_id = $1
ORDER BY timestamp DESC
LIMIT 10
"""
progress_rows = await self.db.fetch(progress_query, task_id)
progress_messages = [
{
"timestamp": row["timestamp"].isoformat(),
"message": row["message"],
"percentage": float(row["percentage"]),
"progress_type": row["progress_type"],
"metadata": row["metadata"] or {}
}
for row in progress_rows
]
return {
"task_id": row["id"],
"user_id": row["user_id"],
"task_type": row["task_type"],
"status": row["status"],
"created_at": row["created_at"].isoformat(),
"updated_at": row["updated_at"].isoformat(),
"completed_at": row["completed_at"].isoformat() if row["completed_at"] else None,
"correlation_id": row["correlation_id"],
"operation": row["operation"],
"retry_count": row["retry_count"],
"max_retries": row["max_retries"],
"priority": row["priority"],
"progress_messages": progress_messages,
"result": json.loads(row["result_data"]) if row["result_data"] else None,
"error": json.loads(row["error_data"]) if row["error_data"] else None,
"metadata": json.loads(row["metadata"]) if row["metadata"] else {}
}
@log_function_call("update_task_status")
async def update_task_status(
self,
task_id: str,
status: str,
result_data: Optional[Dict[str, Any]] = None,
error_data: Optional[Dict[str, Any]] = None,
completed_at: Optional[datetime] = None
):
"""Update task status and data."""
query = """
UPDATE blog_writer_tasks
SET status = $2, result_data = $3, error_data = $4, completed_at = $5, updated_at = NOW()
WHERE id = $1
"""
await self.db.execute(
query,
task_id,
status,
json.dumps(result_data) if result_data else None,
json.dumps(error_data) if error_data else None,
completed_at or (datetime.now() if status in ['completed', 'failed', 'cancelled'] else None)
)
blog_writer_logger.log_operation_end(
"task_status_updated",
0,
success=status in ['completed', 'cancelled'],
task_id=task_id,
status=status
)
@log_function_call("update_progress")
async def update_progress(
self,
task_id: str,
message: str,
percentage: Optional[float] = None,
progress_type: str = "info",
metadata: Optional[Dict[str, Any]] = None
):
"""Update task progress."""
# Insert progress record
progress_query = """
INSERT INTO blog_writer_task_progress
(task_id, message, percentage, progress_type, metadata)
VALUES ($1, $2, $3, $4, $5)
"""
await self.db.execute(
progress_query,
task_id,
message,
percentage or 0.0,
progress_type,
json.dumps(metadata or {})
)
# Update task status to running if it was pending
status_query = """
UPDATE blog_writer_tasks
SET status = 'running', updated_at = NOW()
WHERE id = $1 AND status = 'pending'
"""
await self.db.execute(status_query, task_id)
logger.info(f"Progress update for task {task_id}: {message}")
@log_function_call("record_metrics")
async def record_metrics(
self,
task_id: str,
operation: str,
duration_ms: int,
token_usage: Optional[Dict[str, int]] = None,
api_calls: int = 0,
cache_hits: int = 0,
cache_misses: int = 0,
error_count: int = 0,
metadata: Optional[Dict[str, Any]] = None
):
"""Record performance metrics for a task."""
query = """
INSERT INTO blog_writer_task_metrics
(task_id, operation, duration_ms, token_usage, api_calls, cache_hits, cache_misses, error_count, metadata)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
"""
await self.db.execute(
query,
task_id,
operation,
duration_ms,
json.dumps(token_usage) if token_usage else None,
api_calls,
cache_hits,
cache_misses,
error_count,
json.dumps(metadata or {})
)
blog_writer_logger.log_performance(
f"task_metrics_{operation}",
duration_ms,
"ms",
task_id=task_id,
operation=operation,
api_calls=api_calls,
cache_hits=cache_hits,
cache_misses=cache_misses
)
@log_function_call("increment_retry_count")
async def increment_retry_count(self, task_id: str) -> int:
"""Increment retry count and return new count."""
query = """
UPDATE blog_writer_tasks
SET retry_count = retry_count + 1, updated_at = NOW()
WHERE id = $1
RETURNING retry_count
"""
result = await self.db.fetchval(query, task_id)
return result or 0
@log_function_call("cleanup_old_tasks")
async def cleanup_old_tasks(self, days: int = 7) -> int:
"""Clean up old completed tasks."""
query = """
DELETE FROM blog_writer_tasks
WHERE status IN ('completed', 'failed', 'cancelled')
AND created_at < NOW() - INTERVAL '%s days'
""" % days
result = await self.db.execute(query)
deleted_count = int(result.split()[-1]) if result else 0
if deleted_count > 0:
logger.info(f"Cleaned up {deleted_count} old blog writer tasks")
return deleted_count
@log_function_call("get_user_tasks")
async def get_user_tasks(
self,
user_id: str,
limit: int = 50,
offset: int = 0,
status_filter: Optional[str] = None
) -> List[Dict[str, Any]]:
"""Get tasks for a specific user."""
query = """
SELECT
id, task_type, status, created_at, updated_at, completed_at,
operation, retry_count, max_retries, priority
FROM blog_writer_tasks
WHERE user_id = $1
"""
params = [user_id]
param_count = 1
if status_filter:
param_count += 1
query += f" AND status = ${param_count}"
params.append(status_filter)
query += f" ORDER BY created_at DESC LIMIT ${param_count + 1} OFFSET ${param_count + 2}"
params.extend([limit, offset])
rows = await self.db.fetch(query, *params)
return [
{
"task_id": row["id"],
"task_type": row["task_type"],
"status": row["status"],
"created_at": row["created_at"].isoformat(),
"updated_at": row["updated_at"].isoformat(),
"completed_at": row["completed_at"].isoformat() if row["completed_at"] else None,
"operation": row["operation"],
"retry_count": row["retry_count"],
"max_retries": row["max_retries"],
"priority": row["priority"]
}
for row in rows
]
@log_function_call("get_task_analytics")
async def get_task_analytics(self, days: int = 7) -> Dict[str, Any]:
"""Get task analytics for monitoring."""
query = """
SELECT
task_type,
status,
COUNT(*) as task_count,
AVG(EXTRACT(EPOCH FROM (COALESCE(completed_at, NOW()) - created_at))) as avg_duration_seconds,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as completed_count,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_count,
COUNT(CASE WHEN status = 'running' THEN 1 END) as running_count
FROM blog_writer_tasks
WHERE created_at >= NOW() - INTERVAL '%s days'
GROUP BY task_type, status
ORDER BY task_type, status
""" % days
rows = await self.db.fetch(query)
analytics = {
"summary": {
"total_tasks": sum(row["task_count"] for row in rows),
"completed_tasks": sum(row["completed_count"] for row in rows),
"failed_tasks": sum(row["failed_count"] for row in rows),
"running_tasks": sum(row["running_count"] for row in rows)
},
"by_task_type": {},
"by_status": {}
}
for row in rows:
task_type = row["task_type"]
status = row["status"]
if task_type not in analytics["by_task_type"]:
analytics["by_task_type"][task_type] = {}
analytics["by_task_type"][task_type][status] = {
"count": row["task_count"],
"avg_duration_seconds": float(row["avg_duration_seconds"]) if row["avg_duration_seconds"] else 0
}
if status not in analytics["by_status"]:
analytics["by_status"][status] = 0
analytics["by_status"][status] += row["task_count"]
return analytics
# Task execution methods (same as original but with database persistence)
async def start_research_task(self, request: BlogResearchRequest, user_id: str) -> str:
"""Start a research operation and return a task ID."""
task_id = await self.create_task(
user_id=user_id,
task_type="research",
request_data=request.dict(),
operation="research_operation"
)
# Start the research operation in the background
asyncio.create_task(self._run_research_task(task_id, request))
return task_id
async def start_outline_task(self, request: BlogOutlineRequest, user_id: str) -> str:
"""Start an outline generation operation and return a task ID."""
task_id = await self.create_task(
user_id=user_id,
task_type="outline",
request_data=request.dict(),
operation="outline_generation"
)
# Start the outline generation operation in the background
asyncio.create_task(self._run_outline_generation_task(task_id, request))
return task_id
async def start_medium_generation_task(self, request: MediumBlogGenerateRequest, user_id: str) -> str:
"""Start a medium blog generation task."""
task_id = await self.create_task(
user_id=user_id,
task_type="medium_generation",
request_data=request.dict(),
operation="medium_blog_generation"
)
asyncio.create_task(self._run_medium_generation_task(task_id, request))
return task_id
async def _run_research_task(self, task_id: str, request: BlogResearchRequest):
"""Background task to run research and update status with progress messages."""
try:
await self.update_progress(task_id, "🔍 Starting research operation...", 0)
# Run the actual research with progress updates
result = await self.service.research_with_progress(request, task_id)
# Check if research failed gracefully
if not result.success:
await self.update_progress(
task_id,
f"❌ Research failed: {result.error_message or 'Unknown error'}",
100,
"error"
)
await self.update_task_status(
task_id,
"failed",
error_data={
"error_message": result.error_message,
"retry_suggested": result.retry_suggested,
"error_code": result.error_code,
"actionable_steps": result.actionable_steps
}
)
else:
await self.update_progress(
task_id,
f"✅ Research completed successfully! Found {len(result.sources)} sources and {len(result.search_queries or [])} search queries.",
100,
"success"
)
await self.update_task_status(
task_id,
"completed",
result_data=result.dict()
)
except Exception as e:
await self.update_progress(task_id, f"❌ Research failed with error: {str(e)}", 100, "error")
await self.update_task_status(
task_id,
"failed",
error_data={"error_message": str(e), "error_type": type(e).__name__}
)
blog_writer_logger.log_error(e, "research_task", context={"task_id": task_id})
async def _run_outline_generation_task(self, task_id: str, request: BlogOutlineRequest):
"""Background task to run outline generation and update status with progress messages."""
try:
await self.update_progress(task_id, "🧩 Starting outline generation...", 0)
# Run the actual outline generation with progress updates
result = await self.service.generate_outline_with_progress(request, task_id)
await self.update_progress(
task_id,
f"✅ Outline generated successfully! Created {len(result.outline)} sections with {len(result.title_options)} title options.",
100,
"success"
)
await self.update_task_status(task_id, "completed", result_data=result.dict())
except Exception as e:
await self.update_progress(task_id, f"❌ Outline generation failed: {str(e)}", 100, "error")
await self.update_task_status(
task_id,
"failed",
error_data={"error_message": str(e), "error_type": type(e).__name__}
)
blog_writer_logger.log_error(e, "outline_generation_task", context={"task_id": task_id})
async def _run_medium_generation_task(self, task_id: str, request: MediumBlogGenerateRequest):
"""Background task to generate a medium blog using a single structured JSON call."""
try:
await self.update_progress(task_id, "📦 Packaging outline and metadata...", 0)
# Basic guard: respect global target words
total_target = int(request.globalTargetWords or 1000)
if total_target > 1000:
raise ValueError("Global target words exceed 1000; medium generation not allowed")
result: MediumBlogGenerateResult = await self.service.generate_medium_blog_with_progress(
request,
task_id,
)
if not result or not getattr(result, "sections", None):
raise ValueError("Empty generation result from model")
# Check if result came from cache
cache_hit = getattr(result, 'cache_hit', False)
if cache_hit:
await self.update_progress(task_id, "⚡ Found cached content - loading instantly!", 100, "success")
else:
await self.update_progress(task_id, "🤖 Generated fresh content with AI...", 100, "success")
await self.update_task_status(task_id, "completed", result_data=result.dict())
except Exception as e:
await self.update_progress(task_id, f"❌ Medium generation failed: {str(e)}", 100, "error")
await self.update_task_status(
task_id,
"failed",
error_data={"error_message": str(e), "error_type": type(e).__name__}
)
blog_writer_logger.log_error(e, "medium_generation_task", context={"task_id": task_id})

View File

@@ -0,0 +1,285 @@
"""
Blog Writer Exception Hierarchy
Defines custom exception classes for different failure modes in the AI Blog Writer.
Each exception includes error_code, user_message, retry_suggested, and actionable_steps.
"""
from typing import List, Optional, Dict, Any
from enum import Enum
class ErrorCategory(Enum):
"""Categories for error classification."""
TRANSIENT = "transient" # Temporary issues, retry recommended
PERMANENT = "permanent" # Permanent issues, no retry
USER_ERROR = "user_error" # User input issues, fix input
API_ERROR = "api_error" # External API issues
VALIDATION_ERROR = "validation_error" # Data validation issues
SYSTEM_ERROR = "system_error" # Internal system issues
class BlogWriterException(Exception):
"""Base exception for all Blog Writer errors."""
def __init__(
self,
message: str,
error_code: str,
user_message: str,
retry_suggested: bool = False,
actionable_steps: Optional[List[str]] = None,
error_category: ErrorCategory = ErrorCategory.SYSTEM_ERROR,
context: Optional[Dict[str, Any]] = None
):
super().__init__(message)
self.error_code = error_code
self.user_message = user_message
self.retry_suggested = retry_suggested
self.actionable_steps = actionable_steps or []
self.error_category = error_category
self.context = context or {}
def to_dict(self) -> Dict[str, Any]:
"""Convert exception to dictionary for API responses."""
return {
"error_code": self.error_code,
"user_message": self.user_message,
"retry_suggested": self.retry_suggested,
"actionable_steps": self.actionable_steps,
"error_category": self.error_category.value,
"context": self.context
}
class ResearchFailedException(BlogWriterException):
"""Raised when research operation fails."""
def __init__(
self,
message: str,
user_message: str = "Research failed. Please try again with different keywords or check your internet connection.",
retry_suggested: bool = True,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="RESEARCH_FAILED",
user_message=user_message,
retry_suggested=retry_suggested,
actionable_steps=[
"Try with different keywords",
"Check your internet connection",
"Wait a few minutes and try again",
"Contact support if the issue persists"
],
error_category=ErrorCategory.API_ERROR,
context=context
)
class OutlineGenerationException(BlogWriterException):
"""Raised when outline generation fails."""
def __init__(
self,
message: str,
user_message: str = "Outline generation failed. Please try again or adjust your research data.",
retry_suggested: bool = True,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="OUTLINE_GENERATION_FAILED",
user_message=user_message,
retry_suggested=retry_suggested,
actionable_steps=[
"Try generating outline again",
"Check if research data is complete",
"Try with different research keywords",
"Contact support if the issue persists"
],
error_category=ErrorCategory.API_ERROR,
context=context
)
class ContentGenerationException(BlogWriterException):
"""Raised when content generation fails."""
def __init__(
self,
message: str,
user_message: str = "Content generation failed. Please try again or adjust your outline.",
retry_suggested: bool = True,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="CONTENT_GENERATION_FAILED",
user_message=user_message,
retry_suggested=retry_suggested,
actionable_steps=[
"Try generating content again",
"Check if outline is complete",
"Try with a shorter outline",
"Contact support if the issue persists"
],
error_category=ErrorCategory.API_ERROR,
context=context
)
class SEOAnalysisException(BlogWriterException):
"""Raised when SEO analysis fails."""
def __init__(
self,
message: str,
user_message: str = "SEO analysis failed. Content was generated but SEO optimization is unavailable.",
retry_suggested: bool = True,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="SEO_ANALYSIS_FAILED",
user_message=user_message,
retry_suggested=retry_suggested,
actionable_steps=[
"Try SEO analysis again",
"Continue without SEO optimization",
"Contact support if the issue persists"
],
error_category=ErrorCategory.API_ERROR,
context=context
)
class APIRateLimitException(BlogWriterException):
"""Raised when API rate limit is exceeded."""
def __init__(
self,
message: str,
retry_after: Optional[int] = None,
context: Optional[Dict[str, Any]] = None
):
retry_message = f"Rate limit exceeded. Please wait {retry_after} seconds before trying again." if retry_after else "Rate limit exceeded. Please wait a few minutes before trying again."
super().__init__(
message=message,
error_code="API_RATE_LIMIT",
user_message=retry_message,
retry_suggested=True,
actionable_steps=[
f"Wait {retry_after or 60} seconds before trying again",
"Reduce the frequency of requests",
"Try again during off-peak hours",
"Contact support if you need higher limits"
],
error_category=ErrorCategory.API_ERROR,
context=context
)
class APITimeoutException(BlogWriterException):
"""Raised when API request times out."""
def __init__(
self,
message: str,
timeout_seconds: int = 60,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="API_TIMEOUT",
user_message=f"Request timed out after {timeout_seconds} seconds. Please try again.",
retry_suggested=True,
actionable_steps=[
"Try again with a shorter request",
"Check your internet connection",
"Try again during off-peak hours",
"Contact support if the issue persists"
],
error_category=ErrorCategory.TRANSIENT,
context=context
)
class ValidationException(BlogWriterException):
"""Raised when input validation fails."""
def __init__(
self,
message: str,
field: str,
user_message: str = "Invalid input provided. Please check your data and try again.",
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="VALIDATION_ERROR",
user_message=user_message,
retry_suggested=False,
actionable_steps=[
f"Check the {field} field",
"Ensure all required fields are filled",
"Verify data format is correct",
"Contact support if you need help"
],
error_category=ErrorCategory.USER_ERROR,
context=context
)
class CircuitBreakerOpenException(BlogWriterException):
"""Raised when circuit breaker is open."""
def __init__(
self,
message: str,
retry_after: int,
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="CIRCUIT_BREAKER_OPEN",
user_message=f"Service temporarily unavailable. Please wait {retry_after} seconds before trying again.",
retry_suggested=True,
actionable_steps=[
f"Wait {retry_after} seconds before trying again",
"Try again during off-peak hours",
"Contact support if the issue persists"
],
error_category=ErrorCategory.TRANSIENT,
context=context
)
class PartialSuccessException(BlogWriterException):
"""Raised when operation partially succeeds."""
def __init__(
self,
message: str,
partial_results: Dict[str, Any],
failed_operations: List[str],
user_message: str = "Operation partially completed. Some sections were generated successfully.",
context: Optional[Dict[str, Any]] = None
):
super().__init__(
message=message,
error_code="PARTIAL_SUCCESS",
user_message=user_message,
retry_suggested=True,
actionable_steps=[
"Review the generated content",
"Retry failed sections individually",
"Contact support if you need help with failed sections"
],
error_category=ErrorCategory.TRANSIENT,
context=context
)
self.partial_results = partial_results
self.failed_operations = failed_operations

View File

@@ -0,0 +1,298 @@
"""
Structured Logging Configuration for Blog Writer
Configures structured JSON logging with correlation IDs, context tracking,
and performance metrics for the AI Blog Writer system.
"""
import json
import uuid
import time
import sys
from typing import Dict, Any, Optional
from contextvars import ContextVar
from loguru import logger
from datetime import datetime
# Context variables for request tracking
correlation_id: ContextVar[str] = ContextVar('correlation_id', default='')
user_id: ContextVar[str] = ContextVar('user_id', default='')
task_id: ContextVar[str] = ContextVar('task_id', default='')
operation: ContextVar[str] = ContextVar('operation', default='')
class BlogWriterLogger:
"""Enhanced logger for Blog Writer with structured logging and context tracking."""
def __init__(self):
self._setup_logger()
def _setup_logger(self):
"""Configure loguru with structured JSON output."""
from utils.logger_utils import get_service_logger
return get_service_logger("blog_writer")
def _json_formatter(self, record):
"""Format log record as structured JSON."""
# Extract context variables
correlation_id_val = correlation_id.get('')
user_id_val = user_id.get('')
task_id_val = task_id.get('')
operation_val = operation.get('')
# Build structured log entry
log_entry = {
"timestamp": datetime.fromtimestamp(record["time"].timestamp()).isoformat(),
"level": record["level"].name,
"logger": record["name"],
"function": record["function"],
"line": record["line"],
"message": record["message"],
"correlation_id": correlation_id_val,
"user_id": user_id_val,
"task_id": task_id_val,
"operation": operation_val,
"module": record["module"],
"process_id": record["process"].id,
"thread_id": record["thread"].id
}
# Add exception info if present
if record["exception"]:
log_entry["exception"] = {
"type": record["exception"].type.__name__,
"value": str(record["exception"].value),
"traceback": record["exception"].traceback
}
# Add extra fields from record
if record["extra"]:
log_entry.update(record["extra"])
return json.dumps(log_entry, default=str)
def set_context(
self,
correlation_id_val: Optional[str] = None,
user_id_val: Optional[str] = None,
task_id_val: Optional[str] = None,
operation_val: Optional[str] = None
):
"""Set context variables for the current request."""
if correlation_id_val:
correlation_id.set(correlation_id_val)
if user_id_val:
user_id.set(user_id_val)
if task_id_val:
task_id.set(task_id_val)
if operation_val:
operation.set(operation_val)
def clear_context(self):
"""Clear all context variables."""
correlation_id.set('')
user_id.set('')
task_id.set('')
operation.set('')
def generate_correlation_id(self) -> str:
"""Generate a new correlation ID."""
return str(uuid.uuid4())
def log_operation_start(
self,
operation_name: str,
**kwargs
):
"""Log the start of an operation with context."""
logger.info(
f"Starting {operation_name}",
extra={
"operation": operation_name,
"event_type": "operation_start",
**kwargs
}
)
def log_operation_end(
self,
operation_name: str,
duration_ms: float,
success: bool = True,
**kwargs
):
"""Log the end of an operation with performance metrics."""
logger.info(
f"Completed {operation_name} in {duration_ms:.2f}ms",
extra={
"operation": operation_name,
"event_type": "operation_end",
"duration_ms": duration_ms,
"success": success,
**kwargs
}
)
def log_api_call(
self,
api_name: str,
endpoint: str,
duration_ms: float,
status_code: Optional[int] = None,
token_usage: Optional[Dict[str, int]] = None,
**kwargs
):
"""Log API call with performance metrics."""
logger.info(
f"API call to {api_name}",
extra={
"event_type": "api_call",
"api_name": api_name,
"endpoint": endpoint,
"duration_ms": duration_ms,
"status_code": status_code,
"token_usage": token_usage,
**kwargs
}
)
def log_error(
self,
error: Exception,
operation: str,
context: Optional[Dict[str, Any]] = None
):
"""Log error with full context."""
# Safely format error message to avoid KeyError on format strings in error messages
error_str = str(error)
# Replace any curly braces that might be in the error message to avoid format string issues
safe_error_str = error_str.replace('{', '{{').replace('}', '}}')
logger.error(
f"Error in {operation}: {safe_error_str}",
extra={
"event_type": "error",
"operation": operation,
"error_type": type(error).__name__,
"error_message": error_str, # Keep original in extra, but use safe version in format string
"context": context or {}
},
exc_info=True
)
def log_performance(
self,
metric_name: str,
value: float,
unit: str = "ms",
**kwargs
):
"""Log performance metrics."""
logger.info(
f"Performance metric: {metric_name} = {value} {unit}",
extra={
"event_type": "performance",
"metric_name": metric_name,
"value": value,
"unit": unit,
**kwargs
}
)
# Global logger instance
blog_writer_logger = BlogWriterLogger()
def get_logger(name: str = "blog_writer"):
"""Get a logger instance with the given name."""
return logger.bind(name=name)
def log_function_call(func_name: str, **kwargs):
"""Decorator to log function calls with timing."""
def decorator(func):
async def async_wrapper(*args, **func_kwargs):
start_time = time.time()
correlation_id_val = correlation_id.get('')
blog_writer_logger.log_operation_start(
func_name,
function=func.__name__,
correlation_id=correlation_id_val,
**kwargs
)
try:
result = await func(*args, **func_kwargs)
duration_ms = (time.time() - start_time) * 1000
blog_writer_logger.log_operation_end(
func_name,
duration_ms,
success=True,
function=func.__name__,
correlation_id=correlation_id_val
)
return result
except Exception as e:
duration_ms = (time.time() - start_time) * 1000
blog_writer_logger.log_error(
e,
func_name,
context={
"function": func.__name__,
"duration_ms": duration_ms,
"correlation_id": correlation_id_val
}
)
raise
def sync_wrapper(*args, **func_kwargs):
start_time = time.time()
correlation_id_val = correlation_id.get('')
blog_writer_logger.log_operation_start(
func_name,
function=func.__name__,
correlation_id=correlation_id_val,
**kwargs
)
try:
result = func(*args, **func_kwargs)
duration_ms = (time.time() - start_time) * 1000
blog_writer_logger.log_operation_end(
func_name,
duration_ms,
success=True,
function=func.__name__,
correlation_id=correlation_id_val
)
return result
except Exception as e:
duration_ms = (time.time() - start_time) * 1000
blog_writer_logger.log_error(
e,
func_name,
context={
"function": func.__name__,
"duration_ms": duration_ms,
"correlation_id": correlation_id_val
}
)
raise
# Return appropriate wrapper based on function type
import asyncio
if asyncio.iscoroutinefunction(func):
return async_wrapper
else:
return sync_wrapper
return decorator

View File

@@ -0,0 +1,25 @@
"""
Outline module for AI Blog Writer.
This module handles all outline-related functionality including:
- AI-powered outline generation
- Outline refinement and optimization
- Section enhancement and rebalancing
- Strategic content planning
"""
from .outline_service import OutlineService
from .outline_generator import OutlineGenerator
from .outline_optimizer import OutlineOptimizer
from .section_enhancer import SectionEnhancer
from .source_mapper import SourceToSectionMapper
from .grounding_engine import GroundingContextEngine
__all__ = [
'OutlineService',
'OutlineGenerator',
'OutlineOptimizer',
'SectionEnhancer',
'SourceToSectionMapper',
'GroundingContextEngine'
]

View File

@@ -0,0 +1,644 @@
"""
Grounding Context Engine - Enhanced utilization of grounding metadata.
This module extracts and utilizes rich contextual information from Google Search
grounding metadata to enhance outline generation with authoritative insights,
temporal relevance, and content relationships.
"""
from typing import Dict, Any, List, Tuple, Optional
from collections import Counter, defaultdict
from datetime import datetime, timedelta
import re
from loguru import logger
from models.blog_models import (
GroundingMetadata,
GroundingChunk,
GroundingSupport,
Citation,
BlogOutlineSection,
ResearchSource,
)
class GroundingContextEngine:
"""Extract and utilize rich context from grounding metadata."""
def __init__(self):
"""Initialize the grounding context engine."""
self.min_confidence_threshold = 0.7
self.high_confidence_threshold = 0.9
self.max_contextual_insights = 10
self.max_authority_sources = 5
# Authority indicators for source scoring
self.authority_indicators = {
'high_authority': ['research', 'study', 'analysis', 'report', 'journal', 'academic', 'university', 'institute'],
'medium_authority': ['guide', 'tutorial', 'best practices', 'expert', 'professional', 'industry'],
'low_authority': ['blog', 'opinion', 'personal', 'review', 'commentary']
}
# Temporal relevance patterns
self.temporal_patterns = {
'recent': ['2024', '2025', 'latest', 'new', 'recent', 'current', 'updated'],
'trending': ['trend', 'emerging', 'growing', 'increasing', 'rising'],
'evergreen': ['fundamental', 'basic', 'principles', 'foundation', 'core']
}
logger.info("✅ GroundingContextEngine initialized with contextual analysis capabilities")
def extract_contextual_insights(self, grounding_metadata: Optional[GroundingMetadata]) -> Dict[str, Any]:
"""
Extract comprehensive contextual insights from grounding metadata.
Args:
grounding_metadata: Google Search grounding metadata
Returns:
Dictionary containing contextual insights and analysis
"""
if not grounding_metadata:
return self._get_empty_insights()
logger.info("Extracting contextual insights from grounding metadata...")
insights = {
'confidence_analysis': self._analyze_confidence_patterns(grounding_metadata),
'authority_analysis': self._analyze_source_authority(grounding_metadata),
'temporal_analysis': self._analyze_temporal_relevance(grounding_metadata),
'content_relationships': self._analyze_content_relationships(grounding_metadata),
'citation_insights': self._analyze_citation_patterns(grounding_metadata),
'search_intent_insights': self._analyze_search_intent(grounding_metadata),
'quality_indicators': self._assess_quality_indicators(grounding_metadata)
}
logger.info(f"✅ Extracted {len(insights)} contextual insight categories")
return insights
def enhance_sections_with_grounding(
self,
sections: List[BlogOutlineSection],
grounding_metadata: Optional[GroundingMetadata],
insights: Dict[str, Any]
) -> List[BlogOutlineSection]:
"""
Enhance outline sections using grounding metadata insights.
Args:
sections: List of outline sections to enhance
grounding_metadata: Google Search grounding metadata
insights: Extracted contextual insights
Returns:
Enhanced sections with grounding-driven improvements
"""
if not grounding_metadata or not insights:
return sections
logger.info(f"Enhancing {len(sections)} sections with grounding insights...")
enhanced_sections = []
for section in sections:
enhanced_section = self._enhance_single_section(section, grounding_metadata, insights)
enhanced_sections.append(enhanced_section)
logger.info("✅ Section enhancement with grounding insights completed")
return enhanced_sections
def get_authority_sources(self, grounding_metadata: Optional[GroundingMetadata]) -> List[Tuple[GroundingChunk, float]]:
"""
Get high-authority sources from grounding metadata.
Args:
grounding_metadata: Google Search grounding metadata
Returns:
List of (chunk, authority_score) tuples sorted by authority
"""
if not grounding_metadata:
return []
authority_sources = []
for chunk in grounding_metadata.grounding_chunks:
authority_score = self._calculate_chunk_authority(chunk)
if authority_score >= 0.6: # Only include sources with reasonable authority
authority_sources.append((chunk, authority_score))
# Sort by authority score (descending)
authority_sources.sort(key=lambda x: x[1], reverse=True)
return authority_sources[:self.max_authority_sources]
def get_high_confidence_insights(self, grounding_metadata: Optional[GroundingMetadata]) -> List[str]:
"""
Extract high-confidence insights from grounding supports.
Args:
grounding_metadata: Google Search grounding metadata
Returns:
List of high-confidence insights
"""
if not grounding_metadata:
return []
high_confidence_insights = []
for support in grounding_metadata.grounding_supports:
if support.confidence_scores and max(support.confidence_scores) >= self.high_confidence_threshold:
# Extract meaningful insights from segment text
insight = self._extract_insight_from_segment(support.segment_text)
if insight:
high_confidence_insights.append(insight)
return high_confidence_insights[:self.max_contextual_insights]
# Private helper methods
def _get_empty_insights(self) -> Dict[str, Any]:
"""Return empty insights structure when no grounding metadata is available."""
return {
'confidence_analysis': {
'average_confidence': 0.0,
'high_confidence_sources_count': 0,
'confidence_distribution': {'high': 0, 'medium': 0, 'low': 0}
},
'authority_analysis': {
'average_authority_score': 0.0,
'high_authority_sources': [],
'authority_distribution': {'high': 0, 'medium': 0, 'low': 0}
},
'temporal_analysis': {
'recent_content': 0,
'trending_topics': [],
'evergreen_content': 0
},
'content_relationships': {
'related_concepts': [],
'content_gaps': [],
'concept_coverage_score': 0.0
},
'citation_insights': {
'citation_types': {},
'citation_density': 0.0
},
'search_intent_insights': {
'primary_intent': 'informational',
'intent_signals': [],
'user_questions': []
},
'quality_indicators': {
'overall_quality': 0.0,
'quality_factors': []
}
}
def _analyze_confidence_patterns(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze confidence patterns across grounding data."""
all_confidences = []
# Collect confidence scores from chunks
for chunk in grounding_metadata.grounding_chunks:
if chunk.confidence_score:
all_confidences.append(chunk.confidence_score)
# Collect confidence scores from supports
for support in grounding_metadata.grounding_supports:
all_confidences.extend(support.confidence_scores)
if not all_confidences:
return {
'average_confidence': 0.0,
'high_confidence_sources_count': 0,
'confidence_distribution': {'high': 0, 'medium': 0, 'low': 0}
}
average_confidence = sum(all_confidences) / len(all_confidences)
high_confidence_count = sum(1 for c in all_confidences if c >= self.high_confidence_threshold)
return {
'average_confidence': average_confidence,
'high_confidence_sources_count': high_confidence_count,
'confidence_distribution': self._get_confidence_distribution(all_confidences)
}
def _analyze_source_authority(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze source authority patterns."""
authority_scores = []
authority_distribution = defaultdict(int)
for chunk in grounding_metadata.grounding_chunks:
authority_score = self._calculate_chunk_authority(chunk)
authority_scores.append(authority_score)
# Categorize authority level
if authority_score >= 0.8:
authority_distribution['high'] += 1
elif authority_score >= 0.6:
authority_distribution['medium'] += 1
else:
authority_distribution['low'] += 1
return {
'average_authority_score': sum(authority_scores) / len(authority_scores) if authority_scores else 0.0,
'high_authority_sources': [{'title': 'High Authority Source', 'url': 'example.com', 'score': 0.9}], # Placeholder
'authority_distribution': dict(authority_distribution)
}
def _analyze_temporal_relevance(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze temporal relevance of grounding content."""
recent_content = 0
trending_topics = []
evergreen_content = 0
for chunk in grounding_metadata.grounding_chunks:
chunk_text = f"{chunk.title} {chunk.url}".lower()
# Check for recent indicators
if any(pattern in chunk_text for pattern in self.temporal_patterns['recent']):
recent_content += 1
# Check for trending indicators
if any(pattern in chunk_text for pattern in self.temporal_patterns['trending']):
trending_topics.append(chunk.title)
# Check for evergreen indicators
if any(pattern in chunk_text for pattern in self.temporal_patterns['evergreen']):
evergreen_content += 1
return {
'recent_content': recent_content,
'trending_topics': trending_topics[:5], # Limit to top 5
'evergreen_content': evergreen_content,
'temporal_balance': self._calculate_temporal_balance(recent_content, evergreen_content)
}
def _analyze_content_relationships(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze content relationships and identify gaps."""
all_text = []
# Collect text from chunks
for chunk in grounding_metadata.grounding_chunks:
all_text.append(chunk.title)
# Collect text from supports
for support in grounding_metadata.grounding_supports:
all_text.append(support.segment_text)
# Extract related concepts
related_concepts = self._extract_related_concepts(all_text)
# Identify potential content gaps
content_gaps = self._identify_content_gaps(all_text)
# Calculate concept coverage score (0-1 scale)
concept_coverage_score = min(1.0, len(related_concepts) / 10.0) if related_concepts else 0.0
return {
'related_concepts': related_concepts,
'content_gaps': content_gaps,
'concept_coverage_score': concept_coverage_score,
'gap_count': len(content_gaps)
}
def _analyze_citation_patterns(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze citation patterns and types."""
citation_types = Counter()
total_citations = len(grounding_metadata.citations)
for citation in grounding_metadata.citations:
citation_types[citation.citation_type] += 1
# Calculate citation density (citations per 1000 words of content)
total_content_length = sum(len(support.segment_text) for support in grounding_metadata.grounding_supports)
citation_density = (total_citations / max(total_content_length, 1)) * 1000 if total_content_length > 0 else 0.0
return {
'citation_types': dict(citation_types),
'total_citations': total_citations,
'citation_density': citation_density,
'citation_quality': self._assess_citation_quality(grounding_metadata.citations)
}
def _analyze_search_intent(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Analyze search intent signals from grounding data."""
intent_signals = []
user_questions = []
# Analyze search queries
for query in grounding_metadata.web_search_queries:
query_lower = query.lower()
# Identify intent signals
if any(word in query_lower for word in ['how', 'what', 'why', 'when', 'where']):
intent_signals.append('informational')
elif any(word in query_lower for word in ['best', 'top', 'compare', 'vs']):
intent_signals.append('comparison')
elif any(word in query_lower for word in ['buy', 'price', 'cost', 'deal']):
intent_signals.append('transactional')
# Extract potential user questions
if query_lower.startswith(('how to', 'what is', 'why does', 'when should')):
user_questions.append(query)
return {
'intent_signals': list(set(intent_signals)),
'user_questions': user_questions[:5], # Limit to top 5
'primary_intent': self._determine_primary_intent(intent_signals)
}
def _assess_quality_indicators(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
"""Assess overall quality indicators from grounding metadata."""
quality_factors = []
quality_score = 0.0
# Factor 1: Confidence levels
confidences = [chunk.confidence_score for chunk in grounding_metadata.grounding_chunks if chunk.confidence_score]
if confidences:
avg_confidence = sum(confidences) / len(confidences)
quality_score += avg_confidence * 0.3
quality_factors.append(f"Average confidence: {avg_confidence:.2f}")
# Factor 2: Source diversity
unique_domains = set()
for chunk in grounding_metadata.grounding_chunks:
try:
domain = chunk.url.split('/')[2] if '://' in chunk.url else chunk.url.split('/')[0]
unique_domains.add(domain)
except:
continue
diversity_score = min(len(unique_domains) / 5.0, 1.0) # Normalize to 0-1
quality_score += diversity_score * 0.2
quality_factors.append(f"Source diversity: {len(unique_domains)} unique domains")
# Factor 3: Content depth
total_content_length = sum(len(support.segment_text) for support in grounding_metadata.grounding_supports)
depth_score = min(total_content_length / 5000.0, 1.0) # Normalize to 0-1
quality_score += depth_score * 0.2
quality_factors.append(f"Content depth: {total_content_length} characters")
# Factor 4: Citation quality
citation_quality = self._assess_citation_quality(grounding_metadata.citations)
quality_score += citation_quality * 0.3
quality_factors.append(f"Citation quality: {citation_quality:.2f}")
return {
'overall_quality': min(quality_score, 1.0),
'quality_factors': quality_factors,
'quality_grade': self._get_quality_grade(quality_score)
}
def _enhance_single_section(
self,
section: BlogOutlineSection,
grounding_metadata: GroundingMetadata,
insights: Dict[str, Any]
) -> BlogOutlineSection:
"""Enhance a single section using grounding insights."""
# Extract relevant grounding data for this section
relevant_chunks = self._find_relevant_chunks(section, grounding_metadata)
relevant_supports = self._find_relevant_supports(section, grounding_metadata)
# Enhance subheadings with high-confidence insights
enhanced_subheadings = self._enhance_subheadings(section, relevant_supports, insights)
# Enhance key points with authoritative insights
enhanced_key_points = self._enhance_key_points(section, relevant_chunks, insights)
# Enhance keywords with related concepts
enhanced_keywords = self._enhance_keywords(section, insights)
return BlogOutlineSection(
id=section.id,
heading=section.heading,
subheadings=enhanced_subheadings,
key_points=enhanced_key_points,
references=section.references,
target_words=section.target_words,
keywords=enhanced_keywords
)
def _calculate_chunk_authority(self, chunk: GroundingChunk) -> float:
"""Calculate authority score for a grounding chunk."""
authority_score = 0.5 # Base score
chunk_text = f"{chunk.title} {chunk.url}".lower()
# Check for authority indicators
for level, indicators in self.authority_indicators.items():
for indicator in indicators:
if indicator in chunk_text:
if level == 'high_authority':
authority_score += 0.3
elif level == 'medium_authority':
authority_score += 0.2
else: # low_authority
authority_score -= 0.1
# Boost score based on confidence
if chunk.confidence_score:
authority_score += chunk.confidence_score * 0.2
return min(max(authority_score, 0.0), 1.0)
def _extract_insight_from_segment(self, segment_text: str) -> Optional[str]:
"""Extract meaningful insight from segment text."""
if not segment_text or len(segment_text.strip()) < 20:
return None
# Clean and truncate insight
insight = segment_text.strip()
if len(insight) > 200:
insight = insight[:200] + "..."
return insight
def _get_confidence_distribution(self, confidences: List[float]) -> Dict[str, int]:
"""Get distribution of confidence scores."""
distribution = {'high': 0, 'medium': 0, 'low': 0}
for confidence in confidences:
if confidence >= 0.8:
distribution['high'] += 1
elif confidence >= 0.6:
distribution['medium'] += 1
else:
distribution['low'] += 1
return distribution
def _calculate_temporal_balance(self, recent: int, evergreen: int) -> str:
"""Calculate temporal balance of content."""
total = recent + evergreen
if total == 0:
return 'unknown'
recent_ratio = recent / total
if recent_ratio > 0.7:
return 'recent_heavy'
elif recent_ratio < 0.3:
return 'evergreen_heavy'
else:
return 'balanced'
def _extract_related_concepts(self, text_list: List[str]) -> List[str]:
"""Extract related concepts from text."""
# Simple concept extraction - could be enhanced with NLP
concepts = set()
for text in text_list:
# Extract capitalized words (potential concepts)
words = re.findall(r'\b[A-Z][a-z]+\b', text)
concepts.update(words)
return list(concepts)[:10] # Limit to top 10
def _identify_content_gaps(self, text_list: List[str]) -> List[str]:
"""Identify potential content gaps."""
# Simple gap identification - could be enhanced with more sophisticated analysis
gaps = []
# Look for common gap indicators
gap_indicators = ['missing', 'lack of', 'not covered', 'gap', 'unclear', 'unexplained']
for text in text_list:
text_lower = text.lower()
for indicator in gap_indicators:
if indicator in text_lower:
# Extract potential gap
gap = self._extract_gap_from_text(text, indicator)
if gap:
gaps.append(gap)
return gaps[:5] # Limit to top 5
def _extract_gap_from_text(self, text: str, indicator: str) -> Optional[str]:
"""Extract content gap from text containing gap indicator."""
# Simple extraction - could be enhanced
sentences = text.split('.')
for sentence in sentences:
if indicator in sentence.lower():
return sentence.strip()
return None
def _assess_citation_quality(self, citations: List[Citation]) -> float:
"""Assess quality of citations."""
if not citations:
return 0.0
quality_score = 0.0
for citation in citations:
# Check citation type
if citation.citation_type in ['expert_opinion', 'statistical_data', 'research_study']:
quality_score += 0.3
elif citation.citation_type in ['recent_news', 'case_study']:
quality_score += 0.2
else:
quality_score += 0.1
# Check text quality
if len(citation.text) > 20:
quality_score += 0.1
return min(quality_score / len(citations), 1.0)
def _determine_primary_intent(self, intent_signals: List[str]) -> str:
"""Determine primary search intent from signals."""
if not intent_signals:
return 'informational'
intent_counts = Counter(intent_signals)
return intent_counts.most_common(1)[0][0]
def _get_quality_grade(self, quality_score: float) -> str:
"""Get quality grade from score."""
if quality_score >= 0.9:
return 'A'
elif quality_score >= 0.8:
return 'B'
elif quality_score >= 0.7:
return 'C'
elif quality_score >= 0.6:
return 'D'
else:
return 'F'
def _find_relevant_chunks(self, section: BlogOutlineSection, grounding_metadata: GroundingMetadata) -> List[GroundingChunk]:
"""Find grounding chunks relevant to the section."""
relevant_chunks = []
section_text = f"{section.heading} {' '.join(section.subheadings)} {' '.join(section.key_points)}".lower()
for chunk in grounding_metadata.grounding_chunks:
chunk_text = chunk.title.lower()
# Simple relevance check - could be enhanced with semantic similarity
if any(word in chunk_text for word in section_text.split() if len(word) > 3):
relevant_chunks.append(chunk)
return relevant_chunks
def _find_relevant_supports(self, section: BlogOutlineSection, grounding_metadata: GroundingMetadata) -> List[GroundingSupport]:
"""Find grounding supports relevant to the section."""
relevant_supports = []
section_text = f"{section.heading} {' '.join(section.subheadings)} {' '.join(section.key_points)}".lower()
for support in grounding_metadata.grounding_supports:
support_text = support.segment_text.lower()
# Simple relevance check
if any(word in support_text for word in section_text.split() if len(word) > 3):
relevant_supports.append(support)
return relevant_supports
def _enhance_subheadings(self, section: BlogOutlineSection, relevant_supports: List[GroundingSupport], insights: Dict[str, Any]) -> List[str]:
"""Enhance subheadings with grounding insights."""
enhanced_subheadings = list(section.subheadings)
# Add high-confidence insights as subheadings
high_confidence_insights = self._get_high_confidence_insights_from_supports(relevant_supports)
for insight in high_confidence_insights[:2]: # Add up to 2 new subheadings
if insight not in enhanced_subheadings:
enhanced_subheadings.append(insight)
return enhanced_subheadings
def _enhance_key_points(self, section: BlogOutlineSection, relevant_chunks: List[GroundingChunk], insights: Dict[str, Any]) -> List[str]:
"""Enhance key points with authoritative insights."""
enhanced_key_points = list(section.key_points)
# Add insights from high-authority chunks
for chunk in relevant_chunks:
if chunk.confidence_score and chunk.confidence_score >= self.high_confidence_threshold:
insight = f"Based on {chunk.title}: {self._extract_key_insight(chunk)}"
if insight not in enhanced_key_points:
enhanced_key_points.append(insight)
return enhanced_key_points
def _enhance_keywords(self, section: BlogOutlineSection, insights: Dict[str, Any]) -> List[str]:
"""Enhance keywords with related concepts from grounding."""
enhanced_keywords = list(section.keywords)
# Add related concepts from grounding analysis
related_concepts = insights.get('content_relationships', {}).get('related_concepts', [])
for concept in related_concepts[:3]: # Add up to 3 new keywords
if concept.lower() not in [kw.lower() for kw in enhanced_keywords]:
enhanced_keywords.append(concept)
return enhanced_keywords
def _get_high_confidence_insights_from_supports(self, supports: List[GroundingSupport]) -> List[str]:
"""Get high-confidence insights from grounding supports."""
insights = []
for support in supports:
if support.confidence_scores and max(support.confidence_scores) >= self.high_confidence_threshold:
insight = self._extract_insight_from_segment(support.segment_text)
if insight:
insights.append(insight)
return insights
def _extract_key_insight(self, chunk: GroundingChunk) -> str:
"""Extract key insight from grounding chunk."""
# Simple extraction - could be enhanced
return f"High-confidence source with {chunk.confidence_score:.2f} confidence score"

View File

@@ -0,0 +1,94 @@
"""
Metadata Collector - Handles collection and formatting of outline metadata.
Collects source mapping stats, grounding insights, optimization results, and research coverage.
"""
from typing import Dict, Any, List
from loguru import logger
class MetadataCollector:
"""Handles collection and formatting of various metadata types for UI display."""
def __init__(self):
"""Initialize the metadata collector."""
pass
def collect_source_mapping_stats(self, mapped_sections, research):
"""Collect source mapping statistics for UI display."""
from models.blog_models import SourceMappingStats
total_sources = len(research.sources)
total_mapped = sum(len(section.references) for section in mapped_sections)
coverage_percentage = (total_mapped / total_sources * 100) if total_sources > 0 else 0.0
# Calculate average relevance score (simplified)
all_relevance_scores = []
for section in mapped_sections:
for ref in section.references:
if hasattr(ref, 'credibility_score') and ref.credibility_score:
all_relevance_scores.append(ref.credibility_score)
average_relevance = sum(all_relevance_scores) / len(all_relevance_scores) if all_relevance_scores else 0.0
high_confidence_mappings = sum(1 for score in all_relevance_scores if score >= 0.8)
return SourceMappingStats(
total_sources_mapped=total_mapped,
coverage_percentage=round(coverage_percentage, 1),
average_relevance_score=round(average_relevance, 3),
high_confidence_mappings=high_confidence_mappings
)
def collect_grounding_insights(self, grounding_insights):
"""Collect grounding insights for UI display."""
from models.blog_models import GroundingInsights
return GroundingInsights(
confidence_analysis=grounding_insights.get('confidence_analysis'),
authority_analysis=grounding_insights.get('authority_analysis'),
temporal_analysis=grounding_insights.get('temporal_analysis'),
content_relationships=grounding_insights.get('content_relationships'),
citation_insights=grounding_insights.get('citation_insights'),
search_intent_insights=grounding_insights.get('search_intent_insights'),
quality_indicators=grounding_insights.get('quality_indicators')
)
def collect_optimization_results(self, optimized_sections, focus):
"""Collect optimization results for UI display."""
from models.blog_models import OptimizationResults
# Calculate a quality score based on section completeness
total_sections = len(optimized_sections)
complete_sections = sum(1 for section in optimized_sections
if section.heading and section.subheadings and section.key_points)
quality_score = (complete_sections / total_sections * 10) if total_sections > 0 else 0.0
improvements_made = [
"Enhanced section headings for better SEO",
"Optimized keyword distribution across sections",
"Improved content flow and logical progression",
"Balanced word count distribution",
"Enhanced subheadings for better readability"
]
return OptimizationResults(
overall_quality_score=round(quality_score, 1),
improvements_made=improvements_made,
optimization_focus=focus
)
def collect_research_coverage(self, research):
"""Collect research coverage metrics for UI display."""
from models.blog_models import ResearchCoverage
sources_utilized = len(research.sources)
content_gaps = research.keyword_analysis.get('content_gaps', [])
competitive_advantages = research.competitor_analysis.get('competitive_advantages', [])
return ResearchCoverage(
sources_utilized=sources_utilized,
content_gaps_identified=len(content_gaps),
competitive_advantages=competitive_advantages[:5] # Limit to top 5
)

View File

@@ -0,0 +1,323 @@
"""
Outline Generator - AI-powered outline generation from research data.
Generates comprehensive, SEO-optimized outlines using research intelligence.
"""
from typing import Dict, Any, List, Tuple
import asyncio
from loguru import logger
from models.blog_models import (
BlogOutlineRequest,
BlogOutlineResponse,
BlogOutlineSection,
)
from .source_mapper import SourceToSectionMapper
from .section_enhancer import SectionEnhancer
from .outline_optimizer import OutlineOptimizer
from .grounding_engine import GroundingContextEngine
from .title_generator import TitleGenerator
from .metadata_collector import MetadataCollector
from .prompt_builder import PromptBuilder
from .response_processor import ResponseProcessor
from .parallel_processor import ParallelProcessor
class OutlineGenerator:
"""Generates AI-powered outlines from research data."""
def __init__(self):
"""Initialize the outline generator with all enhancement modules."""
self.source_mapper = SourceToSectionMapper()
self.section_enhancer = SectionEnhancer()
self.outline_optimizer = OutlineOptimizer()
self.grounding_engine = GroundingContextEngine()
# Initialize extracted classes
self.title_generator = TitleGenerator()
self.metadata_collector = MetadataCollector()
self.prompt_builder = PromptBuilder()
self.response_processor = ResponseProcessor()
self.parallel_processor = ParallelProcessor(self.source_mapper, self.grounding_engine)
async def generate(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
"""
Generate AI-powered outline using research results.
Args:
request: Outline generation request with research data
user_id: User ID (required for subscription checks and usage tracking)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
# Extract research insights
research = request.research
primary_keywords = research.keyword_analysis.get('primary', [])
secondary_keywords = research.keyword_analysis.get('secondary', [])
content_angles = research.suggested_angles
sources = research.sources
search_intent = research.keyword_analysis.get('search_intent', 'informational')
# Check for custom instructions
custom_instructions = getattr(request, 'custom_instructions', None)
# Build comprehensive outline generation prompt with rich research data
outline_prompt = self.prompt_builder.build_outline_prompt(
primary_keywords, secondary_keywords, content_angles, sources,
search_intent, request, custom_instructions
)
logger.info("Generating AI-powered outline using research results")
# Define schema with proper property ordering (critical for Gemini API)
outline_schema = self.prompt_builder.get_outline_schema()
# Generate outline using structured JSON response with retry logic (user_id required)
outline_data = await self.response_processor.generate_with_retry(outline_prompt, outline_schema, user_id)
# Convert to BlogOutlineSection objects
outline_sections = self.response_processor.convert_to_sections(outline_data, sources)
# Run parallel processing for speed optimization (user_id required)
mapped_sections, grounding_insights = await self.parallel_processor.run_parallel_processing_async(
outline_sections, research, user_id
)
# Enhance sections with grounding insights
logger.info("Enhancing sections with grounding insights...")
grounding_enhanced_sections = self.grounding_engine.enhance_sections_with_grounding(
mapped_sections, research.grounding_metadata, grounding_insights
)
# Optimize outline for better flow, SEO, and engagement (user_id required)
logger.info("Optimizing outline for better flow and engagement...")
optimized_sections = await self.outline_optimizer.optimize(grounding_enhanced_sections, "comprehensive optimization", user_id)
# Rebalance word counts for optimal distribution
target_words = request.word_count or 1500
balanced_sections = self.outline_optimizer.rebalance_word_counts(optimized_sections, target_words)
# Extract title options - combine AI-generated with content angles
ai_title_options = outline_data.get('title_options', [])
content_angle_titles = self.title_generator.extract_content_angle_titles(research)
# Combine AI-generated titles with content angles
title_options = self.title_generator.combine_title_options(ai_title_options, content_angle_titles, primary_keywords)
logger.info(f"Generated optimized outline with {len(balanced_sections)} sections and {len(title_options)} title options")
# Collect metadata for enhanced UI
source_mapping_stats = self.metadata_collector.collect_source_mapping_stats(mapped_sections, research)
grounding_insights_data = self.metadata_collector.collect_grounding_insights(grounding_insights)
optimization_results = self.metadata_collector.collect_optimization_results(optimized_sections, "comprehensive optimization")
research_coverage = self.metadata_collector.collect_research_coverage(research)
return BlogOutlineResponse(
success=True,
title_options=title_options,
outline=balanced_sections,
source_mapping_stats=source_mapping_stats,
grounding_insights=grounding_insights_data,
optimization_results=optimization_results,
research_coverage=research_coverage
)
async def generate_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
"""
Outline generation method with progress updates for real-time feedback.
Args:
request: Outline generation request with research data
task_id: Task ID for progress updates
user_id: User ID (required for subscription checks and usage tracking)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
from api.blog_writer.task_manager import task_manager
# Extract research insights
research = request.research
primary_keywords = research.keyword_analysis.get('primary', [])
secondary_keywords = research.keyword_analysis.get('secondary', [])
content_angles = research.suggested_angles
sources = research.sources
search_intent = research.keyword_analysis.get('search_intent', 'informational')
# Check for custom instructions
custom_instructions = getattr(request, 'custom_instructions', None)
await task_manager.update_progress(task_id, "📊 Analyzing research data and building content strategy...")
# Build comprehensive outline generation prompt with rich research data
outline_prompt = self.prompt_builder.build_outline_prompt(
primary_keywords, secondary_keywords, content_angles, sources,
search_intent, request, custom_instructions
)
await task_manager.update_progress(task_id, "🤖 Generating AI-powered outline with research insights...")
# Define schema with proper property ordering (critical for Gemini API)
outline_schema = self.prompt_builder.get_outline_schema()
await task_manager.update_progress(task_id, "🔄 Making AI request to generate structured outline...")
# Generate outline using structured JSON response with retry logic (user_id required for subscription checks)
outline_data = await self.response_processor.generate_with_retry(outline_prompt, outline_schema, user_id, task_id)
await task_manager.update_progress(task_id, "📝 Processing outline structure and validating sections...")
# Convert to BlogOutlineSection objects
outline_sections = self.response_processor.convert_to_sections(outline_data, sources)
# Run parallel processing for speed optimization (user_id required for subscription checks)
mapped_sections, grounding_insights = await self.parallel_processor.run_parallel_processing(
outline_sections, research, user_id, task_id
)
# Enhance sections with grounding insights (depends on both previous tasks)
await task_manager.update_progress(task_id, "✨ Enhancing sections with grounding insights...")
grounding_enhanced_sections = self.grounding_engine.enhance_sections_with_grounding(
mapped_sections, research.grounding_metadata, grounding_insights
)
# Optimize outline for better flow, SEO, and engagement (user_id required for subscription checks)
await task_manager.update_progress(task_id, "🎯 Optimizing outline for better flow and engagement...")
optimized_sections = await self.outline_optimizer.optimize(grounding_enhanced_sections, "comprehensive optimization", user_id)
# Rebalance word counts for optimal distribution
await task_manager.update_progress(task_id, "⚖️ Rebalancing word count distribution...")
target_words = request.word_count or 1500
balanced_sections = self.outline_optimizer.rebalance_word_counts(optimized_sections, target_words)
# Extract title options - combine AI-generated with content angles
ai_title_options = outline_data.get('title_options', [])
content_angle_titles = self.title_generator.extract_content_angle_titles(research)
# Combine AI-generated titles with content angles
title_options = self.title_generator.combine_title_options(ai_title_options, content_angle_titles, primary_keywords)
await task_manager.update_progress(task_id, "✅ Outline generation and optimization completed successfully!")
# Collect metadata for enhanced UI
source_mapping_stats = self.metadata_collector.collect_source_mapping_stats(mapped_sections, research)
grounding_insights_data = self.metadata_collector.collect_grounding_insights(grounding_insights)
optimization_results = self.metadata_collector.collect_optimization_results(optimized_sections, "comprehensive optimization")
research_coverage = self.metadata_collector.collect_research_coverage(research)
return BlogOutlineResponse(
success=True,
title_options=title_options,
outline=balanced_sections,
source_mapping_stats=source_mapping_stats,
grounding_insights=grounding_insights_data,
optimization_results=optimization_results,
research_coverage=research_coverage
)
async def enhance_section(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
"""
Enhance a single section using AI with research context.
Args:
section: The section to enhance
focus: Enhancement focus area (e.g., "SEO optimization", "engagement", "comprehensiveness")
Returns:
Enhanced section with improved content
"""
logger.info(f"Enhancing section '{section.heading}' with focus: {focus}")
enhanced_section = await self.section_enhancer.enhance(section, focus)
logger.info(f"✅ Section enhancement completed for '{section.heading}'")
return enhanced_section
async def optimize_outline(self, outline: List[BlogOutlineSection], focus: str = "comprehensive optimization") -> List[BlogOutlineSection]:
"""
Optimize an entire outline for better flow, SEO, and engagement.
Args:
outline: List of sections to optimize
focus: Optimization focus area
Returns:
Optimized outline with improved flow and engagement
"""
logger.info(f"Optimizing outline with {len(outline)} sections, focus: {focus}")
optimized_outline = await self.outline_optimizer.optimize(outline, focus)
logger.info(f"✅ Outline optimization completed for {len(optimized_outline)} sections")
return optimized_outline
def rebalance_outline_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
"""
Rebalance word count distribution across outline sections.
Args:
outline: List of sections to rebalance
target_words: Total target word count
Returns:
Outline with rebalanced word counts
"""
logger.info(f"Rebalancing word counts for {len(outline)} sections, target: {target_words} words")
rebalanced_outline = self.outline_optimizer.rebalance_word_counts(outline, target_words)
logger.info(f"✅ Word count rebalancing completed")
return rebalanced_outline
def get_grounding_insights(self, research_data) -> Dict[str, Any]:
"""
Get grounding metadata insights for research data.
Args:
research_data: Research data with grounding metadata
Returns:
Dictionary containing grounding insights and analysis
"""
logger.info("Extracting grounding insights from research data...")
insights = self.grounding_engine.extract_contextual_insights(research_data.grounding_metadata)
logger.info(f"✅ Extracted {len(insights)} grounding insight categories")
return insights
def get_authority_sources(self, research_data) -> List[Tuple]:
"""
Get high-authority sources from grounding metadata.
Args:
research_data: Research data with grounding metadata
Returns:
List of (chunk, authority_score) tuples sorted by authority
"""
logger.info("Identifying high-authority sources from grounding metadata...")
authority_sources = self.grounding_engine.get_authority_sources(research_data.grounding_metadata)
logger.info(f"✅ Identified {len(authority_sources)} high-authority sources")
return authority_sources
def get_high_confidence_insights(self, research_data) -> List[str]:
"""
Get high-confidence insights from grounding metadata.
Args:
research_data: Research data with grounding metadata
Returns:
List of high-confidence insights
"""
logger.info("Extracting high-confidence insights from grounding metadata...")
insights = self.grounding_engine.get_high_confidence_insights(research_data.grounding_metadata)
logger.info(f"✅ Extracted {len(insights)} high-confidence insights")
return insights

View File

@@ -0,0 +1,137 @@
"""
Outline Optimizer - AI-powered outline optimization and rebalancing.
Optimizes outlines for better flow, SEO, and engagement.
"""
from typing import List
from loguru import logger
from models.blog_models import BlogOutlineSection
class OutlineOptimizer:
"""Optimizes outlines for better flow, SEO, and engagement."""
async def optimize(self, outline: List[BlogOutlineSection], focus: str, user_id: str) -> List[BlogOutlineSection]:
"""Optimize entire outline for better flow, SEO, and engagement.
Args:
outline: List of outline sections to optimize
focus: Optimization focus (e.g., "general optimization")
user_id: User ID (required for subscription checks and usage tracking)
Returns:
List of optimized outline sections
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for outline optimization (subscription checks and usage tracking)")
outline_text = "\n".join([f"{i+1}. {s.heading}" for i, s in enumerate(outline)])
optimization_prompt = f"""Optimize this blog outline for better flow, engagement, and SEO:
Current Outline:
{outline_text}
Optimization Focus: {focus}
Goals: Improve narrative flow, enhance SEO, increase engagement, ensure comprehensive coverage.
Return JSON format:
{{
"outline": [
{{
"heading": "Optimized heading",
"subheadings": ["subheading 1", "subheading 2"],
"key_points": ["point 1", "point 2"],
"target_words": 300,
"keywords": ["keyword1", "keyword2"]
}}
]
}}"""
try:
from services.llm_providers.main_text_generation import llm_text_gen
optimization_schema = {
"type": "object",
"properties": {
"outline": {
"type": "array",
"items": {
"type": "object",
"properties": {
"heading": {"type": "string"},
"subheadings": {"type": "array", "items": {"type": "string"}},
"key_points": {"type": "array", "items": {"type": "string"}},
"target_words": {"type": "integer"},
"keywords": {"type": "array", "items": {"type": "string"}}
},
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
}
}
},
"required": ["outline"],
"propertyOrdering": ["outline"]
}
optimized_data = llm_text_gen(
prompt=optimization_prompt,
json_struct=optimization_schema,
system_prompt=None,
user_id=user_id
)
# Handle the new schema format with "outline" wrapper
if isinstance(optimized_data, dict) and 'outline' in optimized_data:
optimized_sections = []
for i, section_data in enumerate(optimized_data['outline']):
section = BlogOutlineSection(
id=f"s{i+1}",
heading=section_data.get('heading', f'Section {i+1}'),
subheadings=section_data.get('subheadings', []),
key_points=section_data.get('key_points', []),
references=outline[i].references if i < len(outline) else [],
target_words=section_data.get('target_words', 300),
keywords=section_data.get('keywords', [])
)
optimized_sections.append(section)
logger.info(f"✅ Outline optimization completed: {len(optimized_sections)} sections optimized")
return optimized_sections
else:
logger.warning(f"Invalid optimization response format: {type(optimized_data)}")
except Exception as e:
logger.warning(f"AI outline optimization failed: {e}")
logger.info("Returning original outline without optimization")
return outline
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
"""Rebalance word count distribution across sections."""
total_sections = len(outline)
if total_sections == 0:
return outline
# Calculate target distribution
intro_words = int(target_words * 0.12) # 12% for intro
conclusion_words = int(target_words * 0.12) # 12% for conclusion
main_content_words = target_words - intro_words - conclusion_words
# Distribute main content words across sections
words_per_section = main_content_words // total_sections
remainder = main_content_words % total_sections
for i, section in enumerate(outline):
if i == 0: # First section (intro)
section.target_words = intro_words
elif i == total_sections - 1: # Last section (conclusion)
section.target_words = conclusion_words
else: # Main content sections
section.target_words = words_per_section + (1 if i < remainder else 0)
return outline

View File

@@ -0,0 +1,268 @@
"""
Outline Service - Core outline generation and management functionality.
Handles AI-powered outline generation, refinement, and optimization.
"""
from typing import Dict, Any, List
import asyncio
from loguru import logger
from models.blog_models import (
BlogOutlineRequest,
BlogOutlineResponse,
BlogOutlineRefineRequest,
BlogOutlineSection,
)
from .outline_generator import OutlineGenerator
from .outline_optimizer import OutlineOptimizer
from .section_enhancer import SectionEnhancer
from services.cache.persistent_outline_cache import persistent_outline_cache
class OutlineService:
"""Service for generating and managing blog outlines using AI."""
def __init__(self):
self.outline_generator = OutlineGenerator()
self.outline_optimizer = OutlineOptimizer()
self.section_enhancer = SectionEnhancer()
async def generate_outline(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
"""
Stage 2: Content Planning with AI-generated outline using research results.
Uses Gemini with research data to create comprehensive, SEO-optimized outline.
Args:
request: Outline generation request with research data
user_id: User ID (required for subscription checks and usage tracking)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
# Extract cache parameters - use original user keywords for consistent caching
keywords = request.research.original_keywords or request.research.keyword_analysis.get('primary', [])
industry = getattr(request.persona, 'industry', 'general') if request.persona else 'general'
target_audience = getattr(request.persona, 'target_audience', 'general') if request.persona else 'general'
word_count = request.word_count or 1500
custom_instructions = request.custom_instructions or ""
persona_data = request.persona.dict() if request.persona else None
# Check cache first
cached_result = persistent_outline_cache.get_cached_outline(
keywords=keywords,
industry=industry,
target_audience=target_audience,
word_count=word_count,
custom_instructions=custom_instructions,
persona_data=persona_data
)
if cached_result:
logger.info(f"Using cached outline for keywords: {keywords}")
return BlogOutlineResponse(**cached_result)
# Generate new outline if not cached (user_id required)
logger.info(f"Generating new outline for keywords: {keywords}")
result = await self.outline_generator.generate(request, user_id)
# Cache the result
persistent_outline_cache.cache_outline(
keywords=keywords,
industry=industry,
target_audience=target_audience,
word_count=word_count,
custom_instructions=custom_instructions,
persona_data=persona_data,
result=result.dict()
)
return result
async def generate_outline_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
"""
Outline generation method with progress updates for real-time feedback.
"""
# Extract cache parameters - use original user keywords for consistent caching
keywords = request.research.original_keywords or request.research.keyword_analysis.get('primary', [])
industry = getattr(request.persona, 'industry', 'general') if request.persona else 'general'
target_audience = getattr(request.persona, 'target_audience', 'general') if request.persona else 'general'
word_count = request.word_count or 1500
custom_instructions = request.custom_instructions or ""
persona_data = request.persona.dict() if request.persona else None
# Check cache first
cached_result = persistent_outline_cache.get_cached_outline(
keywords=keywords,
industry=industry,
target_audience=target_audience,
word_count=word_count,
custom_instructions=custom_instructions,
persona_data=persona_data
)
if cached_result:
logger.info(f"Using cached outline for keywords: {keywords} (with progress updates)")
# Update progress to show cache hit
from api.blog_writer.task_manager import task_manager
await task_manager.update_progress(task_id, "✅ Using cached outline (saved generation time!)")
return BlogOutlineResponse(**cached_result)
# Generate new outline if not cached
logger.info(f"Generating new outline for keywords: {keywords} (with progress updates)")
result = await self.outline_generator.generate_with_progress(request, task_id, user_id)
# Cache the result
persistent_outline_cache.cache_outline(
keywords=keywords,
industry=industry,
target_audience=target_audience,
word_count=word_count,
custom_instructions=custom_instructions,
persona_data=persona_data,
result=result.dict()
)
return result
async def refine_outline(self, request: BlogOutlineRefineRequest) -> BlogOutlineResponse:
"""
Refine outline with HITL (Human-in-the-Loop) operations
Supports add, remove, move, merge, rename operations
"""
outline = request.outline.copy()
operation = request.operation.lower()
section_id = request.section_id
payload = request.payload or {}
try:
if operation == 'add':
# Add new section
new_section = BlogOutlineSection(
id=f"s{len(outline) + 1}",
heading=payload.get('heading', 'New Section'),
subheadings=payload.get('subheadings', []),
key_points=payload.get('key_points', []),
references=[],
target_words=payload.get('target_words', 300)
)
outline.append(new_section)
logger.info(f"Added new section: {new_section.heading}")
elif operation == 'remove' and section_id:
# Remove section
outline = [s for s in outline if s.id != section_id]
logger.info(f"Removed section: {section_id}")
elif operation == 'rename' and section_id:
# Rename section
for section in outline:
if section.id == section_id:
section.heading = payload.get('heading', section.heading)
break
logger.info(f"Renamed section {section_id} to: {payload.get('heading')}")
elif operation == 'move' and section_id:
# Move section (reorder)
direction = payload.get('direction', 'down') # 'up' or 'down'
current_index = next((i for i, s in enumerate(outline) if s.id == section_id), -1)
if current_index != -1:
if direction == 'up' and current_index > 0:
outline[current_index], outline[current_index - 1] = outline[current_index - 1], outline[current_index]
elif direction == 'down' and current_index < len(outline) - 1:
outline[current_index], outline[current_index + 1] = outline[current_index + 1], outline[current_index]
logger.info(f"Moved section {section_id} {direction}")
elif operation == 'merge' and section_id:
# Merge with next section
current_index = next((i for i, s in enumerate(outline) if s.id == section_id), -1)
if current_index != -1 and current_index < len(outline) - 1:
current_section = outline[current_index]
next_section = outline[current_index + 1]
# Merge sections
current_section.heading = f"{current_section.heading} & {next_section.heading}"
current_section.subheadings.extend(next_section.subheadings)
current_section.key_points.extend(next_section.key_points)
current_section.references.extend(next_section.references)
current_section.target_words = (current_section.target_words or 0) + (next_section.target_words or 0)
# Remove the next section
outline.pop(current_index + 1)
logger.info(f"Merged section {section_id} with next section")
elif operation == 'update' and section_id:
# Update section details
for section in outline:
if section.id == section_id:
if 'heading' in payload:
section.heading = payload['heading']
if 'subheadings' in payload:
section.subheadings = payload['subheadings']
if 'key_points' in payload:
section.key_points = payload['key_points']
if 'target_words' in payload:
section.target_words = payload['target_words']
break
logger.info(f"Updated section {section_id}")
# Reassign IDs to maintain order
for i, section in enumerate(outline):
section.id = f"s{i+1}"
return BlogOutlineResponse(
success=True,
title_options=["Refined Outline"],
outline=outline
)
except Exception as e:
logger.error(f"Outline refinement failed: {e}")
return BlogOutlineResponse(
success=False,
title_options=["Error"],
outline=request.outline
)
async def enhance_section_with_ai(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
"""Enhance a section using AI with research context."""
return await self.section_enhancer.enhance(section, focus)
async def optimize_outline_with_ai(self, outline: List[BlogOutlineSection], focus: str = "general optimization") -> List[BlogOutlineSection]:
"""Optimize entire outline for better flow, SEO, and engagement."""
return await self.outline_optimizer.optimize(outline, focus)
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
"""Rebalance word count distribution across sections."""
return self.outline_optimizer.rebalance_word_counts(outline, target_words)
# Cache Management Methods
def get_outline_cache_stats(self) -> Dict[str, Any]:
"""Get outline cache statistics."""
return persistent_outline_cache.get_cache_stats()
def clear_outline_cache(self):
"""Clear all cached outline entries."""
persistent_outline_cache.clear_cache()
logger.info("Outline cache cleared")
def invalidate_outline_cache_for_keywords(self, keywords: List[str]):
"""
Invalidate outline cache entries for specific keywords.
Useful when research data is updated.
Args:
keywords: Keywords to invalidate cache for
"""
persistent_outline_cache.invalidate_cache_for_keywords(keywords)
logger.info(f"Invalidated outline cache for keywords: {keywords}")
def get_recent_outline_cache_entries(self, limit: int = 20) -> List[Dict[str, Any]]:
"""Get recent outline cache entries for debugging."""
return persistent_outline_cache.get_cache_entries(limit)

View File

@@ -0,0 +1,121 @@
"""
Parallel Processor - Handles parallel processing of outline generation tasks.
Manages concurrent execution of source mapping and grounding insights extraction.
"""
import asyncio
from typing import Tuple, Any
from loguru import logger
class ParallelProcessor:
"""Handles parallel processing of outline generation tasks for speed optimization."""
def __init__(self, source_mapper, grounding_engine):
"""Initialize the parallel processor with required dependencies."""
self.source_mapper = source_mapper
self.grounding_engine = grounding_engine
async def run_parallel_processing(self, outline_sections, research, user_id: str, task_id: str = None) -> Tuple[Any, Any]:
"""
Run source mapping and grounding insights extraction in parallel.
Args:
outline_sections: List of outline sections to process
research: Research data object
user_id: User ID (required for subscription checks and usage tracking)
task_id: Optional task ID for progress updates
Returns:
Tuple of (mapped_sections, grounding_insights)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for parallel processing (subscription checks and usage tracking)")
if task_id:
from api.blog_writer.task_manager import task_manager
await task_manager.update_progress(task_id, "⚡ Running parallel processing for maximum speed...")
logger.info("Running parallel processing for maximum speed...")
# Run these tasks in parallel to save time
source_mapping_task = asyncio.create_task(
self._run_source_mapping(outline_sections, research, task_id, user_id)
)
grounding_insights_task = asyncio.create_task(
self._run_grounding_insights_extraction(research, task_id)
)
# Wait for both parallel tasks to complete
mapped_sections, grounding_insights = await asyncio.gather(
source_mapping_task,
grounding_insights_task
)
return mapped_sections, grounding_insights
async def run_parallel_processing_async(self, outline_sections, research, user_id: str) -> Tuple[Any, Any]:
"""
Run parallel processing without progress updates (for non-progress methods).
Args:
outline_sections: List of outline sections to process
research: Research data object
user_id: User ID (required for subscription checks and usage tracking)
Returns:
Tuple of (mapped_sections, grounding_insights)
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for parallel processing (subscription checks and usage tracking)")
logger.info("Running parallel processing for maximum speed...")
# Run these tasks in parallel to save time
source_mapping_task = asyncio.create_task(
self._run_source_mapping_async(outline_sections, research, user_id)
)
grounding_insights_task = asyncio.create_task(
self._run_grounding_insights_extraction_async(research)
)
# Wait for both parallel tasks to complete
mapped_sections, grounding_insights = await asyncio.gather(
source_mapping_task,
grounding_insights_task
)
return mapped_sections, grounding_insights
async def _run_source_mapping(self, outline_sections, research, task_id, user_id: str):
"""Run source mapping in parallel."""
if task_id:
from api.blog_writer.task_manager import task_manager
await task_manager.update_progress(task_id, "🔗 Applying intelligent source-to-section mapping...")
return self.source_mapper.map_sources_to_sections(outline_sections, research, user_id)
async def _run_grounding_insights_extraction(self, research, task_id):
"""Run grounding insights extraction in parallel."""
if task_id:
from api.blog_writer.task_manager import task_manager
await task_manager.update_progress(task_id, "🧠 Extracting grounding metadata insights...")
return self.grounding_engine.extract_contextual_insights(research.grounding_metadata)
async def _run_source_mapping_async(self, outline_sections, research, user_id: str):
"""Run source mapping in parallel (async version without progress updates)."""
logger.info("Applying intelligent source-to-section mapping...")
return self.source_mapper.map_sources_to_sections(outline_sections, research, user_id)
async def _run_grounding_insights_extraction_async(self, research):
"""Run grounding insights extraction in parallel (async version without progress updates)."""
logger.info("Extracting grounding metadata insights...")
return self.grounding_engine.extract_contextual_insights(research.grounding_metadata)

View File

@@ -0,0 +1,127 @@
"""
Prompt Builder - Handles building of AI prompts for outline generation.
Constructs comprehensive prompts with research data, keywords, and strategic requirements.
"""
from typing import Dict, Any, List
class PromptBuilder:
"""Handles building of comprehensive AI prompts for outline generation."""
def __init__(self):
"""Initialize the prompt builder."""
pass
def build_outline_prompt(self, primary_keywords: List[str], secondary_keywords: List[str],
content_angles: List[str], sources: List, search_intent: str,
request, custom_instructions: str = None) -> str:
"""Build the comprehensive outline generation prompt using filtered research data."""
# Use the filtered research data (already cleaned by ResearchDataFilter)
research = request.research
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else (request.topic or ', '.join(getattr(request.research, 'original_keywords', []) or ['the target topic']))
secondary_kw_text = ', '.join(secondary_keywords) if secondary_keywords else "None provided"
long_tail_text = ', '.join(research.keyword_analysis.get('long_tail', [])) if research and research.keyword_analysis else "None discovered"
semantic_text = ', '.join(research.keyword_analysis.get('semantic_keywords', [])) if research and research.keyword_analysis else "None discovered"
trending_text = ', '.join(research.keyword_analysis.get('trending_terms', [])) if research and research.keyword_analysis else "None discovered"
content_gap_text = ', '.join(research.keyword_analysis.get('content_gaps', [])) if research and research.keyword_analysis else "None identified"
content_angle_text = ', '.join(content_angles) if content_angles else "No explicit angles provided; infer compelling angles from research insights."
competitor_text = ', '.join(research.competitor_analysis.get('top_competitors', [])) if research and research.competitor_analysis else "Not available"
opportunity_text = ', '.join(research.competitor_analysis.get('opportunities', [])) if research and research.competitor_analysis else "Not available"
advantages_text = ', '.join(research.competitor_analysis.get('competitive_advantages', [])) if research and research.competitor_analysis else "Not available"
return f"""Create a comprehensive blog outline for: {primary_kw_text}
CONTEXT:
Search Intent: {search_intent}
Target: {request.word_count or 1500} words
Industry: {getattr(request.persona, 'industry', 'General') if request.persona else 'General'}
Audience: {getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'}
KEYWORDS:
Primary: {primary_kw_text}
Secondary: {secondary_kw_text}
Long-tail: {long_tail_text}
Semantic: {semantic_text}
Trending: {trending_text}
Content Gaps: {content_gap_text}
CONTENT ANGLES / STORYLINES: {content_angle_text}
COMPETITIVE INTELLIGENCE:
Top Competitors: {competitor_text}
Market Opportunities: {opportunity_text}
Competitive Advantages: {advantages_text}
RESEARCH SOURCES: {len(sources)} authoritative sources available
{f"CUSTOM INSTRUCTIONS: {custom_instructions}" if custom_instructions else ""}
STRATEGIC REQUIREMENTS:
- Create SEO-optimized headings with natural keyword integration
- Surface the strongest research-backed angles within the outline
- Build logical narrative flow from problem to solution
- Include data-driven insights from research sources
- Address content gaps and market opportunities
- Optimize for search intent and user questions
- Ensure engaging, actionable content throughout
Return JSON format:
{
"title_options": [
"Title option 1",
"Title option 2",
"Title option 3"
],
"outline": [
{
"heading": "Section heading with primary keyword",
"subheadings": ["Subheading 1", "Subheading 2", "Subheading 3"],
"key_points": ["Key point 1", "Key point 2", "Key point 3"],
"target_words": 300,
"keywords": ["primary keyword", "secondary keyword"]
}
]
}"""
def get_outline_schema(self) -> Dict[str, Any]:
"""Get the structured JSON schema for outline generation."""
return {
"type": "object",
"properties": {
"title_options": {
"type": "array",
"items": {
"type": "string"
}
},
"outline": {
"type": "array",
"items": {
"type": "object",
"properties": {
"heading": {"type": "string"},
"subheadings": {
"type": "array",
"items": {"type": "string"}
},
"key_points": {
"type": "array",
"items": {"type": "string"}
},
"target_words": {"type": "integer"},
"keywords": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
}
}
},
"required": ["title_options", "outline"],
"propertyOrdering": ["title_options", "outline"]
}

View File

@@ -0,0 +1,120 @@
"""
Response Processor - Handles AI response processing and retry logic.
Processes AI responses, handles retries, and converts data to proper formats.
"""
from typing import Dict, Any, List
import asyncio
from loguru import logger
from models.blog_models import BlogOutlineSection
class ResponseProcessor:
"""Handles AI response processing, retry logic, and data conversion."""
def __init__(self):
"""Initialize the response processor."""
pass
async def generate_with_retry(self, prompt: str, schema: Dict[str, Any], user_id: str, task_id: str = None) -> Dict[str, Any]:
"""Generate outline with retry logic for API failures.
Args:
prompt: The prompt for outline generation
schema: JSON schema for structured response
user_id: User ID (required for subscription checks and usage tracking)
task_id: Optional task ID for progress updates
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
from services.llm_providers.main_text_generation import llm_text_gen
from api.blog_writer.task_manager import task_manager
max_retries = 2 # Conservative retry for expensive API calls
retry_delay = 5 # 5 second delay between retries
for attempt in range(max_retries + 1):
try:
if task_id:
await task_manager.update_progress(task_id, f"🤖 Calling AI API for outline generation (attempt {attempt + 1}/{max_retries + 1})...")
outline_data = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt=None,
user_id=user_id
)
# Log response for debugging
logger.info(f"AI response received: {type(outline_data)}")
# Check for errors in the response
if isinstance(outline_data, dict) and 'error' in outline_data:
error_msg = str(outline_data['error'])
if "503" in error_msg and "overloaded" in error_msg and attempt < max_retries:
if task_id:
await task_manager.update_progress(task_id, f"⚠️ AI service overloaded, retrying in {retry_delay} seconds...")
logger.warning(f"AI API overloaded, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
await asyncio.sleep(retry_delay)
continue
elif "No valid structured response content found" in error_msg and attempt < max_retries:
if task_id:
await task_manager.update_progress(task_id, f"⚠️ Invalid response format, retrying in {retry_delay} seconds...")
logger.warning(f"AI response parsing failed, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
await asyncio.sleep(retry_delay)
continue
else:
logger.error(f"AI structured response error: {outline_data['error']}")
raise ValueError(f"AI outline generation failed: {outline_data['error']}")
# Validate required fields
if not isinstance(outline_data, dict) or 'outline' not in outline_data or not isinstance(outline_data['outline'], list):
if attempt < max_retries:
if task_id:
await task_manager.update_progress(task_id, f"⚠️ Invalid response structure, retrying in {retry_delay} seconds...")
logger.warning(f"Invalid response structure, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
await asyncio.sleep(retry_delay)
continue
else:
raise ValueError("Invalid outline structure in AI response")
# If we get here, the response is valid
return outline_data
except Exception as e:
error_str = str(e)
if ("503" in error_str or "overloaded" in error_str) and attempt < max_retries:
if task_id:
await task_manager.update_progress(task_id, f"⚠️ AI service error, retrying in {retry_delay} seconds...")
logger.warning(f"AI API error, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1}): {error_str}")
await asyncio.sleep(retry_delay)
continue
else:
logger.error(f"Outline generation failed after {attempt + 1} attempts: {error_str}")
raise ValueError(f"AI outline generation failed: {error_str}")
def convert_to_sections(self, outline_data: Dict[str, Any], sources: List) -> List[BlogOutlineSection]:
"""Convert outline data to BlogOutlineSection objects."""
outline_sections = []
for i, section_data in enumerate(outline_data.get('outline', [])):
if not isinstance(section_data, dict) or 'heading' not in section_data:
continue
section = BlogOutlineSection(
id=f"s{i+1}",
heading=section_data.get('heading', f'Section {i+1}'),
subheadings=section_data.get('subheadings', []),
key_points=section_data.get('key_points', []),
references=[], # Will be populated by intelligent mapping
target_words=section_data.get('target_words', 200),
keywords=section_data.get('keywords', [])
)
outline_sections.append(section)
return outline_sections

View File

@@ -0,0 +1,96 @@
"""
Section Enhancer - AI-powered section enhancement and improvement.
Enhances individual outline sections for better engagement and value.
"""
from loguru import logger
from models.blog_models import BlogOutlineSection
class SectionEnhancer:
"""Enhances individual outline sections using AI."""
async def enhance(self, section: BlogOutlineSection, focus: str, user_id: str) -> BlogOutlineSection:
"""Enhance a section using AI with research context.
Args:
section: Outline section to enhance
focus: Enhancement focus (e.g., "general improvement")
user_id: User ID (required for subscription checks and usage tracking)
Returns:
Enhanced outline section
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for section enhancement (subscription checks and usage tracking)")
enhancement_prompt = f"""
Enhance the following blog section to make it more engaging, comprehensive, and valuable:
Current Section:
Heading: {section.heading}
Subheadings: {', '.join(section.subheadings)}
Key Points: {', '.join(section.key_points)}
Target Words: {section.target_words}
Keywords: {', '.join(section.keywords)}
Enhancement Focus: {focus}
Improve:
1. Make subheadings more specific and actionable
2. Add more comprehensive key points with data/insights
3. Include practical examples and case studies
4. Address common questions and objections
5. Optimize for SEO with better keyword integration
Respond with JSON:
{{
"heading": "Enhanced heading",
"subheadings": ["enhanced subheading 1", "enhanced subheading 2"],
"key_points": ["enhanced point 1", "enhanced point 2"],
"target_words": 400,
"keywords": ["keyword1", "keyword2"]
}}
"""
try:
from services.llm_providers.main_text_generation import llm_text_gen
enhancement_schema = {
"type": "object",
"properties": {
"heading": {"type": "string"},
"subheadings": {"type": "array", "items": {"type": "string"}},
"key_points": {"type": "array", "items": {"type": "string"}},
"target_words": {"type": "integer"},
"keywords": {"type": "array", "items": {"type": "string"}}
},
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
}
enhanced_data = llm_text_gen(
prompt=enhancement_prompt,
json_struct=enhancement_schema,
system_prompt=None,
user_id=user_id
)
if isinstance(enhanced_data, dict) and 'error' not in enhanced_data:
return BlogOutlineSection(
id=section.id,
heading=enhanced_data.get('heading', section.heading),
subheadings=enhanced_data.get('subheadings', section.subheadings),
key_points=enhanced_data.get('key_points', section.key_points),
references=section.references,
target_words=enhanced_data.get('target_words', section.target_words),
keywords=enhanced_data.get('keywords', section.keywords)
)
except Exception as e:
logger.warning(f"AI section enhancement failed: {e}")
return section

View File

@@ -0,0 +1,198 @@
"""
SEO Title Generator - Specialized service for generating SEO-optimized blog titles.
Generates 5 premium SEO-optimized titles using research data and outline context.
"""
from typing import Dict, Any, List
from loguru import logger
from models.blog_models import BlogResearchResponse, BlogOutlineSection
class SEOTitleGenerator:
"""Generates SEO-optimized blog titles using research and outline data."""
def __init__(self):
"""Initialize the SEO title generator."""
pass
def build_title_prompt(
self,
research: BlogResearchResponse,
outline: List[BlogOutlineSection],
primary_keywords: List[str],
secondary_keywords: List[str],
content_angles: List[str],
search_intent: str,
word_count: int = 1500
) -> str:
"""Build a specialized prompt for SEO title generation."""
# Extract key research insights
keyword_analysis = research.keyword_analysis or {}
competitor_analysis = research.competitor_analysis or {}
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else "the target topic"
secondary_kw_text = ', '.join(secondary_keywords) if secondary_keywords else "None provided"
long_tail_text = ', '.join(keyword_analysis.get('long_tail', [])) if keyword_analysis else "None discovered"
semantic_text = ', '.join(keyword_analysis.get('semantic_keywords', [])) if keyword_analysis else "None discovered"
trending_text = ', '.join(keyword_analysis.get('trending_terms', [])) if keyword_analysis else "None discovered"
content_gap_text = ', '.join(keyword_analysis.get('content_gaps', [])) if keyword_analysis else "None identified"
content_angle_text = ', '.join(content_angles) if content_angles else "No explicit angles provided"
# Extract outline structure summary
outline_summary = []
for i, section in enumerate(outline[:5], 1): # Limit to first 5 sections for context
outline_summary.append(f"{i}. {section.heading}")
if section.subheadings:
outline_summary.append(f" Subtopics: {', '.join(section.subheadings[:3])}")
outline_text = '\n'.join(outline_summary) if outline_summary else "No outline available"
return f"""Generate exactly 5 SEO-optimized blog titles for: {primary_kw_text}
RESEARCH CONTEXT:
Primary Keywords: {primary_kw_text}
Secondary Keywords: {secondary_kw_text}
Long-tail Keywords: {long_tail_text}
Semantic Keywords: {semantic_text}
Trending Terms: {trending_text}
Content Gaps: {content_gap_text}
Search Intent: {search_intent}
Content Angles: {content_angle_text}
OUTLINE STRUCTURE:
{outline_text}
COMPETITIVE INTELLIGENCE:
Top Competitors: {', '.join(competitor_analysis.get('top_competitors', [])) if competitor_analysis else 'Not available'}
Market Opportunities: {', '.join(competitor_analysis.get('opportunities', [])) if competitor_analysis else 'Not available'}
SEO REQUIREMENTS:
- Each title must be 50-65 characters (optimal for search engine display)
- Include the primary keyword within the first 55 characters
- Highlight a unique value proposition from the research angles
- Use power words that drive clicks (e.g., "Ultimate", "Complete", "Essential", "Proven")
- Avoid generic phrasing - be specific and benefit-focused
- Target the search intent: {search_intent}
- Ensure titles are compelling and click-worthy
Return ONLY a JSON array of exactly 5 titles:
[
"Title 1 (50-65 chars)",
"Title 2 (50-65 chars)",
"Title 3 (50-65 chars)",
"Title 4 (50-65 chars)",
"Title 5 (50-65 chars)"
]"""
def get_title_schema(self) -> Dict[str, Any]:
"""Get the JSON schema for title generation."""
return {
"type": "array",
"items": {
"type": "string",
"minLength": 50,
"maxLength": 65
},
"minItems": 5,
"maxItems": 5
}
async def generate_seo_titles(
self,
research: BlogResearchResponse,
outline: List[BlogOutlineSection],
primary_keywords: List[str],
secondary_keywords: List[str],
content_angles: List[str],
search_intent: str,
word_count: int,
user_id: str
) -> List[str]:
"""Generate SEO-optimized titles using research and outline data.
Args:
research: Research data with keywords and insights
outline: Blog outline sections
primary_keywords: Primary keywords for the blog
secondary_keywords: Secondary keywords
content_angles: Content angles from research
search_intent: Search intent (informational, commercial, etc.)
word_count: Target word count
user_id: User ID for API calls
Returns:
List of 5 SEO-optimized titles
"""
from services.llm_providers.main_text_generation import llm_text_gen
if not user_id:
raise ValueError("user_id is required for title generation")
# Build specialized prompt
prompt = self.build_title_prompt(
research=research,
outline=outline,
primary_keywords=primary_keywords,
secondary_keywords=secondary_keywords,
content_angles=content_angles,
search_intent=search_intent,
word_count=word_count
)
# Get schema
schema = self.get_title_schema()
logger.info(f"Generating SEO-optimized titles for user {user_id}")
try:
# Generate titles using structured JSON response
result = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt="You are an expert SEO content strategist specializing in creating compelling, search-optimized blog titles.",
user_id=user_id
)
# Handle response - could be array directly or wrapped in dict
if isinstance(result, list):
titles = result
elif isinstance(result, dict):
# Try common keys
titles = result.get('titles', result.get('title_options', result.get('options', [])))
if not titles and isinstance(result.get('response'), list):
titles = result['response']
else:
logger.warning(f"Unexpected title generation result type: {type(result)}")
titles = []
# Validate and clean titles
cleaned_titles = []
for title in titles:
if isinstance(title, str) and len(title.strip()) >= 30: # Minimum reasonable length
cleaned = title.strip()
# Ensure it's within reasonable bounds (allow slight overflow for quality)
if len(cleaned) <= 70: # Allow slight overflow for quality
cleaned_titles.append(cleaned)
# Ensure we have exactly 5 titles
if len(cleaned_titles) < 5:
logger.warning(f"Generated only {len(cleaned_titles)} titles, expected 5")
# Pad with placeholder if needed (shouldn't happen with proper schema)
while len(cleaned_titles) < 5:
cleaned_titles.append(f"{primary_keywords[0] if primary_keywords else 'Blog'} - Comprehensive Guide")
# Return exactly 5 titles
return cleaned_titles[:5]
except Exception as e:
logger.error(f"Failed to generate SEO titles: {e}")
# Fallback: generate simple titles from keywords
fallback_titles = []
primary = primary_keywords[0] if primary_keywords else "Blog Post"
for i in range(5):
fallback_titles.append(f"{primary}: Complete Guide {i+1}")
return fallback_titles

View File

@@ -0,0 +1,690 @@
"""
Source-to-Section Mapper - Intelligent mapping of research sources to outline sections.
This module provides algorithmic mapping of research sources to specific outline sections
based on semantic similarity, keyword relevance, and contextual matching. Uses a hybrid
approach of algorithmic scoring followed by AI validation for optimal results.
"""
from typing import Dict, Any, List, Tuple, Optional
import re
from collections import Counter
from loguru import logger
from models.blog_models import (
BlogOutlineSection,
ResearchSource,
BlogResearchResponse,
)
class SourceToSectionMapper:
"""Maps research sources to outline sections using intelligent algorithms."""
def __init__(self):
"""Initialize the source-to-section mapper."""
self.min_semantic_score = 0.3
self.min_keyword_score = 0.2
self.min_contextual_score = 0.2
self.max_sources_per_section = 3
self.min_total_score = 0.4
# Weight factors for different scoring methods
self.weights = {
'semantic': 0.4, # Semantic similarity weight
'keyword': 0.3, # Keyword matching weight
'contextual': 0.3 # Contextual relevance weight
}
# Common stop words for text processing
self.stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did',
'will', 'would', 'could', 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those',
'how', 'what', 'when', 'where', 'why', 'who', 'which', 'how', 'much', 'many', 'more', 'most',
'some', 'any', 'all', 'each', 'every', 'other', 'another', 'such', 'no', 'not', 'only', 'own',
'same', 'so', 'than', 'too', 'very', 'just', 'now', 'here', 'there', 'up', 'down', 'out', 'off',
'over', 'under', 'again', 'further', 'then', 'once'
}
logger.info("✅ SourceToSectionMapper initialized with intelligent mapping algorithms")
def map_sources_to_sections(
self,
sections: List[BlogOutlineSection],
research_data: BlogResearchResponse,
user_id: str
) -> List[BlogOutlineSection]:
"""
Map research sources to outline sections using intelligent algorithms.
Args:
sections: List of outline sections to map sources to
research_data: Research data containing sources and metadata
user_id: User ID (required for subscription checks and usage tracking)
Returns:
List of outline sections with intelligently mapped sources
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for source mapping (subscription checks and usage tracking)")
if not sections or not research_data.sources:
logger.warning("No sections or sources to map")
return sections
logger.info(f"Mapping {len(research_data.sources)} sources to {len(sections)} sections")
# Step 1: Algorithmic mapping
mapping_results = self._algorithmic_source_mapping(sections, research_data)
# Step 2: AI validation and improvement (single prompt, user_id required for subscription checks)
validated_mapping = self._ai_validate_mapping(mapping_results, research_data, user_id)
# Step 3: Apply validated mapping to sections
mapped_sections = self._apply_mapping_to_sections(sections, validated_mapping)
logger.info("✅ Source-to-section mapping completed successfully")
return mapped_sections
def _algorithmic_source_mapping(
self,
sections: List[BlogOutlineSection],
research_data: BlogResearchResponse
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
"""
Perform algorithmic mapping of sources to sections.
Args:
sections: List of outline sections
research_data: Research data with sources
Returns:
Dictionary mapping section IDs to list of (source, score) tuples
"""
mapping_results = {}
for section in sections:
section_scores = []
for source in research_data.sources:
# Calculate multi-dimensional relevance score
semantic_score = self._calculate_semantic_similarity(section, source)
keyword_score = self._calculate_keyword_relevance(section, source, research_data)
contextual_score = self._calculate_contextual_relevance(section, source, research_data)
# Weighted total score
total_score = (
semantic_score * self.weights['semantic'] +
keyword_score * self.weights['keyword'] +
contextual_score * self.weights['contextual']
)
# Only include sources that meet minimum threshold
if total_score >= self.min_total_score:
section_scores.append((source, total_score))
# Sort by score and limit to max sources per section
section_scores.sort(key=lambda x: x[1], reverse=True)
section_scores = section_scores[:self.max_sources_per_section]
mapping_results[section.id] = section_scores
logger.debug(f"Section '{section.heading}': {len(section_scores)} sources mapped")
return mapping_results
def _calculate_semantic_similarity(self, section: BlogOutlineSection, source: ResearchSource) -> float:
"""
Calculate semantic similarity between section and source.
Args:
section: Outline section
source: Research source
Returns:
Semantic similarity score (0.0 to 1.0)
"""
# Extract text content for comparison
section_text = self._extract_section_text(section)
source_text = self._extract_source_text(source)
# Calculate word overlap
section_words = self._extract_meaningful_words(section_text)
source_words = self._extract_meaningful_words(source_text)
if not section_words or not source_words:
return 0.0
# Calculate Jaccard similarity
intersection = len(set(section_words) & set(source_words))
union = len(set(section_words) | set(source_words))
jaccard_similarity = intersection / union if union > 0 else 0.0
# Boost score for exact phrase matches
phrase_boost = self._calculate_phrase_similarity(section_text, source_text)
# Combine Jaccard similarity with phrase boost
semantic_score = min(1.0, jaccard_similarity + phrase_boost)
return semantic_score
def _calculate_keyword_relevance(
self,
section: BlogOutlineSection,
source: ResearchSource,
research_data: BlogResearchResponse
) -> float:
"""
Calculate keyword-based relevance between section and source.
Args:
section: Outline section
source: Research source
research_data: Research data with keyword analysis
Returns:
Keyword relevance score (0.0 to 1.0)
"""
# Get section keywords
section_keywords = set(section.keywords)
if not section_keywords:
# Extract keywords from section heading and content
section_text = self._extract_section_text(section)
section_keywords = set(self._extract_meaningful_words(section_text))
# Get source keywords from title and excerpt
source_text = f"{source.title} {source.excerpt or ''}"
source_keywords = set(self._extract_meaningful_words(source_text))
# Get research keywords for context
research_keywords = set()
for category in ['primary', 'secondary', 'long_tail', 'semantic_keywords']:
research_keywords.update(research_data.keyword_analysis.get(category, []))
# Calculate keyword overlap scores
section_overlap = len(section_keywords & source_keywords) / len(section_keywords) if section_keywords else 0.0
research_overlap = len(research_keywords & source_keywords) / len(research_keywords) if research_keywords else 0.0
# Weighted combination
keyword_score = (section_overlap * 0.7) + (research_overlap * 0.3)
return min(1.0, keyword_score)
def _calculate_contextual_relevance(
self,
section: BlogOutlineSection,
source: ResearchSource,
research_data: BlogResearchResponse
) -> float:
"""
Calculate contextual relevance based on section content and source context.
Args:
section: Outline section
source: Research source
research_data: Research data with context
Returns:
Contextual relevance score (0.0 to 1.0)
"""
contextual_score = 0.0
# 1. Content angle matching
section_text = self._extract_section_text(section).lower()
source_text = f"{source.title} {source.excerpt or ''}".lower()
# Check for content angle matches
content_angles = research_data.suggested_angles
for angle in content_angles:
angle_words = self._extract_meaningful_words(angle.lower())
if angle_words:
section_angle_match = sum(1 for word in angle_words if word in section_text) / len(angle_words)
source_angle_match = sum(1 for word in angle_words if word in source_text) / len(angle_words)
contextual_score += (section_angle_match + source_angle_match) * 0.3
# 2. Search intent alignment
search_intent = research_data.keyword_analysis.get('search_intent', 'informational')
intent_keywords = self._get_intent_keywords(search_intent)
intent_score = 0.0
for keyword in intent_keywords:
if keyword in section_text or keyword in source_text:
intent_score += 0.1
contextual_score += min(0.3, intent_score)
# 3. Industry/domain relevance
if hasattr(research_data, 'industry') and research_data.industry:
industry_words = self._extract_meaningful_words(research_data.industry.lower())
industry_score = sum(1 for word in industry_words if word in source_text) / len(industry_words) if industry_words else 0.0
contextual_score += industry_score * 0.2
return min(1.0, contextual_score)
def _ai_validate_mapping(
self,
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]],
research_data: BlogResearchResponse,
user_id: str
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
"""
Use AI to validate and improve the algorithmic mapping results.
Args:
mapping_results: Algorithmic mapping results
research_data: Research data for context
user_id: User ID (required for subscription checks and usage tracking)
Returns:
AI-validated and improved mapping results
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for AI validation (subscription checks and usage tracking)")
try:
logger.info("Starting AI validation of source-to-section mapping...")
# Build AI validation prompt
validation_prompt = self._build_validation_prompt(mapping_results, research_data)
# Get AI validation response (user_id required for subscription checks)
validation_response = self._get_ai_validation_response(validation_prompt, user_id)
# Parse and apply AI validation results
validated_mapping = self._parse_validation_response(validation_response, mapping_results, research_data)
logger.info("✅ AI validation completed successfully")
return validated_mapping
except Exception as e:
logger.warning(f"AI validation failed: {e}. Using algorithmic results as fallback.")
return mapping_results
def _apply_mapping_to_sections(
self,
sections: List[BlogOutlineSection],
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]]
) -> List[BlogOutlineSection]:
"""
Apply the mapping results to the outline sections.
Args:
sections: Original outline sections
mapping_results: Mapping results from algorithmic/AI processing
Returns:
Sections with mapped sources
"""
mapped_sections = []
for section in sections:
# Get mapped sources for this section
mapped_sources = mapping_results.get(section.id, [])
# Extract just the sources (without scores)
section_sources = [source for source, score in mapped_sources]
# Create new section with mapped sources
mapped_section = BlogOutlineSection(
id=section.id,
heading=section.heading,
subheadings=section.subheadings,
key_points=section.key_points,
references=section_sources,
target_words=section.target_words,
keywords=section.keywords
)
mapped_sections.append(mapped_section)
logger.debug(f"Applied {len(section_sources)} sources to section '{section.heading}'")
return mapped_sections
# Helper methods
def _extract_section_text(self, section: BlogOutlineSection) -> str:
"""Extract all text content from a section."""
text_parts = [section.heading]
text_parts.extend(section.subheadings)
text_parts.extend(section.key_points)
text_parts.extend(section.keywords)
return " ".join(text_parts)
def _extract_source_text(self, source: ResearchSource) -> str:
"""Extract all text content from a source."""
text_parts = [source.title]
if source.excerpt:
text_parts.append(source.excerpt)
return " ".join(text_parts)
def _extract_meaningful_words(self, text: str) -> List[str]:
"""Extract meaningful words from text, removing stop words and cleaning."""
if not text:
return []
# Clean and tokenize
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
# Remove stop words and short words
meaningful_words = [
word for word in words
if word not in self.stop_words and len(word) > 2
]
return meaningful_words
def _calculate_phrase_similarity(self, text1: str, text2: str) -> float:
"""Calculate phrase similarity boost score."""
if not text1 or not text2:
return 0.0
text1_lower = text1.lower()
text2_lower = text2.lower()
# Look for 2-3 word phrases
phrase_boost = 0.0
# Extract 2-word phrases
words1 = text1_lower.split()
words2 = text2_lower.split()
for i in range(len(words1) - 1):
phrase = f"{words1[i]} {words1[i+1]}"
if phrase in text2_lower:
phrase_boost += 0.1
# Extract 3-word phrases
for i in range(len(words1) - 2):
phrase = f"{words1[i]} {words1[i+1]} {words1[i+2]}"
if phrase in text2_lower:
phrase_boost += 0.15
return min(0.3, phrase_boost) # Cap at 0.3
def _get_intent_keywords(self, search_intent: str) -> List[str]:
"""Get keywords associated with search intent."""
intent_keywords = {
'informational': ['what', 'how', 'why', 'guide', 'tutorial', 'explain', 'learn', 'understand'],
'navigational': ['find', 'locate', 'search', 'where', 'site', 'website', 'page'],
'transactional': ['buy', 'purchase', 'order', 'price', 'cost', 'deal', 'offer', 'discount'],
'commercial': ['compare', 'review', 'best', 'top', 'vs', 'versus', 'alternative', 'option']
}
return intent_keywords.get(search_intent, [])
def get_mapping_statistics(self, mapping_results: Dict[str, List[Tuple[ResearchSource, float]]]) -> Dict[str, Any]:
"""
Get statistics about the mapping results.
Args:
mapping_results: Mapping results to analyze
Returns:
Dictionary with mapping statistics
"""
total_sections = len(mapping_results)
total_mappings = sum(len(sources) for sources in mapping_results.values())
# Calculate score distribution
all_scores = []
for sources in mapping_results.values():
all_scores.extend([score for source, score in sources])
avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
max_score = max(all_scores) if all_scores else 0.0
min_score = min(all_scores) if all_scores else 0.0
# Count sections with/without sources
sections_with_sources = sum(1 for sources in mapping_results.values() if sources)
sections_without_sources = total_sections - sections_with_sources
return {
'total_sections': total_sections,
'total_mappings': total_mappings,
'sections_with_sources': sections_with_sources,
'sections_without_sources': sections_without_sources,
'average_score': avg_score,
'max_score': max_score,
'min_score': min_score,
'mapping_coverage': sections_with_sources / total_sections if total_sections > 0 else 0.0
}
def _build_validation_prompt(
self,
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]],
research_data: BlogResearchResponse
) -> str:
"""
Build comprehensive AI validation prompt for source-to-section mapping.
Args:
mapping_results: Algorithmic mapping results
research_data: Research data for context
Returns:
Formatted AI validation prompt
"""
# Extract section information
sections_info = []
for section_id, sources in mapping_results.items():
section_info = {
'id': section_id,
'sources': [
{
'title': source.title,
'url': source.url,
'excerpt': source.excerpt,
'credibility_score': source.credibility_score,
'algorithmic_score': score
}
for source, score in sources
]
}
sections_info.append(section_info)
# Extract research context
research_context = {
'primary_keywords': research_data.keyword_analysis.get('primary', []),
'secondary_keywords': research_data.keyword_analysis.get('secondary', []),
'content_angles': research_data.suggested_angles,
'search_intent': research_data.keyword_analysis.get('search_intent', 'informational'),
'all_sources': [
{
'title': source.title,
'url': source.url,
'excerpt': source.excerpt,
'credibility_score': source.credibility_score
}
for source in research_data.sources
]
}
prompt = f"""
You are an expert content strategist and SEO specialist. Your task is to validate and improve the algorithmic mapping of research sources to blog outline sections.
## CONTEXT
Research Topic: {', '.join(research_context['primary_keywords'])}
Search Intent: {research_context['search_intent']}
Content Angles: {', '.join(research_context['content_angles'])}
## ALGORITHMIC MAPPING RESULTS
The following sections have been algorithmically mapped with research sources:
{self._format_sections_for_prompt(sections_info)}
## AVAILABLE SOURCES
All available research sources:
{self._format_sources_for_prompt(research_context['all_sources'])}
## VALIDATION TASK
Please analyze the algorithmic mapping and provide improvements:
1. **Validate Relevance**: Are the mapped sources truly relevant to each section's content and purpose?
2. **Identify Gaps**: Are there better sources available that weren't mapped?
3. **Suggest Improvements**: Recommend specific source changes for better content alignment
4. **Quality Assessment**: Rate the overall mapping quality (1-10)
## RESPONSE FORMAT
Provide your analysis in the following JSON format:
```json
{{
"overall_quality_score": 8,
"section_improvements": [
{{
"section_id": "s1",
"current_sources": ["source_title_1", "source_title_2"],
"recommended_sources": ["better_source_1", "better_source_2", "better_source_3"],
"reasoning": "Explanation of why these sources are better suited for this section",
"confidence": 0.9
}}
],
"summary": "Overall assessment of the mapping quality and key improvements made"
}}
```
## GUIDELINES
- Prioritize sources that directly support the section's key points and subheadings
- Consider source credibility, recency, and content depth
- Ensure sources provide actionable insights for content creation
- Maintain diversity in source types and perspectives
- Focus on sources that enhance the section's value proposition
Analyze the mapping and provide your recommendations.
"""
return prompt
def _get_ai_validation_response(self, prompt: str, user_id: str) -> str:
"""
Get AI validation response using LLM provider.
Args:
prompt: Validation prompt
user_id: User ID (required for subscription checks and usage tracking)
Returns:
AI validation response
Raises:
ValueError: If user_id is not provided
"""
if not user_id:
raise ValueError("user_id is required for AI validation response (subscription checks and usage tracking)")
try:
from services.llm_providers.main_text_generation import llm_text_gen
response = llm_text_gen(
prompt=prompt,
json_struct=None,
system_prompt=None,
user_id=user_id
)
return response
except Exception as e:
logger.error(f"Failed to get AI validation response: {e}")
raise
def _parse_validation_response(
self,
response: str,
original_mapping: Dict[str, List[Tuple[ResearchSource, float]]],
research_data: BlogResearchResponse
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
"""
Parse AI validation response and apply improvements.
Args:
response: AI validation response
original_mapping: Original algorithmic mapping
research_data: Research data for context
Returns:
Improved mapping based on AI validation
"""
try:
import json
import re
# Extract JSON from response
json_match = re.search(r'```json\s*(\{.*?\})\s*```', response, re.DOTALL)
if not json_match:
# Try to find JSON without code blocks
json_match = re.search(r'(\{.*?\})', response, re.DOTALL)
if not json_match:
logger.warning("Could not extract JSON from AI response")
return original_mapping
validation_data = json.loads(json_match.group(1))
# Create source lookup for quick access
source_lookup = {source.title: source for source in research_data.sources}
# Apply AI improvements
improved_mapping = {}
for improvement in validation_data.get('section_improvements', []):
section_id = improvement['section_id']
recommended_titles = improvement['recommended_sources']
# Map recommended titles to actual sources
recommended_sources = []
for title in recommended_titles:
if title in source_lookup:
source = source_lookup[title]
# Use high confidence score for AI-recommended sources
recommended_sources.append((source, 0.9))
if recommended_sources:
improved_mapping[section_id] = recommended_sources
else:
# Fallback to original mapping if no valid sources found
improved_mapping[section_id] = original_mapping.get(section_id, [])
# Add sections not mentioned in AI response
for section_id, sources in original_mapping.items():
if section_id not in improved_mapping:
improved_mapping[section_id] = sources
logger.info(f"AI validation applied: {len(validation_data.get('section_improvements', []))} sections improved")
return improved_mapping
except Exception as e:
logger.warning(f"Failed to parse AI validation response: {e}")
return original_mapping
def _format_sections_for_prompt(self, sections_info: List[Dict]) -> str:
"""Format sections information for AI prompt."""
formatted = []
for section in sections_info:
section_text = f"**Section {section['id']}:**\n"
section_text += f"Sources mapped: {len(section['sources'])}\n"
for source in section['sources']:
section_text += f"- {source['title']} (Score: {source['algorithmic_score']:.2f})\n"
formatted.append(section_text)
return "\n".join(formatted)
def _format_sources_for_prompt(self, sources: List[Dict]) -> str:
"""Format sources information for AI prompt."""
formatted = []
for i, source in enumerate(sources, 1):
source_text = f"{i}. **{source['title']}**\n"
source_text += f" URL: {source['url']}\n"
source_text += f" Credibility: {source['credibility_score']}\n"
if source['excerpt']:
source_text += f" Excerpt: {source['excerpt'][:200]}...\n"
formatted.append(source_text)
return "\n".join(formatted)

View File

@@ -0,0 +1,123 @@
"""
Title Generator - Handles title generation and formatting for blog outlines.
Extracts content angles from research data and combines them with AI-generated titles.
"""
from typing import List
from loguru import logger
class TitleGenerator:
"""Handles title generation, formatting, and combination logic."""
def __init__(self):
"""Initialize the title generator."""
pass
def extract_content_angle_titles(self, research) -> List[str]:
"""
Extract content angles from research data and convert them to blog titles.
Args:
research: BlogResearchResponse object containing suggested_angles
Returns:
List of title-formatted content angles
"""
if not research or not hasattr(research, 'suggested_angles'):
return []
content_angles = research.suggested_angles or []
if not content_angles:
return []
# Convert content angles to title format
title_formatted_angles = []
for angle in content_angles:
if isinstance(angle, str) and angle.strip():
# Clean and format the angle as a title
formatted_angle = self._format_angle_as_title(angle.strip())
if formatted_angle and formatted_angle not in title_formatted_angles:
title_formatted_angles.append(formatted_angle)
logger.info(f"Extracted {len(title_formatted_angles)} content angle titles from research data")
return title_formatted_angles
def _format_angle_as_title(self, angle: str) -> str:
"""
Format a content angle as a proper blog title.
Args:
angle: Raw content angle string
Returns:
Formatted title string
"""
if not angle or len(angle.strip()) < 10: # Too short to be a good title
return ""
# Clean up the angle
cleaned_angle = angle.strip()
# Capitalize first letter of each sentence and proper nouns
sentences = cleaned_angle.split('. ')
formatted_sentences = []
for sentence in sentences:
if sentence.strip():
# Use title case for better formatting
formatted_sentence = sentence.strip().title()
formatted_sentences.append(formatted_sentence)
formatted_title = '. '.join(formatted_sentences)
# Ensure it ends with proper punctuation
if not formatted_title.endswith(('.', '!', '?')):
formatted_title += '.'
# Limit length to reasonable blog title size
if len(formatted_title) > 100:
formatted_title = formatted_title[:97] + "..."
return formatted_title
def combine_title_options(self, ai_titles: List[str], content_angle_titles: List[str], primary_keywords: List[str]) -> List[str]:
"""
Combine AI-generated titles with content angle titles, ensuring variety and quality.
Args:
ai_titles: AI-generated title options
content_angle_titles: Titles derived from content angles
primary_keywords: Primary keywords for fallback generation
Returns:
Combined list of title options (max 6 total)
"""
all_titles = []
# Add content angle titles first (these are research-based and valuable)
for title in content_angle_titles[:3]: # Limit to top 3 content angles
if title and title not in all_titles:
all_titles.append(title)
# Add AI-generated titles
for title in ai_titles:
if title and title not in all_titles:
all_titles.append(title)
# Note: Removed fallback titles as requested - only use research and AI-generated titles
# Limit to 6 titles maximum for UI usability
final_titles = all_titles[:6]
logger.info(f"Combined title options: {len(final_titles)} total (AI: {len(ai_titles)}, Content angles: {len(content_angle_titles)})")
return final_titles
def generate_fallback_titles(self, primary_keywords: List[str]) -> List[str]:
"""Generate fallback titles when AI generation fails."""
primary_keyword = primary_keywords[0] if primary_keywords else "Topic"
return [
f"The Complete Guide to {primary_keyword}",
f"{primary_keyword}: Everything You Need to Know",
f"How to Master {primary_keyword} in 2024"
]

View File

@@ -0,0 +1,31 @@
"""
Research module for AI Blog Writer.
This module handles all research-related functionality including:
- Google Search grounding integration
- Keyword analysis and competitor research
- Content angle discovery
- Research caching and optimization
"""
from .research_service import ResearchService
from .keyword_analyzer import KeywordAnalyzer
from .competitor_analyzer import CompetitorAnalyzer
from .content_angle_generator import ContentAngleGenerator
from .data_filter import ResearchDataFilter
from .base_provider import ResearchProvider as BaseResearchProvider
from .google_provider import GoogleResearchProvider
from .exa_provider import ExaResearchProvider
from .tavily_provider import TavilyResearchProvider
__all__ = [
'ResearchService',
'KeywordAnalyzer',
'CompetitorAnalyzer',
'ContentAngleGenerator',
'ResearchDataFilter',
'BaseResearchProvider',
'GoogleResearchProvider',
'ExaResearchProvider',
'TavilyResearchProvider',
]

View File

@@ -0,0 +1,37 @@
"""
Base Research Provider Interface
Abstract base class for research provider implementations.
Ensures consistency across different research providers (Google, Exa, etc.)
"""
from abc import ABC, abstractmethod
from typing import Dict, Any
class ResearchProvider(ABC):
"""Abstract base class for research providers."""
@abstractmethod
async def search(
self,
prompt: str,
topic: str,
industry: str,
target_audience: str,
config: Any, # ResearchConfig
user_id: str
) -> Dict[str, Any]:
"""Execute research and return raw results."""
pass
@abstractmethod
def get_provider_enum(self):
"""Return APIProvider enum for subscription tracking."""
pass
@abstractmethod
def estimate_tokens(self) -> int:
"""Estimate token usage for pre-flight validation."""
pass

View File

@@ -0,0 +1,72 @@
"""
Competitor Analyzer - AI-powered competitor analysis for research content.
Extracts competitor insights and market intelligence from research content.
"""
from typing import Dict, Any
from loguru import logger
class CompetitorAnalyzer:
"""Analyzes competitors and market intelligence from research content."""
def analyze(self, content: str, user_id: str = None) -> Dict[str, Any]:
"""Parse comprehensive competitor analysis from the research content using AI."""
competitor_prompt = f"""
Analyze the following research content and extract competitor insights:
Research Content:
{content[:3000]}
Extract and analyze:
1. Top competitors mentioned (companies, brands, platforms)
2. Content gaps (what competitors are missing)
3. Market opportunities (untapped areas)
4. Competitive advantages (what makes content unique)
5. Market positioning insights
6. Industry leaders and their strategies
Respond with JSON:
{{
"top_competitors": ["competitor1", "competitor2"],
"content_gaps": ["gap1", "gap2"],
"opportunities": ["opportunity1", "opportunity2"],
"competitive_advantages": ["advantage1", "advantage2"],
"market_positioning": "positioning insights",
"industry_leaders": ["leader1", "leader2"],
"analysis_notes": "Comprehensive competitor analysis summary"
}}
"""
from services.llm_providers.main_text_generation import llm_text_gen
competitor_schema = {
"type": "object",
"properties": {
"top_competitors": {"type": "array", "items": {"type": "string"}},
"content_gaps": {"type": "array", "items": {"type": "string"}},
"opportunities": {"type": "array", "items": {"type": "string"}},
"competitive_advantages": {"type": "array", "items": {"type": "string"}},
"market_positioning": {"type": "string"},
"industry_leaders": {"type": "array", "items": {"type": "string"}},
"analysis_notes": {"type": "string"}
},
"required": ["top_competitors", "content_gaps", "opportunities", "competitive_advantages", "market_positioning", "industry_leaders", "analysis_notes"]
}
competitor_analysis = llm_text_gen(
prompt=competitor_prompt,
json_struct=competitor_schema,
user_id=user_id
)
if isinstance(competitor_analysis, dict) and 'error' not in competitor_analysis:
logger.info("✅ AI competitor analysis completed successfully")
return competitor_analysis
else:
# Fail gracefully - no fallback data
error_msg = competitor_analysis.get('error', 'Unknown error') if isinstance(competitor_analysis, dict) else str(competitor_analysis)
logger.error(f"AI competitor analysis failed: {error_msg}")
raise ValueError(f"Competitor analysis failed: {error_msg}")

View File

@@ -0,0 +1,80 @@
"""
Content Angle Generator - AI-powered content angle discovery.
Generates strategic content angles from research content for blog posts.
"""
from typing import List
from loguru import logger
class ContentAngleGenerator:
"""Generates strategic content angles from research content."""
def generate(self, content: str, topic: str, industry: str, user_id: str = None) -> List[str]:
"""Parse strategic content angles from the research content using AI."""
angles_prompt = f"""
Analyze the following research content and create strategic content angles for: {topic} in {industry}
Research Content:
{content[:3000]}
Create 7 compelling content angles that:
1. Leverage current trends and data from the research
2. Address content gaps and opportunities
3. Appeal to different audience segments
4. Include unique perspectives not covered by competitors
5. Incorporate specific statistics, case studies, or expert insights
6. Create emotional connection and urgency
7. Provide actionable value to readers
Each angle should be:
- Specific and data-driven
- Unique and differentiated
- Compelling and click-worthy
- Actionable for readers
Respond with JSON:
{{
"content_angles": [
"Specific angle 1 with data/trends",
"Specific angle 2 with unique perspective",
"Specific angle 3 with actionable insights",
"Specific angle 4 with case study focus",
"Specific angle 5 with future outlook",
"Specific angle 6 with problem-solving focus",
"Specific angle 7 with industry insights"
]
}}
"""
from services.llm_providers.main_text_generation import llm_text_gen
angles_schema = {
"type": "object",
"properties": {
"content_angles": {
"type": "array",
"items": {"type": "string"},
"minItems": 5,
"maxItems": 7
}
},
"required": ["content_angles"]
}
angles_result = llm_text_gen(
prompt=angles_prompt,
json_struct=angles_schema,
user_id=user_id
)
if isinstance(angles_result, dict) and 'content_angles' in angles_result:
logger.info("✅ AI content angles generation completed successfully")
return angles_result['content_angles'][:7]
else:
# Fail gracefully - no fallback data
error_msg = angles_result.get('error', 'Unknown error') if isinstance(angles_result, dict) else str(angles_result)
logger.error(f"AI content angles generation failed: {error_msg}")
raise ValueError(f"Content angles generation failed: {error_msg}")

View File

@@ -0,0 +1,519 @@
"""
Research Data Filter - Filters and cleans research data for optimal AI processing.
This module provides intelligent filtering and cleaning of research data to:
1. Remove low-quality sources and irrelevant content
2. Optimize data for AI processing (reduce tokens, improve quality)
3. Ensure only high-value insights are sent to AI prompts
4. Maintain data integrity while improving processing efficiency
"""
from typing import Dict, Any, List, Optional, Tuple
from datetime import datetime, timedelta
import re
from loguru import logger
from models.blog_models import (
BlogResearchResponse,
ResearchSource,
GroundingMetadata,
GroundingChunk,
GroundingSupport,
Citation,
)
class ResearchDataFilter:
"""Filters and cleans research data for optimal AI processing."""
def __init__(self):
"""Initialize the research data filter with default settings."""
# Be conservative but avoid over-filtering which can lead to empty UI
self.min_credibility_score = 0.5
self.min_excerpt_length = 20
self.max_sources = 15
self.max_grounding_chunks = 20
self.max_content_gaps = 5
self.max_keywords_per_category = 10
self.min_grounding_confidence = 0.5
self.max_source_age_days = 365 * 5 # allow up to 5 years if relevant
# Common stop words for keyword cleaning
self.stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did',
'will', 'would', 'could', 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'
}
# Irrelevant source patterns
self.irrelevant_patterns = [
r'\.(pdf|doc|docx|xls|xlsx|ppt|pptx)$', # Document files
r'\.(jpg|jpeg|png|gif|svg|webp)$', # Image files
r'\.(mp4|avi|mov|wmv|flv|webm)$', # Video files
r'\.(mp3|wav|flac|aac)$', # Audio files
r'\.(zip|rar|7z|tar|gz)$', # Archive files
r'^https?://(www\.)?(facebook|twitter|instagram|linkedin|youtube)\.com', # Social media
r'^https?://(www\.)?(amazon|ebay|etsy)\.com', # E-commerce
r'^https?://(www\.)?(wikipedia)\.org', # Wikipedia (too generic)
]
logger.info("✅ ResearchDataFilter initialized with quality thresholds")
def filter_research_data(self, research_data: BlogResearchResponse) -> BlogResearchResponse:
"""
Main filtering method that processes all research data components.
Args:
research_data: Raw research data from the research service
Returns:
Filtered and cleaned research data optimized for AI processing
"""
logger.info(f"Starting research data filtering for {len(research_data.sources)} sources")
# Track original counts for logging
original_counts = {
'sources': len(research_data.sources),
'grounding_chunks': len(research_data.grounding_metadata.grounding_chunks) if research_data.grounding_metadata else 0,
'grounding_supports': len(research_data.grounding_metadata.grounding_supports) if research_data.grounding_metadata else 0,
'citations': len(research_data.grounding_metadata.citations) if research_data.grounding_metadata else 0,
}
# Filter sources
filtered_sources = self.filter_sources(research_data.sources)
# Filter grounding metadata
filtered_grounding_metadata = self.filter_grounding_metadata(research_data.grounding_metadata)
# Clean keyword analysis
cleaned_keyword_analysis = self.clean_keyword_analysis(research_data.keyword_analysis)
# Clean competitor analysis
cleaned_competitor_analysis = self.clean_competitor_analysis(research_data.competitor_analysis)
# Filter content gaps
filtered_content_gaps = self.filter_content_gaps(
research_data.keyword_analysis.get('content_gaps', []),
research_data
)
# Update keyword analysis with filtered content gaps
cleaned_keyword_analysis['content_gaps'] = filtered_content_gaps
# Create filtered research response
filtered_research = BlogResearchResponse(
success=research_data.success,
sources=filtered_sources,
keyword_analysis=cleaned_keyword_analysis,
competitor_analysis=cleaned_competitor_analysis,
suggested_angles=research_data.suggested_angles, # Keep as-is for now
search_widget=research_data.search_widget,
search_queries=research_data.search_queries,
grounding_metadata=filtered_grounding_metadata,
error_message=research_data.error_message
)
# Log filtering results
self._log_filtering_results(original_counts, filtered_research)
return filtered_research
def filter_sources(self, sources: List[ResearchSource]) -> List[ResearchSource]:
"""
Filter sources based on quality, relevance, and recency criteria.
Args:
sources: List of research sources to filter
Returns:
Filtered list of high-quality sources
"""
if not sources:
return []
filtered_sources = []
for source in sources:
# Quality filters
if not self._is_source_high_quality(source):
continue
# Relevance filters
if not self._is_source_relevant(source):
continue
# Recency filters
if not self._is_source_recent(source):
continue
filtered_sources.append(source)
# Sort by credibility score and limit to max_sources
filtered_sources.sort(key=lambda s: s.credibility_score or 0.8, reverse=True)
filtered_sources = filtered_sources[:self.max_sources]
# Fail-open: if everything was filtered out, return a trimmed set of original sources
if not filtered_sources and sources:
logger.warning("All sources filtered out by thresholds. Falling back to top sources without strict filters.")
fallback = sorted(
sources,
key=lambda s: (s.credibility_score or 0.8),
reverse=True
)[: self.max_sources]
return fallback
logger.info(f"Filtered sources: {len(sources)}{len(filtered_sources)}")
return filtered_sources
def filter_grounding_metadata(self, grounding_metadata: Optional[GroundingMetadata]) -> Optional[GroundingMetadata]:
"""
Filter grounding metadata to keep only high-confidence, relevant data.
Args:
grounding_metadata: Raw grounding metadata to filter
Returns:
Filtered grounding metadata with high-quality data only
"""
if not grounding_metadata:
return None
# Filter grounding chunks by confidence
filtered_chunks = []
for chunk in grounding_metadata.grounding_chunks:
if chunk.confidence_score and chunk.confidence_score >= self.min_grounding_confidence:
filtered_chunks.append(chunk)
# Limit chunks to max_grounding_chunks
filtered_chunks = filtered_chunks[:self.max_grounding_chunks]
# Filter grounding supports by confidence
filtered_supports = []
for support in grounding_metadata.grounding_supports:
if support.confidence_scores and max(support.confidence_scores) >= self.min_grounding_confidence:
filtered_supports.append(support)
# Filter citations by type and relevance
filtered_citations = []
for citation in grounding_metadata.citations:
if self._is_citation_relevant(citation):
filtered_citations.append(citation)
# Fail-open strategies to avoid empty UI:
if not filtered_chunks and grounding_metadata.grounding_chunks:
logger.warning("All grounding chunks filtered out. Falling back to first N chunks without confidence filter.")
filtered_chunks = grounding_metadata.grounding_chunks[: self.max_grounding_chunks]
if not filtered_supports and grounding_metadata.grounding_supports:
logger.warning("All grounding supports filtered out. Falling back to first N supports without confidence filter.")
filtered_supports = grounding_metadata.grounding_supports[: self.max_grounding_chunks]
# Create filtered grounding metadata
filtered_metadata = GroundingMetadata(
grounding_chunks=filtered_chunks,
grounding_supports=filtered_supports,
citations=filtered_citations,
search_entry_point=grounding_metadata.search_entry_point,
web_search_queries=grounding_metadata.web_search_queries
)
logger.info(f"Filtered grounding metadata: {len(grounding_metadata.grounding_chunks)} chunks → {len(filtered_chunks)} chunks")
return filtered_metadata
def clean_keyword_analysis(self, keyword_analysis: Dict[str, Any]) -> Dict[str, Any]:
"""
Clean and deduplicate keyword analysis data.
Args:
keyword_analysis: Raw keyword analysis data
Returns:
Cleaned and deduplicated keyword analysis
"""
if not keyword_analysis:
return {}
cleaned_analysis = {}
# Clean and deduplicate keyword lists
keyword_categories = ['primary', 'secondary', 'long_tail', 'semantic_keywords', 'trending_terms']
for category in keyword_categories:
if category in keyword_analysis and isinstance(keyword_analysis[category], list):
cleaned_keywords = self._clean_keyword_list(keyword_analysis[category])
cleaned_analysis[category] = cleaned_keywords[:self.max_keywords_per_category]
# Clean other fields
other_fields = ['search_intent', 'difficulty', 'analysis_insights']
for field in other_fields:
if field in keyword_analysis:
cleaned_analysis[field] = keyword_analysis[field]
# Clean content gaps separately (handled by filter_content_gaps)
# Don't add content_gaps if it's empty to avoid adding empty lists
if 'content_gaps' in keyword_analysis and keyword_analysis['content_gaps']:
cleaned_analysis['content_gaps'] = keyword_analysis['content_gaps'] # Will be filtered later
logger.info(f"Cleaned keyword analysis: {len(keyword_analysis)} categories → {len(cleaned_analysis)} categories")
return cleaned_analysis
def clean_competitor_analysis(self, competitor_analysis: Dict[str, Any]) -> Dict[str, Any]:
"""
Clean and validate competitor analysis data.
Args:
competitor_analysis: Raw competitor analysis data
Returns:
Cleaned competitor analysis data
"""
if not competitor_analysis:
return {}
cleaned_analysis = {}
# Clean competitor lists
competitor_lists = ['top_competitors', 'opportunities', 'competitive_advantages']
for field in competitor_lists:
if field in competitor_analysis and isinstance(competitor_analysis[field], list):
cleaned_list = [item.strip() for item in competitor_analysis[field] if item.strip()]
cleaned_analysis[field] = cleaned_list[:10] # Limit to top 10
# Clean other fields
other_fields = ['market_positioning', 'competitive_landscape', 'market_share']
for field in other_fields:
if field in competitor_analysis:
cleaned_analysis[field] = competitor_analysis[field]
logger.info(f"Cleaned competitor analysis: {len(competitor_analysis)} fields → {len(cleaned_analysis)} fields")
return cleaned_analysis
def filter_content_gaps(self, content_gaps: List[str], research_data: BlogResearchResponse) -> List[str]:
"""
Filter content gaps to keep only actionable, high-value ones.
Args:
content_gaps: List of identified content gaps
research_data: Research data for context
Returns:
Filtered list of actionable content gaps
"""
if not content_gaps:
return []
filtered_gaps = []
for gap in content_gaps:
# Quality filters
if not self._is_gap_high_quality(gap):
continue
# Relevance filters
if not self._is_gap_relevant_to_topic(gap, research_data):
continue
# Actionability filters
if not self._is_gap_actionable(gap):
continue
filtered_gaps.append(gap)
# Limit to max_content_gaps
filtered_gaps = filtered_gaps[:self.max_content_gaps]
logger.info(f"Filtered content gaps: {len(content_gaps)}{len(filtered_gaps)}")
return filtered_gaps
# Private helper methods
def _is_source_high_quality(self, source: ResearchSource) -> bool:
"""Check if source meets quality criteria."""
# Credibility score check
if source.credibility_score and source.credibility_score < self.min_credibility_score:
return False
# Excerpt length check
if source.excerpt and len(source.excerpt) < self.min_excerpt_length:
return False
# Title quality check
if not source.title or len(source.title.strip()) < 10:
return False
return True
def _is_source_relevant(self, source: ResearchSource) -> bool:
"""Check if source is relevant (not irrelevant patterns)."""
if not source.url:
return True # Keep sources without URLs
# Check against irrelevant patterns
for pattern in self.irrelevant_patterns:
if re.search(pattern, source.url, re.IGNORECASE):
return False
return True
def _is_source_recent(self, source: ResearchSource) -> bool:
"""Check if source is recent enough."""
if not source.published_at:
return True # Keep sources without dates
try:
# Parse date (assuming ISO format or common formats)
published_date = self._parse_date(source.published_at)
if published_date:
cutoff_date = datetime.now() - timedelta(days=self.max_source_age_days)
return published_date >= cutoff_date
except Exception as e:
logger.warning(f"Error parsing date '{source.published_at}': {e}")
return True # Keep sources with unparseable dates
def _is_citation_relevant(self, citation: Citation) -> bool:
"""Check if citation is relevant and high-quality."""
# Check citation type
relevant_types = ['expert_opinion', 'statistical_data', 'recent_news', 'research_study']
if citation.citation_type not in relevant_types:
return False
# Check text quality
if not citation.text or len(citation.text.strip()) < 20:
return False
return True
def _is_gap_high_quality(self, gap: str) -> bool:
"""Check if content gap is high quality."""
gap = gap.strip()
# Length check
if len(gap) < 10:
return False
# Generic gap check
generic_gaps = ['general', 'overview', 'introduction', 'basics', 'fundamentals']
if gap.lower() in generic_gaps:
return False
# Check for meaningful content
if len(gap.split()) < 3:
return False
return True
def _is_gap_relevant_to_topic(self, gap: str, research_data: BlogResearchResponse) -> bool:
"""Check if content gap is relevant to the research topic."""
# Simple relevance check - could be enhanced with more sophisticated matching
primary_keywords = research_data.keyword_analysis.get('primary', [])
if not primary_keywords:
return True # Keep gaps if no keywords available
gap_lower = gap.lower()
for keyword in primary_keywords:
if keyword.lower() in gap_lower:
return True
# If no direct keyword match, check for common AI-related terms
ai_terms = ['ai', 'artificial intelligence', 'machine learning', 'automation', 'technology', 'digital']
for term in ai_terms:
if term in gap_lower:
return True
return True # Default to keeping gaps if no clear relevance check
def _is_gap_actionable(self, gap: str) -> bool:
"""Check if content gap is actionable (can be addressed with content)."""
gap_lower = gap.lower()
# Check for actionable indicators
actionable_indicators = [
'how to', 'guide', 'tutorial', 'steps', 'process', 'method',
'best practices', 'tips', 'strategies', 'techniques', 'approach',
'comparison', 'vs', 'versus', 'difference', 'pros and cons',
'trends', 'future', '2024', '2025', 'emerging', 'new'
]
for indicator in actionable_indicators:
if indicator in gap_lower:
return True
return True # Default to actionable if no specific indicators
def _clean_keyword_list(self, keywords: List[str]) -> List[str]:
"""Clean and deduplicate a list of keywords."""
cleaned_keywords = []
seen_keywords = set()
for keyword in keywords:
if not keyword or not isinstance(keyword, str):
continue
# Clean keyword
cleaned_keyword = keyword.strip().lower()
# Skip empty or too short keywords
if len(cleaned_keyword) < 2:
continue
# Skip stop words
if cleaned_keyword in self.stop_words:
continue
# Skip duplicates
if cleaned_keyword in seen_keywords:
continue
cleaned_keywords.append(cleaned_keyword)
seen_keywords.add(cleaned_keyword)
return cleaned_keywords
def _parse_date(self, date_str: str) -> Optional[datetime]:
"""Parse date string into datetime object."""
if not date_str:
return None
# Common date formats
date_formats = [
'%Y-%m-%d',
'%Y-%m-%dT%H:%M:%S',
'%Y-%m-%dT%H:%M:%SZ',
'%Y-%m-%dT%H:%M:%S.%fZ',
'%B %d, %Y',
'%b %d, %Y',
'%d %B %Y',
'%d %b %Y',
'%m/%d/%Y',
'%d/%m/%Y'
]
for fmt in date_formats:
try:
return datetime.strptime(date_str, fmt)
except ValueError:
continue
return None
def _log_filtering_results(self, original_counts: Dict[str, int], filtered_research: BlogResearchResponse):
"""Log the results of filtering operations."""
filtered_counts = {
'sources': len(filtered_research.sources),
'grounding_chunks': len(filtered_research.grounding_metadata.grounding_chunks) if filtered_research.grounding_metadata else 0,
'grounding_supports': len(filtered_research.grounding_metadata.grounding_supports) if filtered_research.grounding_metadata else 0,
'citations': len(filtered_research.grounding_metadata.citations) if filtered_research.grounding_metadata else 0,
}
logger.info("📊 Research Data Filtering Results:")
for key, original_count in original_counts.items():
filtered_count = filtered_counts[key]
reduction_percent = ((original_count - filtered_count) / original_count * 100) if original_count > 0 else 0
logger.info(f" {key}: {original_count}{filtered_count} ({reduction_percent:.1f}% reduction)")
# Log content gaps filtering
original_gaps = len(filtered_research.keyword_analysis.get('content_gaps', []))
logger.info(f" content_gaps: {original_gaps}{len(filtered_research.keyword_analysis.get('content_gaps', []))}")
logger.info("✅ Research data filtering completed successfully")

View File

@@ -0,0 +1,226 @@
"""
Exa Research Provider
Neural search implementation using Exa API for high-quality, citation-rich research.
"""
from exa_py import Exa
import os
from loguru import logger
from models.subscription_models import APIProvider
from .base_provider import ResearchProvider as BaseProvider
class ExaResearchProvider(BaseProvider):
"""Exa neural search provider."""
def __init__(self):
self.api_key = os.getenv("EXA_API_KEY")
if not self.api_key:
raise RuntimeError("EXA_API_KEY not configured")
self.exa = Exa(self.api_key)
logger.info("✅ Exa Research Provider initialized")
async def search(self, prompt, topic, industry, target_audience, config, user_id):
"""Execute Exa neural search and return standardized results."""
# Build Exa query
query = f"{topic} {industry} {target_audience}"
# Determine category: use exa_category if set, otherwise map from source_types
category = config.exa_category if config.exa_category else self._map_source_type_to_category(config.source_types)
# Build search kwargs - use correct Exa API format
search_kwargs = {
'type': config.exa_search_type or "auto",
'num_results': min(config.max_sources, 25),
'text': {'max_characters': 1000},
'summary': {'query': f"Key insights about {topic}"},
'highlights': {
'num_sentences': 2,
'highlights_per_url': 3
}
}
# Add optional filters
if category:
search_kwargs['category'] = category
if config.exa_include_domains:
search_kwargs['include_domains'] = config.exa_include_domains
if config.exa_exclude_domains:
search_kwargs['exclude_domains'] = config.exa_exclude_domains
logger.info(f"[Exa Research] Executing search: {query}")
# Execute Exa search - pass contents parameters directly, not nested
try:
results = self.exa.search_and_contents(
query,
text={'max_characters': 1000},
summary={'query': f"Key insights about {topic}"},
highlights={'num_sentences': 2, 'highlights_per_url': 3},
type=config.exa_search_type or "auto",
num_results=min(config.max_sources, 25),
**({k: v for k, v in {
'category': category,
'include_domains': config.exa_include_domains,
'exclude_domains': config.exa_exclude_domains
}.items() if v})
)
except Exception as e:
logger.error(f"[Exa Research] API call failed: {e}")
# Try simpler call without contents if the above fails
try:
logger.info("[Exa Research] Retrying with simplified parameters")
results = self.exa.search_and_contents(
query,
type=config.exa_search_type or "auto",
num_results=min(config.max_sources, 25),
**({k: v for k, v in {
'category': category,
'include_domains': config.exa_include_domains,
'exclude_domains': config.exa_exclude_domains
}.items() if v})
)
except Exception as retry_error:
logger.error(f"[Exa Research] Retry also failed: {retry_error}")
raise RuntimeError(f"Exa search failed: {str(retry_error)}") from retry_error
# Transform to standardized format
sources = self._transform_sources(results.results)
content = self._aggregate_content(results.results)
search_type = getattr(results, 'resolvedSearchType', 'neural') if hasattr(results, 'resolvedSearchType') else 'neural'
# Get cost if available
cost = 0.005 # Default Exa cost for 1-25 results
if hasattr(results, 'costDollars'):
if hasattr(results.costDollars, 'total'):
cost = results.costDollars.total
logger.info(f"[Exa Research] Search completed: {len(sources)} sources, type: {search_type}")
return {
'sources': sources,
'content': content,
'search_type': search_type,
'provider': 'exa',
'search_queries': [query],
'cost': {'total': cost}
}
def get_provider_enum(self):
"""Return EXA provider enum for subscription tracking."""
return APIProvider.EXA
def estimate_tokens(self) -> int:
"""Estimate token usage for Exa (not token-based)."""
return 0 # Exa is per-search, not token-based
def _map_source_type_to_category(self, source_types):
"""Map SourceType enum to Exa category parameter."""
if not source_types:
return None
category_map = {
'research paper': 'research paper',
'news': 'news',
'web': 'personal site',
'industry': 'company',
'expert': 'linkedin profile'
}
for st in source_types:
if st.value in category_map:
return category_map[st.value]
return None
def _transform_sources(self, results):
"""Transform Exa results to ResearchSource format."""
sources = []
for idx, result in enumerate(results):
source_type = self._determine_source_type(result.url if hasattr(result, 'url') else '')
sources.append({
'title': result.title if hasattr(result, 'title') else '',
'url': result.url if hasattr(result, 'url') else '',
'excerpt': self._get_excerpt(result),
'credibility_score': 0.85, # Exa results are high quality
'published_at': result.publishedDate if hasattr(result, 'publishedDate') else None,
'index': idx,
'source_type': source_type,
'content': result.text if hasattr(result, 'text') else '',
'highlights': result.highlights if hasattr(result, 'highlights') else [],
'summary': result.summary if hasattr(result, 'summary') else ''
})
return sources
def _get_excerpt(self, result):
"""Extract excerpt from Exa result."""
if hasattr(result, 'text') and result.text:
return result.text[:500]
elif hasattr(result, 'summary') and result.summary:
return result.summary
return ''
def _determine_source_type(self, url):
"""Determine source type from URL."""
if not url:
return 'web'
url_lower = url.lower()
if 'arxiv.org' in url_lower or 'research' in url_lower:
return 'academic'
elif any(news in url_lower for news in ['cnn.com', 'bbc.com', 'reuters.com', 'theguardian.com']):
return 'news'
elif 'linkedin.com' in url_lower:
return 'expert'
else:
return 'web'
def _aggregate_content(self, results):
"""Aggregate content from Exa results for LLM analysis."""
content_parts = []
for idx, result in enumerate(results):
if hasattr(result, 'summary') and result.summary:
content_parts.append(f"Source {idx + 1}: {result.summary}")
elif hasattr(result, 'text') and result.text:
content_parts.append(f"Source {idx + 1}: {result.text[:1000]}")
return "\n\n".join(content_parts)
def track_exa_usage(self, user_id: str, cost: float):
"""Track Exa API usage after successful call."""
from services.database import get_db
from services.subscription import PricingService
from sqlalchemy import text
db = next(get_db())
try:
pricing_service = PricingService(db)
current_period = pricing_service.get_current_billing_period(user_id)
# Update exa_calls and exa_cost via SQL UPDATE
update_query = text("""
UPDATE usage_summaries
SET exa_calls = COALESCE(exa_calls, 0) + 1,
exa_cost = COALESCE(exa_cost, 0) + :cost,
total_calls = total_calls + 1,
total_cost = total_cost + :cost
WHERE user_id = :user_id AND billing_period = :period
""")
db.execute(update_query, {
'cost': cost,
'user_id': user_id,
'period': current_period
})
db.commit()
logger.info(f"[Exa] Tracked usage: user={user_id}, cost=${cost}")
except Exception as e:
logger.error(f"[Exa] Failed to track usage: {e}")
db.rollback()
finally:
db.close()

View File

@@ -0,0 +1,40 @@
"""
Google Research Provider
Wrapper for Gemini native Google Search grounding to match base provider interface.
"""
from services.llm_providers.gemini_grounded_provider import GeminiGroundedProvider
from models.subscription_models import APIProvider
from .base_provider import ResearchProvider as BaseProvider
from loguru import logger
class GoogleResearchProvider(BaseProvider):
"""Google research provider using Gemini native grounding."""
def __init__(self):
self.gemini = GeminiGroundedProvider()
async def search(self, prompt, topic, industry, target_audience, config, user_id):
"""Call Gemini grounding with pre-flight validation."""
logger.info(f"[Google Research] Executing search for topic: {topic}")
result = await self.gemini.generate_grounded_content(
prompt=prompt,
content_type="research",
max_tokens=2000,
user_id=user_id,
validate_subsequent_operations=True
)
return result
def get_provider_enum(self):
"""Return GEMINI provider enum for subscription tracking."""
return APIProvider.GEMINI
def estimate_tokens(self) -> int:
"""Estimate token usage for Google grounding."""
return 1200 # Conservative estimate

View File

@@ -0,0 +1,79 @@
"""
Keyword Analyzer - AI-powered keyword analysis for research content.
Extracts and analyzes keywords from research content using structured AI responses.
"""
from typing import Dict, Any, List
from loguru import logger
class KeywordAnalyzer:
"""Analyzes keywords from research content using AI-powered extraction."""
def analyze(self, content: str, original_keywords: List[str], user_id: str = None) -> Dict[str, Any]:
"""Parse comprehensive keyword analysis from the research content using AI."""
# Use AI to extract and analyze keywords from the rich research content
keyword_prompt = f"""
Analyze the following research content and extract comprehensive keyword insights for: {', '.join(original_keywords)}
Research Content:
{content[:3000]} # Limit to avoid token limits
Extract and analyze:
1. Primary keywords (main topic terms)
2. Secondary keywords (related terms, synonyms)
3. Long-tail opportunities (specific phrases people search for)
4. Search intent (informational, commercial, navigational, transactional)
5. Keyword difficulty assessment (1-10 scale)
6. Content gaps (what competitors are missing)
7. Semantic keywords (related concepts)
8. Trending terms (emerging keywords)
Respond with JSON:
{{
"primary": ["keyword1", "keyword2"],
"secondary": ["related1", "related2"],
"long_tail": ["specific phrase 1", "specific phrase 2"],
"search_intent": "informational|commercial|navigational|transactional",
"difficulty": 7,
"content_gaps": ["gap1", "gap2"],
"semantic_keywords": ["concept1", "concept2"],
"trending_terms": ["trend1", "trend2"],
"analysis_insights": "Brief analysis of keyword landscape"
}}
"""
from services.llm_providers.main_text_generation import llm_text_gen
keyword_schema = {
"type": "object",
"properties": {
"primary": {"type": "array", "items": {"type": "string"}},
"secondary": {"type": "array", "items": {"type": "string"}},
"long_tail": {"type": "array", "items": {"type": "string"}},
"search_intent": {"type": "string"},
"difficulty": {"type": "integer"},
"content_gaps": {"type": "array", "items": {"type": "string"}},
"semantic_keywords": {"type": "array", "items": {"type": "string"}},
"trending_terms": {"type": "array", "items": {"type": "string"}},
"analysis_insights": {"type": "string"}
},
"required": ["primary", "secondary", "long_tail", "search_intent", "difficulty", "content_gaps", "semantic_keywords", "trending_terms", "analysis_insights"]
}
keyword_analysis = llm_text_gen(
prompt=keyword_prompt,
json_struct=keyword_schema,
user_id=user_id
)
if isinstance(keyword_analysis, dict) and 'error' not in keyword_analysis:
logger.info("✅ AI keyword analysis completed successfully")
return keyword_analysis
else:
# Fail gracefully - no fallback data
error_msg = keyword_analysis.get('error', 'Unknown error') if isinstance(keyword_analysis, dict) else str(keyword_analysis)
logger.error(f"AI keyword analysis failed: {error_msg}")
raise ValueError(f"Keyword analysis failed: {error_msg}")

View File

@@ -0,0 +1,914 @@
"""
Research Service - Core research functionality for AI Blog Writer.
Handles Google Search grounding, caching, and research orchestration.
"""
from typing import Dict, Any, List, Optional
from datetime import datetime
from loguru import logger
from models.blog_models import (
BlogResearchRequest,
BlogResearchResponse,
ResearchSource,
GroundingMetadata,
GroundingChunk,
GroundingSupport,
Citation,
ResearchConfig,
ResearchMode,
ResearchProvider,
)
from services.blog_writer.logger_config import blog_writer_logger, log_function_call
from fastapi import HTTPException
from .keyword_analyzer import KeywordAnalyzer
from .competitor_analyzer import CompetitorAnalyzer
from .content_angle_generator import ContentAngleGenerator
from .data_filter import ResearchDataFilter
from .research_strategies import get_strategy_for_mode
class ResearchService:
"""Service for conducting comprehensive research using Google Search grounding."""
def __init__(self):
self.keyword_analyzer = KeywordAnalyzer()
self.competitor_analyzer = CompetitorAnalyzer()
self.content_angle_generator = ContentAngleGenerator()
self.data_filter = ResearchDataFilter()
@log_function_call("research_operation")
async def research(self, request: BlogResearchRequest, user_id: str) -> BlogResearchResponse:
"""
Stage 1: Research & Strategy (AI Orchestration)
Uses ONLY Gemini's native Google Search grounding - ONE API call for everything.
Follows LinkedIn service pattern for efficiency and cost optimization.
Includes intelligent caching for exact keyword matches.
"""
try:
from services.cache.research_cache import research_cache
topic = request.topic or ", ".join(request.keywords)
industry = request.industry or (request.persona.industry if request.persona and request.persona.industry else "General")
target_audience = getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'
# Log research parameters
blog_writer_logger.log_operation_start(
"research",
topic=topic,
industry=industry,
target_audience=target_audience,
keywords=request.keywords,
keyword_count=len(request.keywords)
)
# Check cache first for exact keyword match
cached_result = research_cache.get_cached_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience
)
if cached_result:
logger.info(f"Returning cached research result for keywords: {request.keywords}")
blog_writer_logger.log_operation_end("research", 0, success=True, cache_hit=True)
# Normalize cached data to fix None values in confidence_scores
normalized_result = self._normalize_cached_research_data(cached_result)
return BlogResearchResponse(**normalized_result)
# User ID validation (validation logic is now in Google Grounding provider)
if not user_id:
raise ValueError("user_id is required for research operation. Please provide Clerk user ID.")
# Cache miss - proceed with API call
logger.info(f"Cache miss - making API call for keywords: {request.keywords}")
blog_writer_logger.log_operation_start("research_api_call", api_name="research", operation="research")
# Determine research mode and get appropriate strategy
research_mode = request.research_mode or ResearchMode.BASIC
config = request.config or ResearchConfig(mode=research_mode, provider=ResearchProvider.GOOGLE)
strategy = get_strategy_for_mode(research_mode)
logger.info(f"Research: mode={research_mode.value}, provider={config.provider.value}")
# Build research prompt based on strategy
research_prompt = strategy.build_research_prompt(topic, industry, target_audience, config)
# Route to appropriate provider
if config.provider == ResearchProvider.EXA:
# Exa research workflow
from .exa_provider import ExaResearchProvider
from services.subscription.preflight_validator import validate_exa_research_operations
from services.database import get_db
from services.subscription import PricingService
import os
import time
# Pre-flight validation
db_val = next(get_db())
try:
pricing_service = PricingService(db_val)
gpt_provider = os.getenv("GPT_PROVIDER", "google")
validate_exa_research_operations(pricing_service, user_id, gpt_provider)
finally:
db_val.close()
# Execute Exa search
api_start_time = time.time()
try:
exa_provider = ExaResearchProvider()
raw_result = await exa_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
api_duration_ms = (time.time() - api_start_time) * 1000
# Track usage
cost = raw_result.get('cost', {}).get('total', 0.005) if isinstance(raw_result.get('cost'), dict) else 0.005
exa_provider.track_exa_usage(user_id, cost)
# Log API call performance
blog_writer_logger.log_api_call(
"exa_search",
"search_and_contents",
api_duration_ms,
token_usage={},
content_length=len(raw_result.get('content', ''))
)
# Extract content for downstream analysis
content = raw_result.get('content', '')
sources = raw_result.get('sources', [])
search_widget = "" # Exa doesn't provide search widgets
search_queries = raw_result.get('search_queries', [])
grounding_metadata = None # Exa doesn't provide grounding metadata
except RuntimeError as e:
if "EXA_API_KEY not configured" in str(e):
logger.warning("Exa not configured, falling back to Google")
config.provider = ResearchProvider.GOOGLE
# Continue to Google flow below
raw_result = None
else:
raise
elif config.provider == ResearchProvider.TAVILY:
# Tavily research workflow
from .tavily_provider import TavilyResearchProvider
from services.database import get_db
from services.subscription import PricingService
import os
import time
# Pre-flight validation (similar to Exa)
db_val = next(get_db())
try:
pricing_service = PricingService(db_val)
# Check Tavily usage limits
limits = pricing_service.get_user_limits(user_id)
tavily_limit = limits.get('limits', {}).get('tavily_calls', 0) if limits else 0
# Get current usage
from models.subscription_models import UsageSummary
from datetime import datetime
current_period = pricing_service.get_current_billing_period(user_id) or datetime.now().strftime("%Y-%m")
usage = db_val.query(UsageSummary).filter(
UsageSummary.user_id == user_id,
UsageSummary.billing_period == current_period
).first()
current_calls = getattr(usage, 'tavily_calls', 0) or 0 if usage else 0
if tavily_limit > 0 and current_calls >= tavily_limit:
raise HTTPException(
status_code=429,
detail={
'error': 'Tavily API call limit exceeded',
'message': f'You have reached your Tavily API call limit ({tavily_limit} calls). Please upgrade your plan or wait for the next billing period.',
'provider': 'tavily',
'usage_info': {
'current': current_calls,
'limit': tavily_limit
}
}
)
except HTTPException:
raise
except Exception as e:
logger.warning(f"Error checking Tavily limits: {e}")
finally:
db_val.close()
# Execute Tavily search
api_start_time = time.time()
try:
tavily_provider = TavilyResearchProvider()
raw_result = await tavily_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
api_duration_ms = (time.time() - api_start_time) * 1000
# Track usage
cost = raw_result.get('cost', {}).get('total', 0.001) if isinstance(raw_result.get('cost'), dict) else 0.001
search_depth = config.tavily_search_depth or "basic"
tavily_provider.track_tavily_usage(user_id, cost, search_depth)
# Log API call performance
blog_writer_logger.log_api_call(
"tavily_search",
"search",
api_duration_ms,
token_usage={},
content_length=len(raw_result.get('content', ''))
)
# Extract content for downstream analysis
content = raw_result.get('content', '')
sources = raw_result.get('sources', [])
search_widget = "" # Tavily doesn't provide search widgets
search_queries = raw_result.get('search_queries', [])
grounding_metadata = None # Tavily doesn't provide grounding metadata
except RuntimeError as e:
if "TAVILY_API_KEY not configured" in str(e):
logger.warning("Tavily not configured, falling back to Google")
config.provider = ResearchProvider.GOOGLE
# Continue to Google flow below
raw_result = None
else:
raise
if config.provider not in [ResearchProvider.EXA, ResearchProvider.TAVILY]:
# Google research (existing flow) or fallback from Exa
from .google_provider import GoogleResearchProvider
import time
api_start_time = time.time()
google_provider = GoogleResearchProvider()
gemini_result = await google_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
api_duration_ms = (time.time() - api_start_time) * 1000
# Log API call performance
blog_writer_logger.log_api_call(
"gemini_grounded",
"generate_grounded_content",
api_duration_ms,
token_usage=gemini_result.get("token_usage", {}),
content_length=len(gemini_result.get("content", ""))
)
# Extract sources and content
sources = self._extract_sources_from_grounding(gemini_result)
content = gemini_result.get("content", "")
search_widget = gemini_result.get("search_widget", "") or ""
search_queries = gemini_result.get("search_queries", []) or []
grounding_metadata = self._extract_grounding_metadata(gemini_result)
# Continue with common analysis (same for both providers)
keyword_analysis = self.keyword_analyzer.analyze(content, request.keywords, user_id=user_id)
competitor_analysis = self.competitor_analyzer.analyze(content, user_id=user_id)
suggested_angles = self.content_angle_generator.generate(content, topic, industry, user_id=user_id)
logger.info(f"Research completed successfully with {len(sources)} sources and {len(search_queries)} search queries")
# Log analysis results
blog_writer_logger.log_performance(
"research_analysis",
len(content),
"characters",
sources_count=len(sources),
search_queries_count=len(search_queries),
keyword_analysis_keys=len(keyword_analysis),
suggested_angles_count=len(suggested_angles)
)
# Create the response
response = BlogResearchResponse(
success=True,
sources=sources,
keyword_analysis=keyword_analysis,
competitor_analysis=competitor_analysis,
suggested_angles=suggested_angles,
# Add search widget and queries for UI display
search_widget=search_widget if 'search_widget' in locals() else "",
search_queries=search_queries if 'search_queries' in locals() else [],
# Add grounding metadata for detailed UI display
grounding_metadata=grounding_metadata,
)
# Filter and clean research data for optimal AI processing
filtered_response = self.data_filter.filter_research_data(response)
logger.info("Research data filtering completed successfully")
# Cache the successful result for future exact keyword matches (both caches)
persistent_research_cache.cache_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience,
result=filtered_response.dict()
)
# Also cache in memory for faster access
research_cache.cache_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience,
result=filtered_response.dict()
)
return filtered_response
except HTTPException:
# Re-raise HTTPException (subscription errors) - let task manager handle it
raise
except Exception as e:
error_message = str(e)
logger.error(f"Research failed: {error_message}")
# Log error with full context
blog_writer_logger.log_error(
e,
"research",
context={
"topic": topic,
"keywords": request.keywords,
"industry": industry,
"target_audience": target_audience
}
)
# Import custom exceptions for better error handling
from services.blog_writer.exceptions import (
ResearchFailedException,
APIRateLimitException,
APITimeoutException,
ValidationException
)
# Determine if this is a retryable error
retry_suggested = True
user_message = "Research failed. Please try again with different keywords or check your internet connection."
if isinstance(e, APIRateLimitException):
retry_suggested = True
user_message = f"Rate limit exceeded. Please wait {e.context.get('retry_after', 60)} seconds before trying again."
elif isinstance(e, APITimeoutException):
retry_suggested = True
user_message = "Research request timed out. Please try again with a shorter query or check your internet connection."
elif isinstance(e, ValidationException):
retry_suggested = False
user_message = "Invalid research request. Please check your input parameters and try again."
elif "401" in error_message or "403" in error_message:
retry_suggested = False
user_message = "Authentication failed. Please check your API credentials."
elif "400" in error_message:
retry_suggested = False
user_message = "Invalid request. Please check your input parameters."
# Return a graceful failure response with enhanced error information
return BlogResearchResponse(
success=False,
sources=[],
keyword_analysis={},
competitor_analysis={},
suggested_angles=[],
search_widget="",
search_queries=[],
error_message=user_message,
retry_suggested=retry_suggested,
error_code=getattr(e, 'error_code', 'RESEARCH_FAILED'),
actionable_steps=getattr(e, 'actionable_steps', [
"Try with different keywords",
"Check your internet connection",
"Wait a few minutes and try again",
"Contact support if the issue persists"
])
)
@log_function_call("research_with_progress")
async def research_with_progress(self, request: BlogResearchRequest, task_id: str, user_id: str) -> BlogResearchResponse:
"""
Research method with progress updates for real-time feedback.
"""
try:
from services.cache.research_cache import research_cache
from services.cache.persistent_research_cache import persistent_research_cache
from api.blog_writer.task_manager import task_manager
topic = request.topic or ", ".join(request.keywords)
industry = request.industry or (request.persona.industry if request.persona and request.persona.industry else "General")
target_audience = getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'
# Check cache first for exact keyword match (try both caches)
await task_manager.update_progress(task_id, "🔍 Checking cache for existing research...")
# Try persistent cache first (survives restarts)
cached_result = persistent_research_cache.get_cached_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience
)
# Fallback to in-memory cache
if not cached_result:
cached_result = research_cache.get_cached_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience
)
if cached_result:
await task_manager.update_progress(task_id, "✅ Found cached research results! Returning instantly...")
logger.info(f"Returning cached research result for keywords: {request.keywords}")
# Normalize cached data to fix None values in confidence_scores
normalized_result = self._normalize_cached_research_data(cached_result)
return BlogResearchResponse(**normalized_result)
# User ID validation
if not user_id:
await task_manager.update_progress(task_id, "❌ Error: User ID is required for research operation")
raise ValueError("user_id is required for research operation. Please provide Clerk user ID.")
# Determine research mode and get appropriate strategy
research_mode = request.research_mode or ResearchMode.BASIC
config = request.config or ResearchConfig(mode=research_mode, provider=ResearchProvider.GOOGLE)
strategy = get_strategy_for_mode(research_mode)
logger.info(f"Research: mode={research_mode.value}, provider={config.provider.value}")
# Build research prompt based on strategy
research_prompt = strategy.build_research_prompt(topic, industry, target_audience, config)
# Route to appropriate provider
if config.provider == ResearchProvider.EXA:
# Exa research workflow
from .exa_provider import ExaResearchProvider
from services.subscription.preflight_validator import validate_exa_research_operations
from services.database import get_db
from services.subscription import PricingService
import os
await task_manager.update_progress(task_id, "🌐 Connecting to Exa neural search...")
# Pre-flight validation
db_val = next(get_db())
try:
pricing_service = PricingService(db_val)
gpt_provider = os.getenv("GPT_PROVIDER", "google")
validate_exa_research_operations(pricing_service, user_id, gpt_provider)
except HTTPException as http_error:
logger.error(f"Subscription limit exceeded for Exa research: {http_error.detail}")
await task_manager.update_progress(task_id, f"❌ Subscription limit exceeded: {http_error.detail.get('message', str(http_error.detail)) if isinstance(http_error.detail, dict) else str(http_error.detail)}")
raise
finally:
db_val.close()
# Execute Exa search
await task_manager.update_progress(task_id, "🤖 Executing Exa neural search...")
try:
exa_provider = ExaResearchProvider()
raw_result = await exa_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
# Track usage
cost = raw_result.get('cost', {}).get('total', 0.005) if isinstance(raw_result.get('cost'), dict) else 0.005
exa_provider.track_exa_usage(user_id, cost)
# Extract content for downstream analysis
# Handle None result case
if raw_result is None:
logger.error("raw_result is None after Exa search - this should not happen if HTTPException was raised")
raise ValueError("Exa research result is None - search operation failed unexpectedly")
if not isinstance(raw_result, dict):
logger.warning(f"raw_result is not a dict (type: {type(raw_result)}), using defaults")
raw_result = {}
content = raw_result.get('content', '')
sources = raw_result.get('sources', []) or []
search_widget = "" # Exa doesn't provide search widgets
search_queries = raw_result.get('search_queries', []) or []
grounding_metadata = None # Exa doesn't provide grounding metadata
except RuntimeError as e:
if "EXA_API_KEY not configured" in str(e):
logger.warning("Exa not configured, falling back to Google")
await task_manager.update_progress(task_id, "⚠️ Exa not configured, falling back to Google Search")
config.provider = ResearchProvider.GOOGLE
# Continue to Google flow below
else:
raise
elif config.provider == ResearchProvider.TAVILY:
# Tavily research workflow
from .tavily_provider import TavilyResearchProvider
from services.database import get_db
from services.subscription import PricingService
import os
await task_manager.update_progress(task_id, "🌐 Connecting to Tavily AI search...")
# Pre-flight validation
db_val = next(get_db())
try:
pricing_service = PricingService(db_val)
# Check Tavily usage limits
limits = pricing_service.get_user_limits(user_id)
tavily_limit = limits.get('limits', {}).get('tavily_calls', 0) if limits else 0
# Get current usage
from models.subscription_models import UsageSummary
from datetime import datetime
current_period = pricing_service.get_current_billing_period(user_id) or datetime.now().strftime("%Y-%m")
usage = db_val.query(UsageSummary).filter(
UsageSummary.user_id == user_id,
UsageSummary.billing_period == current_period
).first()
current_calls = getattr(usage, 'tavily_calls', 0) or 0 if usage else 0
if tavily_limit > 0 and current_calls >= tavily_limit:
await task_manager.update_progress(task_id, f"❌ Tavily API call limit exceeded ({current_calls}/{tavily_limit})")
raise HTTPException(
status_code=429,
detail={
'error': 'Tavily API call limit exceeded',
'message': f'You have reached your Tavily API call limit ({tavily_limit} calls). Please upgrade your plan or wait for the next billing period.',
'provider': 'tavily',
'usage_info': {
'current': current_calls,
'limit': tavily_limit
}
}
)
except HTTPException:
raise
except Exception as e:
logger.warning(f"Error checking Tavily limits: {e}")
finally:
db_val.close()
# Execute Tavily search
await task_manager.update_progress(task_id, "🤖 Executing Tavily AI search...")
try:
tavily_provider = TavilyResearchProvider()
raw_result = await tavily_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
# Track usage
cost = raw_result.get('cost', {}).get('total', 0.001) if isinstance(raw_result.get('cost'), dict) else 0.001
search_depth = config.tavily_search_depth or "basic"
tavily_provider.track_tavily_usage(user_id, cost, search_depth)
# Extract content for downstream analysis
if raw_result is None:
logger.error("raw_result is None after Tavily search")
raise ValueError("Tavily research result is None - search operation failed unexpectedly")
if not isinstance(raw_result, dict):
logger.warning(f"raw_result is not a dict (type: {type(raw_result)}), using defaults")
raw_result = {}
content = raw_result.get('content', '')
sources = raw_result.get('sources', []) or []
search_widget = "" # Tavily doesn't provide search widgets
search_queries = raw_result.get('search_queries', []) or []
grounding_metadata = None # Tavily doesn't provide grounding metadata
except RuntimeError as e:
if "TAVILY_API_KEY not configured" in str(e):
logger.warning("Tavily not configured, falling back to Google")
await task_manager.update_progress(task_id, "⚠️ Tavily not configured, falling back to Google Search")
config.provider = ResearchProvider.GOOGLE
# Continue to Google flow below
else:
raise
if config.provider not in [ResearchProvider.EXA, ResearchProvider.TAVILY]:
# Google research (existing flow)
from .google_provider import GoogleResearchProvider
await task_manager.update_progress(task_id, "🌐 Connecting to Google Search grounding...")
google_provider = GoogleResearchProvider()
await task_manager.update_progress(task_id, "🤖 Making AI request to Gemini with Google Search grounding...")
try:
gemini_result = await google_provider.search(
research_prompt, topic, industry, target_audience, config, user_id
)
except HTTPException as http_error:
logger.error(f"Subscription limit exceeded for Google research: {http_error.detail}")
await task_manager.update_progress(task_id, f"❌ Subscription limit exceeded: {http_error.detail.get('message', str(http_error.detail)) if isinstance(http_error.detail, dict) else str(http_error.detail)}")
raise
await task_manager.update_progress(task_id, "📊 Processing research results and extracting insights...")
# Extract sources and content
# Handle None result case
if gemini_result is None:
logger.error("gemini_result is None after search - this should not happen if HTTPException was raised")
raise ValueError("Research result is None - search operation failed unexpectedly")
sources = self._extract_sources_from_grounding(gemini_result)
content = gemini_result.get("content", "") if isinstance(gemini_result, dict) else ""
search_widget = gemini_result.get("search_widget", "") or "" if isinstance(gemini_result, dict) else ""
search_queries = gemini_result.get("search_queries", []) or [] if isinstance(gemini_result, dict) else []
grounding_metadata = self._extract_grounding_metadata(gemini_result)
# Continue with common analysis (same for both providers)
await task_manager.update_progress(task_id, "🔍 Analyzing keywords and content angles...")
keyword_analysis = self.keyword_analyzer.analyze(content, request.keywords, user_id=user_id)
competitor_analysis = self.competitor_analyzer.analyze(content, user_id=user_id)
suggested_angles = self.content_angle_generator.generate(content, topic, industry, user_id=user_id)
await task_manager.update_progress(task_id, "💾 Caching results for future use...")
logger.info(f"Research completed successfully with {len(sources)} sources and {len(search_queries)} search queries")
# Create the response
response = BlogResearchResponse(
success=True,
sources=sources,
keyword_analysis=keyword_analysis,
competitor_analysis=competitor_analysis,
suggested_angles=suggested_angles,
# Add search widget and queries for UI display
search_widget=search_widget if 'search_widget' in locals() else "",
search_queries=search_queries if 'search_queries' in locals() else [],
# Add grounding metadata for detailed UI display
grounding_metadata=grounding_metadata,
# Preserve original user keywords for caching
original_keywords=request.keywords,
)
# Filter and clean research data for optimal AI processing
await task_manager.update_progress(task_id, "🔍 Filtering and cleaning research data...")
filtered_response = self.data_filter.filter_research_data(response)
logger.info("Research data filtering completed successfully")
# Cache the successful result for future exact keyword matches (both caches)
persistent_research_cache.cache_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience,
result=filtered_response.dict()
)
# Also cache in memory for faster access
research_cache.cache_result(
keywords=request.keywords,
industry=industry,
target_audience=target_audience,
result=filtered_response.dict()
)
return filtered_response
except HTTPException:
# Re-raise HTTPException (subscription errors) - let task manager handle it
raise
except Exception as e:
error_message = str(e)
logger.error(f"Research failed: {error_message}")
# Log error with full context
blog_writer_logger.log_error(
e,
"research",
context={
"topic": topic,
"keywords": request.keywords,
"industry": industry,
"target_audience": target_audience
}
)
# Import custom exceptions for better error handling
from services.blog_writer.exceptions import (
ResearchFailedException,
APIRateLimitException,
APITimeoutException,
ValidationException
)
# Determine if this is a retryable error
retry_suggested = True
user_message = "Research failed. Please try again with different keywords or check your internet connection."
if isinstance(e, APIRateLimitException):
retry_suggested = True
user_message = f"Rate limit exceeded. Please wait {e.context.get('retry_after', 60)} seconds before trying again."
elif isinstance(e, APITimeoutException):
retry_suggested = True
user_message = "Research request timed out. Please try again with a shorter query or check your internet connection."
elif isinstance(e, ValidationException):
retry_suggested = False
user_message = "Invalid research request. Please check your input parameters and try again."
elif "401" in error_message or "403" in error_message:
retry_suggested = False
user_message = "Authentication failed. Please check your API credentials."
elif "400" in error_message:
retry_suggested = False
user_message = "Invalid request. Please check your input parameters."
# Return a graceful failure response with enhanced error information
return BlogResearchResponse(
success=False,
sources=[],
keyword_analysis={},
competitor_analysis={},
suggested_angles=[],
search_widget="",
search_queries=[],
error_message=user_message,
retry_suggested=retry_suggested,
error_code=getattr(e, 'error_code', 'RESEARCH_FAILED'),
actionable_steps=getattr(e, 'actionable_steps', [
"Try with different keywords",
"Check your internet connection",
"Wait a few minutes and try again",
"Contact support if the issue persists"
])
)
def _extract_sources_from_grounding(self, gemini_result: Dict[str, Any]) -> List[ResearchSource]:
"""Extract sources from Gemini grounding metadata."""
sources = []
# Handle None or invalid gemini_result
if not gemini_result or not isinstance(gemini_result, dict):
logger.warning("gemini_result is None or not a dict, returning empty sources")
return sources
# The Gemini grounded provider already extracts sources and puts them in the 'sources' field
raw_sources = gemini_result.get("sources", [])
# Ensure raw_sources is a list (handle None case)
if raw_sources is None:
raw_sources = []
for src in raw_sources:
source = ResearchSource(
title=src.get("title", "Untitled"),
url=src.get("url", ""),
excerpt=src.get("content", "")[:500] if src.get("content") else f"Source from {src.get('title', 'web')}",
credibility_score=float(src.get("credibility_score", 0.8)),
published_at=str(src.get("publication_date", "2024-01-01")),
index=src.get("index"),
source_type=src.get("type", "web")
)
sources.append(source)
return sources
def _normalize_cached_research_data(self, cached_data: Dict[str, Any]) -> Dict[str, Any]:
"""
Normalize cached research data to fix None values in confidence_scores.
Ensures all GroundingSupport objects have confidence_scores as a list.
"""
if not isinstance(cached_data, dict):
return cached_data
normalized = cached_data.copy()
# Normalize grounding_metadata if present
if "grounding_metadata" in normalized and normalized["grounding_metadata"]:
grounding_metadata = normalized["grounding_metadata"].copy() if isinstance(normalized["grounding_metadata"], dict) else {}
# Normalize grounding_supports
if "grounding_supports" in grounding_metadata and isinstance(grounding_metadata["grounding_supports"], list):
normalized_supports = []
for support in grounding_metadata["grounding_supports"]:
if isinstance(support, dict):
normalized_support = support.copy()
# Fix confidence_scores: ensure it's a list, not None
if normalized_support.get("confidence_scores") is None:
normalized_support["confidence_scores"] = []
elif not isinstance(normalized_support.get("confidence_scores"), list):
# If it's not a list, try to convert or default to empty list
normalized_support["confidence_scores"] = []
# Fix grounding_chunk_indices: ensure it's a list, not None
if normalized_support.get("grounding_chunk_indices") is None:
normalized_support["grounding_chunk_indices"] = []
elif not isinstance(normalized_support.get("grounding_chunk_indices"), list):
normalized_support["grounding_chunk_indices"] = []
# Ensure segment_text is a string
if normalized_support.get("segment_text") is None:
normalized_support["segment_text"] = ""
normalized_supports.append(normalized_support)
else:
normalized_supports.append(support)
grounding_metadata["grounding_supports"] = normalized_supports
normalized["grounding_metadata"] = grounding_metadata
return normalized
def _extract_grounding_metadata(self, gemini_result: Dict[str, Any]) -> GroundingMetadata:
"""Extract detailed grounding metadata from Gemini result."""
grounding_chunks = []
grounding_supports = []
citations = []
# Handle None or invalid gemini_result
if not gemini_result or not isinstance(gemini_result, dict):
logger.warning("gemini_result is None or not a dict, returning empty grounding metadata")
return GroundingMetadata(
grounding_chunks=grounding_chunks,
grounding_supports=grounding_supports,
citations=citations
)
# Extract grounding chunks from the raw grounding metadata
raw_grounding = gemini_result.get("grounding_metadata", {})
# Handle case where grounding_metadata might be a GroundingMetadata object
if hasattr(raw_grounding, 'grounding_chunks'):
raw_chunks = raw_grounding.grounding_chunks
else:
raw_chunks = raw_grounding.get("grounding_chunks", []) if isinstance(raw_grounding, dict) else []
# Ensure raw_chunks is a list (handle None case)
if raw_chunks is None:
raw_chunks = []
for chunk in raw_chunks:
if "web" in chunk:
web_data = chunk["web"]
grounding_chunk = GroundingChunk(
title=web_data.get("title", "Untitled"),
url=web_data.get("uri", ""),
confidence_score=None # Will be set from supports
)
grounding_chunks.append(grounding_chunk)
# Extract grounding supports with confidence scores
if hasattr(raw_grounding, 'grounding_supports'):
raw_supports = raw_grounding.grounding_supports
else:
raw_supports = raw_grounding.get("grounding_supports", [])
for support in raw_supports:
# Handle both dictionary and GroundingSupport object formats
if hasattr(support, 'confidence_scores'):
confidence_scores = support.confidence_scores
chunk_indices = support.grounding_chunk_indices
segment_text = getattr(support, 'segment_text', '')
start_index = getattr(support, 'start_index', None)
end_index = getattr(support, 'end_index', None)
else:
confidence_scores = support.get("confidence_scores", [])
chunk_indices = support.get("grounding_chunk_indices", [])
segment = support.get("segment", {})
segment_text = segment.get("text", "")
start_index = segment.get("start_index")
end_index = segment.get("end_index")
grounding_support = GroundingSupport(
confidence_scores=confidence_scores,
grounding_chunk_indices=chunk_indices,
segment_text=segment_text,
start_index=start_index,
end_index=end_index
)
grounding_supports.append(grounding_support)
# Update confidence scores for chunks
if confidence_scores and chunk_indices:
avg_confidence = sum(confidence_scores) / len(confidence_scores)
for idx in chunk_indices:
if idx < len(grounding_chunks):
grounding_chunks[idx].confidence_score = avg_confidence
# Extract citations from the raw result
raw_citations = gemini_result.get("citations", [])
for citation in raw_citations:
citation_obj = Citation(
citation_type=citation.get("type", "inline"),
start_index=citation.get("start_index", 0),
end_index=citation.get("end_index", 0),
text=citation.get("text", ""),
source_indices=citation.get("source_indices", []),
reference=citation.get("reference", "")
)
citations.append(citation_obj)
# Extract search entry point and web search queries
if hasattr(raw_grounding, 'search_entry_point'):
search_entry_point = getattr(raw_grounding.search_entry_point, 'rendered_content', '') if raw_grounding.search_entry_point else ''
else:
search_entry_point = raw_grounding.get("search_entry_point", {}).get("rendered_content", "")
if hasattr(raw_grounding, 'web_search_queries'):
web_search_queries = raw_grounding.web_search_queries
else:
web_search_queries = raw_grounding.get("web_search_queries", [])
return GroundingMetadata(
grounding_chunks=grounding_chunks,
grounding_supports=grounding_supports,
citations=citations,
search_entry_point=search_entry_point,
web_search_queries=web_search_queries
)

View File

@@ -0,0 +1,230 @@
"""
Research Strategy Pattern Implementation
Different strategies for executing research based on depth and focus.
"""
from abc import ABC, abstractmethod
from typing import Dict, Any
from loguru import logger
from models.blog_models import BlogResearchRequest, ResearchMode, ResearchConfig
from .keyword_analyzer import KeywordAnalyzer
from .competitor_analyzer import CompetitorAnalyzer
from .content_angle_generator import ContentAngleGenerator
class ResearchStrategy(ABC):
"""Base class for research strategies."""
def __init__(self):
self.keyword_analyzer = KeywordAnalyzer()
self.competitor_analyzer = CompetitorAnalyzer()
self.content_angle_generator = ContentAngleGenerator()
@abstractmethod
def build_research_prompt(
self,
topic: str,
industry: str,
target_audience: str,
config: ResearchConfig
) -> str:
"""Build the research prompt for the strategy."""
pass
@abstractmethod
def get_mode(self) -> ResearchMode:
"""Return the research mode this strategy handles."""
pass
class BasicResearchStrategy(ResearchStrategy):
"""Basic research strategy - keyword focused, minimal analysis."""
def get_mode(self) -> ResearchMode:
return ResearchMode.BASIC
def build_research_prompt(
self,
topic: str,
industry: str,
target_audience: str,
config: ResearchConfig
) -> str:
"""Build basic research prompt focused on podcast-ready, actionable insights."""
prompt = f"""You are a podcast researcher creating TALKING POINTS and FACT CARDS for a {industry} audience of {target_audience}.
Research Topic: "{topic}"
Provide analysis in this EXACT format:
## PODCAST HOOKS (3)
- [Hook line with tension + data point + source URL]
## OBJECTIONS & COUNTERS (3)
- Objection: [common listener objection]
Counter: [concise rebuttal with stat + source URL]
## KEY STATS & PROOF (6)
- [Specific metric with %/number, date, and source URL]
## MINI CASE SNAPS (3)
- [Brand/company], [what they did], [outcome metric], [source URL]
## KEYWORDS TO MENTION (Primary + 5 Secondary)
- Primary: "{topic}"
- Secondary: [5 related keywords]
## 5 CONTENT ANGLES
1. [Angle with audience benefit + why-now]
2. [Angle ...]
3. [Angle ...]
4. [Angle ...]
5. [Angle ...]
## FACT CARD LIST (8)
- For each: Quote/claim, source URL, published date, metric/context.
REQUIREMENTS:
- Every claim MUST include a source URL (authoritative, recent: 2024-2025 preferred).
- Use concrete numbers, dates, outcomes; avoid generic advice.
- Keep bullets tight and scannable for spoken narration."""
return prompt.strip()
class ComprehensiveResearchStrategy(ResearchStrategy):
"""Comprehensive research strategy - full analysis with all components."""
def get_mode(self) -> ResearchMode:
return ResearchMode.COMPREHENSIVE
def build_research_prompt(
self,
topic: str,
industry: str,
target_audience: str,
config: ResearchConfig
) -> str:
"""Build comprehensive research prompt with podcast-focused, high-value insights."""
date_filter = f"\nDate Focus: {config.date_range.value.replace('_', ' ')}" if config.date_range else ""
source_filter = f"\nPriority Sources: {', '.join([s.value for s in config.source_types])}" if config.source_types else ""
prompt = f"""You are a senior podcast researcher creating deeply sourced talking points for a {industry} audience of {target_audience}.
Research Topic: "{topic}"{date_filter}{source_filter}
Provide COMPLETE analysis in this EXACT format:
## WHAT'S CHANGED (2024-2025)
[5-7 concise trend bullets with numbers + source URLs]
## PROOF & NUMBERS
[10 stats with metric, date, sample size/method, and source URL]
## EXPERT SIGNALS
[5 expert quotes with name, title/company, source URL]
## RECENT MOVES
[5-7 news items or launches with dates and source URLs]
## MARKET SNAPSHOTS
[3-5 insights with TAM/SAM/SOM or adoption metrics, source URLs]
## CASE SNAPS
[3-5 cases: who, what they did, outcome metric, source URL]
## KEYWORD PLAN
Primary (3), Secondary (8-10), Long-tail (5-7) with intent hints.
## COMPETITOR GAPS
- Top 5 competitors (URL) + 1-line strength
- 5 content gaps we can own
- 3 unique angles to differentiate
## PODCAST-READY ANGLES (5)
- Each: Hook, promised takeaway, data or example, source URL.
## FACT CARD LIST (10)
- Each: Quote/claim, source URL, published date, metric/context, suggested angle tag.
VERIFICATION REQUIREMENTS:
- Minimum 2 authoritative sources per major claim.
- Prefer industry reports > research papers > news > blogs.
- 2024-2025 data strongly preferred.
- All numbers must include timeframe and methodology.
- Every bullet must be concise for spoken narration and actionable for {target_audience}."""
return prompt.strip()
class TargetedResearchStrategy(ResearchStrategy):
"""Targeted research strategy - focused on specific aspects."""
def get_mode(self) -> ResearchMode:
return ResearchMode.TARGETED
def build_research_prompt(
self,
topic: str,
industry: str,
target_audience: str,
config: ResearchConfig
) -> str:
"""Build targeted research prompt based on config preferences."""
sections = []
if config.include_trends:
sections.append("""## CURRENT TRENDS
[3-5 trends with data and source URLs]""")
if config.include_statistics:
sections.append("""## KEY STATISTICS
[5-7 statistics with numbers and source URLs]""")
if config.include_expert_quotes:
sections.append("""## EXPERT OPINIONS
[3-4 expert quotes with attribution and source URLs]""")
if config.include_competitors:
sections.append("""## COMPETITOR ANALYSIS
Top Competitors: [3-5]
Content Gaps: [3-5]""")
# Always include keywords and angles
sections.append("""## KEYWORD ANALYSIS
Primary: [2-3 variations]
Secondary: [5-7 keywords]
Long-Tail: [3-5 phrases]""")
sections.append("""## CONTENT ANGLES (3-5)
[Unique blog angles with reasoning]""")
sections_str = "\n\n".join(sections)
prompt = f"""You are a blog content strategist conducting targeted research for a {industry} blog targeting {target_audience}.
Research Topic: "{topic}"
Provide focused analysis in this EXACT format:
{sections_str}
REQUIREMENTS:
- Cite all claims with authoritative source URLs
- Include specific numbers, dates, examples
- Focus on actionable insights for {target_audience}
- Use 2024-2025 data when available"""
return prompt.strip()
def get_strategy_for_mode(mode: ResearchMode) -> ResearchStrategy:
"""Factory function to get the appropriate strategy for a mode."""
strategy_map = {
ResearchMode.BASIC: BasicResearchStrategy,
ResearchMode.COMPREHENSIVE: ComprehensiveResearchStrategy,
ResearchMode.TARGETED: TargetedResearchStrategy,
}
strategy_class = strategy_map.get(mode, BasicResearchStrategy)
return strategy_class()

View File

@@ -0,0 +1,169 @@
"""
Tavily Research Provider
AI-powered search implementation using Tavily API for high-quality research.
"""
import os
from loguru import logger
from models.subscription_models import APIProvider
from services.research.tavily_service import TavilyService
from .base_provider import ResearchProvider as BaseProvider
class TavilyResearchProvider(BaseProvider):
"""Tavily AI-powered search provider."""
def __init__(self):
self.api_key = os.getenv("TAVILY_API_KEY")
if not self.api_key:
raise RuntimeError("TAVILY_API_KEY not configured")
self.tavily_service = TavilyService()
logger.info("✅ Tavily Research Provider initialized")
async def search(self, prompt, topic, industry, target_audience, config, user_id):
"""Execute Tavily search and return standardized results."""
# Build Tavily query
query = f"{topic} {industry} {target_audience}"
# Get Tavily-specific config options
topic = config.tavily_topic or "general"
search_depth = config.tavily_search_depth or "basic"
logger.info(f"[Tavily Research] Executing search: {query}")
# Execute Tavily search
result = await self.tavily_service.search(
query=query,
topic=topic,
search_depth=search_depth,
max_results=min(config.max_sources, 20),
include_domains=config.tavily_include_domains or None,
exclude_domains=config.tavily_exclude_domains or None,
include_answer=config.tavily_include_answer or False,
include_raw_content=config.tavily_include_raw_content or False,
include_images=config.tavily_include_images or False,
include_image_descriptions=config.tavily_include_image_descriptions or False,
time_range=config.tavily_time_range,
start_date=config.tavily_start_date,
end_date=config.tavily_end_date,
country=config.tavily_country,
chunks_per_source=config.tavily_chunks_per_source or 3,
auto_parameters=config.tavily_auto_parameters or False
)
if not result.get("success"):
raise RuntimeError(f"Tavily search failed: {result.get('error', 'Unknown error')}")
# Transform to standardized format
sources = self._transform_sources(result.get("results", []))
content = self._aggregate_content(result.get("results", []))
# Calculate cost (basic = 1 credit, advanced = 2 credits)
cost = 0.001 if search_depth == "basic" else 0.002 # Estimate cost per search
logger.info(f"[Tavily Research] Search completed: {len(sources)} sources, depth: {search_depth}")
return {
'sources': sources,
'content': content,
'search_type': search_depth,
'provider': 'tavily',
'search_queries': [query],
'cost': {'total': cost},
'answer': result.get("answer"), # If include_answer was requested
'images': result.get("images", [])
}
def get_provider_enum(self):
"""Return TAVILY provider enum for subscription tracking."""
return APIProvider.TAVILY
def estimate_tokens(self) -> int:
"""Estimate token usage for Tavily (not token-based, but we estimate API calls)."""
return 0 # Tavily is per-search, not token-based
def _transform_sources(self, results):
"""Transform Tavily results to ResearchSource format."""
sources = []
for idx, result in enumerate(results):
source_type = self._determine_source_type(result.get("url", ""))
sources.append({
'title': result.get("title", ""),
'url': result.get("url", ""),
'excerpt': result.get("content", "")[:500], # First 500 chars
'credibility_score': result.get("relevance_score", 0.5),
'published_at': result.get("published_date"),
'index': idx,
'source_type': source_type,
'content': result.get("content", ""),
'raw_content': result.get("raw_content"), # If include_raw_content was requested
'score': result.get("score", result.get("relevance_score", 0.5)),
'favicon': result.get("favicon")
})
return sources
def _determine_source_type(self, url):
"""Determine source type from URL."""
if not url:
return 'web'
url_lower = url.lower()
if 'arxiv.org' in url_lower or 'research' in url_lower or '.edu' in url_lower:
return 'academic'
elif any(news in url_lower for news in ['cnn.com', 'bbc.com', 'reuters.com', 'theguardian.com', 'nytimes.com']):
return 'news'
elif 'linkedin.com' in url_lower:
return 'expert'
elif '.gov' in url_lower:
return 'government'
else:
return 'web'
def _aggregate_content(self, results):
"""Aggregate content from Tavily results for LLM analysis."""
content_parts = []
for idx, result in enumerate(results):
content = result.get("content", "")
if content:
content_parts.append(f"Source {idx + 1}: {content}")
return "\n\n".join(content_parts)
def track_tavily_usage(self, user_id: str, cost: float, search_depth: str):
"""Track Tavily API usage after successful call."""
from services.database import get_db
from services.subscription import PricingService
from sqlalchemy import text
db = next(get_db())
try:
pricing_service = PricingService(db)
current_period = pricing_service.get_current_billing_period(user_id)
# Update tavily_calls and tavily_cost via SQL UPDATE
update_query = text("""
UPDATE usage_summaries
SET tavily_calls = COALESCE(tavily_calls, 0) + 1,
tavily_cost = COALESCE(tavily_cost, 0) + :cost,
total_calls = COALESCE(total_calls, 0) + 1,
total_cost = COALESCE(total_cost, 0) + :cost
WHERE user_id = :user_id AND billing_period = :period
""")
db.execute(update_query, {
'cost': cost,
'user_id': user_id,
'period': current_period
})
db.commit()
logger.info(f"[Tavily] Tracked usage: user={user_id}, cost=${cost}, depth={search_depth}")
except Exception as e:
logger.error(f"[Tavily] Failed to track usage: {e}", exc_info=True)
db.rollback()
finally:
db.close()

View File

@@ -0,0 +1,223 @@
"""
Enhanced Retry Utilities for Blog Writer
Provides advanced retry logic with exponential backoff, jitter, retry budgets,
and specific error code handling for different types of API failures.
"""
import asyncio
import random
import time
from typing import Callable, Any, Optional, Dict, List
from dataclasses import dataclass
from loguru import logger
from .exceptions import APIRateLimitException, APITimeoutException
@dataclass
class RetryConfig:
"""Configuration for retry behavior."""
max_attempts: int = 3
base_delay: float = 1.0
max_delay: float = 60.0
exponential_base: float = 2.0
jitter: bool = True
max_total_time: float = 300.0 # 5 minutes max total time
retryable_errors: List[str] = None
def __post_init__(self):
if self.retryable_errors is None:
self.retryable_errors = [
"503", "502", "504", # Server errors
"429", # Rate limit
"timeout", "timed out",
"connection", "network",
"overloaded", "busy"
]
class RetryBudget:
"""Tracks retry budget to prevent excessive retries."""
def __init__(self, max_total_time: float):
self.max_total_time = max_total_time
self.start_time = time.time()
self.used_time = 0.0
def can_retry(self) -> bool:
"""Check if we can still retry within budget."""
self.used_time = time.time() - self.start_time
return self.used_time < self.max_total_time
def remaining_time(self) -> float:
"""Get remaining time in budget."""
return max(0, self.max_total_time - self.used_time)
def is_retryable_error(error: Exception, retryable_errors: List[str]) -> bool:
"""Check if an error is retryable based on error message patterns."""
error_str = str(error).lower()
return any(pattern.lower() in error_str for pattern in retryable_errors)
def calculate_delay(attempt: int, config: RetryConfig) -> float:
"""Calculate delay for retry attempt with exponential backoff and jitter."""
# Exponential backoff
delay = config.base_delay * (config.exponential_base ** attempt)
# Cap at max delay
delay = min(delay, config.max_delay)
# Add jitter to prevent thundering herd
if config.jitter:
jitter_range = delay * 0.1 # 10% jitter
delay += random.uniform(-jitter_range, jitter_range)
return max(0, delay)
async def retry_with_backoff(
func: Callable,
config: Optional[RetryConfig] = None,
operation_name: str = "operation",
context: Optional[Dict[str, Any]] = None
) -> Any:
"""
Retry a function with enhanced backoff and budget management.
Args:
func: Async function to retry
config: Retry configuration
operation_name: Name of operation for logging
context: Additional context for logging
Returns:
Function result
Raises:
Last exception if all retries fail
"""
config = config or RetryConfig()
budget = RetryBudget(config.max_total_time)
last_exception = None
for attempt in range(config.max_attempts):
try:
# Check if we're still within budget
if not budget.can_retry():
logger.warning(f"Retry budget exceeded for {operation_name} after {budget.used_time:.2f}s")
break
# Execute the function
result = await func()
logger.info(f"{operation_name} succeeded on attempt {attempt + 1}")
return result
except Exception as e:
last_exception = e
# Check if this is the last attempt
if attempt == config.max_attempts - 1:
logger.error(f"{operation_name} failed after {config.max_attempts} attempts: {str(e)}")
break
# Check if error is retryable
if not is_retryable_error(e, config.retryable_errors):
logger.warning(f"{operation_name} failed with non-retryable error: {str(e)}")
break
# Calculate delay and wait
delay = calculate_delay(attempt, config)
remaining_time = budget.remaining_time()
# Don't wait longer than remaining budget
if delay > remaining_time:
logger.warning(f"Delay {delay:.2f}s exceeds remaining budget {remaining_time:.2f}s for {operation_name}")
break
logger.warning(
f"{operation_name} attempt {attempt + 1} failed: {str(e)}. "
f"Retrying in {delay:.2f}s (attempt {attempt + 2}/{config.max_attempts})"
)
await asyncio.sleep(delay)
# If we get here, all retries failed
if last_exception:
# Enhance exception with retry context
if isinstance(last_exception, Exception):
error_str = str(last_exception)
if "429" in error_str or "rate limit" in error_str.lower():
raise APIRateLimitException(
f"Rate limit exceeded after {config.max_attempts} attempts",
retry_after=int(delay * 2), # Suggest waiting longer
context=context
)
elif "timeout" in error_str.lower():
raise APITimeoutException(
f"Request timed out after {config.max_attempts} attempts",
timeout_seconds=int(config.max_total_time),
context=context
)
raise last_exception
raise Exception(f"{operation_name} failed after {config.max_attempts} attempts")
def retry_decorator(
config: Optional[RetryConfig] = None,
operation_name: Optional[str] = None
):
"""
Decorator to add retry logic to async functions.
Args:
config: Retry configuration
operation_name: Name of operation for logging
"""
def decorator(func: Callable) -> Callable:
async def wrapper(*args, **kwargs):
op_name = operation_name or func.__name__
return await retry_with_backoff(
lambda: func(*args, **kwargs),
config=config,
operation_name=op_name
)
return wrapper
return decorator
# Predefined retry configurations for different operation types
RESEARCH_RETRY_CONFIG = RetryConfig(
max_attempts=3,
base_delay=2.0,
max_delay=30.0,
max_total_time=180.0, # 3 minutes for research
retryable_errors=["503", "429", "timeout", "overloaded", "connection"]
)
OUTLINE_RETRY_CONFIG = RetryConfig(
max_attempts=2,
base_delay=1.5,
max_delay=20.0,
max_total_time=120.0, # 2 minutes for outline
retryable_errors=["503", "429", "timeout", "overloaded"]
)
CONTENT_RETRY_CONFIG = RetryConfig(
max_attempts=3,
base_delay=1.0,
max_delay=15.0,
max_total_time=90.0, # 1.5 minutes for content
retryable_errors=["503", "429", "timeout", "overloaded"]
)
SEO_RETRY_CONFIG = RetryConfig(
max_attempts=2,
base_delay=1.0,
max_delay=10.0,
max_total_time=60.0, # 1 minute for SEO
retryable_errors=["503", "429", "timeout"]
)

View File

@@ -0,0 +1,879 @@
"""
Blog Content SEO Analyzer
Specialized SEO analyzer for blog content with parallel processing.
Leverages existing non-AI SEO tools and uses single AI prompt for structured analysis.
"""
import asyncio
import re
import textstat
from datetime import datetime
from typing import Dict, Any, List, Optional
from utils.logger_utils import get_service_logger
from services.seo_analyzer import (
ContentAnalyzer, KeywordAnalyzer,
URLStructureAnalyzer, AIInsightGenerator
)
from services.llm_providers.main_text_generation import llm_text_gen
class BlogContentSEOAnalyzer:
"""Specialized SEO analyzer for blog content with parallel processing"""
def __init__(self):
"""Initialize the blog content SEO analyzer"""
# Service-specific logger (no global reconfiguration)
global logger
logger = get_service_logger("blog_content_seo_analyzer")
self.content_analyzer = ContentAnalyzer()
self.keyword_analyzer = KeywordAnalyzer()
self.url_analyzer = URLStructureAnalyzer()
self.ai_insights = AIInsightGenerator()
logger.info("BlogContentSEOAnalyzer initialized")
async def analyze_blog_content(self, blog_content: str, research_data: Dict[str, Any], blog_title: Optional[str] = None, user_id: str = None) -> Dict[str, Any]:
"""
Main analysis method with parallel processing
Args:
blog_content: The blog content to analyze
research_data: Research data containing keywords and other insights
blog_title: Optional blog title
user_id: Clerk user ID for subscription checking (required)
Returns:
Comprehensive SEO analysis results
"""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
try:
logger.info("Starting blog content SEO analysis")
# Extract keywords from research data
keywords_data = self._extract_keywords_from_research(research_data)
logger.info(f"Extracted keywords: {keywords_data}")
# Phase 1: Run non-AI analyzers in parallel
logger.info("Running non-AI analyzers in parallel")
non_ai_results = await self._run_non_ai_analyzers(blog_content, keywords_data)
# Phase 2: Single AI analysis for structured insights
logger.info("Running AI analysis")
ai_insights = await self._run_ai_analysis(blog_content, keywords_data, non_ai_results, user_id=user_id)
# Phase 3: Compile and format results
logger.info("Compiling results")
results = self._compile_blog_seo_results(non_ai_results, ai_insights, keywords_data)
logger.info(f"SEO analysis completed. Overall score: {results.get('overall_score', 0)}")
return results
except Exception as e:
logger.error(f"Blog SEO analysis failed: {e}")
# Fail fast - don't return fallback data
raise e
def _extract_keywords_from_research(self, research_data: Dict[str, Any]) -> Dict[str, Any]:
"""Extract keywords from research data"""
try:
logger.info(f"Extracting keywords from research data: {research_data}")
# Extract keywords from research data structure
keyword_analysis = research_data.get('keyword_analysis', {})
logger.info(f"Found keyword_analysis: {keyword_analysis}")
# Handle different possible structures
primary_keywords = []
long_tail_keywords = []
semantic_keywords = []
all_keywords = []
# Try to extract primary keywords from different possible locations
if 'primary' in keyword_analysis:
primary_keywords = keyword_analysis.get('primary', [])
elif 'keywords' in research_data:
# Fallback to top-level keywords
primary_keywords = research_data.get('keywords', [])
# Extract other keyword types
long_tail_keywords = keyword_analysis.get('long_tail', [])
# Handle both 'semantic' and 'semantic_keywords' field names
semantic_keywords = keyword_analysis.get('semantic', []) or keyword_analysis.get('semantic_keywords', [])
all_keywords = keyword_analysis.get('all_keywords', primary_keywords)
result = {
'primary': primary_keywords,
'long_tail': long_tail_keywords,
'semantic': semantic_keywords,
'all_keywords': all_keywords,
'search_intent': keyword_analysis.get('search_intent', 'informational')
}
logger.info(f"Extracted keywords: {result}")
return result
except Exception as e:
logger.error(f"Failed to extract keywords from research data: {e}")
logger.error(f"Research data structure: {research_data}")
# Fail fast - don't return empty keywords
raise ValueError(f"Keyword extraction failed: {e}")
async def _run_non_ai_analyzers(self, blog_content: str, keywords_data: Dict[str, Any]) -> Dict[str, Any]:
"""Run all non-AI analyzers in parallel for maximum performance"""
logger.info(f"Starting non-AI analyzers with content length: {len(blog_content)} chars")
logger.info(f"Keywords data: {keywords_data}")
# Parallel execution of fast analyzers
tasks = [
self._analyze_content_structure(blog_content),
self._analyze_keyword_usage(blog_content, keywords_data),
self._analyze_readability(blog_content),
self._analyze_content_quality(blog_content),
self._analyze_heading_structure(blog_content)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Check for exceptions and fail fast
for i, result in enumerate(results):
if isinstance(result, Exception):
task_names = ['content_structure', 'keyword_analysis', 'readability_analysis', 'content_quality', 'heading_structure']
logger.error(f"Task {task_names[i]} failed: {result}")
raise result
# Log successful results
task_names = ['content_structure', 'keyword_analysis', 'readability_analysis', 'content_quality', 'heading_structure']
for i, (name, result) in enumerate(zip(task_names, results)):
logger.info(f"{name} completed: {type(result).__name__} with {len(result) if isinstance(result, dict) else 'N/A'} fields")
return {
'content_structure': results[0],
'keyword_analysis': results[1],
'readability_analysis': results[2],
'content_quality': results[3],
'heading_structure': results[4]
}
async def _analyze_content_structure(self, content: str) -> Dict[str, Any]:
"""Analyze blog content structure"""
try:
# Parse markdown content
lines = content.split('\n')
# Count sections, paragraphs, sentences
sections = len([line for line in lines if line.startswith('##')])
paragraphs = len([line for line in lines if line.strip() and not line.startswith('#')])
sentences = len(re.findall(r'[.!?]+', content))
# Blog-specific structure analysis
has_introduction = any('introduction' in line.lower() or 'overview' in line.lower()
for line in lines[:10])
has_conclusion = any('conclusion' in line.lower() or 'summary' in line.lower()
for line in lines[-10:])
has_cta = any('call to action' in line.lower() or 'learn more' in line.lower()
for line in lines)
structure_score = self._calculate_structure_score(sections, paragraphs, has_introduction, has_conclusion)
return {
'total_sections': sections,
'total_paragraphs': paragraphs,
'total_sentences': sentences,
'has_introduction': has_introduction,
'has_conclusion': has_conclusion,
'has_call_to_action': has_cta,
'structure_score': structure_score,
'recommendations': self._get_structure_recommendations(sections, has_introduction, has_conclusion)
}
except Exception as e:
logger.error(f"Content structure analysis failed: {e}")
raise e
async def _analyze_keyword_usage(self, content: str, keywords_data: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze keyword usage and optimization"""
try:
# Extract keywords from research data
primary_keywords = keywords_data.get('primary', [])
long_tail_keywords = keywords_data.get('long_tail', [])
semantic_keywords = keywords_data.get('semantic', [])
# Use existing KeywordAnalyzer
keyword_result = self.keyword_analyzer.analyze(content, primary_keywords)
# Blog-specific keyword analysis
keyword_analysis = {
'primary_keywords': primary_keywords,
'long_tail_keywords': long_tail_keywords,
'semantic_keywords': semantic_keywords,
'keyword_density': {},
'keyword_distribution': {},
'missing_keywords': [],
'over_optimization': [],
'recommendations': []
}
# Analyze each keyword type
for keyword in primary_keywords:
density = self._calculate_keyword_density(content, keyword)
keyword_analysis['keyword_density'][keyword] = density
# Check if keyword appears in headings
in_headings = self._keyword_in_headings(content, keyword)
keyword_analysis['keyword_distribution'][keyword] = {
'density': density,
'in_headings': in_headings,
'first_occurrence': content.lower().find(keyword.lower())
}
# Check for missing important keywords
for keyword in primary_keywords:
if keyword.lower() not in content.lower():
keyword_analysis['missing_keywords'].append(keyword)
# Check for over-optimization
for keyword, density in keyword_analysis['keyword_density'].items():
if density > 3.0: # Over 3% density
keyword_analysis['over_optimization'].append(keyword)
return keyword_analysis
except Exception as e:
logger.error(f"Keyword analysis failed: {e}")
raise e
async def _analyze_readability(self, content: str) -> Dict[str, Any]:
"""Analyze content readability using textstat integration"""
try:
# Calculate readability metrics
readability_metrics = {
'flesch_reading_ease': textstat.flesch_reading_ease(content),
'flesch_kincaid_grade': textstat.flesch_kincaid_grade(content),
'gunning_fog': textstat.gunning_fog(content),
'smog_index': textstat.smog_index(content),
'automated_readability': textstat.automated_readability_index(content),
'coleman_liau': textstat.coleman_liau_index(content)
}
# Blog-specific readability analysis
avg_sentence_length = self._calculate_avg_sentence_length(content)
avg_paragraph_length = self._calculate_avg_paragraph_length(content)
readability_score = self._calculate_readability_score(readability_metrics)
return {
'metrics': readability_metrics,
'avg_sentence_length': avg_sentence_length,
'avg_paragraph_length': avg_paragraph_length,
'readability_score': readability_score,
'target_audience': self._determine_target_audience(readability_metrics),
'recommendations': self._get_readability_recommendations(readability_metrics, avg_sentence_length)
}
except Exception as e:
logger.error(f"Readability analysis failed: {e}")
raise e
async def _analyze_content_quality(self, content: str) -> Dict[str, Any]:
"""Analyze overall content quality"""
try:
# Word count analysis
words = content.split()
word_count = len(words)
# Content depth analysis
unique_words = len(set(word.lower() for word in words))
vocabulary_diversity = unique_words / word_count if word_count > 0 else 0
# Content flow analysis
transition_words = ['however', 'therefore', 'furthermore', 'moreover', 'additionally', 'consequently']
transition_count = sum(content.lower().count(word) for word in transition_words)
content_depth_score = self._calculate_content_depth_score(word_count, vocabulary_diversity)
flow_score = self._calculate_flow_score(transition_count, word_count)
return {
'word_count': word_count,
'unique_words': unique_words,
'vocabulary_diversity': vocabulary_diversity,
'transition_words_used': transition_count,
'content_depth_score': content_depth_score,
'flow_score': flow_score,
'recommendations': self._get_content_quality_recommendations(word_count, vocabulary_diversity, transition_count)
}
except Exception as e:
logger.error(f"Content quality analysis failed: {e}")
raise e
async def _analyze_heading_structure(self, content: str) -> Dict[str, Any]:
"""Analyze heading structure and hierarchy"""
try:
# Extract headings
h1_headings = re.findall(r'^# (.+)$', content, re.MULTILINE)
h2_headings = re.findall(r'^## (.+)$', content, re.MULTILINE)
h3_headings = re.findall(r'^### (.+)$', content, re.MULTILINE)
# Analyze heading structure
heading_hierarchy_score = self._calculate_heading_hierarchy_score(h1_headings, h2_headings, h3_headings)
return {
'h1_count': len(h1_headings),
'h2_count': len(h2_headings),
'h3_count': len(h3_headings),
'h1_headings': h1_headings,
'h2_headings': h2_headings,
'h3_headings': h3_headings,
'heading_hierarchy_score': heading_hierarchy_score,
'recommendations': self._get_heading_recommendations(h1_headings, h2_headings, h3_headings)
}
except Exception as e:
logger.error(f"Heading structure analysis failed: {e}")
raise e
# Helper methods for calculations and scoring
def _calculate_structure_score(self, sections: int, paragraphs: int, has_intro: bool, has_conclusion: bool) -> int:
"""Calculate content structure score"""
score = 0
# Section count (optimal: 3-8 sections)
if 3 <= sections <= 8:
score += 30
elif sections < 3:
score += 15
else:
score += 20
# Paragraph count (optimal: 8-20 paragraphs)
if 8 <= paragraphs <= 20:
score += 30
elif paragraphs < 8:
score += 15
else:
score += 20
# Introduction and conclusion
if has_intro:
score += 20
if has_conclusion:
score += 20
return min(score, 100)
def _calculate_keyword_density(self, content: str, keyword: str) -> float:
"""Calculate keyword density percentage"""
content_lower = content.lower()
keyword_lower = keyword.lower()
word_count = len(content.split())
keyword_count = content_lower.count(keyword_lower)
return (keyword_count / word_count * 100) if word_count > 0 else 0
def _keyword_in_headings(self, content: str, keyword: str) -> bool:
"""Check if keyword appears in headings"""
headings = re.findall(r'^#+ (.+)$', content, re.MULTILINE)
return any(keyword.lower() in heading.lower() for heading in headings)
def _calculate_avg_sentence_length(self, content: str) -> float:
"""Calculate average sentence length"""
sentences = re.split(r'[.!?]+', content)
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return 0
total_words = sum(len(sentence.split()) for sentence in sentences)
return total_words / len(sentences)
def _calculate_avg_paragraph_length(self, content: str) -> float:
"""Calculate average paragraph length"""
paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]
if not paragraphs:
return 0
total_words = sum(len(paragraph.split()) for paragraph in paragraphs)
return total_words / len(paragraphs)
def _calculate_readability_score(self, metrics: Dict[str, float]) -> int:
"""Calculate overall readability score"""
# Flesch Reading Ease (0-100, higher is better)
flesch_score = metrics.get('flesch_reading_ease', 0)
# Convert to 0-100 scale
if flesch_score >= 80:
return 90
elif flesch_score >= 60:
return 80
elif flesch_score >= 40:
return 70
elif flesch_score >= 20:
return 60
else:
return 50
def _determine_target_audience(self, metrics: Dict[str, float]) -> str:
"""Determine target audience based on readability metrics"""
flesch_score = metrics.get('flesch_reading_ease', 0)
if flesch_score >= 80:
return "General audience (8th grade level)"
elif flesch_score >= 60:
return "High school level"
elif flesch_score >= 40:
return "College level"
else:
return "Graduate level"
def _calculate_content_depth_score(self, word_count: int, vocabulary_diversity: float) -> int:
"""Calculate content depth score"""
score = 0
# Word count (optimal: 800-2000 words)
if 800 <= word_count <= 2000:
score += 50
elif word_count < 800:
score += 30
else:
score += 40
# Vocabulary diversity (optimal: 0.4-0.7)
if 0.4 <= vocabulary_diversity <= 0.7:
score += 50
elif vocabulary_diversity < 0.4:
score += 30
else:
score += 40
return min(score, 100)
def _calculate_flow_score(self, transition_count: int, word_count: int) -> int:
"""Calculate content flow score"""
if word_count == 0:
return 0
transition_density = transition_count / (word_count / 100)
# Optimal transition density: 1-3 per 100 words
if 1 <= transition_density <= 3:
return 90
elif transition_density < 1:
return 60
else:
return 70
def _calculate_heading_hierarchy_score(self, h1: List[str], h2: List[str], h3: List[str]) -> int:
"""Calculate heading hierarchy score"""
score = 0
# Should have exactly 1 H1
if len(h1) == 1:
score += 40
elif len(h1) == 0:
score += 20
else:
score += 10
# Should have 3-8 H2 headings
if 3 <= len(h2) <= 8:
score += 40
elif len(h2) < 3:
score += 20
else:
score += 30
# H3 headings are optional but good for structure
if len(h3) > 0:
score += 20
return min(score, 100)
def _calculate_keyword_score(self, keyword_analysis: Dict[str, Any]) -> int:
"""Calculate keyword optimization score"""
score = 0
# Check keyword density (optimal: 1-3%)
densities = keyword_analysis.get('keyword_density', {})
for keyword, density in densities.items():
if 1 <= density <= 3:
score += 30
elif density < 1:
score += 15
else:
score += 10
# Check keyword distribution
distributions = keyword_analysis.get('keyword_distribution', {})
for keyword, dist in distributions.items():
if dist.get('in_headings', False):
score += 20
if dist.get('first_occurrence', -1) < 100: # Early occurrence
score += 20
# Penalize missing keywords
missing = len(keyword_analysis.get('missing_keywords', []))
score -= missing * 10
# Penalize over-optimization
over_opt = len(keyword_analysis.get('over_optimization', []))
score -= over_opt * 15
return max(0, min(score, 100))
def _calculate_weighted_score(self, scores: Dict[str, int]) -> int:
"""Calculate weighted overall score"""
weights = {
'structure': 0.2,
'keywords': 0.25,
'readability': 0.2,
'quality': 0.15,
'headings': 0.1,
'ai_insights': 0.1
}
weighted_sum = sum(scores.get(key, 0) * weight for key, weight in weights.items())
return int(weighted_sum)
# Recommendation methods
def _get_structure_recommendations(self, sections: int, has_intro: bool, has_conclusion: bool) -> List[str]:
"""Get structure recommendations"""
recommendations = []
if sections < 3:
recommendations.append("Add more sections to improve content structure")
elif sections > 8:
recommendations.append("Consider combining some sections for better flow")
if not has_intro:
recommendations.append("Add an introduction section to set context")
if not has_conclusion:
recommendations.append("Add a conclusion section to summarize key points")
return recommendations
def _get_readability_recommendations(self, metrics: Dict[str, float], avg_sentence_length: float) -> List[str]:
"""Get readability recommendations"""
recommendations = []
flesch_score = metrics.get('flesch_reading_ease', 0)
if flesch_score < 60:
recommendations.append("Simplify language and use shorter sentences")
if avg_sentence_length > 20:
recommendations.append("Break down long sentences for better readability")
if flesch_score > 80:
recommendations.append("Consider adding more technical depth for expert audience")
return recommendations
def _get_content_quality_recommendations(self, word_count: int, vocabulary_diversity: float, transition_count: int) -> List[str]:
"""Get content quality recommendations"""
recommendations = []
if word_count < 800:
recommendations.append("Expand content with more detailed explanations")
elif word_count > 2000:
recommendations.append("Consider breaking into multiple posts")
if vocabulary_diversity < 0.4:
recommendations.append("Use more varied vocabulary to improve engagement")
if transition_count < 3:
recommendations.append("Add more transition words to improve flow")
return recommendations
def _get_heading_recommendations(self, h1: List[str], h2: List[str], h3: List[str]) -> List[str]:
"""Get heading recommendations"""
recommendations = []
if len(h1) == 0:
recommendations.append("Add a main H1 heading")
elif len(h1) > 1:
recommendations.append("Use only one H1 heading per post")
if len(h2) < 3:
recommendations.append("Add more H2 headings to structure content")
elif len(h2) > 8:
recommendations.append("Consider using H3 headings for better hierarchy")
return recommendations
async def _run_ai_analysis(self, blog_content: str, keywords_data: Dict[str, Any], non_ai_results: Dict[str, Any], user_id: str = None) -> Dict[str, Any]:
"""Run single AI analysis for structured insights (provider-agnostic)"""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
try:
# Prepare context for AI analysis
context = {
'blog_content': blog_content,
'keywords_data': keywords_data,
'non_ai_results': non_ai_results
}
# Create AI prompt for structured analysis
prompt = self._create_ai_analysis_prompt(context)
schema = {
"type": "object",
"properties": {
"content_quality_insights": {
"type": "object",
"properties": {
"engagement_score": {"type": "number"},
"value_proposition": {"type": "string"},
"content_gaps": {"type": "array", "items": {"type": "string"}},
"improvement_suggestions": {"type": "array", "items": {"type": "string"}}
}
},
"seo_optimization_insights": {
"type": "object",
"properties": {
"keyword_optimization": {"type": "string"},
"content_relevance": {"type": "string"},
"search_intent_alignment": {"type": "string"},
"seo_improvements": {"type": "array", "items": {"type": "string"}}
}
},
"user_experience_insights": {
"type": "object",
"properties": {
"content_flow": {"type": "string"},
"readability_assessment": {"type": "string"},
"engagement_factors": {"type": "array", "items": {"type": "string"}},
"ux_improvements": {"type": "array", "items": {"type": "string"}}
}
},
"competitive_analysis": {
"type": "object",
"properties": {
"content_differentiation": {"type": "string"},
"unique_value": {"type": "string"},
"competitive_advantages": {"type": "array", "items": {"type": "string"}},
"market_positioning": {"type": "string"}
}
}
}
}
# Provider-agnostic structured response respecting GPT_PROVIDER
ai_response = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt=None,
user_id=user_id # Pass user_id for subscription checking
)
return ai_response
except Exception as e:
logger.error(f"AI analysis failed: {e}")
raise e
def _create_ai_analysis_prompt(self, context: Dict[str, Any]) -> str:
"""Create AI analysis prompt"""
blog_content = context['blog_content']
keywords_data = context['keywords_data']
non_ai_results = context['non_ai_results']
prompt = f"""
Analyze this blog content for SEO optimization and user experience. Provide structured insights based on the content and keyword data.
BLOG CONTENT:
{blog_content[:2000]}...
KEYWORDS DATA:
Primary Keywords: {keywords_data.get('primary', [])}
Long-tail Keywords: {keywords_data.get('long_tail', [])}
Semantic Keywords: {keywords_data.get('semantic', [])}
Search Intent: {keywords_data.get('search_intent', 'informational')}
NON-AI ANALYSIS RESULTS:
Structure Score: {non_ai_results.get('content_structure', {}).get('structure_score', 0)}
Readability Score: {non_ai_results.get('readability_analysis', {}).get('readability_score', 0)}
Content Quality Score: {non_ai_results.get('content_quality', {}).get('content_depth_score', 0)}
Please provide:
1. Content Quality Insights: Assess engagement potential, value proposition, content gaps, and improvement suggestions
2. SEO Optimization Insights: Evaluate keyword optimization, content relevance, search intent alignment, and SEO improvements
3. User Experience Insights: Analyze content flow, readability, engagement factors, and UX improvements
4. Competitive Analysis: Identify content differentiation, unique value, competitive advantages, and market positioning
Focus on actionable insights that can improve the blog's performance and user engagement.
"""
return prompt
def _compile_blog_seo_results(self, non_ai_results: Dict[str, Any], ai_insights: Dict[str, Any], keywords_data: Dict[str, Any]) -> Dict[str, Any]:
"""Compile comprehensive SEO analysis results"""
try:
# Validate required data - fail fast if missing
if not non_ai_results:
raise ValueError("Non-AI analysis results are missing")
if not ai_insights:
raise ValueError("AI insights are missing")
# Calculate category scores
category_scores = {
'structure': non_ai_results.get('content_structure', {}).get('structure_score', 0),
'keywords': self._calculate_keyword_score(non_ai_results.get('keyword_analysis', {})),
'readability': non_ai_results.get('readability_analysis', {}).get('readability_score', 0),
'quality': non_ai_results.get('content_quality', {}).get('content_depth_score', 0),
'headings': non_ai_results.get('heading_structure', {}).get('heading_hierarchy_score', 0),
'ai_insights': ai_insights.get('content_quality_insights', {}).get('engagement_score', 0)
}
# Calculate overall score
overall_score = self._calculate_weighted_score(category_scores)
# Compile actionable recommendations
actionable_recommendations = self._compile_actionable_recommendations(non_ai_results, ai_insights)
# Create visualization data
visualization_data = self._create_visualization_data(category_scores, non_ai_results)
return {
'overall_score': overall_score,
'category_scores': category_scores,
'detailed_analysis': non_ai_results,
'ai_insights': ai_insights,
'keywords_data': keywords_data,
'visualization_data': visualization_data,
'actionable_recommendations': actionable_recommendations,
'generated_at': datetime.utcnow().isoformat(),
'analysis_summary': self._create_analysis_summary(overall_score, category_scores, ai_insights)
}
except Exception as e:
logger.error(f"Results compilation failed: {e}")
# Fail fast - don't return fallback data
raise e
def _compile_actionable_recommendations(self, non_ai_results: Dict[str, Any], ai_insights: Dict[str, Any]) -> List[Dict[str, Any]]:
"""Compile actionable recommendations from all sources"""
recommendations = []
# Structure recommendations
structure_recs = non_ai_results.get('content_structure', {}).get('recommendations', [])
for rec in structure_recs:
recommendations.append({
'category': 'Structure',
'priority': 'High',
'recommendation': rec,
'impact': 'Improves content organization and user experience'
})
# Keyword recommendations
keyword_recs = non_ai_results.get('keyword_analysis', {}).get('recommendations', [])
for rec in keyword_recs:
recommendations.append({
'category': 'Keywords',
'priority': 'High',
'recommendation': rec,
'impact': 'Improves search engine visibility'
})
# Readability recommendations
readability_recs = non_ai_results.get('readability_analysis', {}).get('recommendations', [])
for rec in readability_recs:
recommendations.append({
'category': 'Readability',
'priority': 'Medium',
'recommendation': rec,
'impact': 'Improves user engagement and comprehension'
})
# AI insights recommendations
ai_recs = ai_insights.get('content_quality_insights', {}).get('improvement_suggestions', [])
for rec in ai_recs:
recommendations.append({
'category': 'Content Quality',
'priority': 'Medium',
'recommendation': rec,
'impact': 'Enhances content value and engagement'
})
return recommendations
def _create_visualization_data(self, category_scores: Dict[str, int], non_ai_results: Dict[str, Any]) -> Dict[str, Any]:
"""Create data for visualization components"""
return {
'score_radar': {
'categories': list(category_scores.keys()),
'scores': list(category_scores.values()),
'max_score': 100
},
'keyword_analysis': {
'densities': non_ai_results.get('keyword_analysis', {}).get('keyword_density', {}),
'missing_keywords': non_ai_results.get('keyword_analysis', {}).get('missing_keywords', []),
'over_optimization': non_ai_results.get('keyword_analysis', {}).get('over_optimization', [])
},
'readability_metrics': non_ai_results.get('readability_analysis', {}).get('metrics', {}),
'content_stats': {
'word_count': non_ai_results.get('content_quality', {}).get('word_count', 0),
'sections': non_ai_results.get('content_structure', {}).get('total_sections', 0),
'paragraphs': non_ai_results.get('content_structure', {}).get('total_paragraphs', 0)
}
}
def _create_analysis_summary(self, overall_score: int, category_scores: Dict[str, int], ai_insights: Dict[str, Any]) -> Dict[str, Any]:
"""Create analysis summary"""
# Determine overall grade
if overall_score >= 90:
grade = 'A'
status = 'Excellent'
elif overall_score >= 80:
grade = 'B'
status = 'Good'
elif overall_score >= 70:
grade = 'C'
status = 'Fair'
elif overall_score >= 60:
grade = 'D'
status = 'Needs Improvement'
else:
grade = 'F'
status = 'Poor'
# Find strongest and weakest categories
strongest_category = max(category_scores.items(), key=lambda x: x[1])
weakest_category = min(category_scores.items(), key=lambda x: x[1])
return {
'overall_grade': grade,
'status': status,
'strongest_category': strongest_category[0],
'weakest_category': weakest_category[0],
'key_strengths': self._identify_key_strengths(category_scores),
'key_weaknesses': self._identify_key_weaknesses(category_scores),
'ai_summary': ai_insights.get('content_quality_insights', {}).get('value_proposition', '')
}
def _identify_key_strengths(self, category_scores: Dict[str, int]) -> List[str]:
"""Identify key strengths"""
strengths = []
for category, score in category_scores.items():
if score >= 80:
strengths.append(f"Strong {category} optimization")
return strengths
def _identify_key_weaknesses(self, category_scores: Dict[str, int]) -> List[str]:
"""Identify key weaknesses"""
weaknesses = []
for category, score in category_scores.items():
if score < 60:
weaknesses.append(f"Needs improvement in {category}")
return weaknesses
def _create_error_result(self, error_message: str) -> Dict[str, Any]:
"""Create error result - this should not be used in fail-fast mode"""
raise ValueError(f"Error result creation not allowed in fail-fast mode: {error_message}")

View File

@@ -0,0 +1,668 @@
"""
Blog SEO Metadata Generator
Optimized SEO metadata generation service that uses maximum 2 AI calls
to generate comprehensive metadata including titles, descriptions,
Open Graph tags, Twitter cards, and structured data.
"""
import asyncio
import re
from datetime import datetime
from typing import Dict, Any, List, Optional
from loguru import logger
from services.llm_providers.main_text_generation import llm_text_gen
class BlogSEOMetadataGenerator:
"""Optimized SEO metadata generator with maximum 2 AI calls"""
def __init__(self):
"""Initialize the metadata generator"""
logger.info("BlogSEOMetadataGenerator initialized")
async def generate_comprehensive_metadata(
self,
blog_content: str,
blog_title: str,
research_data: Dict[str, Any],
outline: Optional[List[Dict[str, Any]]] = None,
seo_analysis: Optional[Dict[str, Any]] = None,
user_id: str = None
) -> Dict[str, Any]:
"""
Generate comprehensive SEO metadata using maximum 2 AI calls
Args:
blog_content: The blog content to analyze
blog_title: The blog title
research_data: Research data containing keywords and insights
outline: Outline structure with sections and headings
seo_analysis: SEO analysis results from previous phase
user_id: Clerk user ID for subscription checking (required)
Returns:
Comprehensive metadata including all SEO elements
"""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
try:
logger.info("Starting comprehensive SEO metadata generation")
# Extract keywords and context from research data
keywords_data = self._extract_keywords_from_research(research_data)
logger.info(f"Extracted keywords: {keywords_data}")
# Call 1: Generate core SEO metadata (parallel with Call 2)
logger.info("Generating core SEO metadata")
core_metadata_task = self._generate_core_metadata(
blog_content, blog_title, keywords_data, outline, seo_analysis, user_id=user_id
)
# Call 2: Generate social media and structured data (parallel with Call 1)
logger.info("Generating social media and structured data")
social_metadata_task = self._generate_social_metadata(
blog_content, blog_title, keywords_data, outline, seo_analysis, user_id=user_id
)
# Wait for both calls to complete
core_metadata, social_metadata = await asyncio.gather(
core_metadata_task,
social_metadata_task
)
# Compile final response
results = self._compile_metadata_response(core_metadata, social_metadata, blog_title)
logger.info(f"SEO metadata generation completed successfully")
return results
except Exception as e:
logger.error(f"SEO metadata generation failed: {e}")
# Fail fast - don't return fallback data
raise e
def _extract_keywords_from_research(self, research_data: Dict[str, Any]) -> Dict[str, Any]:
"""Extract keywords and context from research data"""
try:
keyword_analysis = research_data.get('keyword_analysis', {})
# Handle both 'semantic' and 'semantic_keywords' field names
semantic_keywords = keyword_analysis.get('semantic', []) or keyword_analysis.get('semantic_keywords', [])
return {
'primary_keywords': keyword_analysis.get('primary', []),
'long_tail_keywords': keyword_analysis.get('long_tail', []),
'semantic_keywords': semantic_keywords,
'all_keywords': keyword_analysis.get('all_keywords', []),
'search_intent': keyword_analysis.get('search_intent', 'informational'),
'target_audience': research_data.get('target_audience', 'general'),
'industry': research_data.get('industry', 'general')
}
except Exception as e:
logger.error(f"Failed to extract keywords from research: {e}")
return {
'primary_keywords': [],
'long_tail_keywords': [],
'semantic_keywords': [],
'all_keywords': [],
'search_intent': 'informational',
'target_audience': 'general',
'industry': 'general'
}
async def _generate_core_metadata(
self,
blog_content: str,
blog_title: str,
keywords_data: Dict[str, Any],
outline: Optional[List[Dict[str, Any]]] = None,
seo_analysis: Optional[Dict[str, Any]] = None,
user_id: str = None
) -> Dict[str, Any]:
"""Generate core SEO metadata (Call 1)"""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
try:
# Create comprehensive prompt for core metadata
prompt = self._create_core_metadata_prompt(
blog_content, blog_title, keywords_data, outline, seo_analysis
)
# Define simplified structured schema for core metadata
schema = {
"type": "object",
"properties": {
"seo_title": {
"type": "string",
"description": "SEO-optimized title (50-60 characters)"
},
"meta_description": {
"type": "string",
"description": "Meta description (150-160 characters)"
},
"url_slug": {
"type": "string",
"description": "URL slug (lowercase, hyphens)"
},
"blog_tags": {
"type": "array",
"items": {"type": "string"},
"description": "Blog tags array"
},
"blog_categories": {
"type": "array",
"items": {"type": "string"},
"description": "Blog categories array"
},
"social_hashtags": {
"type": "array",
"items": {"type": "string"},
"description": "Social media hashtags array"
},
"reading_time": {
"type": "integer",
"description": "Reading time in minutes"
},
"focus_keyword": {
"type": "string",
"description": "Primary focus keyword"
}
},
"required": ["seo_title", "meta_description", "url_slug", "blog_tags", "blog_categories", "social_hashtags", "reading_time", "focus_keyword"]
}
# Get structured response using provider-agnostic llm_text_gen
ai_response_raw = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt=None,
user_id=user_id # Pass user_id for subscription checking
)
# Handle response: llm_text_gen may return dict (from structured JSON) or str (needs parsing)
ai_response = ai_response_raw
if isinstance(ai_response_raw, str):
try:
import json
ai_response = json.loads(ai_response_raw)
except json.JSONDecodeError:
logger.error(f"Failed to parse JSON response: {ai_response_raw[:200]}...")
ai_response = None
# Check if we got a valid response
if not ai_response or not isinstance(ai_response, dict):
logger.error("Core metadata generation failed: Invalid response from LLM")
# Return fallback response
primary_keywords = ', '.join(keywords_data.get('primary_keywords', ['content']))
word_count = len(blog_content.split())
return {
'seo_title': blog_title,
'meta_description': f'Learn about {primary_keywords.split(", ")[0] if primary_keywords else "this topic"}.',
'url_slug': blog_title.lower().replace(' ', '-').replace(':', '').replace(',', '')[:50],
'blog_tags': primary_keywords.split(', ') if primary_keywords else ['content'],
'blog_categories': ['Content Marketing', 'Technology'],
'social_hashtags': ['#content', '#marketing', '#technology'],
'reading_time': max(1, word_count // 200),
'focus_keyword': primary_keywords.split(', ')[0] if primary_keywords else 'content'
}
logger.info(f"Core metadata generation completed. Response keys: {list(ai_response.keys())}")
logger.info(f"Core metadata response: {ai_response}")
return ai_response
except Exception as e:
logger.error(f"Core metadata generation failed: {e}")
raise e
async def _generate_social_metadata(
self,
blog_content: str,
blog_title: str,
keywords_data: Dict[str, Any],
outline: Optional[List[Dict[str, Any]]] = None,
seo_analysis: Optional[Dict[str, Any]] = None,
user_id: str = None
) -> Dict[str, Any]:
"""Generate social media and structured data (Call 2)"""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
try:
# Create comprehensive prompt for social metadata
prompt = self._create_social_metadata_prompt(
blog_content, blog_title, keywords_data, outline, seo_analysis
)
# Define simplified structured schema for social metadata
schema = {
"type": "object",
"properties": {
"open_graph": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"},
"image": {"type": "string"},
"type": {"type": "string"},
"site_name": {"type": "string"},
"url": {"type": "string"}
}
},
"twitter_card": {
"type": "object",
"properties": {
"card": {"type": "string"},
"title": {"type": "string"},
"description": {"type": "string"},
"image": {"type": "string"},
"site": {"type": "string"},
"creator": {"type": "string"}
}
},
"json_ld_schema": {
"type": "object",
"properties": {
"@context": {"type": "string"},
"@type": {"type": "string"},
"headline": {"type": "string"},
"description": {"type": "string"},
"author": {"type": "object"},
"publisher": {"type": "object"},
"datePublished": {"type": "string"},
"dateModified": {"type": "string"},
"mainEntityOfPage": {"type": "string"},
"keywords": {"type": "array"},
"wordCount": {"type": "integer"}
}
}
},
"required": ["open_graph", "twitter_card", "json_ld_schema"]
}
# Get structured response using provider-agnostic llm_text_gen
ai_response_raw = llm_text_gen(
prompt=prompt,
json_struct=schema,
system_prompt=None,
user_id=user_id # Pass user_id for subscription checking
)
# Handle response: llm_text_gen may return dict (from structured JSON) or str (needs parsing)
ai_response = ai_response_raw
if isinstance(ai_response_raw, str):
try:
import json
ai_response = json.loads(ai_response_raw)
except json.JSONDecodeError:
logger.error(f"Failed to parse JSON response: {ai_response_raw[:200]}...")
ai_response = None
# Check if we got a valid response
if not ai_response or not isinstance(ai_response, dict) or not ai_response.get('open_graph') or not ai_response.get('twitter_card') or not ai_response.get('json_ld_schema'):
logger.error("Social metadata generation failed: Invalid or empty response from LLM")
# Return fallback response
return {
'open_graph': {
'title': blog_title,
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
'image': 'https://example.com/image.jpg',
'type': 'article',
'site_name': 'Your Website',
'url': 'https://example.com/blog'
},
'twitter_card': {
'card': 'summary_large_image',
'title': blog_title,
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
'image': 'https://example.com/image.jpg',
'site': '@yourwebsite',
'creator': '@author'
},
'json_ld_schema': {
'@context': 'https://schema.org',
'@type': 'Article',
'headline': blog_title,
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
'author': {'@type': 'Person', 'name': 'Author Name'},
'publisher': {'@type': 'Organization', 'name': 'Your Website'},
'datePublished': '2025-01-01T00:00:00Z',
'dateModified': '2025-01-01T00:00:00Z',
'mainEntityOfPage': 'https://example.com/blog',
'keywords': keywords_data.get('primary_keywords', ['content']),
'wordCount': len(blog_content.split())
}
}
logger.info(f"Social metadata generation completed. Response keys: {list(ai_response.keys())}")
logger.info(f"Open Graph data: {ai_response.get('open_graph', 'Not found')}")
logger.info(f"Twitter Card data: {ai_response.get('twitter_card', 'Not found')}")
logger.info(f"JSON-LD data: {ai_response.get('json_ld_schema', 'Not found')}")
return ai_response
except Exception as e:
logger.error(f"Social metadata generation failed: {e}")
raise e
def _extract_content_highlights(self, blog_content: str, max_length: int = 2500) -> str:
"""Extract key sections from blog content for prompt context"""
try:
lines = blog_content.split('\n')
# Get first paragraph (introduction)
intro = ""
for line in lines[:20]:
if line.strip() and not line.strip().startswith('#'):
intro += line.strip() + " "
if len(intro) > 300:
break
# Get section headings
headings = [line.strip() for line in lines if line.strip().startswith('##')][:6]
# Get conclusion if available
conclusion = ""
for line in reversed(lines[-20:]):
if line.strip() and not line.strip().startswith('#'):
conclusion = line.strip() + " " + conclusion
if len(conclusion) > 300:
break
highlights = f"INTRODUCTION: {intro[:300]}...\n\n"
highlights += f"SECTION HEADINGS: {' | '.join([h.replace('##', '').strip() for h in headings])}\n\n"
if conclusion:
highlights += f"CONCLUSION: {conclusion[:300]}..."
return highlights[:max_length]
except Exception as e:
logger.warning(f"Failed to extract content highlights: {e}")
return blog_content[:2000] + "..."
def _create_core_metadata_prompt(
self,
blog_content: str,
blog_title: str,
keywords_data: Dict[str, Any],
outline: Optional[List[Dict[str, Any]]] = None,
seo_analysis: Optional[Dict[str, Any]] = None
) -> str:
"""Create high-quality prompt for core metadata generation"""
primary_keywords = ", ".join(keywords_data.get('primary_keywords', []))
semantic_keywords = ", ".join(keywords_data.get('semantic_keywords', []))
search_intent = keywords_data.get('search_intent', 'informational')
target_audience = keywords_data.get('target_audience', 'general')
industry = keywords_data.get('industry', 'general')
word_count = len(blog_content.split())
# Extract outline structure
outline_context = ""
if outline:
headings = [s.get('heading', '') for s in outline if s.get('heading')]
outline_context = f"""
OUTLINE STRUCTURE:
- Total sections: {len(outline)}
- Section headings: {', '.join(headings[:8])}
- Content hierarchy: Well-structured with {len(outline)} main sections
"""
# Extract SEO analysis insights
seo_context = ""
if seo_analysis:
overall_score = seo_analysis.get('overall_score', seo_analysis.get('seo_score', 0))
category_scores = seo_analysis.get('category_scores', {})
applied_recs = seo_analysis.get('applied_recommendations', [])
seo_context = f"""
SEO ANALYSIS RESULTS:
- Overall SEO Score: {overall_score}/100
- Category Scores: Structure {category_scores.get('structure', category_scores.get('Structure', 0))}, Keywords {category_scores.get('keywords', category_scores.get('Keywords', 0))}, Readability {category_scores.get('readability', category_scores.get('Readability', 0))}
- Applied Recommendations: {len(applied_recs)} SEO optimizations have been applied
- Content Quality: Optimized for search engines with keyword focus
"""
# Get more content context (key sections instead of just first 1000 chars)
content_preview = self._extract_content_highlights(blog_content)
prompt = f"""
Generate comprehensive, personalized SEO metadata for this blog post.
=== BLOG CONTENT CONTEXT ===
TITLE: {blog_title}
CONTENT PREVIEW (key sections): {content_preview}
WORD COUNT: {word_count} words
READING TIME ESTIMATE: {max(1, word_count // 200)} minutes
{outline_context}
=== KEYWORD & AUDIENCE DATA ===
PRIMARY KEYWORDS: {primary_keywords}
SEMANTIC KEYWORDS: {semantic_keywords}
SEARCH INTENT: {search_intent}
TARGET AUDIENCE: {target_audience}
INDUSTRY: {industry}
{seo_context}
=== METADATA GENERATION REQUIREMENTS ===
1. SEO TITLE (50-60 characters, must include primary keyword):
- Front-load primary keyword
- Make it compelling and click-worthy
- Include power words if appropriate for {target_audience} audience
- Optimized for {search_intent} search intent
2. META DESCRIPTION (150-160 characters, must include CTA):
- Include primary keyword naturally in first 120 chars
- Add compelling call-to-action (e.g., "Learn more", "Discover how", "Get started")
- Highlight value proposition for {target_audience} audience
- Use {industry} industry-specific terminology where relevant
3. URL SLUG (lowercase, hyphens, 3-5 words):
- Include primary keyword
- Remove stop words
- Keep it concise and readable
4. BLOG TAGS (5-8 relevant tags):
- Mix of primary, semantic, and long-tail keywords
- Industry-specific tags for {industry}
- Audience-relevant tags for {target_audience}
5. BLOG CATEGORIES (2-3 categories):
- Based on content structure and {industry} industry standards
- Reflect main themes from outline sections
6. SOCIAL HASHTAGS (5-10 hashtags with #):
- Include primary keyword as hashtag
- Industry-specific hashtags for {industry}
- Trending/relevant hashtags for {target_audience}
7. READING TIME (calculate from {word_count} words):
- Average reading speed: 200 words/minute
- Round to nearest minute
8. FOCUS KEYWORD (primary keyword for SEO):
- Select the most important primary keyword
- Should match the main topic and search intent
=== QUALITY REQUIREMENTS ===
- All metadata must be unique, not generic
- Incorporate insights from SEO analysis if provided
- Reflect the actual content structure from outline
- Use language appropriate for {target_audience} audience
- Optimize for {search_intent} search intent
- Make descriptions compelling and action-oriented
Generate metadata that is personalized, compelling, and SEO-optimized.
"""
return prompt
def _create_social_metadata_prompt(
self,
blog_content: str,
blog_title: str,
keywords_data: Dict[str, Any],
outline: Optional[List[Dict[str, Any]]] = None,
seo_analysis: Optional[Dict[str, Any]] = None
) -> str:
"""Create high-quality prompt for social metadata generation"""
primary_keywords = ", ".join(keywords_data.get('primary_keywords', []))
search_intent = keywords_data.get('search_intent', 'informational')
target_audience = keywords_data.get('target_audience', 'general')
industry = keywords_data.get('industry', 'general')
current_date = datetime.now().isoformat()
# Add outline and SEO context similar to core metadata prompt
outline_context = ""
if outline:
headings = [s.get('heading', '') for s in outline if s.get('heading')]
outline_context = f"\nOUTLINE SECTIONS: {', '.join(headings[:6])}\n"
seo_context = ""
if seo_analysis:
overall_score = seo_analysis.get('overall_score', seo_analysis.get('seo_score', 0))
seo_context = f"\nSEO SCORE: {overall_score}/100 (optimized content)\n"
content_preview = self._extract_content_highlights(blog_content, 1500)
prompt = f"""
Generate engaging social media metadata for this blog post.
=== CONTENT ===
TITLE: {blog_title}
CONTENT: {content_preview}
{outline_context}
{seo_context}
KEYWORDS: {primary_keywords}
TARGET AUDIENCE: {target_audience}
INDUSTRY: {industry}
CURRENT DATE: {current_date}
=== GENERATION REQUIREMENTS ===
1. OPEN GRAPH (Facebook/LinkedIn):
- title: 60 chars max, include primary keyword, compelling for {target_audience}
- description: 160 chars max, include CTA and value proposition
- image: Suggest an appropriate image URL (placeholder if none available)
- type: "article"
- site_name: Use appropriate site name for {industry} industry
- url: Generate canonical URL structure
2. TWITTER CARD:
- card: "summary_large_image"
- title: 70 chars max, optimized for Twitter audience
- description: 200 chars max with relevant hashtags inline
- image: Match Open Graph image
- site: @yourwebsite (placeholder, user should update)
- creator: @author (placeholder, user should update)
3. JSON-LD SCHEMA (Article):
- @context: "https://schema.org"
- @type: "Article"
- headline: Article title (optimized)
- description: Article description (150-200 chars)
- author: {{"@type": "Person", "name": "Author Name"}} (placeholder)
- publisher: {{"@type": "Organization", "name": "Site Name", "logo": {{"@type": "ImageObject", "url": "logo-url"}}}}
- datePublished: {current_date}
- dateModified: {current_date}
- mainEntityOfPage: {{"@type": "WebPage", "@id": "canonical-url"}}
- keywords: Array of primary and semantic keywords
- wordCount: {len(blog_content.split())}
- articleSection: Primary category based on content
- inLanguage: "en-US"
Make it engaging, personalized for {target_audience}, and optimized for {industry} industry.
"""
return prompt
def _compile_metadata_response(
self,
core_metadata: Dict[str, Any],
social_metadata: Dict[str, Any],
original_title: str
) -> Dict[str, Any]:
"""Compile final metadata response"""
try:
# Extract data from AI responses
seo_title = core_metadata.get('seo_title', original_title)
meta_description = core_metadata.get('meta_description', '')
url_slug = core_metadata.get('url_slug', '')
blog_tags = core_metadata.get('blog_tags', [])
blog_categories = core_metadata.get('blog_categories', [])
social_hashtags = core_metadata.get('social_hashtags', [])
canonical_url = core_metadata.get('canonical_url', '')
reading_time = core_metadata.get('reading_time', 0)
focus_keyword = core_metadata.get('focus_keyword', '')
open_graph = social_metadata.get('open_graph', {})
twitter_card = social_metadata.get('twitter_card', {})
json_ld_schema = social_metadata.get('json_ld_schema', {})
# Compile comprehensive response
response = {
'success': True,
'title_options': [seo_title], # For backward compatibility
'meta_descriptions': [meta_description], # For backward compatibility
'seo_title': seo_title,
'meta_description': meta_description,
'url_slug': url_slug,
'blog_tags': blog_tags,
'blog_categories': blog_categories,
'social_hashtags': social_hashtags,
'canonical_url': canonical_url,
'reading_time': reading_time,
'focus_keyword': focus_keyword,
'open_graph': open_graph,
'twitter_card': twitter_card,
'json_ld_schema': json_ld_schema,
'generated_at': datetime.utcnow().isoformat(),
'metadata_summary': {
'total_metadata_types': 10,
'ai_calls_used': 2,
'optimization_score': self._calculate_optimization_score(core_metadata, social_metadata)
}
}
logger.info(f"Metadata compilation completed. Generated {len(response)} metadata fields")
return response
except Exception as e:
logger.error(f"Metadata compilation failed: {e}")
raise e
def _calculate_optimization_score(self, core_metadata: Dict[str, Any], social_metadata: Dict[str, Any]) -> int:
"""Calculate overall optimization score for the generated metadata"""
try:
score = 0
# Check core metadata completeness
if core_metadata.get('seo_title'):
score += 15
if core_metadata.get('meta_description'):
score += 15
if core_metadata.get('url_slug'):
score += 10
if core_metadata.get('blog_tags'):
score += 10
if core_metadata.get('blog_categories'):
score += 10
if core_metadata.get('social_hashtags'):
score += 10
if core_metadata.get('focus_keyword'):
score += 10
# Check social metadata completeness
if social_metadata.get('open_graph'):
score += 10
if social_metadata.get('twitter_card'):
score += 5
if social_metadata.get('json_ld_schema'):
score += 5
return min(score, 100) # Cap at 100
except Exception as e:
logger.error(f"Failed to calculate optimization score: {e}")
return 0

View File

@@ -0,0 +1,273 @@
"""Blog SEO Recommendation Applier
Applies actionable SEO recommendations to existing blog content using the
provider-agnostic `llm_text_gen` dispatcher. Ensures GPT_PROVIDER parity.
"""
import asyncio
from typing import Dict, Any, List
from utils.logger_utils import get_service_logger
from services.llm_providers.main_text_generation import llm_text_gen
logger = get_service_logger("blog_seo_recommendation_applier")
class BlogSEORecommendationApplier:
"""Apply actionable SEO recommendations to blog content."""
def __init__(self):
logger.debug("Initialized BlogSEORecommendationApplier")
async def apply_recommendations(self, payload: Dict[str, Any], user_id: str = None) -> Dict[str, Any]:
"""Apply recommendations and return updated content."""
if not user_id:
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
title = payload.get("title", "Untitled Blog")
sections: List[Dict[str, Any]] = payload.get("sections", [])
outline = payload.get("outline", [])
research = payload.get("research", {})
recommendations = payload.get("recommendations", [])
persona = payload.get("persona", {})
tone = payload.get("tone")
audience = payload.get("audience")
if not sections:
return {"success": False, "error": "No sections provided for recommendation application"}
if not recommendations:
logger.warning("apply_recommendations called without recommendations")
return {"success": True, "title": title, "sections": sections, "applied": []}
prompt = self._build_prompt(
title=title,
sections=sections,
outline=outline,
research=research,
recommendations=recommendations,
persona=persona,
tone=tone,
audience=audience,
)
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"heading": {"type": "string"},
"content": {"type": "string"},
"notes": {"type": "array", "items": {"type": "string"}},
},
"required": ["id", "heading", "content"],
},
},
"applied_recommendations": {
"type": "array",
"items": {
"type": "object",
"properties": {
"category": {"type": "string"},
"summary": {"type": "string"},
},
},
},
},
"required": ["sections"],
}
logger.info("Applying SEO recommendations via llm_text_gen")
result = await asyncio.to_thread(
llm_text_gen,
prompt,
None,
schema,
user_id, # Pass user_id for subscription checking
)
if not result or result.get("error"):
error_msg = result.get("error", "Unknown error") if result else "No response from text generator"
logger.error(f"SEO recommendation application failed: {error_msg}")
return {"success": False, "error": error_msg}
raw_sections = result.get("sections", []) or []
normalized_sections: List[Dict[str, Any]] = []
# Build lookup table from updated sections using their identifiers
updated_map: Dict[str, Dict[str, Any]] = {}
for updated in raw_sections:
section_id = str(
updated.get("id")
or updated.get("section_id")
or updated.get("heading")
or ""
).strip()
if not section_id:
continue
heading = (
updated.get("heading")
or updated.get("title")
or section_id
)
content_text = updated.get("content", "")
if isinstance(content_text, list):
content_text = "\n\n".join(str(p).strip() for p in content_text if p)
updated_map[section_id] = {
"id": section_id,
"heading": heading,
"content": str(content_text).strip(),
"notes": updated.get("notes", []),
}
if not updated_map and raw_sections:
logger.warning("Updated sections missing identifiers; falling back to positional mapping")
for index, original in enumerate(sections):
fallback_id = str(
original.get("id")
or original.get("section_id")
or f"section_{index + 1}"
).strip()
mapped = updated_map.get(fallback_id)
if not mapped and raw_sections:
# Fall back to positional match if identifier lookup failed
candidate = raw_sections[index] if index < len(raw_sections) else {}
heading = (
candidate.get("heading")
or candidate.get("title")
or original.get("heading")
or original.get("title")
or f"Section {index + 1}"
)
content_text = candidate.get("content") or original.get("content", "")
if isinstance(content_text, list):
content_text = "\n\n".join(str(p).strip() for p in content_text if p)
mapped = {
"id": fallback_id,
"heading": heading,
"content": str(content_text).strip(),
"notes": candidate.get("notes", []),
}
if not mapped:
# Fallback to original content if nothing else available
mapped = {
"id": fallback_id,
"heading": original.get("heading") or original.get("title") or f"Section {index + 1}",
"content": str(original.get("content", "")).strip(),
"notes": original.get("notes", []),
}
normalized_sections.append(mapped)
applied = result.get("applied_recommendations", [])
logger.info("SEO recommendations applied successfully")
return {
"success": True,
"title": result.get("title", title),
"sections": normalized_sections,
"applied": applied,
}
def _build_prompt(
self,
*,
title: str,
sections: List[Dict[str, Any]],
outline: List[Dict[str, Any]],
research: Dict[str, Any],
recommendations: List[Dict[str, Any]],
persona: Dict[str, Any],
tone: str | None,
audience: str | None,
) -> str:
"""Construct prompt for applying recommendations."""
sections_str = []
for section in sections:
sections_str.append(
f"ID: {section.get('id', 'section')}, Heading: {section.get('heading', 'Untitled')}\n"
f"Current Content:\n{section.get('content', '')}\n"
)
outline_str = "\n".join(
[
f"- {item.get('heading', 'Section')} (Target words: {item.get('target_words', 'N/A')})"
for item in outline
]
)
research_summary = research.get("keyword_analysis", {}) if research else {}
primary_keywords = ", ".join(research_summary.get("primary", [])[:10]) or "None"
recommendations_str = []
for rec in recommendations:
recommendations_str.append(
f"Category: {rec.get('category', 'General')} | Priority: {rec.get('priority', 'Medium')}\n"
f"Recommendation: {rec.get('recommendation', '')}\n"
f"Impact: {rec.get('impact', '')}\n"
)
persona_str = (
f"Persona: {persona}\n"
if persona
else "Persona: (not provided)\n"
)
style_guidance = []
if tone:
style_guidance.append(f"Desired tone: {tone}")
if audience:
style_guidance.append(f"Target audience: {audience}")
style_str = "\n".join(style_guidance) if style_guidance else "Maintain current tone and audience alignment."
prompt = f"""
You are an expert SEO content strategist. Update the blog content to apply the actionable recommendations.
Current Title: {title}
Primary Keywords (for context): {primary_keywords}
Outline Overview:
{outline_str or 'No outline supplied'}
Existing Sections:
{''.join(sections_str)}
Actionable Recommendations to Apply:
{''.join(recommendations_str)}
{persona_str}
{style_str}
Instructions:
1. Carefully apply the recommendations while preserving factual accuracy and research alignment.
2. Keep section identifiers (IDs) unchanged so the frontend can map updates correctly.
3. Improve clarity, flow, and SEO optimization per the guidance.
4. Return updated sections in the requested JSON format.
5. Provide a short summary of which recommendations were addressed.
"""
return prompt
__all__ = ["BlogSEORecommendationApplier"]