Base code
This commit is contained in:
151
backend/services/blog_writer/README.md
Normal file
151
backend/services/blog_writer/README.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# AI Blog Writer Service Architecture
|
||||
|
||||
This directory contains the refactored AI Blog Writer service with a clean, modular architecture.
|
||||
|
||||
## 📁 Directory Structure
|
||||
|
||||
```
|
||||
blog_writer/
|
||||
├── README.md # This file
|
||||
├── blog_service.py # Main entry point (imports from core)
|
||||
├── core/ # Core service orchestrator
|
||||
│ ├── __init__.py
|
||||
│ └── blog_writer_service.py # Main service coordinator
|
||||
├── research/ # Research functionality
|
||||
│ ├── __init__.py
|
||||
│ ├── research_service.py # Main research orchestrator
|
||||
│ ├── keyword_analyzer.py # AI-powered keyword analysis
|
||||
│ ├── competitor_analyzer.py # Competitor intelligence
|
||||
│ └── content_angle_generator.py # Content angle discovery
|
||||
├── outline/ # Outline generation
|
||||
│ ├── __init__.py
|
||||
│ ├── outline_service.py # Main outline orchestrator
|
||||
│ ├── outline_generator.py # AI-powered outline generation
|
||||
│ ├── outline_optimizer.py # Outline optimization
|
||||
│ └── section_enhancer.py # Section enhancement
|
||||
├── content/ # Content generation (TODO)
|
||||
└── optimization/ # SEO & optimization (TODO)
|
||||
```
|
||||
|
||||
## 🏗️ Architecture Overview
|
||||
|
||||
### Core Module (`core/`)
|
||||
- **`BlogWriterService`**: Main orchestrator that coordinates all blog writing functionality
|
||||
- Provides a unified interface for research, outline generation, and content creation
|
||||
- Delegates to specialized modules for specific functionality
|
||||
|
||||
### Research Module (`research/`)
|
||||
- **`ResearchService`**: Orchestrates comprehensive research using Google Search grounding
|
||||
- **`KeywordAnalyzer`**: AI-powered keyword analysis and extraction
|
||||
- **`CompetitorAnalyzer`**: Competitor intelligence and market analysis
|
||||
- **`ContentAngleGenerator`**: Strategic content angle discovery
|
||||
|
||||
### Outline Module (`outline/`)
|
||||
- **`OutlineService`**: Manages outline generation, refinement, and optimization
|
||||
- **`OutlineGenerator`**: AI-powered outline generation from research data
|
||||
- **`OutlineOptimizer`**: Optimizes outlines for flow, SEO, and engagement
|
||||
- **`SectionEnhancer`**: Enhances individual sections using AI
|
||||
|
||||
## 🔄 Service Flow
|
||||
|
||||
1. **Research Phase**: `ResearchService` → `KeywordAnalyzer` + `CompetitorAnalyzer` + `ContentAngleGenerator`
|
||||
2. **Outline Phase**: `OutlineService` → `OutlineGenerator` → `OutlineOptimizer`
|
||||
3. **Content Phase**: (TODO) Content generation and optimization
|
||||
4. **Publishing Phase**: (TODO) Platform integration and publishing
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
```python
|
||||
from services.blog_writer.blog_service import BlogWriterService
|
||||
|
||||
# Initialize the service
|
||||
service = BlogWriterService()
|
||||
|
||||
# Research a topic
|
||||
research_result = await service.research(research_request)
|
||||
|
||||
# Generate outline from research
|
||||
outline_result = await service.generate_outline(outline_request)
|
||||
|
||||
# Enhance sections
|
||||
enhanced_section = await service.enhance_section_with_ai(section, "SEO optimization")
|
||||
```
|
||||
|
||||
## 🎯 Key Benefits
|
||||
|
||||
### 1. **Modularity**
|
||||
- Each module has a single responsibility
|
||||
- Easy to test, maintain, and extend
|
||||
- Clear separation of concerns
|
||||
|
||||
### 2. **Reusability**
|
||||
- Components can be used independently
|
||||
- Easy to swap implementations
|
||||
- Shared utilities and helpers
|
||||
|
||||
### 3. **Scalability**
|
||||
- New features can be added as separate modules
|
||||
- Existing modules can be enhanced without affecting others
|
||||
- Clear interfaces between modules
|
||||
|
||||
### 4. **Maintainability**
|
||||
- Smaller, focused files are easier to understand
|
||||
- Changes are isolated to specific modules
|
||||
- Clear dependency relationships
|
||||
|
||||
## 🔧 Development Guidelines
|
||||
|
||||
### Adding New Features
|
||||
1. Identify the appropriate module (research, outline, content, optimization)
|
||||
2. Create new classes following the existing patterns
|
||||
3. Update the module's `__init__.py` to export new classes
|
||||
4. Add methods to the appropriate service orchestrator
|
||||
5. Update the main `BlogWriterService` if needed
|
||||
|
||||
### Testing
|
||||
- Each module should have its own test suite
|
||||
- Mock external dependencies (AI providers, APIs)
|
||||
- Test both success and failure scenarios
|
||||
- Maintain high test coverage
|
||||
|
||||
### Error Handling
|
||||
- Use graceful degradation with fallbacks
|
||||
- Log errors appropriately
|
||||
- Return meaningful error messages to users
|
||||
- Don't let one module's failure break the entire flow
|
||||
|
||||
## 📈 Future Enhancements
|
||||
|
||||
### Content Module (`content/`)
|
||||
- Section content generation
|
||||
- Content optimization and refinement
|
||||
- Multi-format output (HTML, Markdown, etc.)
|
||||
|
||||
### Optimization Module (`optimization/`)
|
||||
- SEO analysis and recommendations
|
||||
- Readability optimization
|
||||
- Performance metrics and analytics
|
||||
|
||||
### Integration Module (`integration/`)
|
||||
- Platform-specific adapters (WordPress, Wix, etc.)
|
||||
- Publishing workflows
|
||||
- Content management system integration
|
||||
|
||||
## 🔍 Code Quality
|
||||
|
||||
- **Type Hints**: All methods use proper type annotations
|
||||
- **Documentation**: Comprehensive docstrings for all public methods
|
||||
- **Error Handling**: Graceful failure with meaningful error messages
|
||||
- **Logging**: Structured logging with appropriate levels
|
||||
- **Testing**: Unit tests for all major functionality
|
||||
- **Performance**: Efficient caching and API usage
|
||||
|
||||
## 📝 Migration Notes
|
||||
|
||||
The original `blog_service.py` has been refactored into this modular structure:
|
||||
- **Research functionality** → `research/` module
|
||||
- **Outline generation** → `outline/` module
|
||||
- **Service orchestration** → `core/` module
|
||||
- **Main entry point** → `blog_service.py` (now just imports from core)
|
||||
|
||||
All existing API endpoints continue to work without changes due to the maintained interface in `BlogWriterService`.
|
||||
11
backend/services/blog_writer/blog_service.py
Normal file
11
backend/services/blog_writer/blog_service.py
Normal file
@@ -0,0 +1,11 @@
|
||||
"""
|
||||
AI Blog Writer Service - Main entry point for blog writing functionality.
|
||||
|
||||
This module provides a clean interface to the modular blog writer services.
|
||||
The actual implementation has been refactored into specialized modules:
|
||||
- research/ - Research and keyword analysis
|
||||
- outline/ - Outline generation and optimization
|
||||
- core/ - Main service orchestrator
|
||||
"""
|
||||
|
||||
from .core import BlogWriterService
|
||||
209
backend/services/blog_writer/circuit_breaker.py
Normal file
209
backend/services/blog_writer/circuit_breaker.py
Normal file
@@ -0,0 +1,209 @@
|
||||
"""
|
||||
Circuit Breaker Pattern for Blog Writer API Calls
|
||||
|
||||
Implements circuit breaker pattern to prevent cascading failures when external APIs
|
||||
are experiencing issues. Tracks failure rates and automatically disables calls when
|
||||
threshold is exceeded, with auto-recovery after cooldown period.
|
||||
"""
|
||||
|
||||
import time
|
||||
import asyncio
|
||||
from typing import Callable, Any, Optional, Dict
|
||||
from enum import Enum
|
||||
from dataclasses import dataclass
|
||||
from loguru import logger
|
||||
|
||||
from .exceptions import CircuitBreakerOpenException
|
||||
|
||||
|
||||
class CircuitState(Enum):
|
||||
"""Circuit breaker states."""
|
||||
CLOSED = "closed" # Normal operation
|
||||
OPEN = "open" # Circuit is open, calls are blocked
|
||||
HALF_OPEN = "half_open" # Testing if service is back
|
||||
|
||||
|
||||
@dataclass
|
||||
class CircuitBreakerConfig:
|
||||
"""Configuration for circuit breaker."""
|
||||
failure_threshold: int = 5 # Number of failures before opening
|
||||
recovery_timeout: int = 60 # Seconds to wait before trying again
|
||||
success_threshold: int = 3 # Successes needed to close from half-open
|
||||
timeout: int = 30 # Timeout for individual calls
|
||||
max_failures_per_minute: int = 10 # Max failures per minute before opening
|
||||
|
||||
|
||||
class CircuitBreaker:
|
||||
"""Circuit breaker implementation for API calls."""
|
||||
|
||||
def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
|
||||
self.name = name
|
||||
self.config = config or CircuitBreakerConfig()
|
||||
self.state = CircuitState.CLOSED
|
||||
self.failure_count = 0
|
||||
self.success_count = 0
|
||||
self.last_failure_time = 0
|
||||
self.last_success_time = 0
|
||||
self.failure_times = [] # Track failure times for rate limiting
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def call(self, func: Callable, *args, **kwargs) -> Any:
|
||||
"""
|
||||
Execute function with circuit breaker protection.
|
||||
|
||||
Args:
|
||||
func: Function to execute
|
||||
*args: Function arguments
|
||||
**kwargs: Function keyword arguments
|
||||
|
||||
Returns:
|
||||
Function result
|
||||
|
||||
Raises:
|
||||
CircuitBreakerOpenException: If circuit is open
|
||||
"""
|
||||
async with self._lock:
|
||||
# Check if circuit should be opened due to rate limiting
|
||||
await self._check_rate_limit()
|
||||
|
||||
# Check circuit state
|
||||
if self.state == CircuitState.OPEN:
|
||||
if self._should_attempt_reset():
|
||||
self.state = CircuitState.HALF_OPEN
|
||||
self.success_count = 0
|
||||
logger.info(f"Circuit breaker {self.name} transitioning to HALF_OPEN")
|
||||
else:
|
||||
retry_after = int(self.config.recovery_timeout - (time.time() - self.last_failure_time))
|
||||
raise CircuitBreakerOpenException(
|
||||
f"Circuit breaker {self.name} is OPEN",
|
||||
retry_after=max(0, retry_after),
|
||||
context={"circuit_name": self.name, "state": self.state.value}
|
||||
)
|
||||
|
||||
try:
|
||||
# Execute the function with timeout
|
||||
result = await asyncio.wait_for(
|
||||
func(*args, **kwargs),
|
||||
timeout=self.config.timeout
|
||||
)
|
||||
|
||||
# Record success
|
||||
await self._record_success()
|
||||
return result
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
await self._record_failure("timeout")
|
||||
raise
|
||||
except Exception as e:
|
||||
await self._record_failure(str(e))
|
||||
raise
|
||||
|
||||
async def _check_rate_limit(self):
|
||||
"""Check if failure rate exceeds threshold."""
|
||||
current_time = time.time()
|
||||
|
||||
# Remove failures older than 1 minute
|
||||
self.failure_times = [
|
||||
failure_time for failure_time in self.failure_times
|
||||
if current_time - failure_time < 60
|
||||
]
|
||||
|
||||
# Check if we've exceeded the rate limit
|
||||
if len(self.failure_times) >= self.config.max_failures_per_minute:
|
||||
self.state = CircuitState.OPEN
|
||||
self.last_failure_time = current_time
|
||||
logger.warning(f"Circuit breaker {self.name} opened due to rate limit: {len(self.failure_times)} failures in last minute")
|
||||
|
||||
def _should_attempt_reset(self) -> bool:
|
||||
"""Check if enough time has passed to attempt reset."""
|
||||
return time.time() - self.last_failure_time >= self.config.recovery_timeout
|
||||
|
||||
async def _record_success(self):
|
||||
"""Record a successful call."""
|
||||
async with self._lock:
|
||||
self.last_success_time = time.time()
|
||||
|
||||
if self.state == CircuitState.HALF_OPEN:
|
||||
self.success_count += 1
|
||||
if self.success_count >= self.config.success_threshold:
|
||||
self.state = CircuitState.CLOSED
|
||||
self.failure_count = 0
|
||||
logger.info(f"Circuit breaker {self.name} closed after {self.success_count} successes")
|
||||
elif self.state == CircuitState.CLOSED:
|
||||
# Reset failure count on success
|
||||
self.failure_count = 0
|
||||
|
||||
async def _record_failure(self, error: str):
|
||||
"""Record a failed call."""
|
||||
async with self._lock:
|
||||
current_time = time.time()
|
||||
self.failure_count += 1
|
||||
self.last_failure_time = current_time
|
||||
self.failure_times.append(current_time)
|
||||
|
||||
logger.warning(f"Circuit breaker {self.name} recorded failure #{self.failure_count}: {error}")
|
||||
|
||||
# Open circuit if threshold exceeded
|
||||
if self.failure_count >= self.config.failure_threshold:
|
||||
self.state = CircuitState.OPEN
|
||||
logger.error(f"Circuit breaker {self.name} opened after {self.failure_count} failures")
|
||||
|
||||
def get_state(self) -> Dict[str, Any]:
|
||||
"""Get current circuit breaker state."""
|
||||
return {
|
||||
"name": self.name,
|
||||
"state": self.state.value,
|
||||
"failure_count": self.failure_count,
|
||||
"success_count": self.success_count,
|
||||
"last_failure_time": self.last_failure_time,
|
||||
"last_success_time": self.last_success_time,
|
||||
"failures_in_last_minute": len([
|
||||
t for t in self.failure_times
|
||||
if time.time() - t < 60
|
||||
])
|
||||
}
|
||||
|
||||
|
||||
class CircuitBreakerManager:
|
||||
"""Manages multiple circuit breakers."""
|
||||
|
||||
def __init__(self):
|
||||
self._breakers: Dict[str, CircuitBreaker] = {}
|
||||
|
||||
def get_breaker(self, name: str, config: Optional[CircuitBreakerConfig] = None) -> CircuitBreaker:
|
||||
"""Get or create a circuit breaker."""
|
||||
if name not in self._breakers:
|
||||
self._breakers[name] = CircuitBreaker(name, config)
|
||||
return self._breakers[name]
|
||||
|
||||
def get_all_states(self) -> Dict[str, Dict[str, Any]]:
|
||||
"""Get states of all circuit breakers."""
|
||||
return {name: breaker.get_state() for name, breaker in self._breakers.items()}
|
||||
|
||||
def reset_breaker(self, name: str):
|
||||
"""Reset a circuit breaker to closed state."""
|
||||
if name in self._breakers:
|
||||
self._breakers[name].state = CircuitState.CLOSED
|
||||
self._breakers[name].failure_count = 0
|
||||
self._breakers[name].success_count = 0
|
||||
logger.info(f"Circuit breaker {name} manually reset")
|
||||
|
||||
|
||||
# Global circuit breaker manager
|
||||
circuit_breaker_manager = CircuitBreakerManager()
|
||||
|
||||
|
||||
def circuit_breaker(name: str, config: Optional[CircuitBreakerConfig] = None):
|
||||
"""
|
||||
Decorator to add circuit breaker protection to async functions.
|
||||
|
||||
Args:
|
||||
name: Circuit breaker name
|
||||
config: Circuit breaker configuration
|
||||
"""
|
||||
def decorator(func: Callable) -> Callable:
|
||||
async def wrapper(*args, **kwargs):
|
||||
breaker = circuit_breaker_manager.get_breaker(name, config)
|
||||
return await breaker.call(func, *args, **kwargs)
|
||||
return wrapper
|
||||
return decorator
|
||||
209
backend/services/blog_writer/content/blog_rewriter.py
Normal file
209
backend/services/blog_writer/content/blog_rewriter.py
Normal file
@@ -0,0 +1,209 @@
|
||||
"""
|
||||
Blog Rewriter Service
|
||||
|
||||
Handles blog rewriting based on user feedback using structured AI calls.
|
||||
"""
|
||||
|
||||
import time
|
||||
import uuid
|
||||
from typing import Dict, Any
|
||||
from loguru import logger
|
||||
|
||||
from services.llm_providers.gemini_provider import gemini_structured_json_response
|
||||
|
||||
|
||||
class BlogRewriter:
|
||||
"""Service for rewriting blog content based on user feedback."""
|
||||
|
||||
def __init__(self, task_manager):
|
||||
self.task_manager = task_manager
|
||||
|
||||
def start_blog_rewrite(self, request: Dict[str, Any]) -> str:
|
||||
"""Start blog rewrite task with user feedback."""
|
||||
try:
|
||||
# Extract request data
|
||||
title = request.get("title", "Untitled Blog")
|
||||
sections = request.get("sections", [])
|
||||
research = request.get("research", {})
|
||||
outline = request.get("outline", [])
|
||||
feedback = request.get("feedback", "")
|
||||
tone = request.get("tone")
|
||||
audience = request.get("audience")
|
||||
focus = request.get("focus")
|
||||
|
||||
if not sections:
|
||||
raise ValueError("No sections provided for rewrite")
|
||||
|
||||
if not feedback or len(feedback.strip()) < 10:
|
||||
raise ValueError("Feedback is required and must be at least 10 characters")
|
||||
|
||||
# Create task for rewrite
|
||||
task_id = f"rewrite_{int(time.time())}_{uuid.uuid4().hex[:8]}"
|
||||
|
||||
# Start the rewrite task
|
||||
self.task_manager.start_task(
|
||||
task_id,
|
||||
self._execute_blog_rewrite,
|
||||
title=title,
|
||||
sections=sections,
|
||||
research=research,
|
||||
outline=outline,
|
||||
feedback=feedback,
|
||||
tone=tone,
|
||||
audience=audience,
|
||||
focus=focus
|
||||
)
|
||||
|
||||
logger.info(f"Blog rewrite task started: {task_id}")
|
||||
return task_id
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to start blog rewrite: {e}")
|
||||
raise
|
||||
|
||||
async def _execute_blog_rewrite(self, task_id: str, **kwargs):
|
||||
"""Execute the blog rewrite task."""
|
||||
try:
|
||||
title = kwargs.get("title", "Untitled Blog")
|
||||
sections = kwargs.get("sections", [])
|
||||
research = kwargs.get("research", {})
|
||||
outline = kwargs.get("outline", [])
|
||||
feedback = kwargs.get("feedback", "")
|
||||
tone = kwargs.get("tone")
|
||||
audience = kwargs.get("audience")
|
||||
focus = kwargs.get("focus")
|
||||
|
||||
# Update task status
|
||||
self.task_manager.update_task_status(task_id, "processing", "Analyzing current content and feedback...")
|
||||
|
||||
# Build rewrite prompt with user feedback
|
||||
system_prompt = f"""You are an expert blog writer tasked with rewriting content based on user feedback.
|
||||
|
||||
Current Blog Title: {title}
|
||||
User Feedback: {feedback}
|
||||
{f"Desired Tone: {tone}" if tone else ""}
|
||||
{f"Target Audience: {audience}" if audience else ""}
|
||||
{f"Focus Area: {focus}" if focus else ""}
|
||||
|
||||
Your task is to rewrite the blog content to address the user's feedback while maintaining the core structure and research insights."""
|
||||
|
||||
# Prepare content for rewrite
|
||||
full_content = f"Title: {title}\n\n"
|
||||
for section in sections:
|
||||
full_content += f"Section: {section.get('heading', 'Untitled')}\n"
|
||||
full_content += f"Content: {section.get('content', '')}\n\n"
|
||||
|
||||
# Create rewrite prompt
|
||||
rewrite_prompt = f"""
|
||||
Based on the user feedback and current blog content, rewrite the blog to address their concerns and preferences.
|
||||
|
||||
Current Content:
|
||||
{full_content}
|
||||
|
||||
User Feedback: {feedback}
|
||||
{f"Desired Tone: {tone}" if tone else ""}
|
||||
{f"Target Audience: {audience}" if audience else ""}
|
||||
{f"Focus Area: {focus}" if focus else ""}
|
||||
|
||||
Please rewrite the blog content in the following JSON format:
|
||||
{{
|
||||
"title": "New or improved blog title",
|
||||
"sections": [
|
||||
{{
|
||||
"id": "section_id",
|
||||
"heading": "Section heading",
|
||||
"content": "Rewritten section content"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
Guidelines:
|
||||
1. Address the user's feedback directly
|
||||
2. Maintain the research insights and factual accuracy
|
||||
3. Improve flow, clarity, and engagement
|
||||
4. Keep the same section structure unless feedback suggests otherwise
|
||||
5. Ensure content is well-formatted with proper paragraphs
|
||||
"""
|
||||
|
||||
# Update task status
|
||||
self.task_manager.update_task_status(task_id, "processing", "Generating rewritten content...")
|
||||
|
||||
# Use structured JSON generation
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {"type": "string"},
|
||||
"heading": {"type": "string"},
|
||||
"content": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
result = gemini_structured_json_response(
|
||||
prompt=rewrite_prompt,
|
||||
schema=schema,
|
||||
temperature=0.7,
|
||||
max_tokens=4096,
|
||||
system_prompt=system_prompt
|
||||
)
|
||||
|
||||
logger.info(f"Gemini response for rewrite task {task_id}: {result}")
|
||||
|
||||
# Check if we have a valid result - handle both multi-section and single-section formats
|
||||
is_valid_multi_section = result and not result.get("error") and result.get("title") and result.get("sections")
|
||||
is_valid_single_section = result and not result.get("error") and (result.get("heading") or result.get("title")) and result.get("content")
|
||||
|
||||
if is_valid_multi_section or is_valid_single_section:
|
||||
# If single section format, convert to multi-section format for consistency
|
||||
if is_valid_single_section and not is_valid_multi_section:
|
||||
# Convert single section to multi-section format
|
||||
converted_result = {
|
||||
"title": result.get("heading") or result.get("title") or "Rewritten Blog",
|
||||
"sections": [
|
||||
{
|
||||
"id": result.get("id") or "section_1",
|
||||
"heading": result.get("heading") or "Main Content",
|
||||
"content": result.get("content", "")
|
||||
}
|
||||
]
|
||||
}
|
||||
result = converted_result
|
||||
logger.info(f"Converted single section response to multi-section format for task {task_id}")
|
||||
|
||||
# Update task status with success
|
||||
self.task_manager.update_task_status(
|
||||
task_id,
|
||||
"completed",
|
||||
"Blog rewrite completed successfully!",
|
||||
result=result
|
||||
)
|
||||
logger.info(f"Blog rewrite completed successfully: {task_id}")
|
||||
else:
|
||||
# More detailed error handling
|
||||
if not result:
|
||||
error_msg = "No response from AI"
|
||||
elif result.get("error"):
|
||||
error_msg = f"AI error: {result.get('error')}"
|
||||
elif not (result.get("title") or result.get("heading")):
|
||||
error_msg = "AI response missing title/heading"
|
||||
elif not (result.get("sections") or result.get("content")):
|
||||
error_msg = "AI response missing sections/content"
|
||||
else:
|
||||
error_msg = "AI response has invalid structure"
|
||||
|
||||
self.task_manager.update_task_status(task_id, "failed", f"Rewrite failed: {error_msg}")
|
||||
logger.error(f"Blog rewrite failed: {error_msg}")
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Blog rewrite error: {str(e)}"
|
||||
self.task_manager.update_task_status(task_id, "failed", error_msg)
|
||||
logger.error(f"Blog rewrite task failed: {e}")
|
||||
raise
|
||||
152
backend/services/blog_writer/content/context_memory.py
Normal file
152
backend/services/blog_writer/content/context_memory.py
Normal file
@@ -0,0 +1,152 @@
|
||||
"""
|
||||
ContextMemory - maintains intelligent continuity context across sections using LLM-enhanced summarization.
|
||||
|
||||
Stores smart per-section summaries and thread keywords for use in prompts with cost optimization.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
from collections import deque
|
||||
from loguru import logger
|
||||
import hashlib
|
||||
|
||||
# Import the common gemini provider
|
||||
from services.llm_providers.gemini_provider import gemini_text_response
|
||||
|
||||
|
||||
class ContextMemory:
|
||||
"""In-memory continuity store for recent sections with LLM-enhanced summarization.
|
||||
|
||||
Notes:
|
||||
- Keeps an ordered deque of recent (section_id, summary) pairs
|
||||
- Uses LLM for intelligent summarization when content is substantial
|
||||
- Provides utilities to build a compact previous-sections summary
|
||||
- Implements caching to minimize LLM calls
|
||||
"""
|
||||
|
||||
def __init__(self, max_entries: int = 10):
|
||||
self.max_entries = max_entries
|
||||
self._recent: deque[Tuple[str, str]] = deque(maxlen=max_entries)
|
||||
# Cache for LLM-generated summaries
|
||||
self._summary_cache: Dict[str, str] = {}
|
||||
logger.info("✅ ContextMemory initialized with LLM-enhanced summarization")
|
||||
|
||||
def update_with_section(self, section_id: str, full_text: str, use_llm: bool = True) -> None:
|
||||
"""Create a compact summary and store it for continuity usage."""
|
||||
summary = self._summarize_text_intelligently(full_text, use_llm=use_llm)
|
||||
self._recent.append((section_id, summary))
|
||||
|
||||
def get_recent_summaries(self, limit: int = 2) -> List[str]:
|
||||
"""Return the last N stored summaries (most recent first)."""
|
||||
return [s for (_sid, s) in list(self._recent)[-limit:]]
|
||||
|
||||
def build_previous_sections_summary(self, limit: int = 2) -> str:
|
||||
"""Join recent summaries for prompt injection."""
|
||||
recents = self.get_recent_summaries(limit=limit)
|
||||
if not recents:
|
||||
return ""
|
||||
return "\n\n".join(recents)
|
||||
|
||||
def _summarize_text_intelligently(self, text: str, target_words: int = 80, use_llm: bool = True) -> str:
|
||||
"""Create intelligent summary using LLM when appropriate, fallback to truncation."""
|
||||
|
||||
# Create cache key
|
||||
cache_key = self._get_cache_key(text)
|
||||
|
||||
# Check cache first
|
||||
if cache_key in self._summary_cache:
|
||||
logger.debug("Summary cache hit")
|
||||
return self._summary_cache[cache_key]
|
||||
|
||||
# Determine if we should use LLM
|
||||
should_use_llm = use_llm and self._should_use_llm_summarization(text)
|
||||
|
||||
if should_use_llm:
|
||||
try:
|
||||
summary = self._llm_summarize_text(text, target_words)
|
||||
self._summary_cache[cache_key] = summary
|
||||
logger.info("LLM-based summarization completed")
|
||||
return summary
|
||||
except Exception as e:
|
||||
logger.warning(f"LLM summarization failed, using fallback: {e}")
|
||||
# Fall through to local summarization
|
||||
|
||||
# Local fallback
|
||||
summary = self._summarize_text_locally(text, target_words)
|
||||
self._summary_cache[cache_key] = summary
|
||||
return summary
|
||||
|
||||
def _should_use_llm_summarization(self, text: str) -> bool:
|
||||
"""Determine if content is substantial enough to warrant LLM summarization."""
|
||||
word_count = len(text.split())
|
||||
# Use LLM for substantial content (>150 words) or complex structure
|
||||
has_complex_structure = any(marker in text for marker in ['##', '###', '**', '*', '-', '1.', '2.'])
|
||||
|
||||
return word_count > 150 or has_complex_structure
|
||||
|
||||
def _llm_summarize_text(self, text: str, target_words: int = 80) -> str:
|
||||
"""Use Gemini API for intelligent text summarization."""
|
||||
|
||||
# Truncate text to minimize tokens while keeping key content
|
||||
truncated_text = text[:800] # First 800 chars usually contain the main points
|
||||
|
||||
prompt = f"""
|
||||
Summarize the following content in approximately {target_words} words, focusing on key concepts and main points.
|
||||
|
||||
Content: {truncated_text}
|
||||
|
||||
Requirements:
|
||||
- Capture the main ideas and key concepts
|
||||
- Maintain the original tone and style
|
||||
- Keep it concise but informative
|
||||
- Focus on what's most important for continuity
|
||||
|
||||
Generate only the summary, no explanations or formatting.
|
||||
"""
|
||||
|
||||
try:
|
||||
result = gemini_text_response(
|
||||
prompt=prompt,
|
||||
temperature=0.3, # Low temperature for consistent summarization
|
||||
max_tokens=500, # Increased tokens for better summaries
|
||||
system_prompt="You are an expert at creating concise, informative summaries."
|
||||
)
|
||||
|
||||
if result and result.strip():
|
||||
summary = result.strip()
|
||||
# Ensure it's not too long
|
||||
words = summary.split()
|
||||
if len(words) > target_words + 20: # Allow some flexibility
|
||||
summary = " ".join(words[:target_words]) + "..."
|
||||
return summary
|
||||
else:
|
||||
logger.warning("LLM summary response empty, using fallback")
|
||||
return self._summarize_text_locally(text, target_words)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM summarization error: {e}")
|
||||
return self._summarize_text_locally(text, target_words)
|
||||
|
||||
def _summarize_text_locally(self, text: str, target_words: int = 80) -> str:
|
||||
"""Very lightweight, deterministic truncation-based summary.
|
||||
|
||||
This deliberately avoids extra LLM calls. It collects the first
|
||||
sentences up to approximately target_words.
|
||||
"""
|
||||
words = text.split()
|
||||
if len(words) <= target_words:
|
||||
return text.strip()
|
||||
return " ".join(words[:target_words]).strip() + " …"
|
||||
|
||||
def _get_cache_key(self, text: str) -> str:
|
||||
"""Generate cache key from text hash."""
|
||||
# Use first 200 chars for cache key to balance uniqueness vs memory
|
||||
return hashlib.md5(text[:200].encode()).hexdigest()[:12]
|
||||
|
||||
def clear_cache(self):
|
||||
"""Clear summary cache (useful for testing or memory management)."""
|
||||
self._summary_cache.clear()
|
||||
logger.info("ContextMemory cache cleared")
|
||||
|
||||
|
||||
@@ -0,0 +1,92 @@
|
||||
"""
|
||||
EnhancedContentGenerator - thin orchestrator for section generation.
|
||||
|
||||
Provider parity:
|
||||
- Uses main_text_generation.llm_text_gen to respect GPT_PROVIDER (Gemini/HF)
|
||||
- No direct provider coupling here; Google grounding remains in research only
|
||||
"""
|
||||
|
||||
from typing import Any, Dict
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
from .source_url_manager import SourceURLManager
|
||||
from .context_memory import ContextMemory
|
||||
from .transition_generator import TransitionGenerator
|
||||
from .flow_analyzer import FlowAnalyzer
|
||||
|
||||
|
||||
class EnhancedContentGenerator:
|
||||
def __init__(self):
|
||||
self.url_manager = SourceURLManager()
|
||||
self.memory = ContextMemory(max_entries=12)
|
||||
self.transitioner = TransitionGenerator()
|
||||
self.flow = FlowAnalyzer()
|
||||
|
||||
async def generate_section(self, section: Any, research: Any, mode: str = "polished") -> Dict[str, Any]:
|
||||
prev_summary = self.memory.build_previous_sections_summary(limit=2)
|
||||
urls = self.url_manager.pick_relevant_urls(section, research)
|
||||
prompt = self._build_prompt(section, research, prev_summary, urls)
|
||||
# Provider-agnostic text generation (respect GPT_PROVIDER & circuit-breaker)
|
||||
content_text: str = ""
|
||||
try:
|
||||
ai_resp = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=None,
|
||||
system_prompt=None,
|
||||
)
|
||||
if isinstance(ai_resp, dict) and ai_resp.get("text"):
|
||||
content_text = ai_resp.get("text", "")
|
||||
elif isinstance(ai_resp, str):
|
||||
content_text = ai_resp
|
||||
else:
|
||||
# Fallback best-effort extraction
|
||||
content_text = str(ai_resp or "")
|
||||
except Exception as e:
|
||||
content_text = ""
|
||||
|
||||
result = {
|
||||
"content": content_text,
|
||||
"sources": [{"title": u.get("title", ""), "url": u.get("url", "")} for u in urls] if urls else [],
|
||||
}
|
||||
# Generate transition and compute intelligent flow metrics
|
||||
previous_text = prev_summary
|
||||
current_text = result.get("content", "")
|
||||
transition = self.transitioner.generate_transition(previous_text, getattr(section, 'heading', 'This section'), use_llm=True)
|
||||
metrics = self.flow.assess_flow(previous_text, current_text, use_llm=True)
|
||||
|
||||
# Update memory for subsequent sections and store continuity snapshot
|
||||
if current_text:
|
||||
self.memory.update_with_section(getattr(section, 'id', 'unknown'), current_text, use_llm=True)
|
||||
|
||||
# Return enriched result
|
||||
result["transition"] = transition
|
||||
result["continuity_metrics"] = metrics
|
||||
# Persist a lightweight continuity snapshot for API access
|
||||
try:
|
||||
sid = getattr(section, 'id', 'unknown')
|
||||
if not hasattr(self, "_last_continuity"):
|
||||
self._last_continuity = {}
|
||||
self._last_continuity[sid] = metrics
|
||||
except Exception:
|
||||
pass
|
||||
return result
|
||||
|
||||
def _build_prompt(self, section: Any, research: Any, prev_summary: str, urls: list) -> str:
|
||||
heading = getattr(section, 'heading', 'Section')
|
||||
key_points = getattr(section, 'key_points', [])
|
||||
keywords = getattr(section, 'keywords', [])
|
||||
target_words = getattr(section, 'target_words', 300)
|
||||
url_block = "\n".join([f"- {u.get('title','')} ({u.get('url','')})" for u in urls]) if urls else "(no specific URLs provided)"
|
||||
|
||||
return (
|
||||
f"You are writing the blog section '{heading}'.\n\n"
|
||||
f"Context summary (previous sections): {prev_summary}\n\n"
|
||||
f"Authoring requirements:\n"
|
||||
f"- Target word count: ~{target_words}\n"
|
||||
f"- Use the following key points: {', '.join(key_points)}\n"
|
||||
f"- Include these keywords naturally: {', '.join(keywords)}\n"
|
||||
f"- Cite insights from these sources when relevant (do not output raw URLs):\n{url_block}\n\n"
|
||||
"Write engaging, well-structured markdown with clear paragraphs (2-4 sentences each) separated by double line breaks."
|
||||
)
|
||||
|
||||
|
||||
162
backend/services/blog_writer/content/flow_analyzer.py
Normal file
162
backend/services/blog_writer/content/flow_analyzer.py
Normal file
@@ -0,0 +1,162 @@
|
||||
"""
|
||||
FlowAnalyzer - evaluates narrative flow using LLM-based analysis with cost optimization.
|
||||
|
||||
Uses Gemini API for intelligent analysis while minimizing API calls through caching and smart triggers.
|
||||
"""
|
||||
|
||||
from typing import Dict, Optional
|
||||
from loguru import logger
|
||||
import hashlib
|
||||
import json
|
||||
|
||||
# Import the common gemini provider
|
||||
from services.llm_providers.gemini_provider import gemini_structured_json_response
|
||||
|
||||
|
||||
class FlowAnalyzer:
|
||||
def __init__(self):
|
||||
# Simple in-memory cache to avoid redundant LLM calls
|
||||
self._cache: Dict[str, Dict[str, float]] = {}
|
||||
# Cache for rule-based fallback when LLM analysis isn't needed
|
||||
self._rule_cache: Dict[str, Dict[str, float]] = {}
|
||||
logger.info("✅ FlowAnalyzer initialized with LLM-based analysis")
|
||||
|
||||
def assess_flow(self, previous_text: str, current_text: str, use_llm: bool = True) -> Dict[str, float]:
|
||||
"""
|
||||
Return flow metrics in range 0..1.
|
||||
|
||||
Args:
|
||||
previous_text: Previous section content
|
||||
current_text: Current section content
|
||||
use_llm: Whether to use LLM analysis (default: True for significant content)
|
||||
"""
|
||||
if not current_text:
|
||||
return {"flow": 0.0, "consistency": 0.0, "progression": 0.0}
|
||||
|
||||
# Create cache key from content hashes
|
||||
cache_key = self._get_cache_key(previous_text, current_text)
|
||||
|
||||
# Check cache first
|
||||
if cache_key in self._cache:
|
||||
logger.debug("Flow analysis cache hit")
|
||||
return self._cache[cache_key]
|
||||
|
||||
# Determine if we should use LLM analysis
|
||||
should_use_llm = use_llm and self._should_use_llm_analysis(previous_text, current_text)
|
||||
|
||||
if should_use_llm:
|
||||
try:
|
||||
metrics = self._llm_flow_analysis(previous_text, current_text)
|
||||
self._cache[cache_key] = metrics
|
||||
logger.info("LLM-based flow analysis completed")
|
||||
return metrics
|
||||
except Exception as e:
|
||||
logger.warning(f"LLM flow analysis failed, falling back to rules: {e}")
|
||||
# Fall through to rule-based analysis
|
||||
|
||||
# Rule-based fallback (cached separately)
|
||||
if cache_key in self._rule_cache:
|
||||
return self._rule_cache[cache_key]
|
||||
|
||||
metrics = self._rule_based_analysis(previous_text, current_text)
|
||||
self._rule_cache[cache_key] = metrics
|
||||
return metrics
|
||||
|
||||
def _should_use_llm_analysis(self, previous_text: str, current_text: str) -> bool:
|
||||
"""Determine if content is significant enough to warrant LLM analysis."""
|
||||
# Use LLM for substantial content or when previous context exists
|
||||
word_count = len(current_text.split())
|
||||
has_previous = bool(previous_text and len(previous_text.strip()) > 50)
|
||||
|
||||
# Use LLM if: substantial content (>100 words) OR has meaningful previous context
|
||||
return word_count > 100 or has_previous
|
||||
|
||||
def _llm_flow_analysis(self, previous_text: str, current_text: str) -> Dict[str, float]:
|
||||
"""Use Gemini API for intelligent flow analysis."""
|
||||
|
||||
# Truncate content to minimize tokens while keeping context
|
||||
prev_truncated = (previous_text[-300:] if previous_text else "") if previous_text else ""
|
||||
curr_truncated = current_text[:500] # First 500 chars usually contain the key content
|
||||
|
||||
prompt = f"""
|
||||
Analyze the narrative flow between these two content sections. Rate each aspect from 0.0 to 1.0.
|
||||
|
||||
PREVIOUS SECTION (end): {prev_truncated}
|
||||
CURRENT SECTION (start): {curr_truncated}
|
||||
|
||||
Evaluate:
|
||||
1. Flow Quality (0.0-1.0): How smoothly does the content transition? Are there logical connections?
|
||||
2. Consistency (0.0-1.0): Do key themes, terminology, and tone remain consistent?
|
||||
3. Progression (0.0-1.0): Does the content logically build upon previous ideas?
|
||||
|
||||
Return ONLY a JSON object with these exact keys: flow, consistency, progression
|
||||
"""
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"flow": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"consistency": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"progression": {"type": "number", "minimum": 0.0, "maximum": 1.0}
|
||||
},
|
||||
"required": ["flow", "consistency", "progression"]
|
||||
}
|
||||
|
||||
try:
|
||||
result = gemini_structured_json_response(
|
||||
prompt=prompt,
|
||||
schema=schema,
|
||||
temperature=0.2, # Low temperature for consistent scoring
|
||||
max_tokens=1000 # Increased tokens for better analysis
|
||||
)
|
||||
|
||||
if result.parsed:
|
||||
return {
|
||||
"flow": float(result.parsed.get("flow", 0.6)),
|
||||
"consistency": float(result.parsed.get("consistency", 0.6)),
|
||||
"progression": float(result.parsed.get("progression", 0.6))
|
||||
}
|
||||
else:
|
||||
logger.warning("LLM response parsing failed, using fallback")
|
||||
return self._rule_based_analysis(previous_text, current_text)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM flow analysis error: {e}")
|
||||
return self._rule_based_analysis(previous_text, current_text)
|
||||
|
||||
def _rule_based_analysis(self, previous_text: str, current_text: str) -> Dict[str, float]:
|
||||
"""Fallback rule-based analysis for cost efficiency."""
|
||||
flow = 0.6
|
||||
consistency = 0.6
|
||||
progression = 0.6
|
||||
|
||||
# Enhanced heuristics
|
||||
if previous_text and previous_text[-1] in ".!?":
|
||||
flow += 0.1
|
||||
if any(k in current_text.lower() for k in ["therefore", "next", "building on", "as a result", "furthermore", "additionally"]):
|
||||
progression += 0.2
|
||||
if len(current_text.split()) > 120:
|
||||
consistency += 0.1
|
||||
if any(k in current_text.lower() for k in ["however", "but", "although", "despite"]):
|
||||
flow += 0.1 # Good use of contrast words
|
||||
|
||||
return {
|
||||
"flow": min(flow, 1.0),
|
||||
"consistency": min(consistency, 1.0),
|
||||
"progression": min(progression, 1.0),
|
||||
}
|
||||
|
||||
def _get_cache_key(self, previous_text: str, current_text: str) -> str:
|
||||
"""Generate cache key from content hashes."""
|
||||
# Use first 100 chars of each for cache key to balance uniqueness vs memory
|
||||
prev_hash = hashlib.md5((previous_text[:100] if previous_text else "").encode()).hexdigest()[:8]
|
||||
curr_hash = hashlib.md5(current_text[:100].encode()).hexdigest()[:8]
|
||||
return f"{prev_hash}_{curr_hash}"
|
||||
|
||||
def clear_cache(self):
|
||||
"""Clear analysis cache (useful for testing or memory management)."""
|
||||
self._cache.clear()
|
||||
self._rule_cache.clear()
|
||||
logger.info("FlowAnalyzer cache cleared")
|
||||
|
||||
|
||||
186
backend/services/blog_writer/content/introduction_generator.py
Normal file
186
backend/services/blog_writer/content/introduction_generator.py
Normal file
@@ -0,0 +1,186 @@
|
||||
"""
|
||||
Introduction Generator - Generates varied blog introductions based on content and research.
|
||||
|
||||
Generates 3 different introduction options for the user to choose from.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogResearchResponse, BlogOutlineSection
|
||||
|
||||
|
||||
class IntroductionGenerator:
|
||||
"""Generates blog introductions using research and content data."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the introduction generator."""
|
||||
pass
|
||||
|
||||
def build_introduction_prompt(
|
||||
self,
|
||||
blog_title: str,
|
||||
research: BlogResearchResponse,
|
||||
outline: List[BlogOutlineSection],
|
||||
sections_content: Dict[str, str],
|
||||
primary_keywords: List[str],
|
||||
search_intent: str
|
||||
) -> str:
|
||||
"""Build a prompt for generating blog introductions."""
|
||||
|
||||
# Extract key research insights
|
||||
keyword_analysis = research.keyword_analysis or {}
|
||||
content_angles = research.suggested_angles or []
|
||||
|
||||
# Get a summary of the first few sections for context
|
||||
section_summaries = []
|
||||
for i, section in enumerate(outline[:3], 1):
|
||||
section_id = section.id
|
||||
content = sections_content.get(section_id, '')
|
||||
if content:
|
||||
# Take first 200 chars as summary
|
||||
summary = content[:200] + '...' if len(content) > 200 else content
|
||||
section_summaries.append(f"{i}. {section.heading}: {summary}")
|
||||
|
||||
sections_text = '\n'.join(section_summaries) if section_summaries else "Content sections are being generated."
|
||||
|
||||
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else "the topic"
|
||||
content_angle_text = ', '.join(content_angles[:3]) if content_angles else "General insights"
|
||||
|
||||
return f"""Generate exactly 3 varied blog introductions for the following blog post.
|
||||
|
||||
BLOG TITLE: {blog_title}
|
||||
|
||||
PRIMARY KEYWORDS: {primary_kw_text}
|
||||
SEARCH INTENT: {search_intent}
|
||||
CONTENT ANGLES: {content_angle_text}
|
||||
|
||||
BLOG CONTENT SUMMARY:
|
||||
{sections_text}
|
||||
|
||||
REQUIREMENTS FOR EACH INTRODUCTION:
|
||||
- 80-120 words in length
|
||||
- Hook the reader immediately with a compelling opening
|
||||
- Clearly state the value proposition and what readers will learn
|
||||
- Include the primary keyword naturally within the first 2 sentences
|
||||
- Each introduction should have a different angle/approach:
|
||||
1. First: Problem-focused (highlight the challenge readers face)
|
||||
2. Second: Benefit-focused (emphasize the value and outcomes)
|
||||
3. Third: Story/statistic-focused (use a compelling fact or narrative hook)
|
||||
- Maintain a professional yet engaging tone
|
||||
- Avoid generic phrases - be specific and benefit-driven
|
||||
|
||||
Return ONLY a JSON array of exactly 3 introductions:
|
||||
[
|
||||
"First introduction (80-120 words, problem-focused)",
|
||||
"Second introduction (80-120 words, benefit-focused)",
|
||||
"Third introduction (80-120 words, story/statistic-focused)"
|
||||
]"""
|
||||
|
||||
def get_introduction_schema(self) -> Dict[str, Any]:
|
||||
"""Get the JSON schema for introduction generation."""
|
||||
return {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"minLength": 80,
|
||||
"maxLength": 150
|
||||
},
|
||||
"minItems": 3,
|
||||
"maxItems": 3
|
||||
}
|
||||
|
||||
async def generate_introductions(
|
||||
self,
|
||||
blog_title: str,
|
||||
research: BlogResearchResponse,
|
||||
outline: List[BlogOutlineSection],
|
||||
sections_content: Dict[str, str],
|
||||
primary_keywords: List[str],
|
||||
search_intent: str,
|
||||
user_id: str
|
||||
) -> List[str]:
|
||||
"""Generate 3 varied blog introductions.
|
||||
|
||||
Args:
|
||||
blog_title: The blog post title
|
||||
research: Research data with keywords and insights
|
||||
outline: Blog outline sections
|
||||
sections_content: Dictionary mapping section IDs to their content
|
||||
primary_keywords: Primary keywords for the blog
|
||||
search_intent: Search intent (informational, commercial, etc.)
|
||||
user_id: User ID for API calls
|
||||
|
||||
Returns:
|
||||
List of 3 introduction options
|
||||
"""
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for introduction generation")
|
||||
|
||||
# Build prompt
|
||||
prompt = self.build_introduction_prompt(
|
||||
blog_title=blog_title,
|
||||
research=research,
|
||||
outline=outline,
|
||||
sections_content=sections_content,
|
||||
primary_keywords=primary_keywords,
|
||||
search_intent=search_intent
|
||||
)
|
||||
|
||||
# Get schema
|
||||
schema = self.get_introduction_schema()
|
||||
|
||||
logger.info(f"Generating blog introductions for user {user_id}")
|
||||
|
||||
try:
|
||||
# Generate introductions using structured JSON response
|
||||
result = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt="You are an expert content writer specializing in creating compelling blog introductions that hook readers and clearly communicate value.",
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
# Handle response - could be array directly or wrapped in dict
|
||||
if isinstance(result, list):
|
||||
introductions = result
|
||||
elif isinstance(result, dict):
|
||||
# Try common keys
|
||||
introductions = result.get('introductions', result.get('options', result.get('intros', [])))
|
||||
if not introductions and isinstance(result.get('response'), list):
|
||||
introductions = result['response']
|
||||
else:
|
||||
logger.warning(f"Unexpected introduction generation result type: {type(result)}")
|
||||
introductions = []
|
||||
|
||||
# Validate and clean introductions
|
||||
cleaned_introductions = []
|
||||
for intro in introductions:
|
||||
if isinstance(intro, str) and len(intro.strip()) >= 50: # Minimum reasonable length
|
||||
cleaned = intro.strip()
|
||||
# Ensure it's within reasonable bounds
|
||||
if len(cleaned) <= 200: # Allow slight overflow for quality
|
||||
cleaned_introductions.append(cleaned)
|
||||
|
||||
# Ensure we have exactly 3 introductions
|
||||
if len(cleaned_introductions) < 3:
|
||||
logger.warning(f"Generated only {len(cleaned_introductions)} introductions, expected 3")
|
||||
# Pad with placeholder if needed
|
||||
while len(cleaned_introductions) < 3:
|
||||
cleaned_introductions.append(f"{blog_title} - A comprehensive guide covering essential insights and practical strategies.")
|
||||
|
||||
# Return exactly 3 introductions
|
||||
return cleaned_introductions[:3]
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to generate introductions: {e}")
|
||||
# Fallback: generate simple introductions
|
||||
fallback_introductions = [
|
||||
f"In this comprehensive guide, we'll explore {primary_keywords[0] if primary_keywords else 'essential insights'} and provide actionable strategies.",
|
||||
f"Discover everything you need to know about {primary_keywords[0] if primary_keywords else 'this topic'} and how it can transform your approach.",
|
||||
f"Whether you're new to {primary_keywords[0] if primary_keywords else 'this topic'} or looking to deepen your understanding, this guide has you covered."
|
||||
]
|
||||
return fallback_introductions
|
||||
|
||||
257
backend/services/blog_writer/content/medium_blog_generator.py
Normal file
257
backend/services/blog_writer/content/medium_blog_generator.py
Normal file
@@ -0,0 +1,257 @@
|
||||
"""
|
||||
Medium Blog Generator Service
|
||||
|
||||
Handles generation of medium-length blogs (≤1000 words) using structured AI calls.
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from typing import Dict, Any, List
|
||||
from loguru import logger
|
||||
from fastapi import HTTPException
|
||||
|
||||
from models.blog_models import (
|
||||
MediumBlogGenerateRequest,
|
||||
MediumBlogGenerateResult,
|
||||
MediumGeneratedSection,
|
||||
ResearchSource,
|
||||
)
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
from services.cache.persistent_content_cache import persistent_content_cache
|
||||
|
||||
|
||||
class MediumBlogGenerator:
|
||||
"""Service for generating medium-length blog content using structured AI calls."""
|
||||
|
||||
def __init__(self):
|
||||
self.cache = persistent_content_cache
|
||||
|
||||
async def generate_medium_blog_with_progress(self, req: MediumBlogGenerateRequest, task_id: str, user_id: str) -> MediumBlogGenerateResult:
|
||||
"""Use Gemini structured JSON to generate a medium-length blog in one call.
|
||||
|
||||
Args:
|
||||
req: Medium blog generation request
|
||||
task_id: Task ID for progress updates
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for medium blog generation (subscription checks and usage tracking)")
|
||||
|
||||
import time
|
||||
start = time.time()
|
||||
|
||||
# Prepare sections data for cache key generation
|
||||
sections_for_cache = []
|
||||
for s in req.sections:
|
||||
sections_for_cache.append({
|
||||
"id": s.id,
|
||||
"heading": s.heading,
|
||||
"keyPoints": getattr(s, "key_points", []) or getattr(s, "keyPoints", []),
|
||||
"subheadings": getattr(s, "subheadings", []),
|
||||
"keywords": getattr(s, "keywords", []),
|
||||
"targetWords": getattr(s, "target_words", None) or getattr(s, "targetWords", None),
|
||||
})
|
||||
|
||||
# Check cache first
|
||||
cached_result = self.cache.get_cached_content(
|
||||
keywords=req.researchKeywords or [],
|
||||
sections=sections_for_cache,
|
||||
global_target_words=req.globalTargetWords or 1000,
|
||||
persona_data=req.persona.dict() if req.persona else None,
|
||||
tone=req.tone,
|
||||
audience=req.audience
|
||||
)
|
||||
|
||||
if cached_result:
|
||||
logger.info(f"Using cached content for keywords: {req.researchKeywords} (saved expensive generation)")
|
||||
# Add cache hit marker to distinguish from fresh generation
|
||||
cached_result['generation_time_ms'] = 0 # Mark as cache hit
|
||||
cached_result['cache_hit'] = True
|
||||
return MediumBlogGenerateResult(**cached_result)
|
||||
|
||||
# Cache miss - proceed with AI generation
|
||||
logger.info(f"Cache miss - generating new content for keywords: {req.researchKeywords}")
|
||||
|
||||
# Build schema expected from the model
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {"type": "string"},
|
||||
"heading": {"type": "string"},
|
||||
"content": {"type": "string"},
|
||||
"wordCount": {"type": "number"},
|
||||
"sources": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {"title": {"type": "string"}, "url": {"type": "string"}},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
# Compose prompt
|
||||
def section_block(s):
|
||||
return {
|
||||
"id": s.id,
|
||||
"heading": s.heading,
|
||||
"outline": {
|
||||
"keyPoints": getattr(s, "key_points", []) or getattr(s, "keyPoints", []),
|
||||
"subheadings": getattr(s, "subheadings", []),
|
||||
"keywords": getattr(s, "keywords", []),
|
||||
"targetWords": getattr(s, "target_words", None) or getattr(s, "targetWords", None),
|
||||
"references": [
|
||||
{"title": r.title, "url": r.url} for r in getattr(s, "references", [])
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
payload = {
|
||||
"title": req.title,
|
||||
"globalTargetWords": req.globalTargetWords or 1000,
|
||||
"persona": req.persona.dict() if req.persona else None,
|
||||
"tone": req.tone,
|
||||
"audience": req.audience,
|
||||
"sections": [section_block(s) for s in req.sections],
|
||||
}
|
||||
|
||||
# Build persona-aware system prompt
|
||||
persona_context = ""
|
||||
if req.persona:
|
||||
persona_context = f"""
|
||||
PERSONA GUIDELINES:
|
||||
- Industry: {req.persona.industry or 'General'}
|
||||
- Tone: {req.persona.tone or 'Professional'}
|
||||
- Audience: {req.persona.audience or 'General readers'}
|
||||
- Persona ID: {req.persona.persona_id or 'Default'}
|
||||
|
||||
Write content that reflects this persona's expertise and communication style.
|
||||
Use industry-specific terminology and examples where appropriate.
|
||||
Maintain consistent voice and authority throughout all sections.
|
||||
"""
|
||||
|
||||
system = (
|
||||
"You are a professional blog writer with deep expertise in your field. "
|
||||
"Generate high-quality, persona-driven content for each section based on the provided outline. "
|
||||
"Write engaging, informative content that follows the section's key points and target word count. "
|
||||
"Ensure the content flows naturally and maintains consistent voice and authority. "
|
||||
"Format content with proper paragraph breaks using double line breaks (\\n\\n) between paragraphs. "
|
||||
"Structure content with clear paragraphs - aim for 2-4 sentences per paragraph. "
|
||||
f"{persona_context}"
|
||||
"Return ONLY valid JSON with no markdown formatting or explanations."
|
||||
)
|
||||
|
||||
# Build persona-specific content instructions
|
||||
persona_instructions = ""
|
||||
if req.persona:
|
||||
industry = req.persona.industry or 'General'
|
||||
tone = req.persona.tone or 'Professional'
|
||||
audience = req.persona.audience or 'General readers'
|
||||
|
||||
persona_instructions = f"""
|
||||
PERSONA-DRIVEN CONTENT REQUIREMENTS:
|
||||
- Write as an expert in {industry} industry
|
||||
- Use {tone} tone appropriate for {audience}
|
||||
- Include industry-specific examples and terminology
|
||||
- Demonstrate authority and expertise in the field
|
||||
- Use language that resonates with {audience}
|
||||
- Maintain consistent voice that reflects this persona's expertise
|
||||
"""
|
||||
|
||||
prompt = (
|
||||
f"Write blog content for the following sections. Each section should be {req.globalTargetWords or 1000} words total, distributed across all sections.\n\n"
|
||||
f"Blog Title: {req.title}\n\n"
|
||||
"For each section, write engaging content that:\n"
|
||||
"- Follows the key points provided\n"
|
||||
"- Uses the suggested keywords naturally\n"
|
||||
"- Meets the target word count\n"
|
||||
"- Maintains professional tone\n"
|
||||
"- References the provided sources when relevant\n"
|
||||
"- Breaks content into clear paragraphs (2-4 sentences each)\n"
|
||||
"- Uses double line breaks (\\n\\n) between paragraphs for proper formatting\n"
|
||||
"- Starts with an engaging opening paragraph\n"
|
||||
"- Ends with a strong concluding paragraph\n"
|
||||
f"{persona_instructions}\n"
|
||||
"IMPORTANT: Format the 'content' field with proper paragraph breaks using \\n\\n between paragraphs.\n\n"
|
||||
"Return a JSON object with 'title' and 'sections' array. Each section should have 'id', 'heading', 'content', and 'wordCount'.\n\n"
|
||||
f"Sections to write:\n{json.dumps(payload, ensure_ascii=False, indent=2)}"
|
||||
)
|
||||
|
||||
try:
|
||||
ai_resp = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt=system,
|
||||
user_id=user_id
|
||||
)
|
||||
except HTTPException:
|
||||
# Re-raise HTTPExceptions (e.g., 429 subscription limit) to preserve error details
|
||||
raise
|
||||
except Exception as llm_error:
|
||||
# Wrap other errors
|
||||
logger.error(f"AI generation failed: {llm_error}")
|
||||
raise Exception(f"AI generation failed: {str(llm_error)}")
|
||||
|
||||
# Check for errors in AI response
|
||||
if not ai_resp or ai_resp.get("error"):
|
||||
error_msg = ai_resp.get("error", "Empty generation result from model") if ai_resp else "No response from model"
|
||||
logger.error(f"AI generation failed: {error_msg}")
|
||||
raise Exception(f"AI generation failed: {error_msg}")
|
||||
|
||||
# Normalize output
|
||||
title = ai_resp.get("title") or req.title
|
||||
out_sections = []
|
||||
for s in ai_resp.get("sections", []) or []:
|
||||
out_sections.append(
|
||||
MediumGeneratedSection(
|
||||
id=str(s.get("id")),
|
||||
heading=s.get("heading") or "",
|
||||
content=s.get("content") or "",
|
||||
wordCount=int(s.get("wordCount") or 0),
|
||||
sources=[
|
||||
# map to ResearchSource shape if possible; keep minimal
|
||||
ResearchSource(title=src.get("title", ""), url=src.get("url", ""))
|
||||
for src in (s.get("sources") or [])
|
||||
] or None,
|
||||
)
|
||||
)
|
||||
|
||||
duration_ms = int((time.time() - start) * 1000)
|
||||
result = MediumBlogGenerateResult(
|
||||
success=True,
|
||||
title=title,
|
||||
sections=out_sections,
|
||||
model="gemini-2.5-flash",
|
||||
generation_time_ms=duration_ms,
|
||||
safety_flags=None,
|
||||
)
|
||||
|
||||
# Cache the result for future use
|
||||
try:
|
||||
self.cache.cache_content(
|
||||
keywords=req.researchKeywords or [],
|
||||
sections=sections_for_cache,
|
||||
global_target_words=req.globalTargetWords or 1000,
|
||||
persona_data=req.persona.dict() if req.persona else None,
|
||||
tone=req.tone or "professional",
|
||||
audience=req.audience or "general",
|
||||
result=result.dict()
|
||||
)
|
||||
logger.info(f"Cached content result for keywords: {req.researchKeywords}")
|
||||
except Exception as cache_error:
|
||||
logger.warning(f"Failed to cache content result: {cache_error}")
|
||||
# Don't fail the entire operation if caching fails
|
||||
|
||||
return result
|
||||
42
backend/services/blog_writer/content/source_url_manager.py
Normal file
42
backend/services/blog_writer/content/source_url_manager.py
Normal file
@@ -0,0 +1,42 @@
|
||||
"""
|
||||
SourceURLManager - selects the most relevant source URLs for a section.
|
||||
|
||||
Low-effort heuristic using keywords and titles; safe defaults if no research.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
|
||||
|
||||
class SourceURLManager:
|
||||
def pick_relevant_urls(self, section: Any, research: Any, limit: int = 5) -> List[str]:
|
||||
if not research or not getattr(research, 'sources', None):
|
||||
return []
|
||||
|
||||
section_keywords = set([k.lower() for k in getattr(section, 'keywords', [])])
|
||||
scored: List[tuple[float, str]] = []
|
||||
for s in research.sources:
|
||||
url = getattr(s, 'url', None) or getattr(s, 'uri', None) or s.get('url') if isinstance(s, dict) else None
|
||||
title = getattr(s, 'title', None) or s.get('title') if isinstance(s, dict) else ''
|
||||
if not url or not isinstance(url, str):
|
||||
continue
|
||||
title_l = (title or '').lower()
|
||||
# simple overlap score
|
||||
score = 0.0
|
||||
for kw in section_keywords:
|
||||
if kw and kw in title_l:
|
||||
score += 1.0
|
||||
# prefer https and reputable domains lightly
|
||||
if url.startswith('https://'):
|
||||
score += 0.2
|
||||
scored.append((score, url))
|
||||
|
||||
scored.sort(key=lambda x: x[0], reverse=True)
|
||||
dedup: List[str] = []
|
||||
for _, u in scored:
|
||||
if u not in dedup:
|
||||
dedup.append(u)
|
||||
if len(dedup) >= limit:
|
||||
break
|
||||
return dedup
|
||||
|
||||
|
||||
143
backend/services/blog_writer/content/transition_generator.py
Normal file
143
backend/services/blog_writer/content/transition_generator.py
Normal file
@@ -0,0 +1,143 @@
|
||||
"""
|
||||
TransitionGenerator - produces intelligent transitions between sections using LLM analysis.
|
||||
|
||||
Uses Gemini API for natural transitions while maintaining cost efficiency through smart caching.
|
||||
"""
|
||||
|
||||
from typing import Optional, Dict
|
||||
from loguru import logger
|
||||
import hashlib
|
||||
|
||||
# Import the common gemini provider
|
||||
from services.llm_providers.gemini_provider import gemini_text_response
|
||||
|
||||
|
||||
class TransitionGenerator:
|
||||
def __init__(self):
|
||||
# Simple cache to avoid redundant LLM calls for similar transitions
|
||||
self._cache: Dict[str, str] = {}
|
||||
logger.info("✅ TransitionGenerator initialized with LLM-based generation")
|
||||
|
||||
def generate_transition(self, previous_text: str, current_heading: str, use_llm: bool = True) -> str:
|
||||
"""
|
||||
Return a 1–2 sentence bridge from previous_text into current_heading.
|
||||
|
||||
Args:
|
||||
previous_text: Previous section content
|
||||
current_heading: Current section heading
|
||||
use_llm: Whether to use LLM generation (default: True for substantial content)
|
||||
"""
|
||||
prev = (previous_text or "").strip()
|
||||
if not prev:
|
||||
return f"Let's explore {current_heading.lower()} next."
|
||||
|
||||
# Create cache key
|
||||
cache_key = self._get_cache_key(prev, current_heading)
|
||||
|
||||
# Check cache first
|
||||
if cache_key in self._cache:
|
||||
logger.debug("Transition generation cache hit")
|
||||
return self._cache[cache_key]
|
||||
|
||||
# Determine if we should use LLM
|
||||
should_use_llm = use_llm and self._should_use_llm_generation(prev, current_heading)
|
||||
|
||||
if should_use_llm:
|
||||
try:
|
||||
transition = self._llm_generate_transition(prev, current_heading)
|
||||
self._cache[cache_key] = transition
|
||||
logger.info("LLM-based transition generated")
|
||||
return transition
|
||||
except Exception as e:
|
||||
logger.warning(f"LLM transition generation failed, using fallback: {e}")
|
||||
# Fall through to heuristic generation
|
||||
|
||||
# Heuristic fallback
|
||||
transition = self._heuristic_transition(prev, current_heading)
|
||||
self._cache[cache_key] = transition
|
||||
return transition
|
||||
|
||||
def _should_use_llm_generation(self, previous_text: str, current_heading: str) -> bool:
|
||||
"""Determine if content is substantial enough to warrant LLM generation."""
|
||||
# Use LLM for substantial previous content (>100 words) or complex headings
|
||||
word_count = len(previous_text.split())
|
||||
complex_heading = len(current_heading.split()) > 2 or any(char in current_heading for char in [':', '-', '&'])
|
||||
|
||||
return word_count > 100 or complex_heading
|
||||
|
||||
def _llm_generate_transition(self, previous_text: str, current_heading: str) -> str:
|
||||
"""Use Gemini API for intelligent transition generation."""
|
||||
|
||||
# Truncate previous text to minimize tokens while keeping context
|
||||
prev_truncated = previous_text[-200:] # Last 200 chars usually contain the conclusion
|
||||
|
||||
prompt = f"""
|
||||
Create a smooth, natural 1-2 sentence transition from the previous content to the new section.
|
||||
|
||||
PREVIOUS CONTENT (ending): {prev_truncated}
|
||||
NEW SECTION HEADING: {current_heading}
|
||||
|
||||
Requirements:
|
||||
- Write exactly 1-2 sentences
|
||||
- Create a logical bridge between the topics
|
||||
- Use natural, engaging language
|
||||
- Avoid repetition of the previous content
|
||||
- Lead smoothly into the new section topic
|
||||
|
||||
Generate only the transition text, no explanations or formatting.
|
||||
"""
|
||||
|
||||
try:
|
||||
result = gemini_text_response(
|
||||
prompt=prompt,
|
||||
temperature=0.6, # Balanced creativity and consistency
|
||||
max_tokens=300, # Increased tokens for better transitions
|
||||
system_prompt="You are an expert content writer creating smooth transitions between sections."
|
||||
)
|
||||
|
||||
if result and result.strip():
|
||||
# Clean up the response
|
||||
transition = result.strip()
|
||||
# Ensure it's 1-2 sentences
|
||||
sentences = transition.split('. ')
|
||||
if len(sentences) > 2:
|
||||
transition = '. '.join(sentences[:2]) + '.'
|
||||
return transition
|
||||
else:
|
||||
logger.warning("LLM transition response empty, using fallback")
|
||||
return self._heuristic_transition(previous_text, current_heading)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM transition generation error: {e}")
|
||||
return self._heuristic_transition(previous_text, current_heading)
|
||||
|
||||
def _heuristic_transition(self, previous_text: str, current_heading: str) -> str:
|
||||
"""Fallback heuristic-based transition generation."""
|
||||
tail = previous_text[-240:]
|
||||
|
||||
# Enhanced heuristics based on content patterns
|
||||
if any(word in tail.lower() for word in ["problem", "issue", "challenge"]):
|
||||
return f"Now that we've identified the challenges, let's explore {current_heading.lower()} to find solutions."
|
||||
elif any(word in tail.lower() for word in ["solution", "approach", "method"]):
|
||||
return f"Building on this approach, {current_heading.lower()} provides the next step in our analysis."
|
||||
elif any(word in tail.lower() for word in ["important", "crucial", "essential"]):
|
||||
return f"Given this importance, {current_heading.lower()} becomes our next focus area."
|
||||
else:
|
||||
return (
|
||||
f"Building on the discussion above, this leads us into {current_heading.lower()}, "
|
||||
f"where we focus on practical implications and what to do next."
|
||||
)
|
||||
|
||||
def _get_cache_key(self, previous_text: str, current_heading: str) -> str:
|
||||
"""Generate cache key from content hashes."""
|
||||
# Use last 100 chars of previous text and heading for cache key
|
||||
prev_hash = hashlib.md5(previous_text[-100:].encode()).hexdigest()[:8]
|
||||
heading_hash = hashlib.md5(current_heading.encode()).hexdigest()[:8]
|
||||
return f"{prev_hash}_{heading_hash}"
|
||||
|
||||
def clear_cache(self):
|
||||
"""Clear transition cache (useful for testing or memory management)."""
|
||||
self._cache.clear()
|
||||
logger.info("TransitionGenerator cache cleared")
|
||||
|
||||
|
||||
11
backend/services/blog_writer/core/__init__.py
Normal file
11
backend/services/blog_writer/core/__init__.py
Normal file
@@ -0,0 +1,11 @@
|
||||
"""
|
||||
Core module for AI Blog Writer.
|
||||
|
||||
This module contains the main service orchestrator and shared utilities.
|
||||
"""
|
||||
|
||||
from .blog_writer_service import BlogWriterService
|
||||
|
||||
__all__ = [
|
||||
'BlogWriterService'
|
||||
]
|
||||
521
backend/services/blog_writer/core/blog_writer_service.py
Normal file
521
backend/services/blog_writer/core/blog_writer_service.py
Normal file
@@ -0,0 +1,521 @@
|
||||
"""
|
||||
Blog Writer Service - Main orchestrator for AI Blog Writer.
|
||||
|
||||
Coordinates research, outline generation, content creation, and optimization.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
import time
|
||||
import uuid
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogResearchRequest,
|
||||
BlogResearchResponse,
|
||||
BlogOutlineRequest,
|
||||
BlogOutlineResponse,
|
||||
BlogOutlineRefineRequest,
|
||||
BlogSectionRequest,
|
||||
BlogSectionResponse,
|
||||
BlogOptimizeRequest,
|
||||
BlogOptimizeResponse,
|
||||
BlogSEOAnalyzeRequest,
|
||||
BlogSEOAnalyzeResponse,
|
||||
BlogSEOMetadataRequest,
|
||||
BlogSEOMetadataResponse,
|
||||
BlogPublishRequest,
|
||||
BlogPublishResponse,
|
||||
BlogOutlineSection,
|
||||
ResearchSource,
|
||||
)
|
||||
|
||||
from ..research import ResearchService
|
||||
from ..outline import OutlineService
|
||||
from ..content.enhanced_content_generator import EnhancedContentGenerator
|
||||
from ..content.medium_blog_generator import MediumBlogGenerator
|
||||
from ..content.blog_rewriter import BlogRewriter
|
||||
from services.llm_providers.gemini_provider import gemini_structured_json_response
|
||||
from services.cache.persistent_content_cache import persistent_content_cache
|
||||
from models.blog_models import (
|
||||
MediumBlogGenerateRequest,
|
||||
MediumBlogGenerateResult,
|
||||
MediumGeneratedSection,
|
||||
)
|
||||
|
||||
# Import task manager - we'll create a simple one for this service
|
||||
class SimpleTaskManager:
|
||||
"""Simple task manager for BlogWriterService."""
|
||||
|
||||
def __init__(self):
|
||||
self.tasks = {}
|
||||
|
||||
def start_task(self, task_id: str, func, **kwargs):
|
||||
"""Start a task with the given function and arguments."""
|
||||
import asyncio
|
||||
self.tasks[task_id] = {
|
||||
"status": "running",
|
||||
"progress": "Starting...",
|
||||
"result": None,
|
||||
"error": None
|
||||
}
|
||||
# Start the task in the background
|
||||
asyncio.create_task(self._run_task(task_id, func, **kwargs))
|
||||
|
||||
async def _run_task(self, task_id: str, func, **kwargs):
|
||||
"""Run the task function."""
|
||||
try:
|
||||
await func(task_id, **kwargs)
|
||||
except Exception as e:
|
||||
self.tasks[task_id]["status"] = "failed"
|
||||
self.tasks[task_id]["error"] = str(e)
|
||||
logger.error(f"Task {task_id} failed: {e}")
|
||||
|
||||
def update_task_status(self, task_id: str, status: str, progress: str = None, result=None):
|
||||
"""Update task status."""
|
||||
if task_id in self.tasks:
|
||||
self.tasks[task_id]["status"] = status
|
||||
if progress:
|
||||
self.tasks[task_id]["progress"] = progress
|
||||
if result:
|
||||
self.tasks[task_id]["result"] = result
|
||||
|
||||
def get_task_status(self, task_id: str):
|
||||
"""Get task status."""
|
||||
return self.tasks.get(task_id, {"status": "not_found"})
|
||||
|
||||
|
||||
class BlogWriterService:
|
||||
"""Main service orchestrator for AI Blog Writer functionality."""
|
||||
|
||||
def __init__(self):
|
||||
self.research_service = ResearchService()
|
||||
self.outline_service = OutlineService()
|
||||
self.content_generator = EnhancedContentGenerator()
|
||||
self.task_manager = SimpleTaskManager()
|
||||
self.medium_blog_generator = MediumBlogGenerator()
|
||||
self.blog_rewriter = BlogRewriter(self.task_manager)
|
||||
|
||||
# Research Methods
|
||||
async def research(self, request: BlogResearchRequest, user_id: str) -> BlogResearchResponse:
|
||||
"""Conduct comprehensive research using Google Search grounding."""
|
||||
return await self.research_service.research(request, user_id)
|
||||
|
||||
async def research_with_progress(self, request: BlogResearchRequest, task_id: str, user_id: str) -> BlogResearchResponse:
|
||||
"""Conduct research with real-time progress updates."""
|
||||
return await self.research_service.research_with_progress(request, task_id, user_id)
|
||||
|
||||
# Outline Methods
|
||||
async def generate_outline(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
|
||||
"""Generate AI-powered outline from research data.
|
||||
|
||||
Args:
|
||||
request: Outline generation request with research data
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
|
||||
return await self.outline_service.generate_outline(request, user_id)
|
||||
|
||||
async def generate_outline_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
|
||||
"""Generate outline with real-time progress updates."""
|
||||
return await self.outline_service.generate_outline_with_progress(request, task_id, user_id)
|
||||
|
||||
async def refine_outline(self, request: BlogOutlineRefineRequest) -> BlogOutlineResponse:
|
||||
"""Refine outline with HITL operations."""
|
||||
return await self.outline_service.refine_outline(request)
|
||||
|
||||
async def enhance_section_with_ai(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
|
||||
"""Enhance a section using AI."""
|
||||
return await self.outline_service.enhance_section_with_ai(section, focus)
|
||||
|
||||
async def optimize_outline_with_ai(self, outline: List[BlogOutlineSection], focus: str = "general optimization") -> List[BlogOutlineSection]:
|
||||
"""Optimize entire outline for better flow and SEO."""
|
||||
return await self.outline_service.optimize_outline_with_ai(outline, focus)
|
||||
|
||||
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
|
||||
"""Rebalance word count distribution across sections."""
|
||||
return self.outline_service.rebalance_word_counts(outline, target_words)
|
||||
|
||||
# Content Generation Methods
|
||||
async def generate_section(self, request: BlogSectionRequest) -> BlogSectionResponse:
|
||||
"""Generate section content from outline."""
|
||||
# Compose research-lite object with minimal continuity summary if available
|
||||
research_ctx: Any = getattr(request, 'research', None)
|
||||
try:
|
||||
ai_result = await self.content_generator.generate_section(
|
||||
section=request.section,
|
||||
research=research_ctx,
|
||||
mode=(request.mode or "polished"),
|
||||
)
|
||||
markdown = ai_result.get('content') or ai_result.get('markdown') or ''
|
||||
citations = []
|
||||
# Map basic citations from sources if present
|
||||
for s in ai_result.get('sources', [])[:5]:
|
||||
citations.append({
|
||||
"title": s.get('title') if isinstance(s, dict) else getattr(s, 'title', ''),
|
||||
"url": s.get('url') if isinstance(s, dict) else getattr(s, 'url', ''),
|
||||
})
|
||||
if not markdown:
|
||||
markdown = f"## {request.section.heading}\n\n(Generated content was empty.)"
|
||||
return BlogSectionResponse(
|
||||
success=True,
|
||||
markdown=markdown,
|
||||
citations=citations,
|
||||
continuity_metrics=ai_result.get('continuity_metrics')
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Section generation failed: {e}")
|
||||
fallback = f"## {request.section.heading}\n\nThis section will cover: {', '.join(request.section.key_points)}."
|
||||
return BlogSectionResponse(success=False, markdown=fallback, citations=[])
|
||||
|
||||
async def optimize_section(self, request: BlogOptimizeRequest) -> BlogOptimizeResponse:
|
||||
"""Optimize section content for readability and SEO."""
|
||||
# TODO: Move to optimization module
|
||||
return BlogOptimizeResponse(success=True, optimized=request.content, diff_preview=None)
|
||||
|
||||
# SEO and Analysis Methods (TODO: Extract to optimization module)
|
||||
async def hallucination_check(self, payload: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Run hallucination detection on provided text."""
|
||||
text = str(payload.get("text", "") or "").strip()
|
||||
if not text:
|
||||
return {"success": False, "error": "No text provided"}
|
||||
|
||||
# Prefer direct service use over HTTP proxy
|
||||
try:
|
||||
from services.hallucination_detector import HallucinationDetector
|
||||
detector = HallucinationDetector()
|
||||
result = await detector.detect_hallucinations(text)
|
||||
|
||||
# Serialize dataclass-like result to dict
|
||||
claims = []
|
||||
for c in result.claims:
|
||||
claims.append({
|
||||
"text": c.text,
|
||||
"confidence": c.confidence,
|
||||
"assessment": c.assessment,
|
||||
"supporting_sources": c.supporting_sources,
|
||||
"refuting_sources": c.refuting_sources,
|
||||
"reasoning": c.reasoning,
|
||||
})
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"overall_confidence": result.overall_confidence,
|
||||
"total_claims": result.total_claims,
|
||||
"supported_claims": result.supported_claims,
|
||||
"refuted_claims": result.refuted_claims,
|
||||
"insufficient_claims": result.insufficient_claims,
|
||||
"timestamp": result.timestamp,
|
||||
"claims": claims,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def seo_analyze(self, request: BlogSEOAnalyzeRequest, user_id: str = None) -> BlogSEOAnalyzeResponse:
|
||||
"""Analyze content for SEO optimization using comprehensive blog-specific analyzer."""
|
||||
try:
|
||||
from services.blog_writer.seo.blog_content_seo_analyzer import BlogContentSEOAnalyzer
|
||||
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
|
||||
content = request.content or ""
|
||||
target_keywords = request.keywords or []
|
||||
|
||||
# Use research data from request if available, otherwise create fallback
|
||||
if request.research_data:
|
||||
research_data = request.research_data
|
||||
logger.info(f"Using research data from request: {research_data.get('keyword_analysis', {})}")
|
||||
else:
|
||||
# Fallback for backward compatibility
|
||||
research_data = {
|
||||
"keyword_analysis": {
|
||||
"primary": target_keywords,
|
||||
"long_tail": [],
|
||||
"semantic": [],
|
||||
"all_keywords": target_keywords,
|
||||
"search_intent": "informational"
|
||||
}
|
||||
}
|
||||
logger.warning("No research data provided, using fallback keywords")
|
||||
|
||||
# Use our comprehensive SEO analyzer
|
||||
analyzer = BlogContentSEOAnalyzer()
|
||||
analysis_results = await analyzer.analyze_blog_content(content, research_data, user_id=user_id)
|
||||
|
||||
# Convert results to response format
|
||||
recommendations = analysis_results.get('actionable_recommendations', [])
|
||||
# Convert recommendation objects to strings
|
||||
recommendation_strings = []
|
||||
for rec in recommendations:
|
||||
if isinstance(rec, dict):
|
||||
recommendation_strings.append(f"[{rec.get('category', 'General')}] {rec.get('recommendation', '')}")
|
||||
else:
|
||||
recommendation_strings.append(str(rec))
|
||||
|
||||
return BlogSEOAnalyzeResponse(
|
||||
success=True,
|
||||
seo_score=float(analysis_results.get('overall_score', 0)),
|
||||
density=analysis_results.get('visualization_data', {}).get('keyword_analysis', {}).get('densities', {}),
|
||||
structure=analysis_results.get('detailed_analysis', {}).get('content_structure', {}),
|
||||
readability=analysis_results.get('detailed_analysis', {}).get('readability_analysis', {}),
|
||||
link_suggestions=[],
|
||||
image_alt_status={"total_images": 0, "missing_alt": 0},
|
||||
recommendations=recommendation_strings
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"SEO analysis failed: {e}")
|
||||
return BlogSEOAnalyzeResponse(
|
||||
success=False,
|
||||
seo_score=0.0,
|
||||
density={},
|
||||
structure={},
|
||||
readability={},
|
||||
link_suggestions=[],
|
||||
image_alt_status={"total_images": 0, "missing_alt": 0},
|
||||
recommendations=[f"SEO analysis failed: {str(e)}"]
|
||||
)
|
||||
|
||||
async def seo_metadata(self, request: BlogSEOMetadataRequest, user_id: str = None) -> BlogSEOMetadataResponse:
|
||||
"""Generate comprehensive SEO metadata for content."""
|
||||
try:
|
||||
from services.blog_writer.seo.blog_seo_metadata_generator import BlogSEOMetadataGenerator
|
||||
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
|
||||
# Initialize metadata generator
|
||||
metadata_generator = BlogSEOMetadataGenerator()
|
||||
|
||||
# Extract outline and seo_analysis from request
|
||||
outline = request.outline if hasattr(request, 'outline') else None
|
||||
seo_analysis = request.seo_analysis if hasattr(request, 'seo_analysis') else None
|
||||
|
||||
# Generate comprehensive metadata with full context
|
||||
metadata_results = await metadata_generator.generate_comprehensive_metadata(
|
||||
blog_content=request.content,
|
||||
blog_title=request.title or "Untitled Blog Post",
|
||||
research_data=request.research_data or {},
|
||||
outline=outline,
|
||||
seo_analysis=seo_analysis,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
# Convert to BlogSEOMetadataResponse format
|
||||
return BlogSEOMetadataResponse(
|
||||
success=metadata_results.get('success', True),
|
||||
title_options=metadata_results.get('title_options', []),
|
||||
meta_descriptions=metadata_results.get('meta_descriptions', []),
|
||||
seo_title=metadata_results.get('seo_title'),
|
||||
meta_description=metadata_results.get('meta_description'),
|
||||
url_slug=metadata_results.get('url_slug', ''),
|
||||
blog_tags=metadata_results.get('blog_tags', []),
|
||||
blog_categories=metadata_results.get('blog_categories', []),
|
||||
social_hashtags=metadata_results.get('social_hashtags', []),
|
||||
open_graph=metadata_results.get('open_graph', {}),
|
||||
twitter_card=metadata_results.get('twitter_card', {}),
|
||||
json_ld_schema=metadata_results.get('json_ld_schema', {}),
|
||||
canonical_url=metadata_results.get('canonical_url', ''),
|
||||
reading_time=metadata_results.get('reading_time', 0.0),
|
||||
focus_keyword=metadata_results.get('focus_keyword', ''),
|
||||
generated_at=metadata_results.get('generated_at', ''),
|
||||
optimization_score=metadata_results.get('metadata_summary', {}).get('optimization_score', 0)
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"SEO metadata generation failed: {e}")
|
||||
# Return fallback response
|
||||
return BlogSEOMetadataResponse(
|
||||
success=False,
|
||||
title_options=[request.title or "Generated SEO Title"],
|
||||
meta_descriptions=["Compelling meta description..."],
|
||||
open_graph={"title": request.title or "OG Title", "image": ""},
|
||||
twitter_card={"card": "summary_large_image"},
|
||||
json_ld_schema={"@type": "Article"},
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
async def publish(self, request: BlogPublishRequest) -> BlogPublishResponse:
|
||||
"""Publish content to specified platform."""
|
||||
# TODO: Move to content module
|
||||
return BlogPublishResponse(success=True, platform=request.platform, url="https://example.com/post")
|
||||
|
||||
async def generate_medium_blog_with_progress(self, req: MediumBlogGenerateRequest, task_id: str, user_id: str) -> MediumBlogGenerateResult:
|
||||
"""Use Gemini structured JSON to generate a medium-length blog in one call.
|
||||
|
||||
Args:
|
||||
req: Medium blog generation request
|
||||
task_id: Task ID for progress updates
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for medium blog generation (subscription checks and usage tracking)")
|
||||
return await self.medium_blog_generator.generate_medium_blog_with_progress(req, task_id, user_id)
|
||||
|
||||
async def analyze_flow_basic(self, request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Analyze flow metrics for entire blog using single AI call (cost-effective)."""
|
||||
try:
|
||||
# Extract blog content from request
|
||||
sections = request.get("sections", [])
|
||||
title = request.get("title", "Untitled Blog")
|
||||
|
||||
if not sections:
|
||||
return {"error": "No sections provided for analysis"}
|
||||
|
||||
# Combine all content for analysis
|
||||
full_content = f"Title: {title}\n\n"
|
||||
for section in sections:
|
||||
full_content += f"Section: {section.get('heading', 'Untitled')}\n"
|
||||
full_content += f"Content: {section.get('content', '')}\n\n"
|
||||
|
||||
# Build analysis prompt
|
||||
system_prompt = """You are an expert content analyst specializing in narrative flow, consistency, and progression analysis.
|
||||
Analyze the provided blog content and provide detailed, actionable feedback for improvement.
|
||||
Focus on how well the content flows from section to section, maintains consistency in tone and style,
|
||||
and progresses logically through the topic."""
|
||||
|
||||
analysis_prompt = f"""
|
||||
Analyze the following blog content for narrative flow, consistency, and progression:
|
||||
|
||||
{full_content}
|
||||
|
||||
Evaluate each section and provide overall analysis with specific scores and actionable suggestions.
|
||||
Consider:
|
||||
- How well each section flows into the next
|
||||
- Consistency in tone, style, and voice throughout
|
||||
- Logical progression of ideas and arguments
|
||||
- Transition quality between sections
|
||||
- Overall coherence and readability
|
||||
|
||||
IMPORTANT: For each section in the response, use the exact section ID provided in the input.
|
||||
The section IDs in your response must match the section IDs from the input exactly.
|
||||
|
||||
Provide detailed analysis with specific, actionable suggestions for improvement.
|
||||
"""
|
||||
|
||||
# Use Gemini for structured analysis
|
||||
from services.llm_providers.gemini_provider import gemini_structured_json_response
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"overall_flow_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"overall_consistency_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"overall_progression_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"overall_coherence_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"section_id": {"type": "string"},
|
||||
"heading": {"type": "string"},
|
||||
"flow_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"consistency_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"progression_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"coherence_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"transition_quality": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"suggestions": {"type": "array", "items": {"type": "string"}},
|
||||
"strengths": {"type": "array", "items": {"type": "string"}},
|
||||
"improvement_areas": {"type": "array", "items": {"type": "string"}}
|
||||
},
|
||||
"required": ["section_id", "heading", "flow_score", "consistency_score", "progression_score", "coherence_score", "transition_quality", "suggestions"]
|
||||
}
|
||||
},
|
||||
"overall_suggestions": {"type": "array", "items": {"type": "string"}},
|
||||
"overall_strengths": {"type": "array", "items": {"type": "string"}},
|
||||
"overall_improvement_areas": {"type": "array", "items": {"type": "string"}},
|
||||
"transition_analysis": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"overall_transition_quality": {"type": "number", "minimum": 0.0, "maximum": 1.0},
|
||||
"transition_suggestions": {"type": "array", "items": {"type": "string"}}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["overall_flow_score", "overall_consistency_score", "overall_progression_score", "overall_coherence_score", "sections", "overall_suggestions"]
|
||||
}
|
||||
|
||||
result = gemini_structured_json_response(
|
||||
prompt=analysis_prompt,
|
||||
schema=schema,
|
||||
temperature=0.3,
|
||||
max_tokens=4096,
|
||||
system_prompt=system_prompt
|
||||
)
|
||||
|
||||
if result and not result.get("error"):
|
||||
logger.info("Basic flow analysis completed successfully")
|
||||
return {"success": True, "analysis": result, "mode": "basic"}
|
||||
else:
|
||||
error_msg = result.get("error", "Analysis failed") if result else "No response from AI"
|
||||
logger.error(f"Basic flow analysis failed: {error_msg}")
|
||||
return {"error": error_msg}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Basic flow analysis error: {e}")
|
||||
return {"error": str(e)}
|
||||
|
||||
async def analyze_flow_advanced(self, request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Analyze flow metrics for each section individually (detailed but expensive)."""
|
||||
try:
|
||||
# Use the existing enhanced content generator for detailed analysis
|
||||
sections = request.get("sections", [])
|
||||
title = request.get("title", "Untitled Blog")
|
||||
|
||||
if not sections:
|
||||
return {"error": "No sections provided for analysis"}
|
||||
|
||||
results = []
|
||||
for section in sections:
|
||||
# Use the existing flow analyzer for each section
|
||||
section_content = section.get("content", "")
|
||||
section_heading = section.get("heading", "Untitled")
|
||||
|
||||
# Get previous section context for better analysis
|
||||
prev_section_content = ""
|
||||
if len(results) > 0:
|
||||
prev_section_content = results[-1].get("content", "")
|
||||
|
||||
# Use the existing flow analyzer
|
||||
flow_metrics = self.content_generator.flow.assess_flow(
|
||||
prev_section_content,
|
||||
section_content,
|
||||
use_llm=True
|
||||
)
|
||||
|
||||
results.append({
|
||||
"section_id": section.get("id", "unknown"),
|
||||
"heading": section_heading,
|
||||
"flow_score": flow_metrics.get("flow", 0.0),
|
||||
"consistency_score": flow_metrics.get("consistency", 0.0),
|
||||
"progression_score": flow_metrics.get("progression", 0.0),
|
||||
"detailed_analysis": flow_metrics.get("analysis", ""),
|
||||
"suggestions": flow_metrics.get("suggestions", [])
|
||||
})
|
||||
|
||||
# Calculate overall scores
|
||||
overall_flow = sum(r["flow_score"] for r in results) / len(results) if results else 0.0
|
||||
overall_consistency = sum(r["consistency_score"] for r in results) / len(results) if results else 0.0
|
||||
overall_progression = sum(r["progression_score"] for r in results) / len(results) if results else 0.0
|
||||
|
||||
logger.info("Advanced flow analysis completed successfully")
|
||||
return {
|
||||
"success": True,
|
||||
"analysis": {
|
||||
"overall_flow_score": overall_flow,
|
||||
"overall_consistency_score": overall_consistency,
|
||||
"overall_progression_score": overall_progression,
|
||||
"sections": results
|
||||
},
|
||||
"mode": "advanced"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Advanced flow analysis error: {e}")
|
||||
return {"error": str(e)}
|
||||
|
||||
def start_blog_rewrite(self, request: Dict[str, Any]) -> str:
|
||||
"""Start blog rewrite task with user feedback."""
|
||||
return self.blog_rewriter.start_blog_rewrite(request)
|
||||
536
backend/services/blog_writer/database_task_manager.py
Normal file
536
backend/services/blog_writer/database_task_manager.py
Normal file
@@ -0,0 +1,536 @@
|
||||
"""
|
||||
Database-Backed Task Manager for Blog Writer
|
||||
|
||||
Replaces in-memory task storage with persistent database storage for
|
||||
reliability, recovery, and analytics.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import uuid
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Any, Dict, List, Optional
|
||||
from loguru import logger
|
||||
|
||||
from services.blog_writer.logger_config import blog_writer_logger, log_function_call
|
||||
from models.blog_models import (
|
||||
BlogResearchRequest,
|
||||
BlogOutlineRequest,
|
||||
MediumBlogGenerateRequest,
|
||||
MediumBlogGenerateResult,
|
||||
)
|
||||
from services.blog_writer.blog_service import BlogWriterService
|
||||
|
||||
|
||||
class DatabaseTaskManager:
|
||||
"""Database-backed task manager for blog writer operations."""
|
||||
|
||||
def __init__(self, db_connection):
|
||||
self.db = db_connection
|
||||
self.service = BlogWriterService()
|
||||
self._cleanup_task = None
|
||||
self._start_cleanup_task()
|
||||
|
||||
def _start_cleanup_task(self):
|
||||
"""Start background task to clean up old completed tasks."""
|
||||
async def cleanup_loop():
|
||||
while True:
|
||||
try:
|
||||
await self.cleanup_old_tasks()
|
||||
await asyncio.sleep(3600) # Run every hour
|
||||
except Exception as e:
|
||||
logger.error(f"Error in cleanup task: {e}")
|
||||
await asyncio.sleep(300) # Wait 5 minutes on error
|
||||
|
||||
self._cleanup_task = asyncio.create_task(cleanup_loop())
|
||||
|
||||
@log_function_call("create_task")
|
||||
async def create_task(
|
||||
self,
|
||||
user_id: str,
|
||||
task_type: str,
|
||||
request_data: Dict[str, Any],
|
||||
correlation_id: Optional[str] = None,
|
||||
operation: Optional[str] = None,
|
||||
priority: int = 0,
|
||||
max_retries: int = 3,
|
||||
metadata: Optional[Dict[str, Any]] = None
|
||||
) -> str:
|
||||
"""Create a new task in the database."""
|
||||
task_id = str(uuid.uuid4())
|
||||
correlation_id = correlation_id or str(uuid.uuid4())
|
||||
|
||||
query = """
|
||||
INSERT INTO blog_writer_tasks
|
||||
(id, user_id, task_type, status, request_data, correlation_id, operation, priority, max_retries, metadata)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
|
||||
"""
|
||||
|
||||
await self.db.execute(
|
||||
query,
|
||||
task_id,
|
||||
user_id,
|
||||
task_type,
|
||||
'pending',
|
||||
json.dumps(request_data),
|
||||
correlation_id,
|
||||
operation,
|
||||
priority,
|
||||
max_retries,
|
||||
json.dumps(metadata or {})
|
||||
)
|
||||
|
||||
blog_writer_logger.log_operation_start(
|
||||
"task_created",
|
||||
task_id=task_id,
|
||||
task_type=task_type,
|
||||
user_id=user_id,
|
||||
correlation_id=correlation_id
|
||||
)
|
||||
|
||||
return task_id
|
||||
|
||||
@log_function_call("get_task_status")
|
||||
async def get_task_status(self, task_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Get the status of a task."""
|
||||
query = """
|
||||
SELECT
|
||||
id, user_id, task_type, status, request_data, result_data, error_data,
|
||||
created_at, updated_at, completed_at, correlation_id, operation,
|
||||
retry_count, max_retries, priority, metadata
|
||||
FROM blog_writer_tasks
|
||||
WHERE id = $1
|
||||
"""
|
||||
|
||||
row = await self.db.fetchrow(query, task_id)
|
||||
if not row:
|
||||
return None
|
||||
|
||||
# Get progress messages
|
||||
progress_query = """
|
||||
SELECT timestamp, message, percentage, progress_type, metadata
|
||||
FROM blog_writer_task_progress
|
||||
WHERE task_id = $1
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 10
|
||||
"""
|
||||
|
||||
progress_rows = await self.db.fetch(progress_query, task_id)
|
||||
progress_messages = [
|
||||
{
|
||||
"timestamp": row["timestamp"].isoformat(),
|
||||
"message": row["message"],
|
||||
"percentage": float(row["percentage"]),
|
||||
"progress_type": row["progress_type"],
|
||||
"metadata": row["metadata"] or {}
|
||||
}
|
||||
for row in progress_rows
|
||||
]
|
||||
|
||||
return {
|
||||
"task_id": row["id"],
|
||||
"user_id": row["user_id"],
|
||||
"task_type": row["task_type"],
|
||||
"status": row["status"],
|
||||
"created_at": row["created_at"].isoformat(),
|
||||
"updated_at": row["updated_at"].isoformat(),
|
||||
"completed_at": row["completed_at"].isoformat() if row["completed_at"] else None,
|
||||
"correlation_id": row["correlation_id"],
|
||||
"operation": row["operation"],
|
||||
"retry_count": row["retry_count"],
|
||||
"max_retries": row["max_retries"],
|
||||
"priority": row["priority"],
|
||||
"progress_messages": progress_messages,
|
||||
"result": json.loads(row["result_data"]) if row["result_data"] else None,
|
||||
"error": json.loads(row["error_data"]) if row["error_data"] else None,
|
||||
"metadata": json.loads(row["metadata"]) if row["metadata"] else {}
|
||||
}
|
||||
|
||||
@log_function_call("update_task_status")
|
||||
async def update_task_status(
|
||||
self,
|
||||
task_id: str,
|
||||
status: str,
|
||||
result_data: Optional[Dict[str, Any]] = None,
|
||||
error_data: Optional[Dict[str, Any]] = None,
|
||||
completed_at: Optional[datetime] = None
|
||||
):
|
||||
"""Update task status and data."""
|
||||
query = """
|
||||
UPDATE blog_writer_tasks
|
||||
SET status = $2, result_data = $3, error_data = $4, completed_at = $5, updated_at = NOW()
|
||||
WHERE id = $1
|
||||
"""
|
||||
|
||||
await self.db.execute(
|
||||
query,
|
||||
task_id,
|
||||
status,
|
||||
json.dumps(result_data) if result_data else None,
|
||||
json.dumps(error_data) if error_data else None,
|
||||
completed_at or (datetime.now() if status in ['completed', 'failed', 'cancelled'] else None)
|
||||
)
|
||||
|
||||
blog_writer_logger.log_operation_end(
|
||||
"task_status_updated",
|
||||
0,
|
||||
success=status in ['completed', 'cancelled'],
|
||||
task_id=task_id,
|
||||
status=status
|
||||
)
|
||||
|
||||
@log_function_call("update_progress")
|
||||
async def update_progress(
|
||||
self,
|
||||
task_id: str,
|
||||
message: str,
|
||||
percentage: Optional[float] = None,
|
||||
progress_type: str = "info",
|
||||
metadata: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
"""Update task progress."""
|
||||
# Insert progress record
|
||||
progress_query = """
|
||||
INSERT INTO blog_writer_task_progress
|
||||
(task_id, message, percentage, progress_type, metadata)
|
||||
VALUES ($1, $2, $3, $4, $5)
|
||||
"""
|
||||
|
||||
await self.db.execute(
|
||||
progress_query,
|
||||
task_id,
|
||||
message,
|
||||
percentage or 0.0,
|
||||
progress_type,
|
||||
json.dumps(metadata or {})
|
||||
)
|
||||
|
||||
# Update task status to running if it was pending
|
||||
status_query = """
|
||||
UPDATE blog_writer_tasks
|
||||
SET status = 'running', updated_at = NOW()
|
||||
WHERE id = $1 AND status = 'pending'
|
||||
"""
|
||||
|
||||
await self.db.execute(status_query, task_id)
|
||||
|
||||
logger.info(f"Progress update for task {task_id}: {message}")
|
||||
|
||||
@log_function_call("record_metrics")
|
||||
async def record_metrics(
|
||||
self,
|
||||
task_id: str,
|
||||
operation: str,
|
||||
duration_ms: int,
|
||||
token_usage: Optional[Dict[str, int]] = None,
|
||||
api_calls: int = 0,
|
||||
cache_hits: int = 0,
|
||||
cache_misses: int = 0,
|
||||
error_count: int = 0,
|
||||
metadata: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
"""Record performance metrics for a task."""
|
||||
query = """
|
||||
INSERT INTO blog_writer_task_metrics
|
||||
(task_id, operation, duration_ms, token_usage, api_calls, cache_hits, cache_misses, error_count, metadata)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
||||
"""
|
||||
|
||||
await self.db.execute(
|
||||
query,
|
||||
task_id,
|
||||
operation,
|
||||
duration_ms,
|
||||
json.dumps(token_usage) if token_usage else None,
|
||||
api_calls,
|
||||
cache_hits,
|
||||
cache_misses,
|
||||
error_count,
|
||||
json.dumps(metadata or {})
|
||||
)
|
||||
|
||||
blog_writer_logger.log_performance(
|
||||
f"task_metrics_{operation}",
|
||||
duration_ms,
|
||||
"ms",
|
||||
task_id=task_id,
|
||||
operation=operation,
|
||||
api_calls=api_calls,
|
||||
cache_hits=cache_hits,
|
||||
cache_misses=cache_misses
|
||||
)
|
||||
|
||||
@log_function_call("increment_retry_count")
|
||||
async def increment_retry_count(self, task_id: str) -> int:
|
||||
"""Increment retry count and return new count."""
|
||||
query = """
|
||||
UPDATE blog_writer_tasks
|
||||
SET retry_count = retry_count + 1, updated_at = NOW()
|
||||
WHERE id = $1
|
||||
RETURNING retry_count
|
||||
"""
|
||||
|
||||
result = await self.db.fetchval(query, task_id)
|
||||
return result or 0
|
||||
|
||||
@log_function_call("cleanup_old_tasks")
|
||||
async def cleanup_old_tasks(self, days: int = 7) -> int:
|
||||
"""Clean up old completed tasks."""
|
||||
query = """
|
||||
DELETE FROM blog_writer_tasks
|
||||
WHERE status IN ('completed', 'failed', 'cancelled')
|
||||
AND created_at < NOW() - INTERVAL '%s days'
|
||||
""" % days
|
||||
|
||||
result = await self.db.execute(query)
|
||||
deleted_count = int(result.split()[-1]) if result else 0
|
||||
|
||||
if deleted_count > 0:
|
||||
logger.info(f"Cleaned up {deleted_count} old blog writer tasks")
|
||||
|
||||
return deleted_count
|
||||
|
||||
@log_function_call("get_user_tasks")
|
||||
async def get_user_tasks(
|
||||
self,
|
||||
user_id: str,
|
||||
limit: int = 50,
|
||||
offset: int = 0,
|
||||
status_filter: Optional[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Get tasks for a specific user."""
|
||||
query = """
|
||||
SELECT
|
||||
id, task_type, status, created_at, updated_at, completed_at,
|
||||
operation, retry_count, max_retries, priority
|
||||
FROM blog_writer_tasks
|
||||
WHERE user_id = $1
|
||||
"""
|
||||
|
||||
params = [user_id]
|
||||
param_count = 1
|
||||
|
||||
if status_filter:
|
||||
param_count += 1
|
||||
query += f" AND status = ${param_count}"
|
||||
params.append(status_filter)
|
||||
|
||||
query += f" ORDER BY created_at DESC LIMIT ${param_count + 1} OFFSET ${param_count + 2}"
|
||||
params.extend([limit, offset])
|
||||
|
||||
rows = await self.db.fetch(query, *params)
|
||||
|
||||
return [
|
||||
{
|
||||
"task_id": row["id"],
|
||||
"task_type": row["task_type"],
|
||||
"status": row["status"],
|
||||
"created_at": row["created_at"].isoformat(),
|
||||
"updated_at": row["updated_at"].isoformat(),
|
||||
"completed_at": row["completed_at"].isoformat() if row["completed_at"] else None,
|
||||
"operation": row["operation"],
|
||||
"retry_count": row["retry_count"],
|
||||
"max_retries": row["max_retries"],
|
||||
"priority": row["priority"]
|
||||
}
|
||||
for row in rows
|
||||
]
|
||||
|
||||
@log_function_call("get_task_analytics")
|
||||
async def get_task_analytics(self, days: int = 7) -> Dict[str, Any]:
|
||||
"""Get task analytics for monitoring."""
|
||||
query = """
|
||||
SELECT
|
||||
task_type,
|
||||
status,
|
||||
COUNT(*) as task_count,
|
||||
AVG(EXTRACT(EPOCH FROM (COALESCE(completed_at, NOW()) - created_at))) as avg_duration_seconds,
|
||||
COUNT(CASE WHEN status = 'completed' THEN 1 END) as completed_count,
|
||||
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_count,
|
||||
COUNT(CASE WHEN status = 'running' THEN 1 END) as running_count
|
||||
FROM blog_writer_tasks
|
||||
WHERE created_at >= NOW() - INTERVAL '%s days'
|
||||
GROUP BY task_type, status
|
||||
ORDER BY task_type, status
|
||||
""" % days
|
||||
|
||||
rows = await self.db.fetch(query)
|
||||
|
||||
analytics = {
|
||||
"summary": {
|
||||
"total_tasks": sum(row["task_count"] for row in rows),
|
||||
"completed_tasks": sum(row["completed_count"] for row in rows),
|
||||
"failed_tasks": sum(row["failed_count"] for row in rows),
|
||||
"running_tasks": sum(row["running_count"] for row in rows)
|
||||
},
|
||||
"by_task_type": {},
|
||||
"by_status": {}
|
||||
}
|
||||
|
||||
for row in rows:
|
||||
task_type = row["task_type"]
|
||||
status = row["status"]
|
||||
|
||||
if task_type not in analytics["by_task_type"]:
|
||||
analytics["by_task_type"][task_type] = {}
|
||||
|
||||
analytics["by_task_type"][task_type][status] = {
|
||||
"count": row["task_count"],
|
||||
"avg_duration_seconds": float(row["avg_duration_seconds"]) if row["avg_duration_seconds"] else 0
|
||||
}
|
||||
|
||||
if status not in analytics["by_status"]:
|
||||
analytics["by_status"][status] = 0
|
||||
analytics["by_status"][status] += row["task_count"]
|
||||
|
||||
return analytics
|
||||
|
||||
# Task execution methods (same as original but with database persistence)
|
||||
async def start_research_task(self, request: BlogResearchRequest, user_id: str) -> str:
|
||||
"""Start a research operation and return a task ID."""
|
||||
task_id = await self.create_task(
|
||||
user_id=user_id,
|
||||
task_type="research",
|
||||
request_data=request.dict(),
|
||||
operation="research_operation"
|
||||
)
|
||||
|
||||
# Start the research operation in the background
|
||||
asyncio.create_task(self._run_research_task(task_id, request))
|
||||
|
||||
return task_id
|
||||
|
||||
async def start_outline_task(self, request: BlogOutlineRequest, user_id: str) -> str:
|
||||
"""Start an outline generation operation and return a task ID."""
|
||||
task_id = await self.create_task(
|
||||
user_id=user_id,
|
||||
task_type="outline",
|
||||
request_data=request.dict(),
|
||||
operation="outline_generation"
|
||||
)
|
||||
|
||||
# Start the outline generation operation in the background
|
||||
asyncio.create_task(self._run_outline_generation_task(task_id, request))
|
||||
|
||||
return task_id
|
||||
|
||||
async def start_medium_generation_task(self, request: MediumBlogGenerateRequest, user_id: str) -> str:
|
||||
"""Start a medium blog generation task."""
|
||||
task_id = await self.create_task(
|
||||
user_id=user_id,
|
||||
task_type="medium_generation",
|
||||
request_data=request.dict(),
|
||||
operation="medium_blog_generation"
|
||||
)
|
||||
|
||||
asyncio.create_task(self._run_medium_generation_task(task_id, request))
|
||||
return task_id
|
||||
|
||||
async def _run_research_task(self, task_id: str, request: BlogResearchRequest):
|
||||
"""Background task to run research and update status with progress messages."""
|
||||
try:
|
||||
await self.update_progress(task_id, "🔍 Starting research operation...", 0)
|
||||
|
||||
# Run the actual research with progress updates
|
||||
result = await self.service.research_with_progress(request, task_id)
|
||||
|
||||
# Check if research failed gracefully
|
||||
if not result.success:
|
||||
await self.update_progress(
|
||||
task_id,
|
||||
f"❌ Research failed: {result.error_message or 'Unknown error'}",
|
||||
100,
|
||||
"error"
|
||||
)
|
||||
await self.update_task_status(
|
||||
task_id,
|
||||
"failed",
|
||||
error_data={
|
||||
"error_message": result.error_message,
|
||||
"retry_suggested": result.retry_suggested,
|
||||
"error_code": result.error_code,
|
||||
"actionable_steps": result.actionable_steps
|
||||
}
|
||||
)
|
||||
else:
|
||||
await self.update_progress(
|
||||
task_id,
|
||||
f"✅ Research completed successfully! Found {len(result.sources)} sources and {len(result.search_queries or [])} search queries.",
|
||||
100,
|
||||
"success"
|
||||
)
|
||||
await self.update_task_status(
|
||||
task_id,
|
||||
"completed",
|
||||
result_data=result.dict()
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
await self.update_progress(task_id, f"❌ Research failed with error: {str(e)}", 100, "error")
|
||||
await self.update_task_status(
|
||||
task_id,
|
||||
"failed",
|
||||
error_data={"error_message": str(e), "error_type": type(e).__name__}
|
||||
)
|
||||
blog_writer_logger.log_error(e, "research_task", context={"task_id": task_id})
|
||||
|
||||
async def _run_outline_generation_task(self, task_id: str, request: BlogOutlineRequest):
|
||||
"""Background task to run outline generation and update status with progress messages."""
|
||||
try:
|
||||
await self.update_progress(task_id, "🧩 Starting outline generation...", 0)
|
||||
|
||||
# Run the actual outline generation with progress updates
|
||||
result = await self.service.generate_outline_with_progress(request, task_id)
|
||||
|
||||
await self.update_progress(
|
||||
task_id,
|
||||
f"✅ Outline generated successfully! Created {len(result.outline)} sections with {len(result.title_options)} title options.",
|
||||
100,
|
||||
"success"
|
||||
)
|
||||
await self.update_task_status(task_id, "completed", result_data=result.dict())
|
||||
|
||||
except Exception as e:
|
||||
await self.update_progress(task_id, f"❌ Outline generation failed: {str(e)}", 100, "error")
|
||||
await self.update_task_status(
|
||||
task_id,
|
||||
"failed",
|
||||
error_data={"error_message": str(e), "error_type": type(e).__name__}
|
||||
)
|
||||
blog_writer_logger.log_error(e, "outline_generation_task", context={"task_id": task_id})
|
||||
|
||||
async def _run_medium_generation_task(self, task_id: str, request: MediumBlogGenerateRequest):
|
||||
"""Background task to generate a medium blog using a single structured JSON call."""
|
||||
try:
|
||||
await self.update_progress(task_id, "📦 Packaging outline and metadata...", 0)
|
||||
|
||||
# Basic guard: respect global target words
|
||||
total_target = int(request.globalTargetWords or 1000)
|
||||
if total_target > 1000:
|
||||
raise ValueError("Global target words exceed 1000; medium generation not allowed")
|
||||
|
||||
result: MediumBlogGenerateResult = await self.service.generate_medium_blog_with_progress(
|
||||
request,
|
||||
task_id,
|
||||
)
|
||||
|
||||
if not result or not getattr(result, "sections", None):
|
||||
raise ValueError("Empty generation result from model")
|
||||
|
||||
# Check if result came from cache
|
||||
cache_hit = getattr(result, 'cache_hit', False)
|
||||
if cache_hit:
|
||||
await self.update_progress(task_id, "⚡ Found cached content - loading instantly!", 100, "success")
|
||||
else:
|
||||
await self.update_progress(task_id, "🤖 Generated fresh content with AI...", 100, "success")
|
||||
|
||||
await self.update_task_status(task_id, "completed", result_data=result.dict())
|
||||
|
||||
except Exception as e:
|
||||
await self.update_progress(task_id, f"❌ Medium generation failed: {str(e)}", 100, "error")
|
||||
await self.update_task_status(
|
||||
task_id,
|
||||
"failed",
|
||||
error_data={"error_message": str(e), "error_type": type(e).__name__}
|
||||
)
|
||||
blog_writer_logger.log_error(e, "medium_generation_task", context={"task_id": task_id})
|
||||
285
backend/services/blog_writer/exceptions.py
Normal file
285
backend/services/blog_writer/exceptions.py
Normal file
@@ -0,0 +1,285 @@
|
||||
"""
|
||||
Blog Writer Exception Hierarchy
|
||||
|
||||
Defines custom exception classes for different failure modes in the AI Blog Writer.
|
||||
Each exception includes error_code, user_message, retry_suggested, and actionable_steps.
|
||||
"""
|
||||
|
||||
from typing import List, Optional, Dict, Any
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class ErrorCategory(Enum):
|
||||
"""Categories for error classification."""
|
||||
TRANSIENT = "transient" # Temporary issues, retry recommended
|
||||
PERMANENT = "permanent" # Permanent issues, no retry
|
||||
USER_ERROR = "user_error" # User input issues, fix input
|
||||
API_ERROR = "api_error" # External API issues
|
||||
VALIDATION_ERROR = "validation_error" # Data validation issues
|
||||
SYSTEM_ERROR = "system_error" # Internal system issues
|
||||
|
||||
|
||||
class BlogWriterException(Exception):
|
||||
"""Base exception for all Blog Writer errors."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
error_code: str,
|
||||
user_message: str,
|
||||
retry_suggested: bool = False,
|
||||
actionable_steps: Optional[List[str]] = None,
|
||||
error_category: ErrorCategory = ErrorCategory.SYSTEM_ERROR,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(message)
|
||||
self.error_code = error_code
|
||||
self.user_message = user_message
|
||||
self.retry_suggested = retry_suggested
|
||||
self.actionable_steps = actionable_steps or []
|
||||
self.error_category = error_category
|
||||
self.context = context or {}
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""Convert exception to dictionary for API responses."""
|
||||
return {
|
||||
"error_code": self.error_code,
|
||||
"user_message": self.user_message,
|
||||
"retry_suggested": self.retry_suggested,
|
||||
"actionable_steps": self.actionable_steps,
|
||||
"error_category": self.error_category.value,
|
||||
"context": self.context
|
||||
}
|
||||
|
||||
|
||||
class ResearchFailedException(BlogWriterException):
|
||||
"""Raised when research operation fails."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
user_message: str = "Research failed. Please try again with different keywords or check your internet connection.",
|
||||
retry_suggested: bool = True,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="RESEARCH_FAILED",
|
||||
user_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
actionable_steps=[
|
||||
"Try with different keywords",
|
||||
"Check your internet connection",
|
||||
"Wait a few minutes and try again",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.API_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class OutlineGenerationException(BlogWriterException):
|
||||
"""Raised when outline generation fails."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
user_message: str = "Outline generation failed. Please try again or adjust your research data.",
|
||||
retry_suggested: bool = True,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="OUTLINE_GENERATION_FAILED",
|
||||
user_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
actionable_steps=[
|
||||
"Try generating outline again",
|
||||
"Check if research data is complete",
|
||||
"Try with different research keywords",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.API_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class ContentGenerationException(BlogWriterException):
|
||||
"""Raised when content generation fails."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
user_message: str = "Content generation failed. Please try again or adjust your outline.",
|
||||
retry_suggested: bool = True,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="CONTENT_GENERATION_FAILED",
|
||||
user_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
actionable_steps=[
|
||||
"Try generating content again",
|
||||
"Check if outline is complete",
|
||||
"Try with a shorter outline",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.API_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class SEOAnalysisException(BlogWriterException):
|
||||
"""Raised when SEO analysis fails."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
user_message: str = "SEO analysis failed. Content was generated but SEO optimization is unavailable.",
|
||||
retry_suggested: bool = True,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="SEO_ANALYSIS_FAILED",
|
||||
user_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
actionable_steps=[
|
||||
"Try SEO analysis again",
|
||||
"Continue without SEO optimization",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.API_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class APIRateLimitException(BlogWriterException):
|
||||
"""Raised when API rate limit is exceeded."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
retry_after: Optional[int] = None,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
retry_message = f"Rate limit exceeded. Please wait {retry_after} seconds before trying again." if retry_after else "Rate limit exceeded. Please wait a few minutes before trying again."
|
||||
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="API_RATE_LIMIT",
|
||||
user_message=retry_message,
|
||||
retry_suggested=True,
|
||||
actionable_steps=[
|
||||
f"Wait {retry_after or 60} seconds before trying again",
|
||||
"Reduce the frequency of requests",
|
||||
"Try again during off-peak hours",
|
||||
"Contact support if you need higher limits"
|
||||
],
|
||||
error_category=ErrorCategory.API_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class APITimeoutException(BlogWriterException):
|
||||
"""Raised when API request times out."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
timeout_seconds: int = 60,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="API_TIMEOUT",
|
||||
user_message=f"Request timed out after {timeout_seconds} seconds. Please try again.",
|
||||
retry_suggested=True,
|
||||
actionable_steps=[
|
||||
"Try again with a shorter request",
|
||||
"Check your internet connection",
|
||||
"Try again during off-peak hours",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.TRANSIENT,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class ValidationException(BlogWriterException):
|
||||
"""Raised when input validation fails."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
field: str,
|
||||
user_message: str = "Invalid input provided. Please check your data and try again.",
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="VALIDATION_ERROR",
|
||||
user_message=user_message,
|
||||
retry_suggested=False,
|
||||
actionable_steps=[
|
||||
f"Check the {field} field",
|
||||
"Ensure all required fields are filled",
|
||||
"Verify data format is correct",
|
||||
"Contact support if you need help"
|
||||
],
|
||||
error_category=ErrorCategory.USER_ERROR,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class CircuitBreakerOpenException(BlogWriterException):
|
||||
"""Raised when circuit breaker is open."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
retry_after: int,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="CIRCUIT_BREAKER_OPEN",
|
||||
user_message=f"Service temporarily unavailable. Please wait {retry_after} seconds before trying again.",
|
||||
retry_suggested=True,
|
||||
actionable_steps=[
|
||||
f"Wait {retry_after} seconds before trying again",
|
||||
"Try again during off-peak hours",
|
||||
"Contact support if the issue persists"
|
||||
],
|
||||
error_category=ErrorCategory.TRANSIENT,
|
||||
context=context
|
||||
)
|
||||
|
||||
|
||||
class PartialSuccessException(BlogWriterException):
|
||||
"""Raised when operation partially succeeds."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
partial_results: Dict[str, Any],
|
||||
failed_operations: List[str],
|
||||
user_message: str = "Operation partially completed. Some sections were generated successfully.",
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
super().__init__(
|
||||
message=message,
|
||||
error_code="PARTIAL_SUCCESS",
|
||||
user_message=user_message,
|
||||
retry_suggested=True,
|
||||
actionable_steps=[
|
||||
"Review the generated content",
|
||||
"Retry failed sections individually",
|
||||
"Contact support if you need help with failed sections"
|
||||
],
|
||||
error_category=ErrorCategory.TRANSIENT,
|
||||
context=context
|
||||
)
|
||||
self.partial_results = partial_results
|
||||
self.failed_operations = failed_operations
|
||||
298
backend/services/blog_writer/logger_config.py
Normal file
298
backend/services/blog_writer/logger_config.py
Normal file
@@ -0,0 +1,298 @@
|
||||
"""
|
||||
Structured Logging Configuration for Blog Writer
|
||||
|
||||
Configures structured JSON logging with correlation IDs, context tracking,
|
||||
and performance metrics for the AI Blog Writer system.
|
||||
"""
|
||||
|
||||
import json
|
||||
import uuid
|
||||
import time
|
||||
import sys
|
||||
from typing import Dict, Any, Optional
|
||||
from contextvars import ContextVar
|
||||
from loguru import logger
|
||||
from datetime import datetime
|
||||
|
||||
# Context variables for request tracking
|
||||
correlation_id: ContextVar[str] = ContextVar('correlation_id', default='')
|
||||
user_id: ContextVar[str] = ContextVar('user_id', default='')
|
||||
task_id: ContextVar[str] = ContextVar('task_id', default='')
|
||||
operation: ContextVar[str] = ContextVar('operation', default='')
|
||||
|
||||
|
||||
class BlogWriterLogger:
|
||||
"""Enhanced logger for Blog Writer with structured logging and context tracking."""
|
||||
|
||||
def __init__(self):
|
||||
self._setup_logger()
|
||||
|
||||
def _setup_logger(self):
|
||||
"""Configure loguru with structured JSON output."""
|
||||
from utils.logger_utils import get_service_logger
|
||||
return get_service_logger("blog_writer")
|
||||
|
||||
def _json_formatter(self, record):
|
||||
"""Format log record as structured JSON."""
|
||||
# Extract context variables
|
||||
correlation_id_val = correlation_id.get('')
|
||||
user_id_val = user_id.get('')
|
||||
task_id_val = task_id.get('')
|
||||
operation_val = operation.get('')
|
||||
|
||||
# Build structured log entry
|
||||
log_entry = {
|
||||
"timestamp": datetime.fromtimestamp(record["time"].timestamp()).isoformat(),
|
||||
"level": record["level"].name,
|
||||
"logger": record["name"],
|
||||
"function": record["function"],
|
||||
"line": record["line"],
|
||||
"message": record["message"],
|
||||
"correlation_id": correlation_id_val,
|
||||
"user_id": user_id_val,
|
||||
"task_id": task_id_val,
|
||||
"operation": operation_val,
|
||||
"module": record["module"],
|
||||
"process_id": record["process"].id,
|
||||
"thread_id": record["thread"].id
|
||||
}
|
||||
|
||||
# Add exception info if present
|
||||
if record["exception"]:
|
||||
log_entry["exception"] = {
|
||||
"type": record["exception"].type.__name__,
|
||||
"value": str(record["exception"].value),
|
||||
"traceback": record["exception"].traceback
|
||||
}
|
||||
|
||||
# Add extra fields from record
|
||||
if record["extra"]:
|
||||
log_entry.update(record["extra"])
|
||||
|
||||
return json.dumps(log_entry, default=str)
|
||||
|
||||
def set_context(
|
||||
self,
|
||||
correlation_id_val: Optional[str] = None,
|
||||
user_id_val: Optional[str] = None,
|
||||
task_id_val: Optional[str] = None,
|
||||
operation_val: Optional[str] = None
|
||||
):
|
||||
"""Set context variables for the current request."""
|
||||
if correlation_id_val:
|
||||
correlation_id.set(correlation_id_val)
|
||||
if user_id_val:
|
||||
user_id.set(user_id_val)
|
||||
if task_id_val:
|
||||
task_id.set(task_id_val)
|
||||
if operation_val:
|
||||
operation.set(operation_val)
|
||||
|
||||
def clear_context(self):
|
||||
"""Clear all context variables."""
|
||||
correlation_id.set('')
|
||||
user_id.set('')
|
||||
task_id.set('')
|
||||
operation.set('')
|
||||
|
||||
def generate_correlation_id(self) -> str:
|
||||
"""Generate a new correlation ID."""
|
||||
return str(uuid.uuid4())
|
||||
|
||||
def log_operation_start(
|
||||
self,
|
||||
operation_name: str,
|
||||
**kwargs
|
||||
):
|
||||
"""Log the start of an operation with context."""
|
||||
logger.info(
|
||||
f"Starting {operation_name}",
|
||||
extra={
|
||||
"operation": operation_name,
|
||||
"event_type": "operation_start",
|
||||
**kwargs
|
||||
}
|
||||
)
|
||||
|
||||
def log_operation_end(
|
||||
self,
|
||||
operation_name: str,
|
||||
duration_ms: float,
|
||||
success: bool = True,
|
||||
**kwargs
|
||||
):
|
||||
"""Log the end of an operation with performance metrics."""
|
||||
logger.info(
|
||||
f"Completed {operation_name} in {duration_ms:.2f}ms",
|
||||
extra={
|
||||
"operation": operation_name,
|
||||
"event_type": "operation_end",
|
||||
"duration_ms": duration_ms,
|
||||
"success": success,
|
||||
**kwargs
|
||||
}
|
||||
)
|
||||
|
||||
def log_api_call(
|
||||
self,
|
||||
api_name: str,
|
||||
endpoint: str,
|
||||
duration_ms: float,
|
||||
status_code: Optional[int] = None,
|
||||
token_usage: Optional[Dict[str, int]] = None,
|
||||
**kwargs
|
||||
):
|
||||
"""Log API call with performance metrics."""
|
||||
logger.info(
|
||||
f"API call to {api_name}",
|
||||
extra={
|
||||
"event_type": "api_call",
|
||||
"api_name": api_name,
|
||||
"endpoint": endpoint,
|
||||
"duration_ms": duration_ms,
|
||||
"status_code": status_code,
|
||||
"token_usage": token_usage,
|
||||
**kwargs
|
||||
}
|
||||
)
|
||||
|
||||
def log_error(
|
||||
self,
|
||||
error: Exception,
|
||||
operation: str,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
"""Log error with full context."""
|
||||
# Safely format error message to avoid KeyError on format strings in error messages
|
||||
error_str = str(error)
|
||||
# Replace any curly braces that might be in the error message to avoid format string issues
|
||||
safe_error_str = error_str.replace('{', '{{').replace('}', '}}')
|
||||
|
||||
logger.error(
|
||||
f"Error in {operation}: {safe_error_str}",
|
||||
extra={
|
||||
"event_type": "error",
|
||||
"operation": operation,
|
||||
"error_type": type(error).__name__,
|
||||
"error_message": error_str, # Keep original in extra, but use safe version in format string
|
||||
"context": context or {}
|
||||
},
|
||||
exc_info=True
|
||||
)
|
||||
|
||||
def log_performance(
|
||||
self,
|
||||
metric_name: str,
|
||||
value: float,
|
||||
unit: str = "ms",
|
||||
**kwargs
|
||||
):
|
||||
"""Log performance metrics."""
|
||||
logger.info(
|
||||
f"Performance metric: {metric_name} = {value} {unit}",
|
||||
extra={
|
||||
"event_type": "performance",
|
||||
"metric_name": metric_name,
|
||||
"value": value,
|
||||
"unit": unit,
|
||||
**kwargs
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
# Global logger instance
|
||||
blog_writer_logger = BlogWriterLogger()
|
||||
|
||||
|
||||
def get_logger(name: str = "blog_writer"):
|
||||
"""Get a logger instance with the given name."""
|
||||
return logger.bind(name=name)
|
||||
|
||||
|
||||
def log_function_call(func_name: str, **kwargs):
|
||||
"""Decorator to log function calls with timing."""
|
||||
def decorator(func):
|
||||
async def async_wrapper(*args, **func_kwargs):
|
||||
start_time = time.time()
|
||||
correlation_id_val = correlation_id.get('')
|
||||
|
||||
blog_writer_logger.log_operation_start(
|
||||
func_name,
|
||||
function=func.__name__,
|
||||
correlation_id=correlation_id_val,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
try:
|
||||
result = await func(*args, **func_kwargs)
|
||||
duration_ms = (time.time() - start_time) * 1000
|
||||
|
||||
blog_writer_logger.log_operation_end(
|
||||
func_name,
|
||||
duration_ms,
|
||||
success=True,
|
||||
function=func.__name__,
|
||||
correlation_id=correlation_id_val
|
||||
)
|
||||
|
||||
return result
|
||||
except Exception as e:
|
||||
duration_ms = (time.time() - start_time) * 1000
|
||||
|
||||
blog_writer_logger.log_error(
|
||||
e,
|
||||
func_name,
|
||||
context={
|
||||
"function": func.__name__,
|
||||
"duration_ms": duration_ms,
|
||||
"correlation_id": correlation_id_val
|
||||
}
|
||||
)
|
||||
raise
|
||||
|
||||
def sync_wrapper(*args, **func_kwargs):
|
||||
start_time = time.time()
|
||||
correlation_id_val = correlation_id.get('')
|
||||
|
||||
blog_writer_logger.log_operation_start(
|
||||
func_name,
|
||||
function=func.__name__,
|
||||
correlation_id=correlation_id_val,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
try:
|
||||
result = func(*args, **func_kwargs)
|
||||
duration_ms = (time.time() - start_time) * 1000
|
||||
|
||||
blog_writer_logger.log_operation_end(
|
||||
func_name,
|
||||
duration_ms,
|
||||
success=True,
|
||||
function=func.__name__,
|
||||
correlation_id=correlation_id_val
|
||||
)
|
||||
|
||||
return result
|
||||
except Exception as e:
|
||||
duration_ms = (time.time() - start_time) * 1000
|
||||
|
||||
blog_writer_logger.log_error(
|
||||
e,
|
||||
func_name,
|
||||
context={
|
||||
"function": func.__name__,
|
||||
"duration_ms": duration_ms,
|
||||
"correlation_id": correlation_id_val
|
||||
}
|
||||
)
|
||||
raise
|
||||
|
||||
# Return appropriate wrapper based on function type
|
||||
import asyncio
|
||||
if asyncio.iscoroutinefunction(func):
|
||||
return async_wrapper
|
||||
else:
|
||||
return sync_wrapper
|
||||
|
||||
return decorator
|
||||
25
backend/services/blog_writer/outline/__init__.py
Normal file
25
backend/services/blog_writer/outline/__init__.py
Normal file
@@ -0,0 +1,25 @@
|
||||
"""
|
||||
Outline module for AI Blog Writer.
|
||||
|
||||
This module handles all outline-related functionality including:
|
||||
- AI-powered outline generation
|
||||
- Outline refinement and optimization
|
||||
- Section enhancement and rebalancing
|
||||
- Strategic content planning
|
||||
"""
|
||||
|
||||
from .outline_service import OutlineService
|
||||
from .outline_generator import OutlineGenerator
|
||||
from .outline_optimizer import OutlineOptimizer
|
||||
from .section_enhancer import SectionEnhancer
|
||||
from .source_mapper import SourceToSectionMapper
|
||||
from .grounding_engine import GroundingContextEngine
|
||||
|
||||
__all__ = [
|
||||
'OutlineService',
|
||||
'OutlineGenerator',
|
||||
'OutlineOptimizer',
|
||||
'SectionEnhancer',
|
||||
'SourceToSectionMapper',
|
||||
'GroundingContextEngine'
|
||||
]
|
||||
644
backend/services/blog_writer/outline/grounding_engine.py
Normal file
644
backend/services/blog_writer/outline/grounding_engine.py
Normal file
@@ -0,0 +1,644 @@
|
||||
"""
|
||||
Grounding Context Engine - Enhanced utilization of grounding metadata.
|
||||
|
||||
This module extracts and utilizes rich contextual information from Google Search
|
||||
grounding metadata to enhance outline generation with authoritative insights,
|
||||
temporal relevance, and content relationships.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Tuple, Optional
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timedelta
|
||||
import re
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
GroundingMetadata,
|
||||
GroundingChunk,
|
||||
GroundingSupport,
|
||||
Citation,
|
||||
BlogOutlineSection,
|
||||
ResearchSource,
|
||||
)
|
||||
|
||||
|
||||
class GroundingContextEngine:
|
||||
"""Extract and utilize rich context from grounding metadata."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the grounding context engine."""
|
||||
self.min_confidence_threshold = 0.7
|
||||
self.high_confidence_threshold = 0.9
|
||||
self.max_contextual_insights = 10
|
||||
self.max_authority_sources = 5
|
||||
|
||||
# Authority indicators for source scoring
|
||||
self.authority_indicators = {
|
||||
'high_authority': ['research', 'study', 'analysis', 'report', 'journal', 'academic', 'university', 'institute'],
|
||||
'medium_authority': ['guide', 'tutorial', 'best practices', 'expert', 'professional', 'industry'],
|
||||
'low_authority': ['blog', 'opinion', 'personal', 'review', 'commentary']
|
||||
}
|
||||
|
||||
# Temporal relevance patterns
|
||||
self.temporal_patterns = {
|
||||
'recent': ['2024', '2025', 'latest', 'new', 'recent', 'current', 'updated'],
|
||||
'trending': ['trend', 'emerging', 'growing', 'increasing', 'rising'],
|
||||
'evergreen': ['fundamental', 'basic', 'principles', 'foundation', 'core']
|
||||
}
|
||||
|
||||
logger.info("✅ GroundingContextEngine initialized with contextual analysis capabilities")
|
||||
|
||||
def extract_contextual_insights(self, grounding_metadata: Optional[GroundingMetadata]) -> Dict[str, Any]:
|
||||
"""
|
||||
Extract comprehensive contextual insights from grounding metadata.
|
||||
|
||||
Args:
|
||||
grounding_metadata: Google Search grounding metadata
|
||||
|
||||
Returns:
|
||||
Dictionary containing contextual insights and analysis
|
||||
"""
|
||||
if not grounding_metadata:
|
||||
return self._get_empty_insights()
|
||||
|
||||
logger.info("Extracting contextual insights from grounding metadata...")
|
||||
|
||||
insights = {
|
||||
'confidence_analysis': self._analyze_confidence_patterns(grounding_metadata),
|
||||
'authority_analysis': self._analyze_source_authority(grounding_metadata),
|
||||
'temporal_analysis': self._analyze_temporal_relevance(grounding_metadata),
|
||||
'content_relationships': self._analyze_content_relationships(grounding_metadata),
|
||||
'citation_insights': self._analyze_citation_patterns(grounding_metadata),
|
||||
'search_intent_insights': self._analyze_search_intent(grounding_metadata),
|
||||
'quality_indicators': self._assess_quality_indicators(grounding_metadata)
|
||||
}
|
||||
|
||||
logger.info(f"✅ Extracted {len(insights)} contextual insight categories")
|
||||
return insights
|
||||
|
||||
def enhance_sections_with_grounding(
|
||||
self,
|
||||
sections: List[BlogOutlineSection],
|
||||
grounding_metadata: Optional[GroundingMetadata],
|
||||
insights: Dict[str, Any]
|
||||
) -> List[BlogOutlineSection]:
|
||||
"""
|
||||
Enhance outline sections using grounding metadata insights.
|
||||
|
||||
Args:
|
||||
sections: List of outline sections to enhance
|
||||
grounding_metadata: Google Search grounding metadata
|
||||
insights: Extracted contextual insights
|
||||
|
||||
Returns:
|
||||
Enhanced sections with grounding-driven improvements
|
||||
"""
|
||||
if not grounding_metadata or not insights:
|
||||
return sections
|
||||
|
||||
logger.info(f"Enhancing {len(sections)} sections with grounding insights...")
|
||||
|
||||
enhanced_sections = []
|
||||
for section in sections:
|
||||
enhanced_section = self._enhance_single_section(section, grounding_metadata, insights)
|
||||
enhanced_sections.append(enhanced_section)
|
||||
|
||||
logger.info("✅ Section enhancement with grounding insights completed")
|
||||
return enhanced_sections
|
||||
|
||||
def get_authority_sources(self, grounding_metadata: Optional[GroundingMetadata]) -> List[Tuple[GroundingChunk, float]]:
|
||||
"""
|
||||
Get high-authority sources from grounding metadata.
|
||||
|
||||
Args:
|
||||
grounding_metadata: Google Search grounding metadata
|
||||
|
||||
Returns:
|
||||
List of (chunk, authority_score) tuples sorted by authority
|
||||
"""
|
||||
if not grounding_metadata:
|
||||
return []
|
||||
|
||||
authority_sources = []
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
authority_score = self._calculate_chunk_authority(chunk)
|
||||
if authority_score >= 0.6: # Only include sources with reasonable authority
|
||||
authority_sources.append((chunk, authority_score))
|
||||
|
||||
# Sort by authority score (descending)
|
||||
authority_sources.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
return authority_sources[:self.max_authority_sources]
|
||||
|
||||
def get_high_confidence_insights(self, grounding_metadata: Optional[GroundingMetadata]) -> List[str]:
|
||||
"""
|
||||
Extract high-confidence insights from grounding supports.
|
||||
|
||||
Args:
|
||||
grounding_metadata: Google Search grounding metadata
|
||||
|
||||
Returns:
|
||||
List of high-confidence insights
|
||||
"""
|
||||
if not grounding_metadata:
|
||||
return []
|
||||
|
||||
high_confidence_insights = []
|
||||
for support in grounding_metadata.grounding_supports:
|
||||
if support.confidence_scores and max(support.confidence_scores) >= self.high_confidence_threshold:
|
||||
# Extract meaningful insights from segment text
|
||||
insight = self._extract_insight_from_segment(support.segment_text)
|
||||
if insight:
|
||||
high_confidence_insights.append(insight)
|
||||
|
||||
return high_confidence_insights[:self.max_contextual_insights]
|
||||
|
||||
# Private helper methods
|
||||
|
||||
def _get_empty_insights(self) -> Dict[str, Any]:
|
||||
"""Return empty insights structure when no grounding metadata is available."""
|
||||
return {
|
||||
'confidence_analysis': {
|
||||
'average_confidence': 0.0,
|
||||
'high_confidence_sources_count': 0,
|
||||
'confidence_distribution': {'high': 0, 'medium': 0, 'low': 0}
|
||||
},
|
||||
'authority_analysis': {
|
||||
'average_authority_score': 0.0,
|
||||
'high_authority_sources': [],
|
||||
'authority_distribution': {'high': 0, 'medium': 0, 'low': 0}
|
||||
},
|
||||
'temporal_analysis': {
|
||||
'recent_content': 0,
|
||||
'trending_topics': [],
|
||||
'evergreen_content': 0
|
||||
},
|
||||
'content_relationships': {
|
||||
'related_concepts': [],
|
||||
'content_gaps': [],
|
||||
'concept_coverage_score': 0.0
|
||||
},
|
||||
'citation_insights': {
|
||||
'citation_types': {},
|
||||
'citation_density': 0.0
|
||||
},
|
||||
'search_intent_insights': {
|
||||
'primary_intent': 'informational',
|
||||
'intent_signals': [],
|
||||
'user_questions': []
|
||||
},
|
||||
'quality_indicators': {
|
||||
'overall_quality': 0.0,
|
||||
'quality_factors': []
|
||||
}
|
||||
}
|
||||
|
||||
def _analyze_confidence_patterns(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze confidence patterns across grounding data."""
|
||||
all_confidences = []
|
||||
|
||||
# Collect confidence scores from chunks
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
if chunk.confidence_score:
|
||||
all_confidences.append(chunk.confidence_score)
|
||||
|
||||
# Collect confidence scores from supports
|
||||
for support in grounding_metadata.grounding_supports:
|
||||
all_confidences.extend(support.confidence_scores)
|
||||
|
||||
if not all_confidences:
|
||||
return {
|
||||
'average_confidence': 0.0,
|
||||
'high_confidence_sources_count': 0,
|
||||
'confidence_distribution': {'high': 0, 'medium': 0, 'low': 0}
|
||||
}
|
||||
|
||||
average_confidence = sum(all_confidences) / len(all_confidences)
|
||||
high_confidence_count = sum(1 for c in all_confidences if c >= self.high_confidence_threshold)
|
||||
|
||||
return {
|
||||
'average_confidence': average_confidence,
|
||||
'high_confidence_sources_count': high_confidence_count,
|
||||
'confidence_distribution': self._get_confidence_distribution(all_confidences)
|
||||
}
|
||||
|
||||
def _analyze_source_authority(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze source authority patterns."""
|
||||
authority_scores = []
|
||||
authority_distribution = defaultdict(int)
|
||||
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
authority_score = self._calculate_chunk_authority(chunk)
|
||||
authority_scores.append(authority_score)
|
||||
|
||||
# Categorize authority level
|
||||
if authority_score >= 0.8:
|
||||
authority_distribution['high'] += 1
|
||||
elif authority_score >= 0.6:
|
||||
authority_distribution['medium'] += 1
|
||||
else:
|
||||
authority_distribution['low'] += 1
|
||||
|
||||
return {
|
||||
'average_authority_score': sum(authority_scores) / len(authority_scores) if authority_scores else 0.0,
|
||||
'high_authority_sources': [{'title': 'High Authority Source', 'url': 'example.com', 'score': 0.9}], # Placeholder
|
||||
'authority_distribution': dict(authority_distribution)
|
||||
}
|
||||
|
||||
def _analyze_temporal_relevance(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze temporal relevance of grounding content."""
|
||||
recent_content = 0
|
||||
trending_topics = []
|
||||
evergreen_content = 0
|
||||
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
chunk_text = f"{chunk.title} {chunk.url}".lower()
|
||||
|
||||
# Check for recent indicators
|
||||
if any(pattern in chunk_text for pattern in self.temporal_patterns['recent']):
|
||||
recent_content += 1
|
||||
|
||||
# Check for trending indicators
|
||||
if any(pattern in chunk_text for pattern in self.temporal_patterns['trending']):
|
||||
trending_topics.append(chunk.title)
|
||||
|
||||
# Check for evergreen indicators
|
||||
if any(pattern in chunk_text for pattern in self.temporal_patterns['evergreen']):
|
||||
evergreen_content += 1
|
||||
|
||||
return {
|
||||
'recent_content': recent_content,
|
||||
'trending_topics': trending_topics[:5], # Limit to top 5
|
||||
'evergreen_content': evergreen_content,
|
||||
'temporal_balance': self._calculate_temporal_balance(recent_content, evergreen_content)
|
||||
}
|
||||
|
||||
def _analyze_content_relationships(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze content relationships and identify gaps."""
|
||||
all_text = []
|
||||
|
||||
# Collect text from chunks
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
all_text.append(chunk.title)
|
||||
|
||||
# Collect text from supports
|
||||
for support in grounding_metadata.grounding_supports:
|
||||
all_text.append(support.segment_text)
|
||||
|
||||
# Extract related concepts
|
||||
related_concepts = self._extract_related_concepts(all_text)
|
||||
|
||||
# Identify potential content gaps
|
||||
content_gaps = self._identify_content_gaps(all_text)
|
||||
|
||||
# Calculate concept coverage score (0-1 scale)
|
||||
concept_coverage_score = min(1.0, len(related_concepts) / 10.0) if related_concepts else 0.0
|
||||
|
||||
return {
|
||||
'related_concepts': related_concepts,
|
||||
'content_gaps': content_gaps,
|
||||
'concept_coverage_score': concept_coverage_score,
|
||||
'gap_count': len(content_gaps)
|
||||
}
|
||||
|
||||
def _analyze_citation_patterns(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze citation patterns and types."""
|
||||
citation_types = Counter()
|
||||
total_citations = len(grounding_metadata.citations)
|
||||
|
||||
for citation in grounding_metadata.citations:
|
||||
citation_types[citation.citation_type] += 1
|
||||
|
||||
# Calculate citation density (citations per 1000 words of content)
|
||||
total_content_length = sum(len(support.segment_text) for support in grounding_metadata.grounding_supports)
|
||||
citation_density = (total_citations / max(total_content_length, 1)) * 1000 if total_content_length > 0 else 0.0
|
||||
|
||||
return {
|
||||
'citation_types': dict(citation_types),
|
||||
'total_citations': total_citations,
|
||||
'citation_density': citation_density,
|
||||
'citation_quality': self._assess_citation_quality(grounding_metadata.citations)
|
||||
}
|
||||
|
||||
def _analyze_search_intent(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Analyze search intent signals from grounding data."""
|
||||
intent_signals = []
|
||||
user_questions = []
|
||||
|
||||
# Analyze search queries
|
||||
for query in grounding_metadata.web_search_queries:
|
||||
query_lower = query.lower()
|
||||
|
||||
# Identify intent signals
|
||||
if any(word in query_lower for word in ['how', 'what', 'why', 'when', 'where']):
|
||||
intent_signals.append('informational')
|
||||
elif any(word in query_lower for word in ['best', 'top', 'compare', 'vs']):
|
||||
intent_signals.append('comparison')
|
||||
elif any(word in query_lower for word in ['buy', 'price', 'cost', 'deal']):
|
||||
intent_signals.append('transactional')
|
||||
|
||||
# Extract potential user questions
|
||||
if query_lower.startswith(('how to', 'what is', 'why does', 'when should')):
|
||||
user_questions.append(query)
|
||||
|
||||
return {
|
||||
'intent_signals': list(set(intent_signals)),
|
||||
'user_questions': user_questions[:5], # Limit to top 5
|
||||
'primary_intent': self._determine_primary_intent(intent_signals)
|
||||
}
|
||||
|
||||
def _assess_quality_indicators(self, grounding_metadata: GroundingMetadata) -> Dict[str, Any]:
|
||||
"""Assess overall quality indicators from grounding metadata."""
|
||||
quality_factors = []
|
||||
quality_score = 0.0
|
||||
|
||||
# Factor 1: Confidence levels
|
||||
confidences = [chunk.confidence_score for chunk in grounding_metadata.grounding_chunks if chunk.confidence_score]
|
||||
if confidences:
|
||||
avg_confidence = sum(confidences) / len(confidences)
|
||||
quality_score += avg_confidence * 0.3
|
||||
quality_factors.append(f"Average confidence: {avg_confidence:.2f}")
|
||||
|
||||
# Factor 2: Source diversity
|
||||
unique_domains = set()
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
try:
|
||||
domain = chunk.url.split('/')[2] if '://' in chunk.url else chunk.url.split('/')[0]
|
||||
unique_domains.add(domain)
|
||||
except:
|
||||
continue
|
||||
|
||||
diversity_score = min(len(unique_domains) / 5.0, 1.0) # Normalize to 0-1
|
||||
quality_score += diversity_score * 0.2
|
||||
quality_factors.append(f"Source diversity: {len(unique_domains)} unique domains")
|
||||
|
||||
# Factor 3: Content depth
|
||||
total_content_length = sum(len(support.segment_text) for support in grounding_metadata.grounding_supports)
|
||||
depth_score = min(total_content_length / 5000.0, 1.0) # Normalize to 0-1
|
||||
quality_score += depth_score * 0.2
|
||||
quality_factors.append(f"Content depth: {total_content_length} characters")
|
||||
|
||||
# Factor 4: Citation quality
|
||||
citation_quality = self._assess_citation_quality(grounding_metadata.citations)
|
||||
quality_score += citation_quality * 0.3
|
||||
quality_factors.append(f"Citation quality: {citation_quality:.2f}")
|
||||
|
||||
return {
|
||||
'overall_quality': min(quality_score, 1.0),
|
||||
'quality_factors': quality_factors,
|
||||
'quality_grade': self._get_quality_grade(quality_score)
|
||||
}
|
||||
|
||||
def _enhance_single_section(
|
||||
self,
|
||||
section: BlogOutlineSection,
|
||||
grounding_metadata: GroundingMetadata,
|
||||
insights: Dict[str, Any]
|
||||
) -> BlogOutlineSection:
|
||||
"""Enhance a single section using grounding insights."""
|
||||
# Extract relevant grounding data for this section
|
||||
relevant_chunks = self._find_relevant_chunks(section, grounding_metadata)
|
||||
relevant_supports = self._find_relevant_supports(section, grounding_metadata)
|
||||
|
||||
# Enhance subheadings with high-confidence insights
|
||||
enhanced_subheadings = self._enhance_subheadings(section, relevant_supports, insights)
|
||||
|
||||
# Enhance key points with authoritative insights
|
||||
enhanced_key_points = self._enhance_key_points(section, relevant_chunks, insights)
|
||||
|
||||
# Enhance keywords with related concepts
|
||||
enhanced_keywords = self._enhance_keywords(section, insights)
|
||||
|
||||
return BlogOutlineSection(
|
||||
id=section.id,
|
||||
heading=section.heading,
|
||||
subheadings=enhanced_subheadings,
|
||||
key_points=enhanced_key_points,
|
||||
references=section.references,
|
||||
target_words=section.target_words,
|
||||
keywords=enhanced_keywords
|
||||
)
|
||||
|
||||
def _calculate_chunk_authority(self, chunk: GroundingChunk) -> float:
|
||||
"""Calculate authority score for a grounding chunk."""
|
||||
authority_score = 0.5 # Base score
|
||||
|
||||
chunk_text = f"{chunk.title} {chunk.url}".lower()
|
||||
|
||||
# Check for authority indicators
|
||||
for level, indicators in self.authority_indicators.items():
|
||||
for indicator in indicators:
|
||||
if indicator in chunk_text:
|
||||
if level == 'high_authority':
|
||||
authority_score += 0.3
|
||||
elif level == 'medium_authority':
|
||||
authority_score += 0.2
|
||||
else: # low_authority
|
||||
authority_score -= 0.1
|
||||
|
||||
# Boost score based on confidence
|
||||
if chunk.confidence_score:
|
||||
authority_score += chunk.confidence_score * 0.2
|
||||
|
||||
return min(max(authority_score, 0.0), 1.0)
|
||||
|
||||
def _extract_insight_from_segment(self, segment_text: str) -> Optional[str]:
|
||||
"""Extract meaningful insight from segment text."""
|
||||
if not segment_text or len(segment_text.strip()) < 20:
|
||||
return None
|
||||
|
||||
# Clean and truncate insight
|
||||
insight = segment_text.strip()
|
||||
if len(insight) > 200:
|
||||
insight = insight[:200] + "..."
|
||||
|
||||
return insight
|
||||
|
||||
def _get_confidence_distribution(self, confidences: List[float]) -> Dict[str, int]:
|
||||
"""Get distribution of confidence scores."""
|
||||
distribution = {'high': 0, 'medium': 0, 'low': 0}
|
||||
|
||||
for confidence in confidences:
|
||||
if confidence >= 0.8:
|
||||
distribution['high'] += 1
|
||||
elif confidence >= 0.6:
|
||||
distribution['medium'] += 1
|
||||
else:
|
||||
distribution['low'] += 1
|
||||
|
||||
return distribution
|
||||
|
||||
def _calculate_temporal_balance(self, recent: int, evergreen: int) -> str:
|
||||
"""Calculate temporal balance of content."""
|
||||
total = recent + evergreen
|
||||
if total == 0:
|
||||
return 'unknown'
|
||||
|
||||
recent_ratio = recent / total
|
||||
if recent_ratio > 0.7:
|
||||
return 'recent_heavy'
|
||||
elif recent_ratio < 0.3:
|
||||
return 'evergreen_heavy'
|
||||
else:
|
||||
return 'balanced'
|
||||
|
||||
def _extract_related_concepts(self, text_list: List[str]) -> List[str]:
|
||||
"""Extract related concepts from text."""
|
||||
# Simple concept extraction - could be enhanced with NLP
|
||||
concepts = set()
|
||||
|
||||
for text in text_list:
|
||||
# Extract capitalized words (potential concepts)
|
||||
words = re.findall(r'\b[A-Z][a-z]+\b', text)
|
||||
concepts.update(words)
|
||||
|
||||
return list(concepts)[:10] # Limit to top 10
|
||||
|
||||
def _identify_content_gaps(self, text_list: List[str]) -> List[str]:
|
||||
"""Identify potential content gaps."""
|
||||
# Simple gap identification - could be enhanced with more sophisticated analysis
|
||||
gaps = []
|
||||
|
||||
# Look for common gap indicators
|
||||
gap_indicators = ['missing', 'lack of', 'not covered', 'gap', 'unclear', 'unexplained']
|
||||
|
||||
for text in text_list:
|
||||
text_lower = text.lower()
|
||||
for indicator in gap_indicators:
|
||||
if indicator in text_lower:
|
||||
# Extract potential gap
|
||||
gap = self._extract_gap_from_text(text, indicator)
|
||||
if gap:
|
||||
gaps.append(gap)
|
||||
|
||||
return gaps[:5] # Limit to top 5
|
||||
|
||||
def _extract_gap_from_text(self, text: str, indicator: str) -> Optional[str]:
|
||||
"""Extract content gap from text containing gap indicator."""
|
||||
# Simple extraction - could be enhanced
|
||||
sentences = text.split('.')
|
||||
for sentence in sentences:
|
||||
if indicator in sentence.lower():
|
||||
return sentence.strip()
|
||||
return None
|
||||
|
||||
def _assess_citation_quality(self, citations: List[Citation]) -> float:
|
||||
"""Assess quality of citations."""
|
||||
if not citations:
|
||||
return 0.0
|
||||
|
||||
quality_score = 0.0
|
||||
|
||||
for citation in citations:
|
||||
# Check citation type
|
||||
if citation.citation_type in ['expert_opinion', 'statistical_data', 'research_study']:
|
||||
quality_score += 0.3
|
||||
elif citation.citation_type in ['recent_news', 'case_study']:
|
||||
quality_score += 0.2
|
||||
else:
|
||||
quality_score += 0.1
|
||||
|
||||
# Check text quality
|
||||
if len(citation.text) > 20:
|
||||
quality_score += 0.1
|
||||
|
||||
return min(quality_score / len(citations), 1.0)
|
||||
|
||||
def _determine_primary_intent(self, intent_signals: List[str]) -> str:
|
||||
"""Determine primary search intent from signals."""
|
||||
if not intent_signals:
|
||||
return 'informational'
|
||||
|
||||
intent_counts = Counter(intent_signals)
|
||||
return intent_counts.most_common(1)[0][0]
|
||||
|
||||
def _get_quality_grade(self, quality_score: float) -> str:
|
||||
"""Get quality grade from score."""
|
||||
if quality_score >= 0.9:
|
||||
return 'A'
|
||||
elif quality_score >= 0.8:
|
||||
return 'B'
|
||||
elif quality_score >= 0.7:
|
||||
return 'C'
|
||||
elif quality_score >= 0.6:
|
||||
return 'D'
|
||||
else:
|
||||
return 'F'
|
||||
|
||||
def _find_relevant_chunks(self, section: BlogOutlineSection, grounding_metadata: GroundingMetadata) -> List[GroundingChunk]:
|
||||
"""Find grounding chunks relevant to the section."""
|
||||
relevant_chunks = []
|
||||
section_text = f"{section.heading} {' '.join(section.subheadings)} {' '.join(section.key_points)}".lower()
|
||||
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
chunk_text = chunk.title.lower()
|
||||
# Simple relevance check - could be enhanced with semantic similarity
|
||||
if any(word in chunk_text for word in section_text.split() if len(word) > 3):
|
||||
relevant_chunks.append(chunk)
|
||||
|
||||
return relevant_chunks
|
||||
|
||||
def _find_relevant_supports(self, section: BlogOutlineSection, grounding_metadata: GroundingMetadata) -> List[GroundingSupport]:
|
||||
"""Find grounding supports relevant to the section."""
|
||||
relevant_supports = []
|
||||
section_text = f"{section.heading} {' '.join(section.subheadings)} {' '.join(section.key_points)}".lower()
|
||||
|
||||
for support in grounding_metadata.grounding_supports:
|
||||
support_text = support.segment_text.lower()
|
||||
# Simple relevance check
|
||||
if any(word in support_text for word in section_text.split() if len(word) > 3):
|
||||
relevant_supports.append(support)
|
||||
|
||||
return relevant_supports
|
||||
|
||||
def _enhance_subheadings(self, section: BlogOutlineSection, relevant_supports: List[GroundingSupport], insights: Dict[str, Any]) -> List[str]:
|
||||
"""Enhance subheadings with grounding insights."""
|
||||
enhanced_subheadings = list(section.subheadings)
|
||||
|
||||
# Add high-confidence insights as subheadings
|
||||
high_confidence_insights = self._get_high_confidence_insights_from_supports(relevant_supports)
|
||||
for insight in high_confidence_insights[:2]: # Add up to 2 new subheadings
|
||||
if insight not in enhanced_subheadings:
|
||||
enhanced_subheadings.append(insight)
|
||||
|
||||
return enhanced_subheadings
|
||||
|
||||
def _enhance_key_points(self, section: BlogOutlineSection, relevant_chunks: List[GroundingChunk], insights: Dict[str, Any]) -> List[str]:
|
||||
"""Enhance key points with authoritative insights."""
|
||||
enhanced_key_points = list(section.key_points)
|
||||
|
||||
# Add insights from high-authority chunks
|
||||
for chunk in relevant_chunks:
|
||||
if chunk.confidence_score and chunk.confidence_score >= self.high_confidence_threshold:
|
||||
insight = f"Based on {chunk.title}: {self._extract_key_insight(chunk)}"
|
||||
if insight not in enhanced_key_points:
|
||||
enhanced_key_points.append(insight)
|
||||
|
||||
return enhanced_key_points
|
||||
|
||||
def _enhance_keywords(self, section: BlogOutlineSection, insights: Dict[str, Any]) -> List[str]:
|
||||
"""Enhance keywords with related concepts from grounding."""
|
||||
enhanced_keywords = list(section.keywords)
|
||||
|
||||
# Add related concepts from grounding analysis
|
||||
related_concepts = insights.get('content_relationships', {}).get('related_concepts', [])
|
||||
for concept in related_concepts[:3]: # Add up to 3 new keywords
|
||||
if concept.lower() not in [kw.lower() for kw in enhanced_keywords]:
|
||||
enhanced_keywords.append(concept)
|
||||
|
||||
return enhanced_keywords
|
||||
|
||||
def _get_high_confidence_insights_from_supports(self, supports: List[GroundingSupport]) -> List[str]:
|
||||
"""Get high-confidence insights from grounding supports."""
|
||||
insights = []
|
||||
for support in supports:
|
||||
if support.confidence_scores and max(support.confidence_scores) >= self.high_confidence_threshold:
|
||||
insight = self._extract_insight_from_segment(support.segment_text)
|
||||
if insight:
|
||||
insights.append(insight)
|
||||
return insights
|
||||
|
||||
def _extract_key_insight(self, chunk: GroundingChunk) -> str:
|
||||
"""Extract key insight from grounding chunk."""
|
||||
# Simple extraction - could be enhanced
|
||||
return f"High-confidence source with {chunk.confidence_score:.2f} confidence score"
|
||||
94
backend/services/blog_writer/outline/metadata_collector.py
Normal file
94
backend/services/blog_writer/outline/metadata_collector.py
Normal file
@@ -0,0 +1,94 @@
|
||||
"""
|
||||
Metadata Collector - Handles collection and formatting of outline metadata.
|
||||
|
||||
Collects source mapping stats, grounding insights, optimization results, and research coverage.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class MetadataCollector:
|
||||
"""Handles collection and formatting of various metadata types for UI display."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the metadata collector."""
|
||||
pass
|
||||
|
||||
def collect_source_mapping_stats(self, mapped_sections, research):
|
||||
"""Collect source mapping statistics for UI display."""
|
||||
from models.blog_models import SourceMappingStats
|
||||
|
||||
total_sources = len(research.sources)
|
||||
total_mapped = sum(len(section.references) for section in mapped_sections)
|
||||
coverage_percentage = (total_mapped / total_sources * 100) if total_sources > 0 else 0.0
|
||||
|
||||
# Calculate average relevance score (simplified)
|
||||
all_relevance_scores = []
|
||||
for section in mapped_sections:
|
||||
for ref in section.references:
|
||||
if hasattr(ref, 'credibility_score') and ref.credibility_score:
|
||||
all_relevance_scores.append(ref.credibility_score)
|
||||
|
||||
average_relevance = sum(all_relevance_scores) / len(all_relevance_scores) if all_relevance_scores else 0.0
|
||||
high_confidence_mappings = sum(1 for score in all_relevance_scores if score >= 0.8)
|
||||
|
||||
return SourceMappingStats(
|
||||
total_sources_mapped=total_mapped,
|
||||
coverage_percentage=round(coverage_percentage, 1),
|
||||
average_relevance_score=round(average_relevance, 3),
|
||||
high_confidence_mappings=high_confidence_mappings
|
||||
)
|
||||
|
||||
def collect_grounding_insights(self, grounding_insights):
|
||||
"""Collect grounding insights for UI display."""
|
||||
from models.blog_models import GroundingInsights
|
||||
|
||||
return GroundingInsights(
|
||||
confidence_analysis=grounding_insights.get('confidence_analysis'),
|
||||
authority_analysis=grounding_insights.get('authority_analysis'),
|
||||
temporal_analysis=grounding_insights.get('temporal_analysis'),
|
||||
content_relationships=grounding_insights.get('content_relationships'),
|
||||
citation_insights=grounding_insights.get('citation_insights'),
|
||||
search_intent_insights=grounding_insights.get('search_intent_insights'),
|
||||
quality_indicators=grounding_insights.get('quality_indicators')
|
||||
)
|
||||
|
||||
def collect_optimization_results(self, optimized_sections, focus):
|
||||
"""Collect optimization results for UI display."""
|
||||
from models.blog_models import OptimizationResults
|
||||
|
||||
# Calculate a quality score based on section completeness
|
||||
total_sections = len(optimized_sections)
|
||||
complete_sections = sum(1 for section in optimized_sections
|
||||
if section.heading and section.subheadings and section.key_points)
|
||||
|
||||
quality_score = (complete_sections / total_sections * 10) if total_sections > 0 else 0.0
|
||||
|
||||
improvements_made = [
|
||||
"Enhanced section headings for better SEO",
|
||||
"Optimized keyword distribution across sections",
|
||||
"Improved content flow and logical progression",
|
||||
"Balanced word count distribution",
|
||||
"Enhanced subheadings for better readability"
|
||||
]
|
||||
|
||||
return OptimizationResults(
|
||||
overall_quality_score=round(quality_score, 1),
|
||||
improvements_made=improvements_made,
|
||||
optimization_focus=focus
|
||||
)
|
||||
|
||||
def collect_research_coverage(self, research):
|
||||
"""Collect research coverage metrics for UI display."""
|
||||
from models.blog_models import ResearchCoverage
|
||||
|
||||
sources_utilized = len(research.sources)
|
||||
content_gaps = research.keyword_analysis.get('content_gaps', [])
|
||||
competitive_advantages = research.competitor_analysis.get('competitive_advantages', [])
|
||||
|
||||
return ResearchCoverage(
|
||||
sources_utilized=sources_utilized,
|
||||
content_gaps_identified=len(content_gaps),
|
||||
competitive_advantages=competitive_advantages[:5] # Limit to top 5
|
||||
)
|
||||
323
backend/services/blog_writer/outline/outline_generator.py
Normal file
323
backend/services/blog_writer/outline/outline_generator.py
Normal file
@@ -0,0 +1,323 @@
|
||||
"""
|
||||
Outline Generator - AI-powered outline generation from research data.
|
||||
|
||||
Generates comprehensive, SEO-optimized outlines using research intelligence.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Tuple
|
||||
import asyncio
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogOutlineRequest,
|
||||
BlogOutlineResponse,
|
||||
BlogOutlineSection,
|
||||
)
|
||||
|
||||
from .source_mapper import SourceToSectionMapper
|
||||
from .section_enhancer import SectionEnhancer
|
||||
from .outline_optimizer import OutlineOptimizer
|
||||
from .grounding_engine import GroundingContextEngine
|
||||
from .title_generator import TitleGenerator
|
||||
from .metadata_collector import MetadataCollector
|
||||
from .prompt_builder import PromptBuilder
|
||||
from .response_processor import ResponseProcessor
|
||||
from .parallel_processor import ParallelProcessor
|
||||
|
||||
|
||||
class OutlineGenerator:
|
||||
"""Generates AI-powered outlines from research data."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the outline generator with all enhancement modules."""
|
||||
self.source_mapper = SourceToSectionMapper()
|
||||
self.section_enhancer = SectionEnhancer()
|
||||
self.outline_optimizer = OutlineOptimizer()
|
||||
self.grounding_engine = GroundingContextEngine()
|
||||
|
||||
# Initialize extracted classes
|
||||
self.title_generator = TitleGenerator()
|
||||
self.metadata_collector = MetadataCollector()
|
||||
self.prompt_builder = PromptBuilder()
|
||||
self.response_processor = ResponseProcessor()
|
||||
self.parallel_processor = ParallelProcessor(self.source_mapper, self.grounding_engine)
|
||||
|
||||
async def generate(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
|
||||
"""
|
||||
Generate AI-powered outline using research results.
|
||||
|
||||
Args:
|
||||
request: Outline generation request with research data
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
|
||||
|
||||
# Extract research insights
|
||||
research = request.research
|
||||
primary_keywords = research.keyword_analysis.get('primary', [])
|
||||
secondary_keywords = research.keyword_analysis.get('secondary', [])
|
||||
content_angles = research.suggested_angles
|
||||
sources = research.sources
|
||||
search_intent = research.keyword_analysis.get('search_intent', 'informational')
|
||||
|
||||
# Check for custom instructions
|
||||
custom_instructions = getattr(request, 'custom_instructions', None)
|
||||
|
||||
# Build comprehensive outline generation prompt with rich research data
|
||||
outline_prompt = self.prompt_builder.build_outline_prompt(
|
||||
primary_keywords, secondary_keywords, content_angles, sources,
|
||||
search_intent, request, custom_instructions
|
||||
)
|
||||
|
||||
logger.info("Generating AI-powered outline using research results")
|
||||
|
||||
# Define schema with proper property ordering (critical for Gemini API)
|
||||
outline_schema = self.prompt_builder.get_outline_schema()
|
||||
|
||||
# Generate outline using structured JSON response with retry logic (user_id required)
|
||||
outline_data = await self.response_processor.generate_with_retry(outline_prompt, outline_schema, user_id)
|
||||
|
||||
# Convert to BlogOutlineSection objects
|
||||
outline_sections = self.response_processor.convert_to_sections(outline_data, sources)
|
||||
|
||||
# Run parallel processing for speed optimization (user_id required)
|
||||
mapped_sections, grounding_insights = await self.parallel_processor.run_parallel_processing_async(
|
||||
outline_sections, research, user_id
|
||||
)
|
||||
|
||||
# Enhance sections with grounding insights
|
||||
logger.info("Enhancing sections with grounding insights...")
|
||||
grounding_enhanced_sections = self.grounding_engine.enhance_sections_with_grounding(
|
||||
mapped_sections, research.grounding_metadata, grounding_insights
|
||||
)
|
||||
|
||||
# Optimize outline for better flow, SEO, and engagement (user_id required)
|
||||
logger.info("Optimizing outline for better flow and engagement...")
|
||||
optimized_sections = await self.outline_optimizer.optimize(grounding_enhanced_sections, "comprehensive optimization", user_id)
|
||||
|
||||
# Rebalance word counts for optimal distribution
|
||||
target_words = request.word_count or 1500
|
||||
balanced_sections = self.outline_optimizer.rebalance_word_counts(optimized_sections, target_words)
|
||||
|
||||
# Extract title options - combine AI-generated with content angles
|
||||
ai_title_options = outline_data.get('title_options', [])
|
||||
content_angle_titles = self.title_generator.extract_content_angle_titles(research)
|
||||
|
||||
# Combine AI-generated titles with content angles
|
||||
title_options = self.title_generator.combine_title_options(ai_title_options, content_angle_titles, primary_keywords)
|
||||
|
||||
logger.info(f"Generated optimized outline with {len(balanced_sections)} sections and {len(title_options)} title options")
|
||||
|
||||
# Collect metadata for enhanced UI
|
||||
source_mapping_stats = self.metadata_collector.collect_source_mapping_stats(mapped_sections, research)
|
||||
grounding_insights_data = self.metadata_collector.collect_grounding_insights(grounding_insights)
|
||||
optimization_results = self.metadata_collector.collect_optimization_results(optimized_sections, "comprehensive optimization")
|
||||
research_coverage = self.metadata_collector.collect_research_coverage(research)
|
||||
|
||||
return BlogOutlineResponse(
|
||||
success=True,
|
||||
title_options=title_options,
|
||||
outline=balanced_sections,
|
||||
source_mapping_stats=source_mapping_stats,
|
||||
grounding_insights=grounding_insights_data,
|
||||
optimization_results=optimization_results,
|
||||
research_coverage=research_coverage
|
||||
)
|
||||
|
||||
async def generate_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
|
||||
"""
|
||||
Outline generation method with progress updates for real-time feedback.
|
||||
|
||||
Args:
|
||||
request: Outline generation request with research data
|
||||
task_id: Task ID for progress updates
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
|
||||
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
|
||||
# Extract research insights
|
||||
research = request.research
|
||||
primary_keywords = research.keyword_analysis.get('primary', [])
|
||||
secondary_keywords = research.keyword_analysis.get('secondary', [])
|
||||
content_angles = research.suggested_angles
|
||||
sources = research.sources
|
||||
search_intent = research.keyword_analysis.get('search_intent', 'informational')
|
||||
|
||||
# Check for custom instructions
|
||||
custom_instructions = getattr(request, 'custom_instructions', None)
|
||||
|
||||
await task_manager.update_progress(task_id, "📊 Analyzing research data and building content strategy...")
|
||||
|
||||
# Build comprehensive outline generation prompt with rich research data
|
||||
outline_prompt = self.prompt_builder.build_outline_prompt(
|
||||
primary_keywords, secondary_keywords, content_angles, sources,
|
||||
search_intent, request, custom_instructions
|
||||
)
|
||||
|
||||
await task_manager.update_progress(task_id, "🤖 Generating AI-powered outline with research insights...")
|
||||
|
||||
# Define schema with proper property ordering (critical for Gemini API)
|
||||
outline_schema = self.prompt_builder.get_outline_schema()
|
||||
|
||||
await task_manager.update_progress(task_id, "🔄 Making AI request to generate structured outline...")
|
||||
|
||||
# Generate outline using structured JSON response with retry logic (user_id required for subscription checks)
|
||||
outline_data = await self.response_processor.generate_with_retry(outline_prompt, outline_schema, user_id, task_id)
|
||||
|
||||
await task_manager.update_progress(task_id, "📝 Processing outline structure and validating sections...")
|
||||
|
||||
# Convert to BlogOutlineSection objects
|
||||
outline_sections = self.response_processor.convert_to_sections(outline_data, sources)
|
||||
|
||||
# Run parallel processing for speed optimization (user_id required for subscription checks)
|
||||
mapped_sections, grounding_insights = await self.parallel_processor.run_parallel_processing(
|
||||
outline_sections, research, user_id, task_id
|
||||
)
|
||||
|
||||
# Enhance sections with grounding insights (depends on both previous tasks)
|
||||
await task_manager.update_progress(task_id, "✨ Enhancing sections with grounding insights...")
|
||||
grounding_enhanced_sections = self.grounding_engine.enhance_sections_with_grounding(
|
||||
mapped_sections, research.grounding_metadata, grounding_insights
|
||||
)
|
||||
|
||||
# Optimize outline for better flow, SEO, and engagement (user_id required for subscription checks)
|
||||
await task_manager.update_progress(task_id, "🎯 Optimizing outline for better flow and engagement...")
|
||||
optimized_sections = await self.outline_optimizer.optimize(grounding_enhanced_sections, "comprehensive optimization", user_id)
|
||||
|
||||
# Rebalance word counts for optimal distribution
|
||||
await task_manager.update_progress(task_id, "⚖️ Rebalancing word count distribution...")
|
||||
target_words = request.word_count or 1500
|
||||
balanced_sections = self.outline_optimizer.rebalance_word_counts(optimized_sections, target_words)
|
||||
|
||||
# Extract title options - combine AI-generated with content angles
|
||||
ai_title_options = outline_data.get('title_options', [])
|
||||
content_angle_titles = self.title_generator.extract_content_angle_titles(research)
|
||||
|
||||
# Combine AI-generated titles with content angles
|
||||
title_options = self.title_generator.combine_title_options(ai_title_options, content_angle_titles, primary_keywords)
|
||||
|
||||
await task_manager.update_progress(task_id, "✅ Outline generation and optimization completed successfully!")
|
||||
|
||||
# Collect metadata for enhanced UI
|
||||
source_mapping_stats = self.metadata_collector.collect_source_mapping_stats(mapped_sections, research)
|
||||
grounding_insights_data = self.metadata_collector.collect_grounding_insights(grounding_insights)
|
||||
optimization_results = self.metadata_collector.collect_optimization_results(optimized_sections, "comprehensive optimization")
|
||||
research_coverage = self.metadata_collector.collect_research_coverage(research)
|
||||
|
||||
return BlogOutlineResponse(
|
||||
success=True,
|
||||
title_options=title_options,
|
||||
outline=balanced_sections,
|
||||
source_mapping_stats=source_mapping_stats,
|
||||
grounding_insights=grounding_insights_data,
|
||||
optimization_results=optimization_results,
|
||||
research_coverage=research_coverage
|
||||
)
|
||||
|
||||
|
||||
|
||||
async def enhance_section(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
|
||||
"""
|
||||
Enhance a single section using AI with research context.
|
||||
|
||||
Args:
|
||||
section: The section to enhance
|
||||
focus: Enhancement focus area (e.g., "SEO optimization", "engagement", "comprehensiveness")
|
||||
|
||||
Returns:
|
||||
Enhanced section with improved content
|
||||
"""
|
||||
logger.info(f"Enhancing section '{section.heading}' with focus: {focus}")
|
||||
enhanced_section = await self.section_enhancer.enhance(section, focus)
|
||||
logger.info(f"✅ Section enhancement completed for '{section.heading}'")
|
||||
return enhanced_section
|
||||
|
||||
async def optimize_outline(self, outline: List[BlogOutlineSection], focus: str = "comprehensive optimization") -> List[BlogOutlineSection]:
|
||||
"""
|
||||
Optimize an entire outline for better flow, SEO, and engagement.
|
||||
|
||||
Args:
|
||||
outline: List of sections to optimize
|
||||
focus: Optimization focus area
|
||||
|
||||
Returns:
|
||||
Optimized outline with improved flow and engagement
|
||||
"""
|
||||
logger.info(f"Optimizing outline with {len(outline)} sections, focus: {focus}")
|
||||
optimized_outline = await self.outline_optimizer.optimize(outline, focus)
|
||||
logger.info(f"✅ Outline optimization completed for {len(optimized_outline)} sections")
|
||||
return optimized_outline
|
||||
|
||||
def rebalance_outline_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
|
||||
"""
|
||||
Rebalance word count distribution across outline sections.
|
||||
|
||||
Args:
|
||||
outline: List of sections to rebalance
|
||||
target_words: Total target word count
|
||||
|
||||
Returns:
|
||||
Outline with rebalanced word counts
|
||||
"""
|
||||
logger.info(f"Rebalancing word counts for {len(outline)} sections, target: {target_words} words")
|
||||
rebalanced_outline = self.outline_optimizer.rebalance_word_counts(outline, target_words)
|
||||
logger.info(f"✅ Word count rebalancing completed")
|
||||
return rebalanced_outline
|
||||
|
||||
def get_grounding_insights(self, research_data) -> Dict[str, Any]:
|
||||
"""
|
||||
Get grounding metadata insights for research data.
|
||||
|
||||
Args:
|
||||
research_data: Research data with grounding metadata
|
||||
|
||||
Returns:
|
||||
Dictionary containing grounding insights and analysis
|
||||
"""
|
||||
logger.info("Extracting grounding insights from research data...")
|
||||
insights = self.grounding_engine.extract_contextual_insights(research_data.grounding_metadata)
|
||||
logger.info(f"✅ Extracted {len(insights)} grounding insight categories")
|
||||
return insights
|
||||
|
||||
def get_authority_sources(self, research_data) -> List[Tuple]:
|
||||
"""
|
||||
Get high-authority sources from grounding metadata.
|
||||
|
||||
Args:
|
||||
research_data: Research data with grounding metadata
|
||||
|
||||
Returns:
|
||||
List of (chunk, authority_score) tuples sorted by authority
|
||||
"""
|
||||
logger.info("Identifying high-authority sources from grounding metadata...")
|
||||
authority_sources = self.grounding_engine.get_authority_sources(research_data.grounding_metadata)
|
||||
logger.info(f"✅ Identified {len(authority_sources)} high-authority sources")
|
||||
return authority_sources
|
||||
|
||||
def get_high_confidence_insights(self, research_data) -> List[str]:
|
||||
"""
|
||||
Get high-confidence insights from grounding metadata.
|
||||
|
||||
Args:
|
||||
research_data: Research data with grounding metadata
|
||||
|
||||
Returns:
|
||||
List of high-confidence insights
|
||||
"""
|
||||
logger.info("Extracting high-confidence insights from grounding metadata...")
|
||||
insights = self.grounding_engine.get_high_confidence_insights(research_data.grounding_metadata)
|
||||
logger.info(f"✅ Extracted {len(insights)} high-confidence insights")
|
||||
return insights
|
||||
|
||||
|
||||
|
||||
137
backend/services/blog_writer/outline/outline_optimizer.py
Normal file
137
backend/services/blog_writer/outline/outline_optimizer.py
Normal file
@@ -0,0 +1,137 @@
|
||||
"""
|
||||
Outline Optimizer - AI-powered outline optimization and rebalancing.
|
||||
|
||||
Optimizes outlines for better flow, SEO, and engagement.
|
||||
"""
|
||||
|
||||
from typing import List
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogOutlineSection
|
||||
|
||||
|
||||
class OutlineOptimizer:
|
||||
"""Optimizes outlines for better flow, SEO, and engagement."""
|
||||
|
||||
async def optimize(self, outline: List[BlogOutlineSection], focus: str, user_id: str) -> List[BlogOutlineSection]:
|
||||
"""Optimize entire outline for better flow, SEO, and engagement.
|
||||
|
||||
Args:
|
||||
outline: List of outline sections to optimize
|
||||
focus: Optimization focus (e.g., "general optimization")
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
List of optimized outline sections
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline optimization (subscription checks and usage tracking)")
|
||||
|
||||
outline_text = "\n".join([f"{i+1}. {s.heading}" for i, s in enumerate(outline)])
|
||||
|
||||
optimization_prompt = f"""Optimize this blog outline for better flow, engagement, and SEO:
|
||||
|
||||
Current Outline:
|
||||
{outline_text}
|
||||
|
||||
Optimization Focus: {focus}
|
||||
|
||||
Goals: Improve narrative flow, enhance SEO, increase engagement, ensure comprehensive coverage.
|
||||
|
||||
Return JSON format:
|
||||
{{
|
||||
"outline": [
|
||||
{{
|
||||
"heading": "Optimized heading",
|
||||
"subheadings": ["subheading 1", "subheading 2"],
|
||||
"key_points": ["point 1", "point 2"],
|
||||
"target_words": 300,
|
||||
"keywords": ["keyword1", "keyword2"]
|
||||
}}
|
||||
]
|
||||
}}"""
|
||||
|
||||
try:
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
optimization_schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"outline": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"heading": {"type": "string"},
|
||||
"subheadings": {"type": "array", "items": {"type": "string"}},
|
||||
"key_points": {"type": "array", "items": {"type": "string"}},
|
||||
"target_words": {"type": "integer"},
|
||||
"keywords": {"type": "array", "items": {"type": "string"}}
|
||||
},
|
||||
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["outline"],
|
||||
"propertyOrdering": ["outline"]
|
||||
}
|
||||
|
||||
optimized_data = llm_text_gen(
|
||||
prompt=optimization_prompt,
|
||||
json_struct=optimization_schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
# Handle the new schema format with "outline" wrapper
|
||||
if isinstance(optimized_data, dict) and 'outline' in optimized_data:
|
||||
optimized_sections = []
|
||||
for i, section_data in enumerate(optimized_data['outline']):
|
||||
section = BlogOutlineSection(
|
||||
id=f"s{i+1}",
|
||||
heading=section_data.get('heading', f'Section {i+1}'),
|
||||
subheadings=section_data.get('subheadings', []),
|
||||
key_points=section_data.get('key_points', []),
|
||||
references=outline[i].references if i < len(outline) else [],
|
||||
target_words=section_data.get('target_words', 300),
|
||||
keywords=section_data.get('keywords', [])
|
||||
)
|
||||
optimized_sections.append(section)
|
||||
logger.info(f"✅ Outline optimization completed: {len(optimized_sections)} sections optimized")
|
||||
return optimized_sections
|
||||
else:
|
||||
logger.warning(f"Invalid optimization response format: {type(optimized_data)}")
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"AI outline optimization failed: {e}")
|
||||
logger.info("Returning original outline without optimization")
|
||||
|
||||
return outline
|
||||
|
||||
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
|
||||
"""Rebalance word count distribution across sections."""
|
||||
total_sections = len(outline)
|
||||
if total_sections == 0:
|
||||
return outline
|
||||
|
||||
# Calculate target distribution
|
||||
intro_words = int(target_words * 0.12) # 12% for intro
|
||||
conclusion_words = int(target_words * 0.12) # 12% for conclusion
|
||||
main_content_words = target_words - intro_words - conclusion_words
|
||||
|
||||
# Distribute main content words across sections
|
||||
words_per_section = main_content_words // total_sections
|
||||
remainder = main_content_words % total_sections
|
||||
|
||||
for i, section in enumerate(outline):
|
||||
if i == 0: # First section (intro)
|
||||
section.target_words = intro_words
|
||||
elif i == total_sections - 1: # Last section (conclusion)
|
||||
section.target_words = conclusion_words
|
||||
else: # Main content sections
|
||||
section.target_words = words_per_section + (1 if i < remainder else 0)
|
||||
|
||||
return outline
|
||||
268
backend/services/blog_writer/outline/outline_service.py
Normal file
268
backend/services/blog_writer/outline/outline_service.py
Normal file
@@ -0,0 +1,268 @@
|
||||
"""
|
||||
Outline Service - Core outline generation and management functionality.
|
||||
|
||||
Handles AI-powered outline generation, refinement, and optimization.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
import asyncio
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogOutlineRequest,
|
||||
BlogOutlineResponse,
|
||||
BlogOutlineRefineRequest,
|
||||
BlogOutlineSection,
|
||||
)
|
||||
|
||||
from .outline_generator import OutlineGenerator
|
||||
from .outline_optimizer import OutlineOptimizer
|
||||
from .section_enhancer import SectionEnhancer
|
||||
from services.cache.persistent_outline_cache import persistent_outline_cache
|
||||
|
||||
|
||||
class OutlineService:
|
||||
"""Service for generating and managing blog outlines using AI."""
|
||||
|
||||
def __init__(self):
|
||||
self.outline_generator = OutlineGenerator()
|
||||
self.outline_optimizer = OutlineOptimizer()
|
||||
self.section_enhancer = SectionEnhancer()
|
||||
|
||||
async def generate_outline(self, request: BlogOutlineRequest, user_id: str) -> BlogOutlineResponse:
|
||||
"""
|
||||
Stage 2: Content Planning with AI-generated outline using research results.
|
||||
Uses Gemini with research data to create comprehensive, SEO-optimized outline.
|
||||
|
||||
Args:
|
||||
request: Outline generation request with research data
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
|
||||
|
||||
# Extract cache parameters - use original user keywords for consistent caching
|
||||
keywords = request.research.original_keywords or request.research.keyword_analysis.get('primary', [])
|
||||
industry = getattr(request.persona, 'industry', 'general') if request.persona else 'general'
|
||||
target_audience = getattr(request.persona, 'target_audience', 'general') if request.persona else 'general'
|
||||
word_count = request.word_count or 1500
|
||||
custom_instructions = request.custom_instructions or ""
|
||||
persona_data = request.persona.dict() if request.persona else None
|
||||
|
||||
# Check cache first
|
||||
cached_result = persistent_outline_cache.get_cached_outline(
|
||||
keywords=keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
word_count=word_count,
|
||||
custom_instructions=custom_instructions,
|
||||
persona_data=persona_data
|
||||
)
|
||||
|
||||
if cached_result:
|
||||
logger.info(f"Using cached outline for keywords: {keywords}")
|
||||
return BlogOutlineResponse(**cached_result)
|
||||
|
||||
# Generate new outline if not cached (user_id required)
|
||||
logger.info(f"Generating new outline for keywords: {keywords}")
|
||||
result = await self.outline_generator.generate(request, user_id)
|
||||
|
||||
# Cache the result
|
||||
persistent_outline_cache.cache_outline(
|
||||
keywords=keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
word_count=word_count,
|
||||
custom_instructions=custom_instructions,
|
||||
persona_data=persona_data,
|
||||
result=result.dict()
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
async def generate_outline_with_progress(self, request: BlogOutlineRequest, task_id: str, user_id: str) -> BlogOutlineResponse:
|
||||
"""
|
||||
Outline generation method with progress updates for real-time feedback.
|
||||
"""
|
||||
# Extract cache parameters - use original user keywords for consistent caching
|
||||
keywords = request.research.original_keywords or request.research.keyword_analysis.get('primary', [])
|
||||
industry = getattr(request.persona, 'industry', 'general') if request.persona else 'general'
|
||||
target_audience = getattr(request.persona, 'target_audience', 'general') if request.persona else 'general'
|
||||
word_count = request.word_count or 1500
|
||||
custom_instructions = request.custom_instructions or ""
|
||||
persona_data = request.persona.dict() if request.persona else None
|
||||
|
||||
# Check cache first
|
||||
cached_result = persistent_outline_cache.get_cached_outline(
|
||||
keywords=keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
word_count=word_count,
|
||||
custom_instructions=custom_instructions,
|
||||
persona_data=persona_data
|
||||
)
|
||||
|
||||
if cached_result:
|
||||
logger.info(f"Using cached outline for keywords: {keywords} (with progress updates)")
|
||||
# Update progress to show cache hit
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
await task_manager.update_progress(task_id, "✅ Using cached outline (saved generation time!)")
|
||||
return BlogOutlineResponse(**cached_result)
|
||||
|
||||
# Generate new outline if not cached
|
||||
logger.info(f"Generating new outline for keywords: {keywords} (with progress updates)")
|
||||
result = await self.outline_generator.generate_with_progress(request, task_id, user_id)
|
||||
|
||||
# Cache the result
|
||||
persistent_outline_cache.cache_outline(
|
||||
keywords=keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
word_count=word_count,
|
||||
custom_instructions=custom_instructions,
|
||||
persona_data=persona_data,
|
||||
result=result.dict()
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
async def refine_outline(self, request: BlogOutlineRefineRequest) -> BlogOutlineResponse:
|
||||
"""
|
||||
Refine outline with HITL (Human-in-the-Loop) operations
|
||||
Supports add, remove, move, merge, rename operations
|
||||
"""
|
||||
outline = request.outline.copy()
|
||||
operation = request.operation.lower()
|
||||
section_id = request.section_id
|
||||
payload = request.payload or {}
|
||||
|
||||
try:
|
||||
if operation == 'add':
|
||||
# Add new section
|
||||
new_section = BlogOutlineSection(
|
||||
id=f"s{len(outline) + 1}",
|
||||
heading=payload.get('heading', 'New Section'),
|
||||
subheadings=payload.get('subheadings', []),
|
||||
key_points=payload.get('key_points', []),
|
||||
references=[],
|
||||
target_words=payload.get('target_words', 300)
|
||||
)
|
||||
outline.append(new_section)
|
||||
logger.info(f"Added new section: {new_section.heading}")
|
||||
|
||||
elif operation == 'remove' and section_id:
|
||||
# Remove section
|
||||
outline = [s for s in outline if s.id != section_id]
|
||||
logger.info(f"Removed section: {section_id}")
|
||||
|
||||
elif operation == 'rename' and section_id:
|
||||
# Rename section
|
||||
for section in outline:
|
||||
if section.id == section_id:
|
||||
section.heading = payload.get('heading', section.heading)
|
||||
break
|
||||
logger.info(f"Renamed section {section_id} to: {payload.get('heading')}")
|
||||
|
||||
elif operation == 'move' and section_id:
|
||||
# Move section (reorder)
|
||||
direction = payload.get('direction', 'down') # 'up' or 'down'
|
||||
current_index = next((i for i, s in enumerate(outline) if s.id == section_id), -1)
|
||||
|
||||
if current_index != -1:
|
||||
if direction == 'up' and current_index > 0:
|
||||
outline[current_index], outline[current_index - 1] = outline[current_index - 1], outline[current_index]
|
||||
elif direction == 'down' and current_index < len(outline) - 1:
|
||||
outline[current_index], outline[current_index + 1] = outline[current_index + 1], outline[current_index]
|
||||
logger.info(f"Moved section {section_id} {direction}")
|
||||
|
||||
elif operation == 'merge' and section_id:
|
||||
# Merge with next section
|
||||
current_index = next((i for i, s in enumerate(outline) if s.id == section_id), -1)
|
||||
if current_index != -1 and current_index < len(outline) - 1:
|
||||
current_section = outline[current_index]
|
||||
next_section = outline[current_index + 1]
|
||||
|
||||
# Merge sections
|
||||
current_section.heading = f"{current_section.heading} & {next_section.heading}"
|
||||
current_section.subheadings.extend(next_section.subheadings)
|
||||
current_section.key_points.extend(next_section.key_points)
|
||||
current_section.references.extend(next_section.references)
|
||||
current_section.target_words = (current_section.target_words or 0) + (next_section.target_words or 0)
|
||||
|
||||
# Remove the next section
|
||||
outline.pop(current_index + 1)
|
||||
logger.info(f"Merged section {section_id} with next section")
|
||||
|
||||
elif operation == 'update' and section_id:
|
||||
# Update section details
|
||||
for section in outline:
|
||||
if section.id == section_id:
|
||||
if 'heading' in payload:
|
||||
section.heading = payload['heading']
|
||||
if 'subheadings' in payload:
|
||||
section.subheadings = payload['subheadings']
|
||||
if 'key_points' in payload:
|
||||
section.key_points = payload['key_points']
|
||||
if 'target_words' in payload:
|
||||
section.target_words = payload['target_words']
|
||||
break
|
||||
logger.info(f"Updated section {section_id}")
|
||||
|
||||
# Reassign IDs to maintain order
|
||||
for i, section in enumerate(outline):
|
||||
section.id = f"s{i+1}"
|
||||
|
||||
return BlogOutlineResponse(
|
||||
success=True,
|
||||
title_options=["Refined Outline"],
|
||||
outline=outline
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Outline refinement failed: {e}")
|
||||
return BlogOutlineResponse(
|
||||
success=False,
|
||||
title_options=["Error"],
|
||||
outline=request.outline
|
||||
)
|
||||
|
||||
async def enhance_section_with_ai(self, section: BlogOutlineSection, focus: str = "general improvement") -> BlogOutlineSection:
|
||||
"""Enhance a section using AI with research context."""
|
||||
return await self.section_enhancer.enhance(section, focus)
|
||||
|
||||
async def optimize_outline_with_ai(self, outline: List[BlogOutlineSection], focus: str = "general optimization") -> List[BlogOutlineSection]:
|
||||
"""Optimize entire outline for better flow, SEO, and engagement."""
|
||||
return await self.outline_optimizer.optimize(outline, focus)
|
||||
|
||||
def rebalance_word_counts(self, outline: List[BlogOutlineSection], target_words: int) -> List[BlogOutlineSection]:
|
||||
"""Rebalance word count distribution across sections."""
|
||||
return self.outline_optimizer.rebalance_word_counts(outline, target_words)
|
||||
|
||||
# Cache Management Methods
|
||||
|
||||
def get_outline_cache_stats(self) -> Dict[str, Any]:
|
||||
"""Get outline cache statistics."""
|
||||
return persistent_outline_cache.get_cache_stats()
|
||||
|
||||
def clear_outline_cache(self):
|
||||
"""Clear all cached outline entries."""
|
||||
persistent_outline_cache.clear_cache()
|
||||
logger.info("Outline cache cleared")
|
||||
|
||||
def invalidate_outline_cache_for_keywords(self, keywords: List[str]):
|
||||
"""
|
||||
Invalidate outline cache entries for specific keywords.
|
||||
Useful when research data is updated.
|
||||
|
||||
Args:
|
||||
keywords: Keywords to invalidate cache for
|
||||
"""
|
||||
persistent_outline_cache.invalidate_cache_for_keywords(keywords)
|
||||
logger.info(f"Invalidated outline cache for keywords: {keywords}")
|
||||
|
||||
def get_recent_outline_cache_entries(self, limit: int = 20) -> List[Dict[str, Any]]:
|
||||
"""Get recent outline cache entries for debugging."""
|
||||
return persistent_outline_cache.get_cache_entries(limit)
|
||||
121
backend/services/blog_writer/outline/parallel_processor.py
Normal file
121
backend/services/blog_writer/outline/parallel_processor.py
Normal file
@@ -0,0 +1,121 @@
|
||||
"""
|
||||
Parallel Processor - Handles parallel processing of outline generation tasks.
|
||||
|
||||
Manages concurrent execution of source mapping and grounding insights extraction.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from typing import Tuple, Any
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class ParallelProcessor:
|
||||
"""Handles parallel processing of outline generation tasks for speed optimization."""
|
||||
|
||||
def __init__(self, source_mapper, grounding_engine):
|
||||
"""Initialize the parallel processor with required dependencies."""
|
||||
self.source_mapper = source_mapper
|
||||
self.grounding_engine = grounding_engine
|
||||
|
||||
async def run_parallel_processing(self, outline_sections, research, user_id: str, task_id: str = None) -> Tuple[Any, Any]:
|
||||
"""
|
||||
Run source mapping and grounding insights extraction in parallel.
|
||||
|
||||
Args:
|
||||
outline_sections: List of outline sections to process
|
||||
research: Research data object
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
task_id: Optional task ID for progress updates
|
||||
|
||||
Returns:
|
||||
Tuple of (mapped_sections, grounding_insights)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for parallel processing (subscription checks and usage tracking)")
|
||||
|
||||
if task_id:
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
await task_manager.update_progress(task_id, "⚡ Running parallel processing for maximum speed...")
|
||||
|
||||
logger.info("Running parallel processing for maximum speed...")
|
||||
|
||||
# Run these tasks in parallel to save time
|
||||
source_mapping_task = asyncio.create_task(
|
||||
self._run_source_mapping(outline_sections, research, task_id, user_id)
|
||||
)
|
||||
|
||||
grounding_insights_task = asyncio.create_task(
|
||||
self._run_grounding_insights_extraction(research, task_id)
|
||||
)
|
||||
|
||||
# Wait for both parallel tasks to complete
|
||||
mapped_sections, grounding_insights = await asyncio.gather(
|
||||
source_mapping_task,
|
||||
grounding_insights_task
|
||||
)
|
||||
|
||||
return mapped_sections, grounding_insights
|
||||
|
||||
async def run_parallel_processing_async(self, outline_sections, research, user_id: str) -> Tuple[Any, Any]:
|
||||
"""
|
||||
Run parallel processing without progress updates (for non-progress methods).
|
||||
|
||||
Args:
|
||||
outline_sections: List of outline sections to process
|
||||
research: Research data object
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
Tuple of (mapped_sections, grounding_insights)
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for parallel processing (subscription checks and usage tracking)")
|
||||
|
||||
logger.info("Running parallel processing for maximum speed...")
|
||||
|
||||
# Run these tasks in parallel to save time
|
||||
source_mapping_task = asyncio.create_task(
|
||||
self._run_source_mapping_async(outline_sections, research, user_id)
|
||||
)
|
||||
|
||||
grounding_insights_task = asyncio.create_task(
|
||||
self._run_grounding_insights_extraction_async(research)
|
||||
)
|
||||
|
||||
# Wait for both parallel tasks to complete
|
||||
mapped_sections, grounding_insights = await asyncio.gather(
|
||||
source_mapping_task,
|
||||
grounding_insights_task
|
||||
)
|
||||
|
||||
return mapped_sections, grounding_insights
|
||||
|
||||
async def _run_source_mapping(self, outline_sections, research, task_id, user_id: str):
|
||||
"""Run source mapping in parallel."""
|
||||
if task_id:
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
await task_manager.update_progress(task_id, "🔗 Applying intelligent source-to-section mapping...")
|
||||
return self.source_mapper.map_sources_to_sections(outline_sections, research, user_id)
|
||||
|
||||
async def _run_grounding_insights_extraction(self, research, task_id):
|
||||
"""Run grounding insights extraction in parallel."""
|
||||
if task_id:
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
await task_manager.update_progress(task_id, "🧠 Extracting grounding metadata insights...")
|
||||
return self.grounding_engine.extract_contextual_insights(research.grounding_metadata)
|
||||
|
||||
async def _run_source_mapping_async(self, outline_sections, research, user_id: str):
|
||||
"""Run source mapping in parallel (async version without progress updates)."""
|
||||
logger.info("Applying intelligent source-to-section mapping...")
|
||||
return self.source_mapper.map_sources_to_sections(outline_sections, research, user_id)
|
||||
|
||||
async def _run_grounding_insights_extraction_async(self, research):
|
||||
"""Run grounding insights extraction in parallel (async version without progress updates)."""
|
||||
logger.info("Extracting grounding metadata insights...")
|
||||
return self.grounding_engine.extract_contextual_insights(research.grounding_metadata)
|
||||
127
backend/services/blog_writer/outline/prompt_builder.py
Normal file
127
backend/services/blog_writer/outline/prompt_builder.py
Normal file
@@ -0,0 +1,127 @@
|
||||
"""
|
||||
Prompt Builder - Handles building of AI prompts for outline generation.
|
||||
|
||||
Constructs comprehensive prompts with research data, keywords, and strategic requirements.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
|
||||
|
||||
class PromptBuilder:
|
||||
"""Handles building of comprehensive AI prompts for outline generation."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the prompt builder."""
|
||||
pass
|
||||
|
||||
def build_outline_prompt(self, primary_keywords: List[str], secondary_keywords: List[str],
|
||||
content_angles: List[str], sources: List, search_intent: str,
|
||||
request, custom_instructions: str = None) -> str:
|
||||
"""Build the comprehensive outline generation prompt using filtered research data."""
|
||||
|
||||
# Use the filtered research data (already cleaned by ResearchDataFilter)
|
||||
research = request.research
|
||||
|
||||
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else (request.topic or ', '.join(getattr(request.research, 'original_keywords', []) or ['the target topic']))
|
||||
secondary_kw_text = ', '.join(secondary_keywords) if secondary_keywords else "None provided"
|
||||
long_tail_text = ', '.join(research.keyword_analysis.get('long_tail', [])) if research and research.keyword_analysis else "None discovered"
|
||||
semantic_text = ', '.join(research.keyword_analysis.get('semantic_keywords', [])) if research and research.keyword_analysis else "None discovered"
|
||||
trending_text = ', '.join(research.keyword_analysis.get('trending_terms', [])) if research and research.keyword_analysis else "None discovered"
|
||||
content_gap_text = ', '.join(research.keyword_analysis.get('content_gaps', [])) if research and research.keyword_analysis else "None identified"
|
||||
content_angle_text = ', '.join(content_angles) if content_angles else "No explicit angles provided; infer compelling angles from research insights."
|
||||
competitor_text = ', '.join(research.competitor_analysis.get('top_competitors', [])) if research and research.competitor_analysis else "Not available"
|
||||
opportunity_text = ', '.join(research.competitor_analysis.get('opportunities', [])) if research and research.competitor_analysis else "Not available"
|
||||
advantages_text = ', '.join(research.competitor_analysis.get('competitive_advantages', [])) if research and research.competitor_analysis else "Not available"
|
||||
|
||||
return f"""Create a comprehensive blog outline for: {primary_kw_text}
|
||||
|
||||
CONTEXT:
|
||||
Search Intent: {search_intent}
|
||||
Target: {request.word_count or 1500} words
|
||||
Industry: {getattr(request.persona, 'industry', 'General') if request.persona else 'General'}
|
||||
Audience: {getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'}
|
||||
|
||||
KEYWORDS:
|
||||
Primary: {primary_kw_text}
|
||||
Secondary: {secondary_kw_text}
|
||||
Long-tail: {long_tail_text}
|
||||
Semantic: {semantic_text}
|
||||
Trending: {trending_text}
|
||||
Content Gaps: {content_gap_text}
|
||||
|
||||
CONTENT ANGLES / STORYLINES: {content_angle_text}
|
||||
|
||||
COMPETITIVE INTELLIGENCE:
|
||||
Top Competitors: {competitor_text}
|
||||
Market Opportunities: {opportunity_text}
|
||||
Competitive Advantages: {advantages_text}
|
||||
|
||||
RESEARCH SOURCES: {len(sources)} authoritative sources available
|
||||
|
||||
{f"CUSTOM INSTRUCTIONS: {custom_instructions}" if custom_instructions else ""}
|
||||
|
||||
STRATEGIC REQUIREMENTS:
|
||||
- Create SEO-optimized headings with natural keyword integration
|
||||
- Surface the strongest research-backed angles within the outline
|
||||
- Build logical narrative flow from problem to solution
|
||||
- Include data-driven insights from research sources
|
||||
- Address content gaps and market opportunities
|
||||
- Optimize for search intent and user questions
|
||||
- Ensure engaging, actionable content throughout
|
||||
|
||||
Return JSON format:
|
||||
{
|
||||
"title_options": [
|
||||
"Title option 1",
|
||||
"Title option 2",
|
||||
"Title option 3"
|
||||
],
|
||||
"outline": [
|
||||
{
|
||||
"heading": "Section heading with primary keyword",
|
||||
"subheadings": ["Subheading 1", "Subheading 2", "Subheading 3"],
|
||||
"key_points": ["Key point 1", "Key point 2", "Key point 3"],
|
||||
"target_words": 300,
|
||||
"keywords": ["primary keyword", "secondary keyword"]
|
||||
}
|
||||
]
|
||||
}"""
|
||||
|
||||
def get_outline_schema(self) -> Dict[str, Any]:
|
||||
"""Get the structured JSON schema for outline generation."""
|
||||
return {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title_options": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"outline": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"heading": {"type": "string"},
|
||||
"subheadings": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"key_points": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"target_words": {"type": "integer"},
|
||||
"keywords": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
}
|
||||
},
|
||||
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["title_options", "outline"],
|
||||
"propertyOrdering": ["title_options", "outline"]
|
||||
}
|
||||
120
backend/services/blog_writer/outline/response_processor.py
Normal file
120
backend/services/blog_writer/outline/response_processor.py
Normal file
@@ -0,0 +1,120 @@
|
||||
"""
|
||||
Response Processor - Handles AI response processing and retry logic.
|
||||
|
||||
Processes AI responses, handles retries, and converts data to proper formats.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
import asyncio
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogOutlineSection
|
||||
|
||||
|
||||
class ResponseProcessor:
|
||||
"""Handles AI response processing, retry logic, and data conversion."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the response processor."""
|
||||
pass
|
||||
|
||||
async def generate_with_retry(self, prompt: str, schema: Dict[str, Any], user_id: str, task_id: str = None) -> Dict[str, Any]:
|
||||
"""Generate outline with retry logic for API failures.
|
||||
|
||||
Args:
|
||||
prompt: The prompt for outline generation
|
||||
schema: JSON schema for structured response
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
task_id: Optional task ID for progress updates
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for outline generation (subscription checks and usage tracking)")
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
|
||||
max_retries = 2 # Conservative retry for expensive API calls
|
||||
retry_delay = 5 # 5 second delay between retries
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
if task_id:
|
||||
await task_manager.update_progress(task_id, f"🤖 Calling AI API for outline generation (attempt {attempt + 1}/{max_retries + 1})...")
|
||||
|
||||
outline_data = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
# Log response for debugging
|
||||
logger.info(f"AI response received: {type(outline_data)}")
|
||||
|
||||
# Check for errors in the response
|
||||
if isinstance(outline_data, dict) and 'error' in outline_data:
|
||||
error_msg = str(outline_data['error'])
|
||||
if "503" in error_msg and "overloaded" in error_msg and attempt < max_retries:
|
||||
if task_id:
|
||||
await task_manager.update_progress(task_id, f"⚠️ AI service overloaded, retrying in {retry_delay} seconds...")
|
||||
logger.warning(f"AI API overloaded, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
|
||||
await asyncio.sleep(retry_delay)
|
||||
continue
|
||||
elif "No valid structured response content found" in error_msg and attempt < max_retries:
|
||||
if task_id:
|
||||
await task_manager.update_progress(task_id, f"⚠️ Invalid response format, retrying in {retry_delay} seconds...")
|
||||
logger.warning(f"AI response parsing failed, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
|
||||
await asyncio.sleep(retry_delay)
|
||||
continue
|
||||
else:
|
||||
logger.error(f"AI structured response error: {outline_data['error']}")
|
||||
raise ValueError(f"AI outline generation failed: {outline_data['error']}")
|
||||
|
||||
# Validate required fields
|
||||
if not isinstance(outline_data, dict) or 'outline' not in outline_data or not isinstance(outline_data['outline'], list):
|
||||
if attempt < max_retries:
|
||||
if task_id:
|
||||
await task_manager.update_progress(task_id, f"⚠️ Invalid response structure, retrying in {retry_delay} seconds...")
|
||||
logger.warning(f"Invalid response structure, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1})")
|
||||
await asyncio.sleep(retry_delay)
|
||||
continue
|
||||
else:
|
||||
raise ValueError("Invalid outline structure in AI response")
|
||||
|
||||
# If we get here, the response is valid
|
||||
return outline_data
|
||||
|
||||
except Exception as e:
|
||||
error_str = str(e)
|
||||
if ("503" in error_str or "overloaded" in error_str) and attempt < max_retries:
|
||||
if task_id:
|
||||
await task_manager.update_progress(task_id, f"⚠️ AI service error, retrying in {retry_delay} seconds...")
|
||||
logger.warning(f"AI API error, retrying in {retry_delay} seconds (attempt {attempt + 1}/{max_retries + 1}): {error_str}")
|
||||
await asyncio.sleep(retry_delay)
|
||||
continue
|
||||
else:
|
||||
logger.error(f"Outline generation failed after {attempt + 1} attempts: {error_str}")
|
||||
raise ValueError(f"AI outline generation failed: {error_str}")
|
||||
|
||||
def convert_to_sections(self, outline_data: Dict[str, Any], sources: List) -> List[BlogOutlineSection]:
|
||||
"""Convert outline data to BlogOutlineSection objects."""
|
||||
outline_sections = []
|
||||
for i, section_data in enumerate(outline_data.get('outline', [])):
|
||||
if not isinstance(section_data, dict) or 'heading' not in section_data:
|
||||
continue
|
||||
|
||||
section = BlogOutlineSection(
|
||||
id=f"s{i+1}",
|
||||
heading=section_data.get('heading', f'Section {i+1}'),
|
||||
subheadings=section_data.get('subheadings', []),
|
||||
key_points=section_data.get('key_points', []),
|
||||
references=[], # Will be populated by intelligent mapping
|
||||
target_words=section_data.get('target_words', 200),
|
||||
keywords=section_data.get('keywords', [])
|
||||
)
|
||||
outline_sections.append(section)
|
||||
|
||||
return outline_sections
|
||||
96
backend/services/blog_writer/outline/section_enhancer.py
Normal file
96
backend/services/blog_writer/outline/section_enhancer.py
Normal file
@@ -0,0 +1,96 @@
|
||||
"""
|
||||
Section Enhancer - AI-powered section enhancement and improvement.
|
||||
|
||||
Enhances individual outline sections for better engagement and value.
|
||||
"""
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogOutlineSection
|
||||
|
||||
|
||||
class SectionEnhancer:
|
||||
"""Enhances individual outline sections using AI."""
|
||||
|
||||
async def enhance(self, section: BlogOutlineSection, focus: str, user_id: str) -> BlogOutlineSection:
|
||||
"""Enhance a section using AI with research context.
|
||||
|
||||
Args:
|
||||
section: Outline section to enhance
|
||||
focus: Enhancement focus (e.g., "general improvement")
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
Enhanced outline section
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for section enhancement (subscription checks and usage tracking)")
|
||||
|
||||
enhancement_prompt = f"""
|
||||
Enhance the following blog section to make it more engaging, comprehensive, and valuable:
|
||||
|
||||
Current Section:
|
||||
Heading: {section.heading}
|
||||
Subheadings: {', '.join(section.subheadings)}
|
||||
Key Points: {', '.join(section.key_points)}
|
||||
Target Words: {section.target_words}
|
||||
Keywords: {', '.join(section.keywords)}
|
||||
|
||||
Enhancement Focus: {focus}
|
||||
|
||||
Improve:
|
||||
1. Make subheadings more specific and actionable
|
||||
2. Add more comprehensive key points with data/insights
|
||||
3. Include practical examples and case studies
|
||||
4. Address common questions and objections
|
||||
5. Optimize for SEO with better keyword integration
|
||||
|
||||
Respond with JSON:
|
||||
{{
|
||||
"heading": "Enhanced heading",
|
||||
"subheadings": ["enhanced subheading 1", "enhanced subheading 2"],
|
||||
"key_points": ["enhanced point 1", "enhanced point 2"],
|
||||
"target_words": 400,
|
||||
"keywords": ["keyword1", "keyword2"]
|
||||
}}
|
||||
"""
|
||||
|
||||
try:
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
enhancement_schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"heading": {"type": "string"},
|
||||
"subheadings": {"type": "array", "items": {"type": "string"}},
|
||||
"key_points": {"type": "array", "items": {"type": "string"}},
|
||||
"target_words": {"type": "integer"},
|
||||
"keywords": {"type": "array", "items": {"type": "string"}}
|
||||
},
|
||||
"required": ["heading", "subheadings", "key_points", "target_words", "keywords"]
|
||||
}
|
||||
|
||||
enhanced_data = llm_text_gen(
|
||||
prompt=enhancement_prompt,
|
||||
json_struct=enhancement_schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
if isinstance(enhanced_data, dict) and 'error' not in enhanced_data:
|
||||
return BlogOutlineSection(
|
||||
id=section.id,
|
||||
heading=enhanced_data.get('heading', section.heading),
|
||||
subheadings=enhanced_data.get('subheadings', section.subheadings),
|
||||
key_points=enhanced_data.get('key_points', section.key_points),
|
||||
references=section.references,
|
||||
target_words=enhanced_data.get('target_words', section.target_words),
|
||||
keywords=enhanced_data.get('keywords', section.keywords)
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"AI section enhancement failed: {e}")
|
||||
|
||||
return section
|
||||
198
backend/services/blog_writer/outline/seo_title_generator.py
Normal file
198
backend/services/blog_writer/outline/seo_title_generator.py
Normal file
@@ -0,0 +1,198 @@
|
||||
"""
|
||||
SEO Title Generator - Specialized service for generating SEO-optimized blog titles.
|
||||
|
||||
Generates 5 premium SEO-optimized titles using research data and outline context.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogResearchResponse, BlogOutlineSection
|
||||
|
||||
|
||||
class SEOTitleGenerator:
|
||||
"""Generates SEO-optimized blog titles using research and outline data."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the SEO title generator."""
|
||||
pass
|
||||
|
||||
def build_title_prompt(
|
||||
self,
|
||||
research: BlogResearchResponse,
|
||||
outline: List[BlogOutlineSection],
|
||||
primary_keywords: List[str],
|
||||
secondary_keywords: List[str],
|
||||
content_angles: List[str],
|
||||
search_intent: str,
|
||||
word_count: int = 1500
|
||||
) -> str:
|
||||
"""Build a specialized prompt for SEO title generation."""
|
||||
|
||||
# Extract key research insights
|
||||
keyword_analysis = research.keyword_analysis or {}
|
||||
competitor_analysis = research.competitor_analysis or {}
|
||||
|
||||
primary_kw_text = ', '.join(primary_keywords) if primary_keywords else "the target topic"
|
||||
secondary_kw_text = ', '.join(secondary_keywords) if secondary_keywords else "None provided"
|
||||
long_tail_text = ', '.join(keyword_analysis.get('long_tail', [])) if keyword_analysis else "None discovered"
|
||||
semantic_text = ', '.join(keyword_analysis.get('semantic_keywords', [])) if keyword_analysis else "None discovered"
|
||||
trending_text = ', '.join(keyword_analysis.get('trending_terms', [])) if keyword_analysis else "None discovered"
|
||||
content_gap_text = ', '.join(keyword_analysis.get('content_gaps', [])) if keyword_analysis else "None identified"
|
||||
content_angle_text = ', '.join(content_angles) if content_angles else "No explicit angles provided"
|
||||
|
||||
# Extract outline structure summary
|
||||
outline_summary = []
|
||||
for i, section in enumerate(outline[:5], 1): # Limit to first 5 sections for context
|
||||
outline_summary.append(f"{i}. {section.heading}")
|
||||
if section.subheadings:
|
||||
outline_summary.append(f" Subtopics: {', '.join(section.subheadings[:3])}")
|
||||
|
||||
outline_text = '\n'.join(outline_summary) if outline_summary else "No outline available"
|
||||
|
||||
return f"""Generate exactly 5 SEO-optimized blog titles for: {primary_kw_text}
|
||||
|
||||
RESEARCH CONTEXT:
|
||||
Primary Keywords: {primary_kw_text}
|
||||
Secondary Keywords: {secondary_kw_text}
|
||||
Long-tail Keywords: {long_tail_text}
|
||||
Semantic Keywords: {semantic_text}
|
||||
Trending Terms: {trending_text}
|
||||
Content Gaps: {content_gap_text}
|
||||
Search Intent: {search_intent}
|
||||
Content Angles: {content_angle_text}
|
||||
|
||||
OUTLINE STRUCTURE:
|
||||
{outline_text}
|
||||
|
||||
COMPETITIVE INTELLIGENCE:
|
||||
Top Competitors: {', '.join(competitor_analysis.get('top_competitors', [])) if competitor_analysis else 'Not available'}
|
||||
Market Opportunities: {', '.join(competitor_analysis.get('opportunities', [])) if competitor_analysis else 'Not available'}
|
||||
|
||||
SEO REQUIREMENTS:
|
||||
- Each title must be 50-65 characters (optimal for search engine display)
|
||||
- Include the primary keyword within the first 55 characters
|
||||
- Highlight a unique value proposition from the research angles
|
||||
- Use power words that drive clicks (e.g., "Ultimate", "Complete", "Essential", "Proven")
|
||||
- Avoid generic phrasing - be specific and benefit-focused
|
||||
- Target the search intent: {search_intent}
|
||||
- Ensure titles are compelling and click-worthy
|
||||
|
||||
Return ONLY a JSON array of exactly 5 titles:
|
||||
[
|
||||
"Title 1 (50-65 chars)",
|
||||
"Title 2 (50-65 chars)",
|
||||
"Title 3 (50-65 chars)",
|
||||
"Title 4 (50-65 chars)",
|
||||
"Title 5 (50-65 chars)"
|
||||
]"""
|
||||
|
||||
def get_title_schema(self) -> Dict[str, Any]:
|
||||
"""Get the JSON schema for title generation."""
|
||||
return {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"minLength": 50,
|
||||
"maxLength": 65
|
||||
},
|
||||
"minItems": 5,
|
||||
"maxItems": 5
|
||||
}
|
||||
|
||||
async def generate_seo_titles(
|
||||
self,
|
||||
research: BlogResearchResponse,
|
||||
outline: List[BlogOutlineSection],
|
||||
primary_keywords: List[str],
|
||||
secondary_keywords: List[str],
|
||||
content_angles: List[str],
|
||||
search_intent: str,
|
||||
word_count: int,
|
||||
user_id: str
|
||||
) -> List[str]:
|
||||
"""Generate SEO-optimized titles using research and outline data.
|
||||
|
||||
Args:
|
||||
research: Research data with keywords and insights
|
||||
outline: Blog outline sections
|
||||
primary_keywords: Primary keywords for the blog
|
||||
secondary_keywords: Secondary keywords
|
||||
content_angles: Content angles from research
|
||||
search_intent: Search intent (informational, commercial, etc.)
|
||||
word_count: Target word count
|
||||
user_id: User ID for API calls
|
||||
|
||||
Returns:
|
||||
List of 5 SEO-optimized titles
|
||||
"""
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for title generation")
|
||||
|
||||
# Build specialized prompt
|
||||
prompt = self.build_title_prompt(
|
||||
research=research,
|
||||
outline=outline,
|
||||
primary_keywords=primary_keywords,
|
||||
secondary_keywords=secondary_keywords,
|
||||
content_angles=content_angles,
|
||||
search_intent=search_intent,
|
||||
word_count=word_count
|
||||
)
|
||||
|
||||
# Get schema
|
||||
schema = self.get_title_schema()
|
||||
|
||||
logger.info(f"Generating SEO-optimized titles for user {user_id}")
|
||||
|
||||
try:
|
||||
# Generate titles using structured JSON response
|
||||
result = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt="You are an expert SEO content strategist specializing in creating compelling, search-optimized blog titles.",
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
# Handle response - could be array directly or wrapped in dict
|
||||
if isinstance(result, list):
|
||||
titles = result
|
||||
elif isinstance(result, dict):
|
||||
# Try common keys
|
||||
titles = result.get('titles', result.get('title_options', result.get('options', [])))
|
||||
if not titles and isinstance(result.get('response'), list):
|
||||
titles = result['response']
|
||||
else:
|
||||
logger.warning(f"Unexpected title generation result type: {type(result)}")
|
||||
titles = []
|
||||
|
||||
# Validate and clean titles
|
||||
cleaned_titles = []
|
||||
for title in titles:
|
||||
if isinstance(title, str) and len(title.strip()) >= 30: # Minimum reasonable length
|
||||
cleaned = title.strip()
|
||||
# Ensure it's within reasonable bounds (allow slight overflow for quality)
|
||||
if len(cleaned) <= 70: # Allow slight overflow for quality
|
||||
cleaned_titles.append(cleaned)
|
||||
|
||||
# Ensure we have exactly 5 titles
|
||||
if len(cleaned_titles) < 5:
|
||||
logger.warning(f"Generated only {len(cleaned_titles)} titles, expected 5")
|
||||
# Pad with placeholder if needed (shouldn't happen with proper schema)
|
||||
while len(cleaned_titles) < 5:
|
||||
cleaned_titles.append(f"{primary_keywords[0] if primary_keywords else 'Blog'} - Comprehensive Guide")
|
||||
|
||||
# Return exactly 5 titles
|
||||
return cleaned_titles[:5]
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to generate SEO titles: {e}")
|
||||
# Fallback: generate simple titles from keywords
|
||||
fallback_titles = []
|
||||
primary = primary_keywords[0] if primary_keywords else "Blog Post"
|
||||
for i in range(5):
|
||||
fallback_titles.append(f"{primary}: Complete Guide {i+1}")
|
||||
return fallback_titles
|
||||
|
||||
690
backend/services/blog_writer/outline/source_mapper.py
Normal file
690
backend/services/blog_writer/outline/source_mapper.py
Normal file
@@ -0,0 +1,690 @@
|
||||
"""
|
||||
Source-to-Section Mapper - Intelligent mapping of research sources to outline sections.
|
||||
|
||||
This module provides algorithmic mapping of research sources to specific outline sections
|
||||
based on semantic similarity, keyword relevance, and contextual matching. Uses a hybrid
|
||||
approach of algorithmic scoring followed by AI validation for optimal results.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Tuple, Optional
|
||||
import re
|
||||
from collections import Counter
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogOutlineSection,
|
||||
ResearchSource,
|
||||
BlogResearchResponse,
|
||||
)
|
||||
|
||||
|
||||
class SourceToSectionMapper:
|
||||
"""Maps research sources to outline sections using intelligent algorithms."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the source-to-section mapper."""
|
||||
self.min_semantic_score = 0.3
|
||||
self.min_keyword_score = 0.2
|
||||
self.min_contextual_score = 0.2
|
||||
self.max_sources_per_section = 3
|
||||
self.min_total_score = 0.4
|
||||
|
||||
# Weight factors for different scoring methods
|
||||
self.weights = {
|
||||
'semantic': 0.4, # Semantic similarity weight
|
||||
'keyword': 0.3, # Keyword matching weight
|
||||
'contextual': 0.3 # Contextual relevance weight
|
||||
}
|
||||
|
||||
# Common stop words for text processing
|
||||
self.stop_words = {
|
||||
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
|
||||
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did',
|
||||
'will', 'would', 'could', 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those',
|
||||
'how', 'what', 'when', 'where', 'why', 'who', 'which', 'how', 'much', 'many', 'more', 'most',
|
||||
'some', 'any', 'all', 'each', 'every', 'other', 'another', 'such', 'no', 'not', 'only', 'own',
|
||||
'same', 'so', 'than', 'too', 'very', 'just', 'now', 'here', 'there', 'up', 'down', 'out', 'off',
|
||||
'over', 'under', 'again', 'further', 'then', 'once'
|
||||
}
|
||||
|
||||
logger.info("✅ SourceToSectionMapper initialized with intelligent mapping algorithms")
|
||||
|
||||
def map_sources_to_sections(
|
||||
self,
|
||||
sections: List[BlogOutlineSection],
|
||||
research_data: BlogResearchResponse,
|
||||
user_id: str
|
||||
) -> List[BlogOutlineSection]:
|
||||
"""
|
||||
Map research sources to outline sections using intelligent algorithms.
|
||||
|
||||
Args:
|
||||
sections: List of outline sections to map sources to
|
||||
research_data: Research data containing sources and metadata
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
List of outline sections with intelligently mapped sources
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for source mapping (subscription checks and usage tracking)")
|
||||
|
||||
if not sections or not research_data.sources:
|
||||
logger.warning("No sections or sources to map")
|
||||
return sections
|
||||
|
||||
logger.info(f"Mapping {len(research_data.sources)} sources to {len(sections)} sections")
|
||||
|
||||
# Step 1: Algorithmic mapping
|
||||
mapping_results = self._algorithmic_source_mapping(sections, research_data)
|
||||
|
||||
# Step 2: AI validation and improvement (single prompt, user_id required for subscription checks)
|
||||
validated_mapping = self._ai_validate_mapping(mapping_results, research_data, user_id)
|
||||
|
||||
# Step 3: Apply validated mapping to sections
|
||||
mapped_sections = self._apply_mapping_to_sections(sections, validated_mapping)
|
||||
|
||||
logger.info("✅ Source-to-section mapping completed successfully")
|
||||
return mapped_sections
|
||||
|
||||
def _algorithmic_source_mapping(
|
||||
self,
|
||||
sections: List[BlogOutlineSection],
|
||||
research_data: BlogResearchResponse
|
||||
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
|
||||
"""
|
||||
Perform algorithmic mapping of sources to sections.
|
||||
|
||||
Args:
|
||||
sections: List of outline sections
|
||||
research_data: Research data with sources
|
||||
|
||||
Returns:
|
||||
Dictionary mapping section IDs to list of (source, score) tuples
|
||||
"""
|
||||
mapping_results = {}
|
||||
|
||||
for section in sections:
|
||||
section_scores = []
|
||||
|
||||
for source in research_data.sources:
|
||||
# Calculate multi-dimensional relevance score
|
||||
semantic_score = self._calculate_semantic_similarity(section, source)
|
||||
keyword_score = self._calculate_keyword_relevance(section, source, research_data)
|
||||
contextual_score = self._calculate_contextual_relevance(section, source, research_data)
|
||||
|
||||
# Weighted total score
|
||||
total_score = (
|
||||
semantic_score * self.weights['semantic'] +
|
||||
keyword_score * self.weights['keyword'] +
|
||||
contextual_score * self.weights['contextual']
|
||||
)
|
||||
|
||||
# Only include sources that meet minimum threshold
|
||||
if total_score >= self.min_total_score:
|
||||
section_scores.append((source, total_score))
|
||||
|
||||
# Sort by score and limit to max sources per section
|
||||
section_scores.sort(key=lambda x: x[1], reverse=True)
|
||||
section_scores = section_scores[:self.max_sources_per_section]
|
||||
|
||||
mapping_results[section.id] = section_scores
|
||||
|
||||
logger.debug(f"Section '{section.heading}': {len(section_scores)} sources mapped")
|
||||
|
||||
return mapping_results
|
||||
|
||||
def _calculate_semantic_similarity(self, section: BlogOutlineSection, source: ResearchSource) -> float:
|
||||
"""
|
||||
Calculate semantic similarity between section and source.
|
||||
|
||||
Args:
|
||||
section: Outline section
|
||||
source: Research source
|
||||
|
||||
Returns:
|
||||
Semantic similarity score (0.0 to 1.0)
|
||||
"""
|
||||
# Extract text content for comparison
|
||||
section_text = self._extract_section_text(section)
|
||||
source_text = self._extract_source_text(source)
|
||||
|
||||
# Calculate word overlap
|
||||
section_words = self._extract_meaningful_words(section_text)
|
||||
source_words = self._extract_meaningful_words(source_text)
|
||||
|
||||
if not section_words or not source_words:
|
||||
return 0.0
|
||||
|
||||
# Calculate Jaccard similarity
|
||||
intersection = len(set(section_words) & set(source_words))
|
||||
union = len(set(section_words) | set(source_words))
|
||||
|
||||
jaccard_similarity = intersection / union if union > 0 else 0.0
|
||||
|
||||
# Boost score for exact phrase matches
|
||||
phrase_boost = self._calculate_phrase_similarity(section_text, source_text)
|
||||
|
||||
# Combine Jaccard similarity with phrase boost
|
||||
semantic_score = min(1.0, jaccard_similarity + phrase_boost)
|
||||
|
||||
return semantic_score
|
||||
|
||||
def _calculate_keyword_relevance(
|
||||
self,
|
||||
section: BlogOutlineSection,
|
||||
source: ResearchSource,
|
||||
research_data: BlogResearchResponse
|
||||
) -> float:
|
||||
"""
|
||||
Calculate keyword-based relevance between section and source.
|
||||
|
||||
Args:
|
||||
section: Outline section
|
||||
source: Research source
|
||||
research_data: Research data with keyword analysis
|
||||
|
||||
Returns:
|
||||
Keyword relevance score (0.0 to 1.0)
|
||||
"""
|
||||
# Get section keywords
|
||||
section_keywords = set(section.keywords)
|
||||
if not section_keywords:
|
||||
# Extract keywords from section heading and content
|
||||
section_text = self._extract_section_text(section)
|
||||
section_keywords = set(self._extract_meaningful_words(section_text))
|
||||
|
||||
# Get source keywords from title and excerpt
|
||||
source_text = f"{source.title} {source.excerpt or ''}"
|
||||
source_keywords = set(self._extract_meaningful_words(source_text))
|
||||
|
||||
# Get research keywords for context
|
||||
research_keywords = set()
|
||||
for category in ['primary', 'secondary', 'long_tail', 'semantic_keywords']:
|
||||
research_keywords.update(research_data.keyword_analysis.get(category, []))
|
||||
|
||||
# Calculate keyword overlap scores
|
||||
section_overlap = len(section_keywords & source_keywords) / len(section_keywords) if section_keywords else 0.0
|
||||
research_overlap = len(research_keywords & source_keywords) / len(research_keywords) if research_keywords else 0.0
|
||||
|
||||
# Weighted combination
|
||||
keyword_score = (section_overlap * 0.7) + (research_overlap * 0.3)
|
||||
|
||||
return min(1.0, keyword_score)
|
||||
|
||||
def _calculate_contextual_relevance(
|
||||
self,
|
||||
section: BlogOutlineSection,
|
||||
source: ResearchSource,
|
||||
research_data: BlogResearchResponse
|
||||
) -> float:
|
||||
"""
|
||||
Calculate contextual relevance based on section content and source context.
|
||||
|
||||
Args:
|
||||
section: Outline section
|
||||
source: Research source
|
||||
research_data: Research data with context
|
||||
|
||||
Returns:
|
||||
Contextual relevance score (0.0 to 1.0)
|
||||
"""
|
||||
contextual_score = 0.0
|
||||
|
||||
# 1. Content angle matching
|
||||
section_text = self._extract_section_text(section).lower()
|
||||
source_text = f"{source.title} {source.excerpt or ''}".lower()
|
||||
|
||||
# Check for content angle matches
|
||||
content_angles = research_data.suggested_angles
|
||||
for angle in content_angles:
|
||||
angle_words = self._extract_meaningful_words(angle.lower())
|
||||
if angle_words:
|
||||
section_angle_match = sum(1 for word in angle_words if word in section_text) / len(angle_words)
|
||||
source_angle_match = sum(1 for word in angle_words if word in source_text) / len(angle_words)
|
||||
contextual_score += (section_angle_match + source_angle_match) * 0.3
|
||||
|
||||
# 2. Search intent alignment
|
||||
search_intent = research_data.keyword_analysis.get('search_intent', 'informational')
|
||||
intent_keywords = self._get_intent_keywords(search_intent)
|
||||
|
||||
intent_score = 0.0
|
||||
for keyword in intent_keywords:
|
||||
if keyword in section_text or keyword in source_text:
|
||||
intent_score += 0.1
|
||||
|
||||
contextual_score += min(0.3, intent_score)
|
||||
|
||||
# 3. Industry/domain relevance
|
||||
if hasattr(research_data, 'industry') and research_data.industry:
|
||||
industry_words = self._extract_meaningful_words(research_data.industry.lower())
|
||||
industry_score = sum(1 for word in industry_words if word in source_text) / len(industry_words) if industry_words else 0.0
|
||||
contextual_score += industry_score * 0.2
|
||||
|
||||
return min(1.0, contextual_score)
|
||||
|
||||
def _ai_validate_mapping(
|
||||
self,
|
||||
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]],
|
||||
research_data: BlogResearchResponse,
|
||||
user_id: str
|
||||
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
|
||||
"""
|
||||
Use AI to validate and improve the algorithmic mapping results.
|
||||
|
||||
Args:
|
||||
mapping_results: Algorithmic mapping results
|
||||
research_data: Research data for context
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
AI-validated and improved mapping results
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for AI validation (subscription checks and usage tracking)")
|
||||
|
||||
try:
|
||||
logger.info("Starting AI validation of source-to-section mapping...")
|
||||
|
||||
# Build AI validation prompt
|
||||
validation_prompt = self._build_validation_prompt(mapping_results, research_data)
|
||||
|
||||
# Get AI validation response (user_id required for subscription checks)
|
||||
validation_response = self._get_ai_validation_response(validation_prompt, user_id)
|
||||
|
||||
# Parse and apply AI validation results
|
||||
validated_mapping = self._parse_validation_response(validation_response, mapping_results, research_data)
|
||||
|
||||
logger.info("✅ AI validation completed successfully")
|
||||
return validated_mapping
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"AI validation failed: {e}. Using algorithmic results as fallback.")
|
||||
return mapping_results
|
||||
|
||||
def _apply_mapping_to_sections(
|
||||
self,
|
||||
sections: List[BlogOutlineSection],
|
||||
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]]
|
||||
) -> List[BlogOutlineSection]:
|
||||
"""
|
||||
Apply the mapping results to the outline sections.
|
||||
|
||||
Args:
|
||||
sections: Original outline sections
|
||||
mapping_results: Mapping results from algorithmic/AI processing
|
||||
|
||||
Returns:
|
||||
Sections with mapped sources
|
||||
"""
|
||||
mapped_sections = []
|
||||
|
||||
for section in sections:
|
||||
# Get mapped sources for this section
|
||||
mapped_sources = mapping_results.get(section.id, [])
|
||||
|
||||
# Extract just the sources (without scores)
|
||||
section_sources = [source for source, score in mapped_sources]
|
||||
|
||||
# Create new section with mapped sources
|
||||
mapped_section = BlogOutlineSection(
|
||||
id=section.id,
|
||||
heading=section.heading,
|
||||
subheadings=section.subheadings,
|
||||
key_points=section.key_points,
|
||||
references=section_sources,
|
||||
target_words=section.target_words,
|
||||
keywords=section.keywords
|
||||
)
|
||||
|
||||
mapped_sections.append(mapped_section)
|
||||
|
||||
logger.debug(f"Applied {len(section_sources)} sources to section '{section.heading}'")
|
||||
|
||||
return mapped_sections
|
||||
|
||||
# Helper methods
|
||||
|
||||
def _extract_section_text(self, section: BlogOutlineSection) -> str:
|
||||
"""Extract all text content from a section."""
|
||||
text_parts = [section.heading]
|
||||
text_parts.extend(section.subheadings)
|
||||
text_parts.extend(section.key_points)
|
||||
text_parts.extend(section.keywords)
|
||||
return " ".join(text_parts)
|
||||
|
||||
def _extract_source_text(self, source: ResearchSource) -> str:
|
||||
"""Extract all text content from a source."""
|
||||
text_parts = [source.title]
|
||||
if source.excerpt:
|
||||
text_parts.append(source.excerpt)
|
||||
return " ".join(text_parts)
|
||||
|
||||
def _extract_meaningful_words(self, text: str) -> List[str]:
|
||||
"""Extract meaningful words from text, removing stop words and cleaning."""
|
||||
if not text:
|
||||
return []
|
||||
|
||||
# Clean and tokenize
|
||||
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
|
||||
|
||||
# Remove stop words and short words
|
||||
meaningful_words = [
|
||||
word for word in words
|
||||
if word not in self.stop_words and len(word) > 2
|
||||
]
|
||||
|
||||
return meaningful_words
|
||||
|
||||
def _calculate_phrase_similarity(self, text1: str, text2: str) -> float:
|
||||
"""Calculate phrase similarity boost score."""
|
||||
if not text1 or not text2:
|
||||
return 0.0
|
||||
|
||||
text1_lower = text1.lower()
|
||||
text2_lower = text2.lower()
|
||||
|
||||
# Look for 2-3 word phrases
|
||||
phrase_boost = 0.0
|
||||
|
||||
# Extract 2-word phrases
|
||||
words1 = text1_lower.split()
|
||||
words2 = text2_lower.split()
|
||||
|
||||
for i in range(len(words1) - 1):
|
||||
phrase = f"{words1[i]} {words1[i+1]}"
|
||||
if phrase in text2_lower:
|
||||
phrase_boost += 0.1
|
||||
|
||||
# Extract 3-word phrases
|
||||
for i in range(len(words1) - 2):
|
||||
phrase = f"{words1[i]} {words1[i+1]} {words1[i+2]}"
|
||||
if phrase in text2_lower:
|
||||
phrase_boost += 0.15
|
||||
|
||||
return min(0.3, phrase_boost) # Cap at 0.3
|
||||
|
||||
def _get_intent_keywords(self, search_intent: str) -> List[str]:
|
||||
"""Get keywords associated with search intent."""
|
||||
intent_keywords = {
|
||||
'informational': ['what', 'how', 'why', 'guide', 'tutorial', 'explain', 'learn', 'understand'],
|
||||
'navigational': ['find', 'locate', 'search', 'where', 'site', 'website', 'page'],
|
||||
'transactional': ['buy', 'purchase', 'order', 'price', 'cost', 'deal', 'offer', 'discount'],
|
||||
'commercial': ['compare', 'review', 'best', 'top', 'vs', 'versus', 'alternative', 'option']
|
||||
}
|
||||
|
||||
return intent_keywords.get(search_intent, [])
|
||||
|
||||
def get_mapping_statistics(self, mapping_results: Dict[str, List[Tuple[ResearchSource, float]]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Get statistics about the mapping results.
|
||||
|
||||
Args:
|
||||
mapping_results: Mapping results to analyze
|
||||
|
||||
Returns:
|
||||
Dictionary with mapping statistics
|
||||
"""
|
||||
total_sections = len(mapping_results)
|
||||
total_mappings = sum(len(sources) for sources in mapping_results.values())
|
||||
|
||||
# Calculate score distribution
|
||||
all_scores = []
|
||||
for sources in mapping_results.values():
|
||||
all_scores.extend([score for source, score in sources])
|
||||
|
||||
avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
|
||||
max_score = max(all_scores) if all_scores else 0.0
|
||||
min_score = min(all_scores) if all_scores else 0.0
|
||||
|
||||
# Count sections with/without sources
|
||||
sections_with_sources = sum(1 for sources in mapping_results.values() if sources)
|
||||
sections_without_sources = total_sections - sections_with_sources
|
||||
|
||||
return {
|
||||
'total_sections': total_sections,
|
||||
'total_mappings': total_mappings,
|
||||
'sections_with_sources': sections_with_sources,
|
||||
'sections_without_sources': sections_without_sources,
|
||||
'average_score': avg_score,
|
||||
'max_score': max_score,
|
||||
'min_score': min_score,
|
||||
'mapping_coverage': sections_with_sources / total_sections if total_sections > 0 else 0.0
|
||||
}
|
||||
|
||||
def _build_validation_prompt(
|
||||
self,
|
||||
mapping_results: Dict[str, List[Tuple[ResearchSource, float]]],
|
||||
research_data: BlogResearchResponse
|
||||
) -> str:
|
||||
"""
|
||||
Build comprehensive AI validation prompt for source-to-section mapping.
|
||||
|
||||
Args:
|
||||
mapping_results: Algorithmic mapping results
|
||||
research_data: Research data for context
|
||||
|
||||
Returns:
|
||||
Formatted AI validation prompt
|
||||
"""
|
||||
# Extract section information
|
||||
sections_info = []
|
||||
for section_id, sources in mapping_results.items():
|
||||
section_info = {
|
||||
'id': section_id,
|
||||
'sources': [
|
||||
{
|
||||
'title': source.title,
|
||||
'url': source.url,
|
||||
'excerpt': source.excerpt,
|
||||
'credibility_score': source.credibility_score,
|
||||
'algorithmic_score': score
|
||||
}
|
||||
for source, score in sources
|
||||
]
|
||||
}
|
||||
sections_info.append(section_info)
|
||||
|
||||
# Extract research context
|
||||
research_context = {
|
||||
'primary_keywords': research_data.keyword_analysis.get('primary', []),
|
||||
'secondary_keywords': research_data.keyword_analysis.get('secondary', []),
|
||||
'content_angles': research_data.suggested_angles,
|
||||
'search_intent': research_data.keyword_analysis.get('search_intent', 'informational'),
|
||||
'all_sources': [
|
||||
{
|
||||
'title': source.title,
|
||||
'url': source.url,
|
||||
'excerpt': source.excerpt,
|
||||
'credibility_score': source.credibility_score
|
||||
}
|
||||
for source in research_data.sources
|
||||
]
|
||||
}
|
||||
|
||||
prompt = f"""
|
||||
You are an expert content strategist and SEO specialist. Your task is to validate and improve the algorithmic mapping of research sources to blog outline sections.
|
||||
|
||||
## CONTEXT
|
||||
Research Topic: {', '.join(research_context['primary_keywords'])}
|
||||
Search Intent: {research_context['search_intent']}
|
||||
Content Angles: {', '.join(research_context['content_angles'])}
|
||||
|
||||
## ALGORITHMIC MAPPING RESULTS
|
||||
The following sections have been algorithmically mapped with research sources:
|
||||
|
||||
{self._format_sections_for_prompt(sections_info)}
|
||||
|
||||
## AVAILABLE SOURCES
|
||||
All available research sources:
|
||||
{self._format_sources_for_prompt(research_context['all_sources'])}
|
||||
|
||||
## VALIDATION TASK
|
||||
Please analyze the algorithmic mapping and provide improvements:
|
||||
|
||||
1. **Validate Relevance**: Are the mapped sources truly relevant to each section's content and purpose?
|
||||
2. **Identify Gaps**: Are there better sources available that weren't mapped?
|
||||
3. **Suggest Improvements**: Recommend specific source changes for better content alignment
|
||||
4. **Quality Assessment**: Rate the overall mapping quality (1-10)
|
||||
|
||||
## RESPONSE FORMAT
|
||||
Provide your analysis in the following JSON format:
|
||||
|
||||
```json
|
||||
{{
|
||||
"overall_quality_score": 8,
|
||||
"section_improvements": [
|
||||
{{
|
||||
"section_id": "s1",
|
||||
"current_sources": ["source_title_1", "source_title_2"],
|
||||
"recommended_sources": ["better_source_1", "better_source_2", "better_source_3"],
|
||||
"reasoning": "Explanation of why these sources are better suited for this section",
|
||||
"confidence": 0.9
|
||||
}}
|
||||
],
|
||||
"summary": "Overall assessment of the mapping quality and key improvements made"
|
||||
}}
|
||||
```
|
||||
|
||||
## GUIDELINES
|
||||
- Prioritize sources that directly support the section's key points and subheadings
|
||||
- Consider source credibility, recency, and content depth
|
||||
- Ensure sources provide actionable insights for content creation
|
||||
- Maintain diversity in source types and perspectives
|
||||
- Focus on sources that enhance the section's value proposition
|
||||
|
||||
Analyze the mapping and provide your recommendations.
|
||||
"""
|
||||
|
||||
return prompt
|
||||
|
||||
def _get_ai_validation_response(self, prompt: str, user_id: str) -> str:
|
||||
"""
|
||||
Get AI validation response using LLM provider.
|
||||
|
||||
Args:
|
||||
prompt: Validation prompt
|
||||
user_id: User ID (required for subscription checks and usage tracking)
|
||||
|
||||
Returns:
|
||||
AI validation response
|
||||
|
||||
Raises:
|
||||
ValueError: If user_id is not provided
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for AI validation response (subscription checks and usage tracking)")
|
||||
|
||||
try:
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
response = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=None,
|
||||
system_prompt=None,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
return response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get AI validation response: {e}")
|
||||
raise
|
||||
|
||||
def _parse_validation_response(
|
||||
self,
|
||||
response: str,
|
||||
original_mapping: Dict[str, List[Tuple[ResearchSource, float]]],
|
||||
research_data: BlogResearchResponse
|
||||
) -> Dict[str, List[Tuple[ResearchSource, float]]]:
|
||||
"""
|
||||
Parse AI validation response and apply improvements.
|
||||
|
||||
Args:
|
||||
response: AI validation response
|
||||
original_mapping: Original algorithmic mapping
|
||||
research_data: Research data for context
|
||||
|
||||
Returns:
|
||||
Improved mapping based on AI validation
|
||||
"""
|
||||
try:
|
||||
import json
|
||||
import re
|
||||
|
||||
# Extract JSON from response
|
||||
json_match = re.search(r'```json\s*(\{.*?\})\s*```', response, re.DOTALL)
|
||||
if not json_match:
|
||||
# Try to find JSON without code blocks
|
||||
json_match = re.search(r'(\{.*?\})', response, re.DOTALL)
|
||||
|
||||
if not json_match:
|
||||
logger.warning("Could not extract JSON from AI response")
|
||||
return original_mapping
|
||||
|
||||
validation_data = json.loads(json_match.group(1))
|
||||
|
||||
# Create source lookup for quick access
|
||||
source_lookup = {source.title: source for source in research_data.sources}
|
||||
|
||||
# Apply AI improvements
|
||||
improved_mapping = {}
|
||||
|
||||
for improvement in validation_data.get('section_improvements', []):
|
||||
section_id = improvement['section_id']
|
||||
recommended_titles = improvement['recommended_sources']
|
||||
|
||||
# Map recommended titles to actual sources
|
||||
recommended_sources = []
|
||||
for title in recommended_titles:
|
||||
if title in source_lookup:
|
||||
source = source_lookup[title]
|
||||
# Use high confidence score for AI-recommended sources
|
||||
recommended_sources.append((source, 0.9))
|
||||
|
||||
if recommended_sources:
|
||||
improved_mapping[section_id] = recommended_sources
|
||||
else:
|
||||
# Fallback to original mapping if no valid sources found
|
||||
improved_mapping[section_id] = original_mapping.get(section_id, [])
|
||||
|
||||
# Add sections not mentioned in AI response
|
||||
for section_id, sources in original_mapping.items():
|
||||
if section_id not in improved_mapping:
|
||||
improved_mapping[section_id] = sources
|
||||
|
||||
logger.info(f"AI validation applied: {len(validation_data.get('section_improvements', []))} sections improved")
|
||||
return improved_mapping
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to parse AI validation response: {e}")
|
||||
return original_mapping
|
||||
|
||||
def _format_sections_for_prompt(self, sections_info: List[Dict]) -> str:
|
||||
"""Format sections information for AI prompt."""
|
||||
formatted = []
|
||||
for section in sections_info:
|
||||
section_text = f"**Section {section['id']}:**\n"
|
||||
section_text += f"Sources mapped: {len(section['sources'])}\n"
|
||||
for source in section['sources']:
|
||||
section_text += f"- {source['title']} (Score: {source['algorithmic_score']:.2f})\n"
|
||||
formatted.append(section_text)
|
||||
return "\n".join(formatted)
|
||||
|
||||
def _format_sources_for_prompt(self, sources: List[Dict]) -> str:
|
||||
"""Format sources information for AI prompt."""
|
||||
formatted = []
|
||||
for i, source in enumerate(sources, 1):
|
||||
source_text = f"{i}. **{source['title']}**\n"
|
||||
source_text += f" URL: {source['url']}\n"
|
||||
source_text += f" Credibility: {source['credibility_score']}\n"
|
||||
if source['excerpt']:
|
||||
source_text += f" Excerpt: {source['excerpt'][:200]}...\n"
|
||||
formatted.append(source_text)
|
||||
return "\n".join(formatted)
|
||||
123
backend/services/blog_writer/outline/title_generator.py
Normal file
123
backend/services/blog_writer/outline/title_generator.py
Normal file
@@ -0,0 +1,123 @@
|
||||
"""
|
||||
Title Generator - Handles title generation and formatting for blog outlines.
|
||||
|
||||
Extracts content angles from research data and combines them with AI-generated titles.
|
||||
"""
|
||||
|
||||
from typing import List
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class TitleGenerator:
|
||||
"""Handles title generation, formatting, and combination logic."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the title generator."""
|
||||
pass
|
||||
|
||||
def extract_content_angle_titles(self, research) -> List[str]:
|
||||
"""
|
||||
Extract content angles from research data and convert them to blog titles.
|
||||
|
||||
Args:
|
||||
research: BlogResearchResponse object containing suggested_angles
|
||||
|
||||
Returns:
|
||||
List of title-formatted content angles
|
||||
"""
|
||||
if not research or not hasattr(research, 'suggested_angles'):
|
||||
return []
|
||||
|
||||
content_angles = research.suggested_angles or []
|
||||
if not content_angles:
|
||||
return []
|
||||
|
||||
# Convert content angles to title format
|
||||
title_formatted_angles = []
|
||||
for angle in content_angles:
|
||||
if isinstance(angle, str) and angle.strip():
|
||||
# Clean and format the angle as a title
|
||||
formatted_angle = self._format_angle_as_title(angle.strip())
|
||||
if formatted_angle and formatted_angle not in title_formatted_angles:
|
||||
title_formatted_angles.append(formatted_angle)
|
||||
|
||||
logger.info(f"Extracted {len(title_formatted_angles)} content angle titles from research data")
|
||||
return title_formatted_angles
|
||||
|
||||
def _format_angle_as_title(self, angle: str) -> str:
|
||||
"""
|
||||
Format a content angle as a proper blog title.
|
||||
|
||||
Args:
|
||||
angle: Raw content angle string
|
||||
|
||||
Returns:
|
||||
Formatted title string
|
||||
"""
|
||||
if not angle or len(angle.strip()) < 10: # Too short to be a good title
|
||||
return ""
|
||||
|
||||
# Clean up the angle
|
||||
cleaned_angle = angle.strip()
|
||||
|
||||
# Capitalize first letter of each sentence and proper nouns
|
||||
sentences = cleaned_angle.split('. ')
|
||||
formatted_sentences = []
|
||||
for sentence in sentences:
|
||||
if sentence.strip():
|
||||
# Use title case for better formatting
|
||||
formatted_sentence = sentence.strip().title()
|
||||
formatted_sentences.append(formatted_sentence)
|
||||
|
||||
formatted_title = '. '.join(formatted_sentences)
|
||||
|
||||
# Ensure it ends with proper punctuation
|
||||
if not formatted_title.endswith(('.', '!', '?')):
|
||||
formatted_title += '.'
|
||||
|
||||
# Limit length to reasonable blog title size
|
||||
if len(formatted_title) > 100:
|
||||
formatted_title = formatted_title[:97] + "..."
|
||||
|
||||
return formatted_title
|
||||
|
||||
def combine_title_options(self, ai_titles: List[str], content_angle_titles: List[str], primary_keywords: List[str]) -> List[str]:
|
||||
"""
|
||||
Combine AI-generated titles with content angle titles, ensuring variety and quality.
|
||||
|
||||
Args:
|
||||
ai_titles: AI-generated title options
|
||||
content_angle_titles: Titles derived from content angles
|
||||
primary_keywords: Primary keywords for fallback generation
|
||||
|
||||
Returns:
|
||||
Combined list of title options (max 6 total)
|
||||
"""
|
||||
all_titles = []
|
||||
|
||||
# Add content angle titles first (these are research-based and valuable)
|
||||
for title in content_angle_titles[:3]: # Limit to top 3 content angles
|
||||
if title and title not in all_titles:
|
||||
all_titles.append(title)
|
||||
|
||||
# Add AI-generated titles
|
||||
for title in ai_titles:
|
||||
if title and title not in all_titles:
|
||||
all_titles.append(title)
|
||||
|
||||
# Note: Removed fallback titles as requested - only use research and AI-generated titles
|
||||
|
||||
# Limit to 6 titles maximum for UI usability
|
||||
final_titles = all_titles[:6]
|
||||
|
||||
logger.info(f"Combined title options: {len(final_titles)} total (AI: {len(ai_titles)}, Content angles: {len(content_angle_titles)})")
|
||||
return final_titles
|
||||
|
||||
def generate_fallback_titles(self, primary_keywords: List[str]) -> List[str]:
|
||||
"""Generate fallback titles when AI generation fails."""
|
||||
primary_keyword = primary_keywords[0] if primary_keywords else "Topic"
|
||||
return [
|
||||
f"The Complete Guide to {primary_keyword}",
|
||||
f"{primary_keyword}: Everything You Need to Know",
|
||||
f"How to Master {primary_keyword} in 2024"
|
||||
]
|
||||
31
backend/services/blog_writer/research/__init__.py
Normal file
31
backend/services/blog_writer/research/__init__.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""
|
||||
Research module for AI Blog Writer.
|
||||
|
||||
This module handles all research-related functionality including:
|
||||
- Google Search grounding integration
|
||||
- Keyword analysis and competitor research
|
||||
- Content angle discovery
|
||||
- Research caching and optimization
|
||||
"""
|
||||
|
||||
from .research_service import ResearchService
|
||||
from .keyword_analyzer import KeywordAnalyzer
|
||||
from .competitor_analyzer import CompetitorAnalyzer
|
||||
from .content_angle_generator import ContentAngleGenerator
|
||||
from .data_filter import ResearchDataFilter
|
||||
from .base_provider import ResearchProvider as BaseResearchProvider
|
||||
from .google_provider import GoogleResearchProvider
|
||||
from .exa_provider import ExaResearchProvider
|
||||
from .tavily_provider import TavilyResearchProvider
|
||||
|
||||
__all__ = [
|
||||
'ResearchService',
|
||||
'KeywordAnalyzer',
|
||||
'CompetitorAnalyzer',
|
||||
'ContentAngleGenerator',
|
||||
'ResearchDataFilter',
|
||||
'BaseResearchProvider',
|
||||
'GoogleResearchProvider',
|
||||
'ExaResearchProvider',
|
||||
'TavilyResearchProvider',
|
||||
]
|
||||
37
backend/services/blog_writer/research/base_provider.py
Normal file
37
backend/services/blog_writer/research/base_provider.py
Normal file
@@ -0,0 +1,37 @@
|
||||
"""
|
||||
Base Research Provider Interface
|
||||
|
||||
Abstract base class for research provider implementations.
|
||||
Ensures consistency across different research providers (Google, Exa, etc.)
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
|
||||
|
||||
class ResearchProvider(ABC):
|
||||
"""Abstract base class for research providers."""
|
||||
|
||||
@abstractmethod
|
||||
async def search(
|
||||
self,
|
||||
prompt: str,
|
||||
topic: str,
|
||||
industry: str,
|
||||
target_audience: str,
|
||||
config: Any, # ResearchConfig
|
||||
user_id: str
|
||||
) -> Dict[str, Any]:
|
||||
"""Execute research and return raw results."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_provider_enum(self):
|
||||
"""Return APIProvider enum for subscription tracking."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def estimate_tokens(self) -> int:
|
||||
"""Estimate token usage for pre-flight validation."""
|
||||
pass
|
||||
|
||||
72
backend/services/blog_writer/research/competitor_analyzer.py
Normal file
72
backend/services/blog_writer/research/competitor_analyzer.py
Normal file
@@ -0,0 +1,72 @@
|
||||
"""
|
||||
Competitor Analyzer - AI-powered competitor analysis for research content.
|
||||
|
||||
Extracts competitor insights and market intelligence from research content.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class CompetitorAnalyzer:
|
||||
"""Analyzes competitors and market intelligence from research content."""
|
||||
|
||||
def analyze(self, content: str, user_id: str = None) -> Dict[str, Any]:
|
||||
"""Parse comprehensive competitor analysis from the research content using AI."""
|
||||
competitor_prompt = f"""
|
||||
Analyze the following research content and extract competitor insights:
|
||||
|
||||
Research Content:
|
||||
{content[:3000]}
|
||||
|
||||
Extract and analyze:
|
||||
1. Top competitors mentioned (companies, brands, platforms)
|
||||
2. Content gaps (what competitors are missing)
|
||||
3. Market opportunities (untapped areas)
|
||||
4. Competitive advantages (what makes content unique)
|
||||
5. Market positioning insights
|
||||
6. Industry leaders and their strategies
|
||||
|
||||
Respond with JSON:
|
||||
{{
|
||||
"top_competitors": ["competitor1", "competitor2"],
|
||||
"content_gaps": ["gap1", "gap2"],
|
||||
"opportunities": ["opportunity1", "opportunity2"],
|
||||
"competitive_advantages": ["advantage1", "advantage2"],
|
||||
"market_positioning": "positioning insights",
|
||||
"industry_leaders": ["leader1", "leader2"],
|
||||
"analysis_notes": "Comprehensive competitor analysis summary"
|
||||
}}
|
||||
"""
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
competitor_schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"top_competitors": {"type": "array", "items": {"type": "string"}},
|
||||
"content_gaps": {"type": "array", "items": {"type": "string"}},
|
||||
"opportunities": {"type": "array", "items": {"type": "string"}},
|
||||
"competitive_advantages": {"type": "array", "items": {"type": "string"}},
|
||||
"market_positioning": {"type": "string"},
|
||||
"industry_leaders": {"type": "array", "items": {"type": "string"}},
|
||||
"analysis_notes": {"type": "string"}
|
||||
},
|
||||
"required": ["top_competitors", "content_gaps", "opportunities", "competitive_advantages", "market_positioning", "industry_leaders", "analysis_notes"]
|
||||
}
|
||||
|
||||
competitor_analysis = llm_text_gen(
|
||||
prompt=competitor_prompt,
|
||||
json_struct=competitor_schema,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
if isinstance(competitor_analysis, dict) and 'error' not in competitor_analysis:
|
||||
logger.info("✅ AI competitor analysis completed successfully")
|
||||
return competitor_analysis
|
||||
else:
|
||||
# Fail gracefully - no fallback data
|
||||
error_msg = competitor_analysis.get('error', 'Unknown error') if isinstance(competitor_analysis, dict) else str(competitor_analysis)
|
||||
logger.error(f"AI competitor analysis failed: {error_msg}")
|
||||
raise ValueError(f"Competitor analysis failed: {error_msg}")
|
||||
|
||||
@@ -0,0 +1,80 @@
|
||||
"""
|
||||
Content Angle Generator - AI-powered content angle discovery.
|
||||
|
||||
Generates strategic content angles from research content for blog posts.
|
||||
"""
|
||||
|
||||
from typing import List
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class ContentAngleGenerator:
|
||||
"""Generates strategic content angles from research content."""
|
||||
|
||||
def generate(self, content: str, topic: str, industry: str, user_id: str = None) -> List[str]:
|
||||
"""Parse strategic content angles from the research content using AI."""
|
||||
angles_prompt = f"""
|
||||
Analyze the following research content and create strategic content angles for: {topic} in {industry}
|
||||
|
||||
Research Content:
|
||||
{content[:3000]}
|
||||
|
||||
Create 7 compelling content angles that:
|
||||
1. Leverage current trends and data from the research
|
||||
2. Address content gaps and opportunities
|
||||
3. Appeal to different audience segments
|
||||
4. Include unique perspectives not covered by competitors
|
||||
5. Incorporate specific statistics, case studies, or expert insights
|
||||
6. Create emotional connection and urgency
|
||||
7. Provide actionable value to readers
|
||||
|
||||
Each angle should be:
|
||||
- Specific and data-driven
|
||||
- Unique and differentiated
|
||||
- Compelling and click-worthy
|
||||
- Actionable for readers
|
||||
|
||||
Respond with JSON:
|
||||
{{
|
||||
"content_angles": [
|
||||
"Specific angle 1 with data/trends",
|
||||
"Specific angle 2 with unique perspective",
|
||||
"Specific angle 3 with actionable insights",
|
||||
"Specific angle 4 with case study focus",
|
||||
"Specific angle 5 with future outlook",
|
||||
"Specific angle 6 with problem-solving focus",
|
||||
"Specific angle 7 with industry insights"
|
||||
]
|
||||
}}
|
||||
"""
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
angles_schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content_angles": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"minItems": 5,
|
||||
"maxItems": 7
|
||||
}
|
||||
},
|
||||
"required": ["content_angles"]
|
||||
}
|
||||
|
||||
angles_result = llm_text_gen(
|
||||
prompt=angles_prompt,
|
||||
json_struct=angles_schema,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
if isinstance(angles_result, dict) and 'content_angles' in angles_result:
|
||||
logger.info("✅ AI content angles generation completed successfully")
|
||||
return angles_result['content_angles'][:7]
|
||||
else:
|
||||
# Fail gracefully - no fallback data
|
||||
error_msg = angles_result.get('error', 'Unknown error') if isinstance(angles_result, dict) else str(angles_result)
|
||||
logger.error(f"AI content angles generation failed: {error_msg}")
|
||||
raise ValueError(f"Content angles generation failed: {error_msg}")
|
||||
|
||||
519
backend/services/blog_writer/research/data_filter.py
Normal file
519
backend/services/blog_writer/research/data_filter.py
Normal file
@@ -0,0 +1,519 @@
|
||||
"""
|
||||
Research Data Filter - Filters and cleans research data for optimal AI processing.
|
||||
|
||||
This module provides intelligent filtering and cleaning of research data to:
|
||||
1. Remove low-quality sources and irrelevant content
|
||||
2. Optimize data for AI processing (reduce tokens, improve quality)
|
||||
3. Ensure only high-value insights are sent to AI prompts
|
||||
4. Maintain data integrity while improving processing efficiency
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Optional, Tuple
|
||||
from datetime import datetime, timedelta
|
||||
import re
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogResearchResponse,
|
||||
ResearchSource,
|
||||
GroundingMetadata,
|
||||
GroundingChunk,
|
||||
GroundingSupport,
|
||||
Citation,
|
||||
)
|
||||
|
||||
|
||||
class ResearchDataFilter:
|
||||
"""Filters and cleans research data for optimal AI processing."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the research data filter with default settings."""
|
||||
# Be conservative but avoid over-filtering which can lead to empty UI
|
||||
self.min_credibility_score = 0.5
|
||||
self.min_excerpt_length = 20
|
||||
self.max_sources = 15
|
||||
self.max_grounding_chunks = 20
|
||||
self.max_content_gaps = 5
|
||||
self.max_keywords_per_category = 10
|
||||
self.min_grounding_confidence = 0.5
|
||||
self.max_source_age_days = 365 * 5 # allow up to 5 years if relevant
|
||||
|
||||
# Common stop words for keyword cleaning
|
||||
self.stop_words = {
|
||||
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
|
||||
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did',
|
||||
'will', 'would', 'could', 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'
|
||||
}
|
||||
|
||||
# Irrelevant source patterns
|
||||
self.irrelevant_patterns = [
|
||||
r'\.(pdf|doc|docx|xls|xlsx|ppt|pptx)$', # Document files
|
||||
r'\.(jpg|jpeg|png|gif|svg|webp)$', # Image files
|
||||
r'\.(mp4|avi|mov|wmv|flv|webm)$', # Video files
|
||||
r'\.(mp3|wav|flac|aac)$', # Audio files
|
||||
r'\.(zip|rar|7z|tar|gz)$', # Archive files
|
||||
r'^https?://(www\.)?(facebook|twitter|instagram|linkedin|youtube)\.com', # Social media
|
||||
r'^https?://(www\.)?(amazon|ebay|etsy)\.com', # E-commerce
|
||||
r'^https?://(www\.)?(wikipedia)\.org', # Wikipedia (too generic)
|
||||
]
|
||||
|
||||
logger.info("✅ ResearchDataFilter initialized with quality thresholds")
|
||||
|
||||
def filter_research_data(self, research_data: BlogResearchResponse) -> BlogResearchResponse:
|
||||
"""
|
||||
Main filtering method that processes all research data components.
|
||||
|
||||
Args:
|
||||
research_data: Raw research data from the research service
|
||||
|
||||
Returns:
|
||||
Filtered and cleaned research data optimized for AI processing
|
||||
"""
|
||||
logger.info(f"Starting research data filtering for {len(research_data.sources)} sources")
|
||||
|
||||
# Track original counts for logging
|
||||
original_counts = {
|
||||
'sources': len(research_data.sources),
|
||||
'grounding_chunks': len(research_data.grounding_metadata.grounding_chunks) if research_data.grounding_metadata else 0,
|
||||
'grounding_supports': len(research_data.grounding_metadata.grounding_supports) if research_data.grounding_metadata else 0,
|
||||
'citations': len(research_data.grounding_metadata.citations) if research_data.grounding_metadata else 0,
|
||||
}
|
||||
|
||||
# Filter sources
|
||||
filtered_sources = self.filter_sources(research_data.sources)
|
||||
|
||||
# Filter grounding metadata
|
||||
filtered_grounding_metadata = self.filter_grounding_metadata(research_data.grounding_metadata)
|
||||
|
||||
# Clean keyword analysis
|
||||
cleaned_keyword_analysis = self.clean_keyword_analysis(research_data.keyword_analysis)
|
||||
|
||||
# Clean competitor analysis
|
||||
cleaned_competitor_analysis = self.clean_competitor_analysis(research_data.competitor_analysis)
|
||||
|
||||
# Filter content gaps
|
||||
filtered_content_gaps = self.filter_content_gaps(
|
||||
research_data.keyword_analysis.get('content_gaps', []),
|
||||
research_data
|
||||
)
|
||||
|
||||
# Update keyword analysis with filtered content gaps
|
||||
cleaned_keyword_analysis['content_gaps'] = filtered_content_gaps
|
||||
|
||||
# Create filtered research response
|
||||
filtered_research = BlogResearchResponse(
|
||||
success=research_data.success,
|
||||
sources=filtered_sources,
|
||||
keyword_analysis=cleaned_keyword_analysis,
|
||||
competitor_analysis=cleaned_competitor_analysis,
|
||||
suggested_angles=research_data.suggested_angles, # Keep as-is for now
|
||||
search_widget=research_data.search_widget,
|
||||
search_queries=research_data.search_queries,
|
||||
grounding_metadata=filtered_grounding_metadata,
|
||||
error_message=research_data.error_message
|
||||
)
|
||||
|
||||
# Log filtering results
|
||||
self._log_filtering_results(original_counts, filtered_research)
|
||||
|
||||
return filtered_research
|
||||
|
||||
def filter_sources(self, sources: List[ResearchSource]) -> List[ResearchSource]:
|
||||
"""
|
||||
Filter sources based on quality, relevance, and recency criteria.
|
||||
|
||||
Args:
|
||||
sources: List of research sources to filter
|
||||
|
||||
Returns:
|
||||
Filtered list of high-quality sources
|
||||
"""
|
||||
if not sources:
|
||||
return []
|
||||
|
||||
filtered_sources = []
|
||||
|
||||
for source in sources:
|
||||
# Quality filters
|
||||
if not self._is_source_high_quality(source):
|
||||
continue
|
||||
|
||||
# Relevance filters
|
||||
if not self._is_source_relevant(source):
|
||||
continue
|
||||
|
||||
# Recency filters
|
||||
if not self._is_source_recent(source):
|
||||
continue
|
||||
|
||||
filtered_sources.append(source)
|
||||
|
||||
# Sort by credibility score and limit to max_sources
|
||||
filtered_sources.sort(key=lambda s: s.credibility_score or 0.8, reverse=True)
|
||||
filtered_sources = filtered_sources[:self.max_sources]
|
||||
|
||||
# Fail-open: if everything was filtered out, return a trimmed set of original sources
|
||||
if not filtered_sources and sources:
|
||||
logger.warning("All sources filtered out by thresholds. Falling back to top sources without strict filters.")
|
||||
fallback = sorted(
|
||||
sources,
|
||||
key=lambda s: (s.credibility_score or 0.8),
|
||||
reverse=True
|
||||
)[: self.max_sources]
|
||||
return fallback
|
||||
|
||||
logger.info(f"Filtered sources: {len(sources)} → {len(filtered_sources)}")
|
||||
return filtered_sources
|
||||
|
||||
def filter_grounding_metadata(self, grounding_metadata: Optional[GroundingMetadata]) -> Optional[GroundingMetadata]:
|
||||
"""
|
||||
Filter grounding metadata to keep only high-confidence, relevant data.
|
||||
|
||||
Args:
|
||||
grounding_metadata: Raw grounding metadata to filter
|
||||
|
||||
Returns:
|
||||
Filtered grounding metadata with high-quality data only
|
||||
"""
|
||||
if not grounding_metadata:
|
||||
return None
|
||||
|
||||
# Filter grounding chunks by confidence
|
||||
filtered_chunks = []
|
||||
for chunk in grounding_metadata.grounding_chunks:
|
||||
if chunk.confidence_score and chunk.confidence_score >= self.min_grounding_confidence:
|
||||
filtered_chunks.append(chunk)
|
||||
|
||||
# Limit chunks to max_grounding_chunks
|
||||
filtered_chunks = filtered_chunks[:self.max_grounding_chunks]
|
||||
|
||||
# Filter grounding supports by confidence
|
||||
filtered_supports = []
|
||||
for support in grounding_metadata.grounding_supports:
|
||||
if support.confidence_scores and max(support.confidence_scores) >= self.min_grounding_confidence:
|
||||
filtered_supports.append(support)
|
||||
|
||||
# Filter citations by type and relevance
|
||||
filtered_citations = []
|
||||
for citation in grounding_metadata.citations:
|
||||
if self._is_citation_relevant(citation):
|
||||
filtered_citations.append(citation)
|
||||
|
||||
# Fail-open strategies to avoid empty UI:
|
||||
if not filtered_chunks and grounding_metadata.grounding_chunks:
|
||||
logger.warning("All grounding chunks filtered out. Falling back to first N chunks without confidence filter.")
|
||||
filtered_chunks = grounding_metadata.grounding_chunks[: self.max_grounding_chunks]
|
||||
if not filtered_supports and grounding_metadata.grounding_supports:
|
||||
logger.warning("All grounding supports filtered out. Falling back to first N supports without confidence filter.")
|
||||
filtered_supports = grounding_metadata.grounding_supports[: self.max_grounding_chunks]
|
||||
|
||||
# Create filtered grounding metadata
|
||||
filtered_metadata = GroundingMetadata(
|
||||
grounding_chunks=filtered_chunks,
|
||||
grounding_supports=filtered_supports,
|
||||
citations=filtered_citations,
|
||||
search_entry_point=grounding_metadata.search_entry_point,
|
||||
web_search_queries=grounding_metadata.web_search_queries
|
||||
)
|
||||
|
||||
logger.info(f"Filtered grounding metadata: {len(grounding_metadata.grounding_chunks)} chunks → {len(filtered_chunks)} chunks")
|
||||
return filtered_metadata
|
||||
|
||||
def clean_keyword_analysis(self, keyword_analysis: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Clean and deduplicate keyword analysis data.
|
||||
|
||||
Args:
|
||||
keyword_analysis: Raw keyword analysis data
|
||||
|
||||
Returns:
|
||||
Cleaned and deduplicated keyword analysis
|
||||
"""
|
||||
if not keyword_analysis:
|
||||
return {}
|
||||
|
||||
cleaned_analysis = {}
|
||||
|
||||
# Clean and deduplicate keyword lists
|
||||
keyword_categories = ['primary', 'secondary', 'long_tail', 'semantic_keywords', 'trending_terms']
|
||||
|
||||
for category in keyword_categories:
|
||||
if category in keyword_analysis and isinstance(keyword_analysis[category], list):
|
||||
cleaned_keywords = self._clean_keyword_list(keyword_analysis[category])
|
||||
cleaned_analysis[category] = cleaned_keywords[:self.max_keywords_per_category]
|
||||
|
||||
# Clean other fields
|
||||
other_fields = ['search_intent', 'difficulty', 'analysis_insights']
|
||||
for field in other_fields:
|
||||
if field in keyword_analysis:
|
||||
cleaned_analysis[field] = keyword_analysis[field]
|
||||
|
||||
# Clean content gaps separately (handled by filter_content_gaps)
|
||||
# Don't add content_gaps if it's empty to avoid adding empty lists
|
||||
if 'content_gaps' in keyword_analysis and keyword_analysis['content_gaps']:
|
||||
cleaned_analysis['content_gaps'] = keyword_analysis['content_gaps'] # Will be filtered later
|
||||
|
||||
logger.info(f"Cleaned keyword analysis: {len(keyword_analysis)} categories → {len(cleaned_analysis)} categories")
|
||||
return cleaned_analysis
|
||||
|
||||
def clean_competitor_analysis(self, competitor_analysis: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Clean and validate competitor analysis data.
|
||||
|
||||
Args:
|
||||
competitor_analysis: Raw competitor analysis data
|
||||
|
||||
Returns:
|
||||
Cleaned competitor analysis data
|
||||
"""
|
||||
if not competitor_analysis:
|
||||
return {}
|
||||
|
||||
cleaned_analysis = {}
|
||||
|
||||
# Clean competitor lists
|
||||
competitor_lists = ['top_competitors', 'opportunities', 'competitive_advantages']
|
||||
for field in competitor_lists:
|
||||
if field in competitor_analysis and isinstance(competitor_analysis[field], list):
|
||||
cleaned_list = [item.strip() for item in competitor_analysis[field] if item.strip()]
|
||||
cleaned_analysis[field] = cleaned_list[:10] # Limit to top 10
|
||||
|
||||
# Clean other fields
|
||||
other_fields = ['market_positioning', 'competitive_landscape', 'market_share']
|
||||
for field in other_fields:
|
||||
if field in competitor_analysis:
|
||||
cleaned_analysis[field] = competitor_analysis[field]
|
||||
|
||||
logger.info(f"Cleaned competitor analysis: {len(competitor_analysis)} fields → {len(cleaned_analysis)} fields")
|
||||
return cleaned_analysis
|
||||
|
||||
def filter_content_gaps(self, content_gaps: List[str], research_data: BlogResearchResponse) -> List[str]:
|
||||
"""
|
||||
Filter content gaps to keep only actionable, high-value ones.
|
||||
|
||||
Args:
|
||||
content_gaps: List of identified content gaps
|
||||
research_data: Research data for context
|
||||
|
||||
Returns:
|
||||
Filtered list of actionable content gaps
|
||||
"""
|
||||
if not content_gaps:
|
||||
return []
|
||||
|
||||
filtered_gaps = []
|
||||
|
||||
for gap in content_gaps:
|
||||
# Quality filters
|
||||
if not self._is_gap_high_quality(gap):
|
||||
continue
|
||||
|
||||
# Relevance filters
|
||||
if not self._is_gap_relevant_to_topic(gap, research_data):
|
||||
continue
|
||||
|
||||
# Actionability filters
|
||||
if not self._is_gap_actionable(gap):
|
||||
continue
|
||||
|
||||
filtered_gaps.append(gap)
|
||||
|
||||
# Limit to max_content_gaps
|
||||
filtered_gaps = filtered_gaps[:self.max_content_gaps]
|
||||
|
||||
logger.info(f"Filtered content gaps: {len(content_gaps)} → {len(filtered_gaps)}")
|
||||
return filtered_gaps
|
||||
|
||||
# Private helper methods
|
||||
|
||||
def _is_source_high_quality(self, source: ResearchSource) -> bool:
|
||||
"""Check if source meets quality criteria."""
|
||||
# Credibility score check
|
||||
if source.credibility_score and source.credibility_score < self.min_credibility_score:
|
||||
return False
|
||||
|
||||
# Excerpt length check
|
||||
if source.excerpt and len(source.excerpt) < self.min_excerpt_length:
|
||||
return False
|
||||
|
||||
# Title quality check
|
||||
if not source.title or len(source.title.strip()) < 10:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _is_source_relevant(self, source: ResearchSource) -> bool:
|
||||
"""Check if source is relevant (not irrelevant patterns)."""
|
||||
if not source.url:
|
||||
return True # Keep sources without URLs
|
||||
|
||||
# Check against irrelevant patterns
|
||||
for pattern in self.irrelevant_patterns:
|
||||
if re.search(pattern, source.url, re.IGNORECASE):
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _is_source_recent(self, source: ResearchSource) -> bool:
|
||||
"""Check if source is recent enough."""
|
||||
if not source.published_at:
|
||||
return True # Keep sources without dates
|
||||
|
||||
try:
|
||||
# Parse date (assuming ISO format or common formats)
|
||||
published_date = self._parse_date(source.published_at)
|
||||
if published_date:
|
||||
cutoff_date = datetime.now() - timedelta(days=self.max_source_age_days)
|
||||
return published_date >= cutoff_date
|
||||
except Exception as e:
|
||||
logger.warning(f"Error parsing date '{source.published_at}': {e}")
|
||||
|
||||
return True # Keep sources with unparseable dates
|
||||
|
||||
def _is_citation_relevant(self, citation: Citation) -> bool:
|
||||
"""Check if citation is relevant and high-quality."""
|
||||
# Check citation type
|
||||
relevant_types = ['expert_opinion', 'statistical_data', 'recent_news', 'research_study']
|
||||
if citation.citation_type not in relevant_types:
|
||||
return False
|
||||
|
||||
# Check text quality
|
||||
if not citation.text or len(citation.text.strip()) < 20:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _is_gap_high_quality(self, gap: str) -> bool:
|
||||
"""Check if content gap is high quality."""
|
||||
gap = gap.strip()
|
||||
|
||||
# Length check
|
||||
if len(gap) < 10:
|
||||
return False
|
||||
|
||||
# Generic gap check
|
||||
generic_gaps = ['general', 'overview', 'introduction', 'basics', 'fundamentals']
|
||||
if gap.lower() in generic_gaps:
|
||||
return False
|
||||
|
||||
# Check for meaningful content
|
||||
if len(gap.split()) < 3:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _is_gap_relevant_to_topic(self, gap: str, research_data: BlogResearchResponse) -> bool:
|
||||
"""Check if content gap is relevant to the research topic."""
|
||||
# Simple relevance check - could be enhanced with more sophisticated matching
|
||||
primary_keywords = research_data.keyword_analysis.get('primary', [])
|
||||
|
||||
if not primary_keywords:
|
||||
return True # Keep gaps if no keywords available
|
||||
|
||||
gap_lower = gap.lower()
|
||||
for keyword in primary_keywords:
|
||||
if keyword.lower() in gap_lower:
|
||||
return True
|
||||
|
||||
# If no direct keyword match, check for common AI-related terms
|
||||
ai_terms = ['ai', 'artificial intelligence', 'machine learning', 'automation', 'technology', 'digital']
|
||||
for term in ai_terms:
|
||||
if term in gap_lower:
|
||||
return True
|
||||
|
||||
return True # Default to keeping gaps if no clear relevance check
|
||||
|
||||
def _is_gap_actionable(self, gap: str) -> bool:
|
||||
"""Check if content gap is actionable (can be addressed with content)."""
|
||||
gap_lower = gap.lower()
|
||||
|
||||
# Check for actionable indicators
|
||||
actionable_indicators = [
|
||||
'how to', 'guide', 'tutorial', 'steps', 'process', 'method',
|
||||
'best practices', 'tips', 'strategies', 'techniques', 'approach',
|
||||
'comparison', 'vs', 'versus', 'difference', 'pros and cons',
|
||||
'trends', 'future', '2024', '2025', 'emerging', 'new'
|
||||
]
|
||||
|
||||
for indicator in actionable_indicators:
|
||||
if indicator in gap_lower:
|
||||
return True
|
||||
|
||||
return True # Default to actionable if no specific indicators
|
||||
|
||||
def _clean_keyword_list(self, keywords: List[str]) -> List[str]:
|
||||
"""Clean and deduplicate a list of keywords."""
|
||||
cleaned_keywords = []
|
||||
seen_keywords = set()
|
||||
|
||||
for keyword in keywords:
|
||||
if not keyword or not isinstance(keyword, str):
|
||||
continue
|
||||
|
||||
# Clean keyword
|
||||
cleaned_keyword = keyword.strip().lower()
|
||||
|
||||
# Skip empty or too short keywords
|
||||
if len(cleaned_keyword) < 2:
|
||||
continue
|
||||
|
||||
# Skip stop words
|
||||
if cleaned_keyword in self.stop_words:
|
||||
continue
|
||||
|
||||
# Skip duplicates
|
||||
if cleaned_keyword in seen_keywords:
|
||||
continue
|
||||
|
||||
cleaned_keywords.append(cleaned_keyword)
|
||||
seen_keywords.add(cleaned_keyword)
|
||||
|
||||
return cleaned_keywords
|
||||
|
||||
def _parse_date(self, date_str: str) -> Optional[datetime]:
|
||||
"""Parse date string into datetime object."""
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
# Common date formats
|
||||
date_formats = [
|
||||
'%Y-%m-%d',
|
||||
'%Y-%m-%dT%H:%M:%S',
|
||||
'%Y-%m-%dT%H:%M:%SZ',
|
||||
'%Y-%m-%dT%H:%M:%S.%fZ',
|
||||
'%B %d, %Y',
|
||||
'%b %d, %Y',
|
||||
'%d %B %Y',
|
||||
'%d %b %Y',
|
||||
'%m/%d/%Y',
|
||||
'%d/%m/%Y'
|
||||
]
|
||||
|
||||
for fmt in date_formats:
|
||||
try:
|
||||
return datetime.strptime(date_str, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
return None
|
||||
|
||||
def _log_filtering_results(self, original_counts: Dict[str, int], filtered_research: BlogResearchResponse):
|
||||
"""Log the results of filtering operations."""
|
||||
filtered_counts = {
|
||||
'sources': len(filtered_research.sources),
|
||||
'grounding_chunks': len(filtered_research.grounding_metadata.grounding_chunks) if filtered_research.grounding_metadata else 0,
|
||||
'grounding_supports': len(filtered_research.grounding_metadata.grounding_supports) if filtered_research.grounding_metadata else 0,
|
||||
'citations': len(filtered_research.grounding_metadata.citations) if filtered_research.grounding_metadata else 0,
|
||||
}
|
||||
|
||||
logger.info("📊 Research Data Filtering Results:")
|
||||
for key, original_count in original_counts.items():
|
||||
filtered_count = filtered_counts[key]
|
||||
reduction_percent = ((original_count - filtered_count) / original_count * 100) if original_count > 0 else 0
|
||||
logger.info(f" {key}: {original_count} → {filtered_count} ({reduction_percent:.1f}% reduction)")
|
||||
|
||||
# Log content gaps filtering
|
||||
original_gaps = len(filtered_research.keyword_analysis.get('content_gaps', []))
|
||||
logger.info(f" content_gaps: {original_gaps} → {len(filtered_research.keyword_analysis.get('content_gaps', []))}")
|
||||
|
||||
logger.info("✅ Research data filtering completed successfully")
|
||||
226
backend/services/blog_writer/research/exa_provider.py
Normal file
226
backend/services/blog_writer/research/exa_provider.py
Normal file
@@ -0,0 +1,226 @@
|
||||
"""
|
||||
Exa Research Provider
|
||||
|
||||
Neural search implementation using Exa API for high-quality, citation-rich research.
|
||||
"""
|
||||
|
||||
from exa_py import Exa
|
||||
import os
|
||||
from loguru import logger
|
||||
from models.subscription_models import APIProvider
|
||||
from .base_provider import ResearchProvider as BaseProvider
|
||||
|
||||
|
||||
class ExaResearchProvider(BaseProvider):
|
||||
"""Exa neural search provider."""
|
||||
|
||||
def __init__(self):
|
||||
self.api_key = os.getenv("EXA_API_KEY")
|
||||
if not self.api_key:
|
||||
raise RuntimeError("EXA_API_KEY not configured")
|
||||
self.exa = Exa(self.api_key)
|
||||
logger.info("✅ Exa Research Provider initialized")
|
||||
|
||||
async def search(self, prompt, topic, industry, target_audience, config, user_id):
|
||||
"""Execute Exa neural search and return standardized results."""
|
||||
# Build Exa query
|
||||
query = f"{topic} {industry} {target_audience}"
|
||||
|
||||
# Determine category: use exa_category if set, otherwise map from source_types
|
||||
category = config.exa_category if config.exa_category else self._map_source_type_to_category(config.source_types)
|
||||
|
||||
# Build search kwargs - use correct Exa API format
|
||||
search_kwargs = {
|
||||
'type': config.exa_search_type or "auto",
|
||||
'num_results': min(config.max_sources, 25),
|
||||
'text': {'max_characters': 1000},
|
||||
'summary': {'query': f"Key insights about {topic}"},
|
||||
'highlights': {
|
||||
'num_sentences': 2,
|
||||
'highlights_per_url': 3
|
||||
}
|
||||
}
|
||||
|
||||
# Add optional filters
|
||||
if category:
|
||||
search_kwargs['category'] = category
|
||||
if config.exa_include_domains:
|
||||
search_kwargs['include_domains'] = config.exa_include_domains
|
||||
if config.exa_exclude_domains:
|
||||
search_kwargs['exclude_domains'] = config.exa_exclude_domains
|
||||
|
||||
logger.info(f"[Exa Research] Executing search: {query}")
|
||||
|
||||
# Execute Exa search - pass contents parameters directly, not nested
|
||||
try:
|
||||
results = self.exa.search_and_contents(
|
||||
query,
|
||||
text={'max_characters': 1000},
|
||||
summary={'query': f"Key insights about {topic}"},
|
||||
highlights={'num_sentences': 2, 'highlights_per_url': 3},
|
||||
type=config.exa_search_type or "auto",
|
||||
num_results=min(config.max_sources, 25),
|
||||
**({k: v for k, v in {
|
||||
'category': category,
|
||||
'include_domains': config.exa_include_domains,
|
||||
'exclude_domains': config.exa_exclude_domains
|
||||
}.items() if v})
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"[Exa Research] API call failed: {e}")
|
||||
# Try simpler call without contents if the above fails
|
||||
try:
|
||||
logger.info("[Exa Research] Retrying with simplified parameters")
|
||||
results = self.exa.search_and_contents(
|
||||
query,
|
||||
type=config.exa_search_type or "auto",
|
||||
num_results=min(config.max_sources, 25),
|
||||
**({k: v for k, v in {
|
||||
'category': category,
|
||||
'include_domains': config.exa_include_domains,
|
||||
'exclude_domains': config.exa_exclude_domains
|
||||
}.items() if v})
|
||||
)
|
||||
except Exception as retry_error:
|
||||
logger.error(f"[Exa Research] Retry also failed: {retry_error}")
|
||||
raise RuntimeError(f"Exa search failed: {str(retry_error)}") from retry_error
|
||||
|
||||
# Transform to standardized format
|
||||
sources = self._transform_sources(results.results)
|
||||
content = self._aggregate_content(results.results)
|
||||
search_type = getattr(results, 'resolvedSearchType', 'neural') if hasattr(results, 'resolvedSearchType') else 'neural'
|
||||
|
||||
# Get cost if available
|
||||
cost = 0.005 # Default Exa cost for 1-25 results
|
||||
if hasattr(results, 'costDollars'):
|
||||
if hasattr(results.costDollars, 'total'):
|
||||
cost = results.costDollars.total
|
||||
|
||||
logger.info(f"[Exa Research] Search completed: {len(sources)} sources, type: {search_type}")
|
||||
|
||||
return {
|
||||
'sources': sources,
|
||||
'content': content,
|
||||
'search_type': search_type,
|
||||
'provider': 'exa',
|
||||
'search_queries': [query],
|
||||
'cost': {'total': cost}
|
||||
}
|
||||
|
||||
def get_provider_enum(self):
|
||||
"""Return EXA provider enum for subscription tracking."""
|
||||
return APIProvider.EXA
|
||||
|
||||
def estimate_tokens(self) -> int:
|
||||
"""Estimate token usage for Exa (not token-based)."""
|
||||
return 0 # Exa is per-search, not token-based
|
||||
|
||||
def _map_source_type_to_category(self, source_types):
|
||||
"""Map SourceType enum to Exa category parameter."""
|
||||
if not source_types:
|
||||
return None
|
||||
|
||||
category_map = {
|
||||
'research paper': 'research paper',
|
||||
'news': 'news',
|
||||
'web': 'personal site',
|
||||
'industry': 'company',
|
||||
'expert': 'linkedin profile'
|
||||
}
|
||||
|
||||
for st in source_types:
|
||||
if st.value in category_map:
|
||||
return category_map[st.value]
|
||||
|
||||
return None
|
||||
|
||||
def _transform_sources(self, results):
|
||||
"""Transform Exa results to ResearchSource format."""
|
||||
sources = []
|
||||
for idx, result in enumerate(results):
|
||||
source_type = self._determine_source_type(result.url if hasattr(result, 'url') else '')
|
||||
|
||||
sources.append({
|
||||
'title': result.title if hasattr(result, 'title') else '',
|
||||
'url': result.url if hasattr(result, 'url') else '',
|
||||
'excerpt': self._get_excerpt(result),
|
||||
'credibility_score': 0.85, # Exa results are high quality
|
||||
'published_at': result.publishedDate if hasattr(result, 'publishedDate') else None,
|
||||
'index': idx,
|
||||
'source_type': source_type,
|
||||
'content': result.text if hasattr(result, 'text') else '',
|
||||
'highlights': result.highlights if hasattr(result, 'highlights') else [],
|
||||
'summary': result.summary if hasattr(result, 'summary') else ''
|
||||
})
|
||||
|
||||
return sources
|
||||
|
||||
def _get_excerpt(self, result):
|
||||
"""Extract excerpt from Exa result."""
|
||||
if hasattr(result, 'text') and result.text:
|
||||
return result.text[:500]
|
||||
elif hasattr(result, 'summary') and result.summary:
|
||||
return result.summary
|
||||
return ''
|
||||
|
||||
def _determine_source_type(self, url):
|
||||
"""Determine source type from URL."""
|
||||
if not url:
|
||||
return 'web'
|
||||
|
||||
url_lower = url.lower()
|
||||
if 'arxiv.org' in url_lower or 'research' in url_lower:
|
||||
return 'academic'
|
||||
elif any(news in url_lower for news in ['cnn.com', 'bbc.com', 'reuters.com', 'theguardian.com']):
|
||||
return 'news'
|
||||
elif 'linkedin.com' in url_lower:
|
||||
return 'expert'
|
||||
else:
|
||||
return 'web'
|
||||
|
||||
def _aggregate_content(self, results):
|
||||
"""Aggregate content from Exa results for LLM analysis."""
|
||||
content_parts = []
|
||||
|
||||
for idx, result in enumerate(results):
|
||||
if hasattr(result, 'summary') and result.summary:
|
||||
content_parts.append(f"Source {idx + 1}: {result.summary}")
|
||||
elif hasattr(result, 'text') and result.text:
|
||||
content_parts.append(f"Source {idx + 1}: {result.text[:1000]}")
|
||||
|
||||
return "\n\n".join(content_parts)
|
||||
|
||||
def track_exa_usage(self, user_id: str, cost: float):
|
||||
"""Track Exa API usage after successful call."""
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
from sqlalchemy import text
|
||||
|
||||
db = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db)
|
||||
current_period = pricing_service.get_current_billing_period(user_id)
|
||||
|
||||
# Update exa_calls and exa_cost via SQL UPDATE
|
||||
update_query = text("""
|
||||
UPDATE usage_summaries
|
||||
SET exa_calls = COALESCE(exa_calls, 0) + 1,
|
||||
exa_cost = COALESCE(exa_cost, 0) + :cost,
|
||||
total_calls = total_calls + 1,
|
||||
total_cost = total_cost + :cost
|
||||
WHERE user_id = :user_id AND billing_period = :period
|
||||
""")
|
||||
db.execute(update_query, {
|
||||
'cost': cost,
|
||||
'user_id': user_id,
|
||||
'period': current_period
|
||||
})
|
||||
db.commit()
|
||||
|
||||
logger.info(f"[Exa] Tracked usage: user={user_id}, cost=${cost}")
|
||||
except Exception as e:
|
||||
logger.error(f"[Exa] Failed to track usage: {e}")
|
||||
db.rollback()
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
40
backend/services/blog_writer/research/google_provider.py
Normal file
40
backend/services/blog_writer/research/google_provider.py
Normal file
@@ -0,0 +1,40 @@
|
||||
"""
|
||||
Google Research Provider
|
||||
|
||||
Wrapper for Gemini native Google Search grounding to match base provider interface.
|
||||
"""
|
||||
|
||||
from services.llm_providers.gemini_grounded_provider import GeminiGroundedProvider
|
||||
from models.subscription_models import APIProvider
|
||||
from .base_provider import ResearchProvider as BaseProvider
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class GoogleResearchProvider(BaseProvider):
|
||||
"""Google research provider using Gemini native grounding."""
|
||||
|
||||
def __init__(self):
|
||||
self.gemini = GeminiGroundedProvider()
|
||||
|
||||
async def search(self, prompt, topic, industry, target_audience, config, user_id):
|
||||
"""Call Gemini grounding with pre-flight validation."""
|
||||
logger.info(f"[Google Research] Executing search for topic: {topic}")
|
||||
|
||||
result = await self.gemini.generate_grounded_content(
|
||||
prompt=prompt,
|
||||
content_type="research",
|
||||
max_tokens=2000,
|
||||
user_id=user_id,
|
||||
validate_subsequent_operations=True
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
def get_provider_enum(self):
|
||||
"""Return GEMINI provider enum for subscription tracking."""
|
||||
return APIProvider.GEMINI
|
||||
|
||||
def estimate_tokens(self) -> int:
|
||||
"""Estimate token usage for Google grounding."""
|
||||
return 1200 # Conservative estimate
|
||||
|
||||
79
backend/services/blog_writer/research/keyword_analyzer.py
Normal file
79
backend/services/blog_writer/research/keyword_analyzer.py
Normal file
@@ -0,0 +1,79 @@
|
||||
"""
|
||||
Keyword Analyzer - AI-powered keyword analysis for research content.
|
||||
|
||||
Extracts and analyzes keywords from research content using structured AI responses.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class KeywordAnalyzer:
|
||||
"""Analyzes keywords from research content using AI-powered extraction."""
|
||||
|
||||
def analyze(self, content: str, original_keywords: List[str], user_id: str = None) -> Dict[str, Any]:
|
||||
"""Parse comprehensive keyword analysis from the research content using AI."""
|
||||
# Use AI to extract and analyze keywords from the rich research content
|
||||
keyword_prompt = f"""
|
||||
Analyze the following research content and extract comprehensive keyword insights for: {', '.join(original_keywords)}
|
||||
|
||||
Research Content:
|
||||
{content[:3000]} # Limit to avoid token limits
|
||||
|
||||
Extract and analyze:
|
||||
1. Primary keywords (main topic terms)
|
||||
2. Secondary keywords (related terms, synonyms)
|
||||
3. Long-tail opportunities (specific phrases people search for)
|
||||
4. Search intent (informational, commercial, navigational, transactional)
|
||||
5. Keyword difficulty assessment (1-10 scale)
|
||||
6. Content gaps (what competitors are missing)
|
||||
7. Semantic keywords (related concepts)
|
||||
8. Trending terms (emerging keywords)
|
||||
|
||||
Respond with JSON:
|
||||
{{
|
||||
"primary": ["keyword1", "keyword2"],
|
||||
"secondary": ["related1", "related2"],
|
||||
"long_tail": ["specific phrase 1", "specific phrase 2"],
|
||||
"search_intent": "informational|commercial|navigational|transactional",
|
||||
"difficulty": 7,
|
||||
"content_gaps": ["gap1", "gap2"],
|
||||
"semantic_keywords": ["concept1", "concept2"],
|
||||
"trending_terms": ["trend1", "trend2"],
|
||||
"analysis_insights": "Brief analysis of keyword landscape"
|
||||
}}
|
||||
"""
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
keyword_schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"primary": {"type": "array", "items": {"type": "string"}},
|
||||
"secondary": {"type": "array", "items": {"type": "string"}},
|
||||
"long_tail": {"type": "array", "items": {"type": "string"}},
|
||||
"search_intent": {"type": "string"},
|
||||
"difficulty": {"type": "integer"},
|
||||
"content_gaps": {"type": "array", "items": {"type": "string"}},
|
||||
"semantic_keywords": {"type": "array", "items": {"type": "string"}},
|
||||
"trending_terms": {"type": "array", "items": {"type": "string"}},
|
||||
"analysis_insights": {"type": "string"}
|
||||
},
|
||||
"required": ["primary", "secondary", "long_tail", "search_intent", "difficulty", "content_gaps", "semantic_keywords", "trending_terms", "analysis_insights"]
|
||||
}
|
||||
|
||||
keyword_analysis = llm_text_gen(
|
||||
prompt=keyword_prompt,
|
||||
json_struct=keyword_schema,
|
||||
user_id=user_id
|
||||
)
|
||||
|
||||
if isinstance(keyword_analysis, dict) and 'error' not in keyword_analysis:
|
||||
logger.info("✅ AI keyword analysis completed successfully")
|
||||
return keyword_analysis
|
||||
else:
|
||||
# Fail gracefully - no fallback data
|
||||
error_msg = keyword_analysis.get('error', 'Unknown error') if isinstance(keyword_analysis, dict) else str(keyword_analysis)
|
||||
logger.error(f"AI keyword analysis failed: {error_msg}")
|
||||
raise ValueError(f"Keyword analysis failed: {error_msg}")
|
||||
|
||||
914
backend/services/blog_writer/research/research_service.py
Normal file
914
backend/services/blog_writer/research/research_service.py
Normal file
@@ -0,0 +1,914 @@
|
||||
"""
|
||||
Research Service - Core research functionality for AI Blog Writer.
|
||||
|
||||
Handles Google Search grounding, caching, and research orchestration.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Optional
|
||||
from datetime import datetime
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import (
|
||||
BlogResearchRequest,
|
||||
BlogResearchResponse,
|
||||
ResearchSource,
|
||||
GroundingMetadata,
|
||||
GroundingChunk,
|
||||
GroundingSupport,
|
||||
Citation,
|
||||
ResearchConfig,
|
||||
ResearchMode,
|
||||
ResearchProvider,
|
||||
)
|
||||
from services.blog_writer.logger_config import blog_writer_logger, log_function_call
|
||||
from fastapi import HTTPException
|
||||
|
||||
from .keyword_analyzer import KeywordAnalyzer
|
||||
from .competitor_analyzer import CompetitorAnalyzer
|
||||
from .content_angle_generator import ContentAngleGenerator
|
||||
from .data_filter import ResearchDataFilter
|
||||
from .research_strategies import get_strategy_for_mode
|
||||
|
||||
|
||||
class ResearchService:
|
||||
"""Service for conducting comprehensive research using Google Search grounding."""
|
||||
|
||||
def __init__(self):
|
||||
self.keyword_analyzer = KeywordAnalyzer()
|
||||
self.competitor_analyzer = CompetitorAnalyzer()
|
||||
self.content_angle_generator = ContentAngleGenerator()
|
||||
self.data_filter = ResearchDataFilter()
|
||||
|
||||
@log_function_call("research_operation")
|
||||
async def research(self, request: BlogResearchRequest, user_id: str) -> BlogResearchResponse:
|
||||
"""
|
||||
Stage 1: Research & Strategy (AI Orchestration)
|
||||
Uses ONLY Gemini's native Google Search grounding - ONE API call for everything.
|
||||
Follows LinkedIn service pattern for efficiency and cost optimization.
|
||||
Includes intelligent caching for exact keyword matches.
|
||||
"""
|
||||
try:
|
||||
from services.cache.research_cache import research_cache
|
||||
|
||||
topic = request.topic or ", ".join(request.keywords)
|
||||
industry = request.industry or (request.persona.industry if request.persona and request.persona.industry else "General")
|
||||
target_audience = getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'
|
||||
|
||||
# Log research parameters
|
||||
blog_writer_logger.log_operation_start(
|
||||
"research",
|
||||
topic=topic,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
keywords=request.keywords,
|
||||
keyword_count=len(request.keywords)
|
||||
)
|
||||
|
||||
# Check cache first for exact keyword match
|
||||
cached_result = research_cache.get_cached_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience
|
||||
)
|
||||
|
||||
if cached_result:
|
||||
logger.info(f"Returning cached research result for keywords: {request.keywords}")
|
||||
blog_writer_logger.log_operation_end("research", 0, success=True, cache_hit=True)
|
||||
# Normalize cached data to fix None values in confidence_scores
|
||||
normalized_result = self._normalize_cached_research_data(cached_result)
|
||||
return BlogResearchResponse(**normalized_result)
|
||||
|
||||
# User ID validation (validation logic is now in Google Grounding provider)
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for research operation. Please provide Clerk user ID.")
|
||||
|
||||
# Cache miss - proceed with API call
|
||||
logger.info(f"Cache miss - making API call for keywords: {request.keywords}")
|
||||
blog_writer_logger.log_operation_start("research_api_call", api_name="research", operation="research")
|
||||
|
||||
# Determine research mode and get appropriate strategy
|
||||
research_mode = request.research_mode or ResearchMode.BASIC
|
||||
config = request.config or ResearchConfig(mode=research_mode, provider=ResearchProvider.GOOGLE)
|
||||
strategy = get_strategy_for_mode(research_mode)
|
||||
|
||||
logger.info(f"Research: mode={research_mode.value}, provider={config.provider.value}")
|
||||
|
||||
# Build research prompt based on strategy
|
||||
research_prompt = strategy.build_research_prompt(topic, industry, target_audience, config)
|
||||
|
||||
# Route to appropriate provider
|
||||
if config.provider == ResearchProvider.EXA:
|
||||
# Exa research workflow
|
||||
from .exa_provider import ExaResearchProvider
|
||||
from services.subscription.preflight_validator import validate_exa_research_operations
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
import os
|
||||
import time
|
||||
|
||||
# Pre-flight validation
|
||||
db_val = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db_val)
|
||||
gpt_provider = os.getenv("GPT_PROVIDER", "google")
|
||||
validate_exa_research_operations(pricing_service, user_id, gpt_provider)
|
||||
finally:
|
||||
db_val.close()
|
||||
|
||||
# Execute Exa search
|
||||
api_start_time = time.time()
|
||||
try:
|
||||
exa_provider = ExaResearchProvider()
|
||||
raw_result = await exa_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
api_duration_ms = (time.time() - api_start_time) * 1000
|
||||
|
||||
# Track usage
|
||||
cost = raw_result.get('cost', {}).get('total', 0.005) if isinstance(raw_result.get('cost'), dict) else 0.005
|
||||
exa_provider.track_exa_usage(user_id, cost)
|
||||
|
||||
# Log API call performance
|
||||
blog_writer_logger.log_api_call(
|
||||
"exa_search",
|
||||
"search_and_contents",
|
||||
api_duration_ms,
|
||||
token_usage={},
|
||||
content_length=len(raw_result.get('content', ''))
|
||||
)
|
||||
|
||||
# Extract content for downstream analysis
|
||||
content = raw_result.get('content', '')
|
||||
sources = raw_result.get('sources', [])
|
||||
search_widget = "" # Exa doesn't provide search widgets
|
||||
search_queries = raw_result.get('search_queries', [])
|
||||
grounding_metadata = None # Exa doesn't provide grounding metadata
|
||||
|
||||
except RuntimeError as e:
|
||||
if "EXA_API_KEY not configured" in str(e):
|
||||
logger.warning("Exa not configured, falling back to Google")
|
||||
config.provider = ResearchProvider.GOOGLE
|
||||
# Continue to Google flow below
|
||||
raw_result = None
|
||||
else:
|
||||
raise
|
||||
|
||||
elif config.provider == ResearchProvider.TAVILY:
|
||||
# Tavily research workflow
|
||||
from .tavily_provider import TavilyResearchProvider
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
import os
|
||||
import time
|
||||
|
||||
# Pre-flight validation (similar to Exa)
|
||||
db_val = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db_val)
|
||||
# Check Tavily usage limits
|
||||
limits = pricing_service.get_user_limits(user_id)
|
||||
tavily_limit = limits.get('limits', {}).get('tavily_calls', 0) if limits else 0
|
||||
|
||||
# Get current usage
|
||||
from models.subscription_models import UsageSummary
|
||||
from datetime import datetime
|
||||
current_period = pricing_service.get_current_billing_period(user_id) or datetime.now().strftime("%Y-%m")
|
||||
usage = db_val.query(UsageSummary).filter(
|
||||
UsageSummary.user_id == user_id,
|
||||
UsageSummary.billing_period == current_period
|
||||
).first()
|
||||
|
||||
current_calls = getattr(usage, 'tavily_calls', 0) or 0 if usage else 0
|
||||
|
||||
if tavily_limit > 0 and current_calls >= tavily_limit:
|
||||
raise HTTPException(
|
||||
status_code=429,
|
||||
detail={
|
||||
'error': 'Tavily API call limit exceeded',
|
||||
'message': f'You have reached your Tavily API call limit ({tavily_limit} calls). Please upgrade your plan or wait for the next billing period.',
|
||||
'provider': 'tavily',
|
||||
'usage_info': {
|
||||
'current': current_calls,
|
||||
'limit': tavily_limit
|
||||
}
|
||||
}
|
||||
)
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking Tavily limits: {e}")
|
||||
finally:
|
||||
db_val.close()
|
||||
|
||||
# Execute Tavily search
|
||||
api_start_time = time.time()
|
||||
try:
|
||||
tavily_provider = TavilyResearchProvider()
|
||||
raw_result = await tavily_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
api_duration_ms = (time.time() - api_start_time) * 1000
|
||||
|
||||
# Track usage
|
||||
cost = raw_result.get('cost', {}).get('total', 0.001) if isinstance(raw_result.get('cost'), dict) else 0.001
|
||||
search_depth = config.tavily_search_depth or "basic"
|
||||
tavily_provider.track_tavily_usage(user_id, cost, search_depth)
|
||||
|
||||
# Log API call performance
|
||||
blog_writer_logger.log_api_call(
|
||||
"tavily_search",
|
||||
"search",
|
||||
api_duration_ms,
|
||||
token_usage={},
|
||||
content_length=len(raw_result.get('content', ''))
|
||||
)
|
||||
|
||||
# Extract content for downstream analysis
|
||||
content = raw_result.get('content', '')
|
||||
sources = raw_result.get('sources', [])
|
||||
search_widget = "" # Tavily doesn't provide search widgets
|
||||
search_queries = raw_result.get('search_queries', [])
|
||||
grounding_metadata = None # Tavily doesn't provide grounding metadata
|
||||
|
||||
except RuntimeError as e:
|
||||
if "TAVILY_API_KEY not configured" in str(e):
|
||||
logger.warning("Tavily not configured, falling back to Google")
|
||||
config.provider = ResearchProvider.GOOGLE
|
||||
# Continue to Google flow below
|
||||
raw_result = None
|
||||
else:
|
||||
raise
|
||||
|
||||
if config.provider not in [ResearchProvider.EXA, ResearchProvider.TAVILY]:
|
||||
# Google research (existing flow) or fallback from Exa
|
||||
from .google_provider import GoogleResearchProvider
|
||||
import time
|
||||
|
||||
api_start_time = time.time()
|
||||
google_provider = GoogleResearchProvider()
|
||||
gemini_result = await google_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
api_duration_ms = (time.time() - api_start_time) * 1000
|
||||
|
||||
# Log API call performance
|
||||
blog_writer_logger.log_api_call(
|
||||
"gemini_grounded",
|
||||
"generate_grounded_content",
|
||||
api_duration_ms,
|
||||
token_usage=gemini_result.get("token_usage", {}),
|
||||
content_length=len(gemini_result.get("content", ""))
|
||||
)
|
||||
|
||||
# Extract sources and content
|
||||
sources = self._extract_sources_from_grounding(gemini_result)
|
||||
content = gemini_result.get("content", "")
|
||||
search_widget = gemini_result.get("search_widget", "") or ""
|
||||
search_queries = gemini_result.get("search_queries", []) or []
|
||||
grounding_metadata = self._extract_grounding_metadata(gemini_result)
|
||||
|
||||
# Continue with common analysis (same for both providers)
|
||||
keyword_analysis = self.keyword_analyzer.analyze(content, request.keywords, user_id=user_id)
|
||||
competitor_analysis = self.competitor_analyzer.analyze(content, user_id=user_id)
|
||||
suggested_angles = self.content_angle_generator.generate(content, topic, industry, user_id=user_id)
|
||||
|
||||
logger.info(f"Research completed successfully with {len(sources)} sources and {len(search_queries)} search queries")
|
||||
|
||||
# Log analysis results
|
||||
blog_writer_logger.log_performance(
|
||||
"research_analysis",
|
||||
len(content),
|
||||
"characters",
|
||||
sources_count=len(sources),
|
||||
search_queries_count=len(search_queries),
|
||||
keyword_analysis_keys=len(keyword_analysis),
|
||||
suggested_angles_count=len(suggested_angles)
|
||||
)
|
||||
|
||||
# Create the response
|
||||
response = BlogResearchResponse(
|
||||
success=True,
|
||||
sources=sources,
|
||||
keyword_analysis=keyword_analysis,
|
||||
competitor_analysis=competitor_analysis,
|
||||
suggested_angles=suggested_angles,
|
||||
# Add search widget and queries for UI display
|
||||
search_widget=search_widget if 'search_widget' in locals() else "",
|
||||
search_queries=search_queries if 'search_queries' in locals() else [],
|
||||
# Add grounding metadata for detailed UI display
|
||||
grounding_metadata=grounding_metadata,
|
||||
)
|
||||
|
||||
# Filter and clean research data for optimal AI processing
|
||||
filtered_response = self.data_filter.filter_research_data(response)
|
||||
logger.info("Research data filtering completed successfully")
|
||||
|
||||
# Cache the successful result for future exact keyword matches (both caches)
|
||||
persistent_research_cache.cache_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
result=filtered_response.dict()
|
||||
)
|
||||
|
||||
# Also cache in memory for faster access
|
||||
research_cache.cache_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
result=filtered_response.dict()
|
||||
)
|
||||
|
||||
return filtered_response
|
||||
|
||||
except HTTPException:
|
||||
# Re-raise HTTPException (subscription errors) - let task manager handle it
|
||||
raise
|
||||
except Exception as e:
|
||||
error_message = str(e)
|
||||
logger.error(f"Research failed: {error_message}")
|
||||
|
||||
# Log error with full context
|
||||
blog_writer_logger.log_error(
|
||||
e,
|
||||
"research",
|
||||
context={
|
||||
"topic": topic,
|
||||
"keywords": request.keywords,
|
||||
"industry": industry,
|
||||
"target_audience": target_audience
|
||||
}
|
||||
)
|
||||
|
||||
# Import custom exceptions for better error handling
|
||||
from services.blog_writer.exceptions import (
|
||||
ResearchFailedException,
|
||||
APIRateLimitException,
|
||||
APITimeoutException,
|
||||
ValidationException
|
||||
)
|
||||
|
||||
# Determine if this is a retryable error
|
||||
retry_suggested = True
|
||||
user_message = "Research failed. Please try again with different keywords or check your internet connection."
|
||||
|
||||
if isinstance(e, APIRateLimitException):
|
||||
retry_suggested = True
|
||||
user_message = f"Rate limit exceeded. Please wait {e.context.get('retry_after', 60)} seconds before trying again."
|
||||
elif isinstance(e, APITimeoutException):
|
||||
retry_suggested = True
|
||||
user_message = "Research request timed out. Please try again with a shorter query or check your internet connection."
|
||||
elif isinstance(e, ValidationException):
|
||||
retry_suggested = False
|
||||
user_message = "Invalid research request. Please check your input parameters and try again."
|
||||
elif "401" in error_message or "403" in error_message:
|
||||
retry_suggested = False
|
||||
user_message = "Authentication failed. Please check your API credentials."
|
||||
elif "400" in error_message:
|
||||
retry_suggested = False
|
||||
user_message = "Invalid request. Please check your input parameters."
|
||||
|
||||
# Return a graceful failure response with enhanced error information
|
||||
return BlogResearchResponse(
|
||||
success=False,
|
||||
sources=[],
|
||||
keyword_analysis={},
|
||||
competitor_analysis={},
|
||||
suggested_angles=[],
|
||||
search_widget="",
|
||||
search_queries=[],
|
||||
error_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
error_code=getattr(e, 'error_code', 'RESEARCH_FAILED'),
|
||||
actionable_steps=getattr(e, 'actionable_steps', [
|
||||
"Try with different keywords",
|
||||
"Check your internet connection",
|
||||
"Wait a few minutes and try again",
|
||||
"Contact support if the issue persists"
|
||||
])
|
||||
)
|
||||
|
||||
@log_function_call("research_with_progress")
|
||||
async def research_with_progress(self, request: BlogResearchRequest, task_id: str, user_id: str) -> BlogResearchResponse:
|
||||
"""
|
||||
Research method with progress updates for real-time feedback.
|
||||
"""
|
||||
try:
|
||||
from services.cache.research_cache import research_cache
|
||||
from services.cache.persistent_research_cache import persistent_research_cache
|
||||
from api.blog_writer.task_manager import task_manager
|
||||
|
||||
topic = request.topic or ", ".join(request.keywords)
|
||||
industry = request.industry or (request.persona.industry if request.persona and request.persona.industry else "General")
|
||||
target_audience = getattr(request.persona, 'target_audience', 'General') if request.persona else 'General'
|
||||
|
||||
# Check cache first for exact keyword match (try both caches)
|
||||
await task_manager.update_progress(task_id, "🔍 Checking cache for existing research...")
|
||||
|
||||
# Try persistent cache first (survives restarts)
|
||||
cached_result = persistent_research_cache.get_cached_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience
|
||||
)
|
||||
|
||||
# Fallback to in-memory cache
|
||||
if not cached_result:
|
||||
cached_result = research_cache.get_cached_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience
|
||||
)
|
||||
|
||||
if cached_result:
|
||||
await task_manager.update_progress(task_id, "✅ Found cached research results! Returning instantly...")
|
||||
logger.info(f"Returning cached research result for keywords: {request.keywords}")
|
||||
# Normalize cached data to fix None values in confidence_scores
|
||||
normalized_result = self._normalize_cached_research_data(cached_result)
|
||||
return BlogResearchResponse(**normalized_result)
|
||||
|
||||
# User ID validation
|
||||
if not user_id:
|
||||
await task_manager.update_progress(task_id, "❌ Error: User ID is required for research operation")
|
||||
raise ValueError("user_id is required for research operation. Please provide Clerk user ID.")
|
||||
|
||||
# Determine research mode and get appropriate strategy
|
||||
research_mode = request.research_mode or ResearchMode.BASIC
|
||||
config = request.config or ResearchConfig(mode=research_mode, provider=ResearchProvider.GOOGLE)
|
||||
strategy = get_strategy_for_mode(research_mode)
|
||||
|
||||
logger.info(f"Research: mode={research_mode.value}, provider={config.provider.value}")
|
||||
|
||||
# Build research prompt based on strategy
|
||||
research_prompt = strategy.build_research_prompt(topic, industry, target_audience, config)
|
||||
|
||||
# Route to appropriate provider
|
||||
if config.provider == ResearchProvider.EXA:
|
||||
# Exa research workflow
|
||||
from .exa_provider import ExaResearchProvider
|
||||
from services.subscription.preflight_validator import validate_exa_research_operations
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
import os
|
||||
|
||||
await task_manager.update_progress(task_id, "🌐 Connecting to Exa neural search...")
|
||||
|
||||
# Pre-flight validation
|
||||
db_val = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db_val)
|
||||
gpt_provider = os.getenv("GPT_PROVIDER", "google")
|
||||
validate_exa_research_operations(pricing_service, user_id, gpt_provider)
|
||||
except HTTPException as http_error:
|
||||
logger.error(f"Subscription limit exceeded for Exa research: {http_error.detail}")
|
||||
await task_manager.update_progress(task_id, f"❌ Subscription limit exceeded: {http_error.detail.get('message', str(http_error.detail)) if isinstance(http_error.detail, dict) else str(http_error.detail)}")
|
||||
raise
|
||||
finally:
|
||||
db_val.close()
|
||||
|
||||
# Execute Exa search
|
||||
await task_manager.update_progress(task_id, "🤖 Executing Exa neural search...")
|
||||
try:
|
||||
exa_provider = ExaResearchProvider()
|
||||
raw_result = await exa_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
|
||||
# Track usage
|
||||
cost = raw_result.get('cost', {}).get('total', 0.005) if isinstance(raw_result.get('cost'), dict) else 0.005
|
||||
exa_provider.track_exa_usage(user_id, cost)
|
||||
|
||||
# Extract content for downstream analysis
|
||||
# Handle None result case
|
||||
if raw_result is None:
|
||||
logger.error("raw_result is None after Exa search - this should not happen if HTTPException was raised")
|
||||
raise ValueError("Exa research result is None - search operation failed unexpectedly")
|
||||
|
||||
if not isinstance(raw_result, dict):
|
||||
logger.warning(f"raw_result is not a dict (type: {type(raw_result)}), using defaults")
|
||||
raw_result = {}
|
||||
|
||||
content = raw_result.get('content', '')
|
||||
sources = raw_result.get('sources', []) or []
|
||||
search_widget = "" # Exa doesn't provide search widgets
|
||||
search_queries = raw_result.get('search_queries', []) or []
|
||||
grounding_metadata = None # Exa doesn't provide grounding metadata
|
||||
|
||||
except RuntimeError as e:
|
||||
if "EXA_API_KEY not configured" in str(e):
|
||||
logger.warning("Exa not configured, falling back to Google")
|
||||
await task_manager.update_progress(task_id, "⚠️ Exa not configured, falling back to Google Search")
|
||||
config.provider = ResearchProvider.GOOGLE
|
||||
# Continue to Google flow below
|
||||
else:
|
||||
raise
|
||||
|
||||
elif config.provider == ResearchProvider.TAVILY:
|
||||
# Tavily research workflow
|
||||
from .tavily_provider import TavilyResearchProvider
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
import os
|
||||
|
||||
await task_manager.update_progress(task_id, "🌐 Connecting to Tavily AI search...")
|
||||
|
||||
# Pre-flight validation
|
||||
db_val = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db_val)
|
||||
# Check Tavily usage limits
|
||||
limits = pricing_service.get_user_limits(user_id)
|
||||
tavily_limit = limits.get('limits', {}).get('tavily_calls', 0) if limits else 0
|
||||
|
||||
# Get current usage
|
||||
from models.subscription_models import UsageSummary
|
||||
from datetime import datetime
|
||||
current_period = pricing_service.get_current_billing_period(user_id) or datetime.now().strftime("%Y-%m")
|
||||
usage = db_val.query(UsageSummary).filter(
|
||||
UsageSummary.user_id == user_id,
|
||||
UsageSummary.billing_period == current_period
|
||||
).first()
|
||||
|
||||
current_calls = getattr(usage, 'tavily_calls', 0) or 0 if usage else 0
|
||||
|
||||
if tavily_limit > 0 and current_calls >= tavily_limit:
|
||||
await task_manager.update_progress(task_id, f"❌ Tavily API call limit exceeded ({current_calls}/{tavily_limit})")
|
||||
raise HTTPException(
|
||||
status_code=429,
|
||||
detail={
|
||||
'error': 'Tavily API call limit exceeded',
|
||||
'message': f'You have reached your Tavily API call limit ({tavily_limit} calls). Please upgrade your plan or wait for the next billing period.',
|
||||
'provider': 'tavily',
|
||||
'usage_info': {
|
||||
'current': current_calls,
|
||||
'limit': tavily_limit
|
||||
}
|
||||
}
|
||||
)
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking Tavily limits: {e}")
|
||||
finally:
|
||||
db_val.close()
|
||||
|
||||
# Execute Tavily search
|
||||
await task_manager.update_progress(task_id, "🤖 Executing Tavily AI search...")
|
||||
try:
|
||||
tavily_provider = TavilyResearchProvider()
|
||||
raw_result = await tavily_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
|
||||
# Track usage
|
||||
cost = raw_result.get('cost', {}).get('total', 0.001) if isinstance(raw_result.get('cost'), dict) else 0.001
|
||||
search_depth = config.tavily_search_depth or "basic"
|
||||
tavily_provider.track_tavily_usage(user_id, cost, search_depth)
|
||||
|
||||
# Extract content for downstream analysis
|
||||
if raw_result is None:
|
||||
logger.error("raw_result is None after Tavily search")
|
||||
raise ValueError("Tavily research result is None - search operation failed unexpectedly")
|
||||
|
||||
if not isinstance(raw_result, dict):
|
||||
logger.warning(f"raw_result is not a dict (type: {type(raw_result)}), using defaults")
|
||||
raw_result = {}
|
||||
|
||||
content = raw_result.get('content', '')
|
||||
sources = raw_result.get('sources', []) or []
|
||||
search_widget = "" # Tavily doesn't provide search widgets
|
||||
search_queries = raw_result.get('search_queries', []) or []
|
||||
grounding_metadata = None # Tavily doesn't provide grounding metadata
|
||||
|
||||
except RuntimeError as e:
|
||||
if "TAVILY_API_KEY not configured" in str(e):
|
||||
logger.warning("Tavily not configured, falling back to Google")
|
||||
await task_manager.update_progress(task_id, "⚠️ Tavily not configured, falling back to Google Search")
|
||||
config.provider = ResearchProvider.GOOGLE
|
||||
# Continue to Google flow below
|
||||
else:
|
||||
raise
|
||||
|
||||
if config.provider not in [ResearchProvider.EXA, ResearchProvider.TAVILY]:
|
||||
# Google research (existing flow)
|
||||
from .google_provider import GoogleResearchProvider
|
||||
|
||||
await task_manager.update_progress(task_id, "🌐 Connecting to Google Search grounding...")
|
||||
google_provider = GoogleResearchProvider()
|
||||
|
||||
await task_manager.update_progress(task_id, "🤖 Making AI request to Gemini with Google Search grounding...")
|
||||
try:
|
||||
gemini_result = await google_provider.search(
|
||||
research_prompt, topic, industry, target_audience, config, user_id
|
||||
)
|
||||
except HTTPException as http_error:
|
||||
logger.error(f"Subscription limit exceeded for Google research: {http_error.detail}")
|
||||
await task_manager.update_progress(task_id, f"❌ Subscription limit exceeded: {http_error.detail.get('message', str(http_error.detail)) if isinstance(http_error.detail, dict) else str(http_error.detail)}")
|
||||
raise
|
||||
|
||||
await task_manager.update_progress(task_id, "📊 Processing research results and extracting insights...")
|
||||
# Extract sources and content
|
||||
# Handle None result case
|
||||
if gemini_result is None:
|
||||
logger.error("gemini_result is None after search - this should not happen if HTTPException was raised")
|
||||
raise ValueError("Research result is None - search operation failed unexpectedly")
|
||||
|
||||
sources = self._extract_sources_from_grounding(gemini_result)
|
||||
content = gemini_result.get("content", "") if isinstance(gemini_result, dict) else ""
|
||||
search_widget = gemini_result.get("search_widget", "") or "" if isinstance(gemini_result, dict) else ""
|
||||
search_queries = gemini_result.get("search_queries", []) or [] if isinstance(gemini_result, dict) else []
|
||||
grounding_metadata = self._extract_grounding_metadata(gemini_result)
|
||||
|
||||
# Continue with common analysis (same for both providers)
|
||||
await task_manager.update_progress(task_id, "🔍 Analyzing keywords and content angles...")
|
||||
keyword_analysis = self.keyword_analyzer.analyze(content, request.keywords, user_id=user_id)
|
||||
competitor_analysis = self.competitor_analyzer.analyze(content, user_id=user_id)
|
||||
suggested_angles = self.content_angle_generator.generate(content, topic, industry, user_id=user_id)
|
||||
|
||||
await task_manager.update_progress(task_id, "💾 Caching results for future use...")
|
||||
logger.info(f"Research completed successfully with {len(sources)} sources and {len(search_queries)} search queries")
|
||||
|
||||
# Create the response
|
||||
response = BlogResearchResponse(
|
||||
success=True,
|
||||
sources=sources,
|
||||
keyword_analysis=keyword_analysis,
|
||||
competitor_analysis=competitor_analysis,
|
||||
suggested_angles=suggested_angles,
|
||||
# Add search widget and queries for UI display
|
||||
search_widget=search_widget if 'search_widget' in locals() else "",
|
||||
search_queries=search_queries if 'search_queries' in locals() else [],
|
||||
# Add grounding metadata for detailed UI display
|
||||
grounding_metadata=grounding_metadata,
|
||||
# Preserve original user keywords for caching
|
||||
original_keywords=request.keywords,
|
||||
)
|
||||
|
||||
# Filter and clean research data for optimal AI processing
|
||||
await task_manager.update_progress(task_id, "🔍 Filtering and cleaning research data...")
|
||||
filtered_response = self.data_filter.filter_research_data(response)
|
||||
logger.info("Research data filtering completed successfully")
|
||||
|
||||
# Cache the successful result for future exact keyword matches (both caches)
|
||||
persistent_research_cache.cache_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
result=filtered_response.dict()
|
||||
)
|
||||
|
||||
# Also cache in memory for faster access
|
||||
research_cache.cache_result(
|
||||
keywords=request.keywords,
|
||||
industry=industry,
|
||||
target_audience=target_audience,
|
||||
result=filtered_response.dict()
|
||||
)
|
||||
|
||||
return filtered_response
|
||||
|
||||
except HTTPException:
|
||||
# Re-raise HTTPException (subscription errors) - let task manager handle it
|
||||
raise
|
||||
except Exception as e:
|
||||
error_message = str(e)
|
||||
logger.error(f"Research failed: {error_message}")
|
||||
|
||||
# Log error with full context
|
||||
blog_writer_logger.log_error(
|
||||
e,
|
||||
"research",
|
||||
context={
|
||||
"topic": topic,
|
||||
"keywords": request.keywords,
|
||||
"industry": industry,
|
||||
"target_audience": target_audience
|
||||
}
|
||||
)
|
||||
|
||||
# Import custom exceptions for better error handling
|
||||
from services.blog_writer.exceptions import (
|
||||
ResearchFailedException,
|
||||
APIRateLimitException,
|
||||
APITimeoutException,
|
||||
ValidationException
|
||||
)
|
||||
|
||||
# Determine if this is a retryable error
|
||||
retry_suggested = True
|
||||
user_message = "Research failed. Please try again with different keywords or check your internet connection."
|
||||
|
||||
if isinstance(e, APIRateLimitException):
|
||||
retry_suggested = True
|
||||
user_message = f"Rate limit exceeded. Please wait {e.context.get('retry_after', 60)} seconds before trying again."
|
||||
elif isinstance(e, APITimeoutException):
|
||||
retry_suggested = True
|
||||
user_message = "Research request timed out. Please try again with a shorter query or check your internet connection."
|
||||
elif isinstance(e, ValidationException):
|
||||
retry_suggested = False
|
||||
user_message = "Invalid research request. Please check your input parameters and try again."
|
||||
elif "401" in error_message or "403" in error_message:
|
||||
retry_suggested = False
|
||||
user_message = "Authentication failed. Please check your API credentials."
|
||||
elif "400" in error_message:
|
||||
retry_suggested = False
|
||||
user_message = "Invalid request. Please check your input parameters."
|
||||
|
||||
# Return a graceful failure response with enhanced error information
|
||||
return BlogResearchResponse(
|
||||
success=False,
|
||||
sources=[],
|
||||
keyword_analysis={},
|
||||
competitor_analysis={},
|
||||
suggested_angles=[],
|
||||
search_widget="",
|
||||
search_queries=[],
|
||||
error_message=user_message,
|
||||
retry_suggested=retry_suggested,
|
||||
error_code=getattr(e, 'error_code', 'RESEARCH_FAILED'),
|
||||
actionable_steps=getattr(e, 'actionable_steps', [
|
||||
"Try with different keywords",
|
||||
"Check your internet connection",
|
||||
"Wait a few minutes and try again",
|
||||
"Contact support if the issue persists"
|
||||
])
|
||||
)
|
||||
|
||||
def _extract_sources_from_grounding(self, gemini_result: Dict[str, Any]) -> List[ResearchSource]:
|
||||
"""Extract sources from Gemini grounding metadata."""
|
||||
sources = []
|
||||
|
||||
# Handle None or invalid gemini_result
|
||||
if not gemini_result or not isinstance(gemini_result, dict):
|
||||
logger.warning("gemini_result is None or not a dict, returning empty sources")
|
||||
return sources
|
||||
|
||||
# The Gemini grounded provider already extracts sources and puts them in the 'sources' field
|
||||
raw_sources = gemini_result.get("sources", [])
|
||||
# Ensure raw_sources is a list (handle None case)
|
||||
if raw_sources is None:
|
||||
raw_sources = []
|
||||
|
||||
for src in raw_sources:
|
||||
source = ResearchSource(
|
||||
title=src.get("title", "Untitled"),
|
||||
url=src.get("url", ""),
|
||||
excerpt=src.get("content", "")[:500] if src.get("content") else f"Source from {src.get('title', 'web')}",
|
||||
credibility_score=float(src.get("credibility_score", 0.8)),
|
||||
published_at=str(src.get("publication_date", "2024-01-01")),
|
||||
index=src.get("index"),
|
||||
source_type=src.get("type", "web")
|
||||
)
|
||||
sources.append(source)
|
||||
|
||||
return sources
|
||||
|
||||
def _normalize_cached_research_data(self, cached_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Normalize cached research data to fix None values in confidence_scores.
|
||||
Ensures all GroundingSupport objects have confidence_scores as a list.
|
||||
"""
|
||||
if not isinstance(cached_data, dict):
|
||||
return cached_data
|
||||
|
||||
normalized = cached_data.copy()
|
||||
|
||||
# Normalize grounding_metadata if present
|
||||
if "grounding_metadata" in normalized and normalized["grounding_metadata"]:
|
||||
grounding_metadata = normalized["grounding_metadata"].copy() if isinstance(normalized["grounding_metadata"], dict) else {}
|
||||
|
||||
# Normalize grounding_supports
|
||||
if "grounding_supports" in grounding_metadata and isinstance(grounding_metadata["grounding_supports"], list):
|
||||
normalized_supports = []
|
||||
for support in grounding_metadata["grounding_supports"]:
|
||||
if isinstance(support, dict):
|
||||
normalized_support = support.copy()
|
||||
# Fix confidence_scores: ensure it's a list, not None
|
||||
if normalized_support.get("confidence_scores") is None:
|
||||
normalized_support["confidence_scores"] = []
|
||||
elif not isinstance(normalized_support.get("confidence_scores"), list):
|
||||
# If it's not a list, try to convert or default to empty list
|
||||
normalized_support["confidence_scores"] = []
|
||||
# Fix grounding_chunk_indices: ensure it's a list, not None
|
||||
if normalized_support.get("grounding_chunk_indices") is None:
|
||||
normalized_support["grounding_chunk_indices"] = []
|
||||
elif not isinstance(normalized_support.get("grounding_chunk_indices"), list):
|
||||
normalized_support["grounding_chunk_indices"] = []
|
||||
# Ensure segment_text is a string
|
||||
if normalized_support.get("segment_text") is None:
|
||||
normalized_support["segment_text"] = ""
|
||||
normalized_supports.append(normalized_support)
|
||||
else:
|
||||
normalized_supports.append(support)
|
||||
grounding_metadata["grounding_supports"] = normalized_supports
|
||||
|
||||
normalized["grounding_metadata"] = grounding_metadata
|
||||
|
||||
return normalized
|
||||
|
||||
def _extract_grounding_metadata(self, gemini_result: Dict[str, Any]) -> GroundingMetadata:
|
||||
"""Extract detailed grounding metadata from Gemini result."""
|
||||
grounding_chunks = []
|
||||
grounding_supports = []
|
||||
citations = []
|
||||
|
||||
# Handle None or invalid gemini_result
|
||||
if not gemini_result or not isinstance(gemini_result, dict):
|
||||
logger.warning("gemini_result is None or not a dict, returning empty grounding metadata")
|
||||
return GroundingMetadata(
|
||||
grounding_chunks=grounding_chunks,
|
||||
grounding_supports=grounding_supports,
|
||||
citations=citations
|
||||
)
|
||||
|
||||
# Extract grounding chunks from the raw grounding metadata
|
||||
raw_grounding = gemini_result.get("grounding_metadata", {})
|
||||
|
||||
# Handle case where grounding_metadata might be a GroundingMetadata object
|
||||
if hasattr(raw_grounding, 'grounding_chunks'):
|
||||
raw_chunks = raw_grounding.grounding_chunks
|
||||
else:
|
||||
raw_chunks = raw_grounding.get("grounding_chunks", []) if isinstance(raw_grounding, dict) else []
|
||||
|
||||
# Ensure raw_chunks is a list (handle None case)
|
||||
if raw_chunks is None:
|
||||
raw_chunks = []
|
||||
|
||||
for chunk in raw_chunks:
|
||||
if "web" in chunk:
|
||||
web_data = chunk["web"]
|
||||
grounding_chunk = GroundingChunk(
|
||||
title=web_data.get("title", "Untitled"),
|
||||
url=web_data.get("uri", ""),
|
||||
confidence_score=None # Will be set from supports
|
||||
)
|
||||
grounding_chunks.append(grounding_chunk)
|
||||
|
||||
# Extract grounding supports with confidence scores
|
||||
if hasattr(raw_grounding, 'grounding_supports'):
|
||||
raw_supports = raw_grounding.grounding_supports
|
||||
else:
|
||||
raw_supports = raw_grounding.get("grounding_supports", [])
|
||||
for support in raw_supports:
|
||||
# Handle both dictionary and GroundingSupport object formats
|
||||
if hasattr(support, 'confidence_scores'):
|
||||
confidence_scores = support.confidence_scores
|
||||
chunk_indices = support.grounding_chunk_indices
|
||||
segment_text = getattr(support, 'segment_text', '')
|
||||
start_index = getattr(support, 'start_index', None)
|
||||
end_index = getattr(support, 'end_index', None)
|
||||
else:
|
||||
confidence_scores = support.get("confidence_scores", [])
|
||||
chunk_indices = support.get("grounding_chunk_indices", [])
|
||||
segment = support.get("segment", {})
|
||||
segment_text = segment.get("text", "")
|
||||
start_index = segment.get("start_index")
|
||||
end_index = segment.get("end_index")
|
||||
|
||||
grounding_support = GroundingSupport(
|
||||
confidence_scores=confidence_scores,
|
||||
grounding_chunk_indices=chunk_indices,
|
||||
segment_text=segment_text,
|
||||
start_index=start_index,
|
||||
end_index=end_index
|
||||
)
|
||||
grounding_supports.append(grounding_support)
|
||||
|
||||
# Update confidence scores for chunks
|
||||
if confidence_scores and chunk_indices:
|
||||
avg_confidence = sum(confidence_scores) / len(confidence_scores)
|
||||
for idx in chunk_indices:
|
||||
if idx < len(grounding_chunks):
|
||||
grounding_chunks[idx].confidence_score = avg_confidence
|
||||
|
||||
# Extract citations from the raw result
|
||||
raw_citations = gemini_result.get("citations", [])
|
||||
for citation in raw_citations:
|
||||
citation_obj = Citation(
|
||||
citation_type=citation.get("type", "inline"),
|
||||
start_index=citation.get("start_index", 0),
|
||||
end_index=citation.get("end_index", 0),
|
||||
text=citation.get("text", ""),
|
||||
source_indices=citation.get("source_indices", []),
|
||||
reference=citation.get("reference", "")
|
||||
)
|
||||
citations.append(citation_obj)
|
||||
|
||||
# Extract search entry point and web search queries
|
||||
if hasattr(raw_grounding, 'search_entry_point'):
|
||||
search_entry_point = getattr(raw_grounding.search_entry_point, 'rendered_content', '') if raw_grounding.search_entry_point else ''
|
||||
else:
|
||||
search_entry_point = raw_grounding.get("search_entry_point", {}).get("rendered_content", "")
|
||||
|
||||
if hasattr(raw_grounding, 'web_search_queries'):
|
||||
web_search_queries = raw_grounding.web_search_queries
|
||||
else:
|
||||
web_search_queries = raw_grounding.get("web_search_queries", [])
|
||||
|
||||
return GroundingMetadata(
|
||||
grounding_chunks=grounding_chunks,
|
||||
grounding_supports=grounding_supports,
|
||||
citations=citations,
|
||||
search_entry_point=search_entry_point,
|
||||
web_search_queries=web_search_queries
|
||||
)
|
||||
230
backend/services/blog_writer/research/research_strategies.py
Normal file
230
backend/services/blog_writer/research/research_strategies.py
Normal file
@@ -0,0 +1,230 @@
|
||||
"""
|
||||
Research Strategy Pattern Implementation
|
||||
|
||||
Different strategies for executing research based on depth and focus.
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
from loguru import logger
|
||||
|
||||
from models.blog_models import BlogResearchRequest, ResearchMode, ResearchConfig
|
||||
from .keyword_analyzer import KeywordAnalyzer
|
||||
from .competitor_analyzer import CompetitorAnalyzer
|
||||
from .content_angle_generator import ContentAngleGenerator
|
||||
|
||||
|
||||
class ResearchStrategy(ABC):
|
||||
"""Base class for research strategies."""
|
||||
|
||||
def __init__(self):
|
||||
self.keyword_analyzer = KeywordAnalyzer()
|
||||
self.competitor_analyzer = CompetitorAnalyzer()
|
||||
self.content_angle_generator = ContentAngleGenerator()
|
||||
|
||||
@abstractmethod
|
||||
def build_research_prompt(
|
||||
self,
|
||||
topic: str,
|
||||
industry: str,
|
||||
target_audience: str,
|
||||
config: ResearchConfig
|
||||
) -> str:
|
||||
"""Build the research prompt for the strategy."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_mode(self) -> ResearchMode:
|
||||
"""Return the research mode this strategy handles."""
|
||||
pass
|
||||
|
||||
|
||||
class BasicResearchStrategy(ResearchStrategy):
|
||||
"""Basic research strategy - keyword focused, minimal analysis."""
|
||||
|
||||
def get_mode(self) -> ResearchMode:
|
||||
return ResearchMode.BASIC
|
||||
|
||||
def build_research_prompt(
|
||||
self,
|
||||
topic: str,
|
||||
industry: str,
|
||||
target_audience: str,
|
||||
config: ResearchConfig
|
||||
) -> str:
|
||||
"""Build basic research prompt focused on podcast-ready, actionable insights."""
|
||||
prompt = f"""You are a podcast researcher creating TALKING POINTS and FACT CARDS for a {industry} audience of {target_audience}.
|
||||
|
||||
Research Topic: "{topic}"
|
||||
|
||||
Provide analysis in this EXACT format:
|
||||
|
||||
## PODCAST HOOKS (3)
|
||||
- [Hook line with tension + data point + source URL]
|
||||
|
||||
## OBJECTIONS & COUNTERS (3)
|
||||
- Objection: [common listener objection]
|
||||
Counter: [concise rebuttal with stat + source URL]
|
||||
|
||||
## KEY STATS & PROOF (6)
|
||||
- [Specific metric with %/number, date, and source URL]
|
||||
|
||||
## MINI CASE SNAPS (3)
|
||||
- [Brand/company], [what they did], [outcome metric], [source URL]
|
||||
|
||||
## KEYWORDS TO MENTION (Primary + 5 Secondary)
|
||||
- Primary: "{topic}"
|
||||
- Secondary: [5 related keywords]
|
||||
|
||||
## 5 CONTENT ANGLES
|
||||
1. [Angle with audience benefit + why-now]
|
||||
2. [Angle ...]
|
||||
3. [Angle ...]
|
||||
4. [Angle ...]
|
||||
5. [Angle ...]
|
||||
|
||||
## FACT CARD LIST (8)
|
||||
- For each: Quote/claim, source URL, published date, metric/context.
|
||||
|
||||
REQUIREMENTS:
|
||||
- Every claim MUST include a source URL (authoritative, recent: 2024-2025 preferred).
|
||||
- Use concrete numbers, dates, outcomes; avoid generic advice.
|
||||
- Keep bullets tight and scannable for spoken narration."""
|
||||
return prompt.strip()
|
||||
|
||||
|
||||
class ComprehensiveResearchStrategy(ResearchStrategy):
|
||||
"""Comprehensive research strategy - full analysis with all components."""
|
||||
|
||||
def get_mode(self) -> ResearchMode:
|
||||
return ResearchMode.COMPREHENSIVE
|
||||
|
||||
def build_research_prompt(
|
||||
self,
|
||||
topic: str,
|
||||
industry: str,
|
||||
target_audience: str,
|
||||
config: ResearchConfig
|
||||
) -> str:
|
||||
"""Build comprehensive research prompt with podcast-focused, high-value insights."""
|
||||
date_filter = f"\nDate Focus: {config.date_range.value.replace('_', ' ')}" if config.date_range else ""
|
||||
source_filter = f"\nPriority Sources: {', '.join([s.value for s in config.source_types])}" if config.source_types else ""
|
||||
|
||||
prompt = f"""You are a senior podcast researcher creating deeply sourced talking points for a {industry} audience of {target_audience}.
|
||||
|
||||
Research Topic: "{topic}"{date_filter}{source_filter}
|
||||
|
||||
Provide COMPLETE analysis in this EXACT format:
|
||||
|
||||
## WHAT'S CHANGED (2024-2025)
|
||||
[5-7 concise trend bullets with numbers + source URLs]
|
||||
|
||||
## PROOF & NUMBERS
|
||||
[10 stats with metric, date, sample size/method, and source URL]
|
||||
|
||||
## EXPERT SIGNALS
|
||||
[5 expert quotes with name, title/company, source URL]
|
||||
|
||||
## RECENT MOVES
|
||||
[5-7 news items or launches with dates and source URLs]
|
||||
|
||||
## MARKET SNAPSHOTS
|
||||
[3-5 insights with TAM/SAM/SOM or adoption metrics, source URLs]
|
||||
|
||||
## CASE SNAPS
|
||||
[3-5 cases: who, what they did, outcome metric, source URL]
|
||||
|
||||
## KEYWORD PLAN
|
||||
Primary (3), Secondary (8-10), Long-tail (5-7) with intent hints.
|
||||
|
||||
## COMPETITOR GAPS
|
||||
- Top 5 competitors (URL) + 1-line strength
|
||||
- 5 content gaps we can own
|
||||
- 3 unique angles to differentiate
|
||||
|
||||
## PODCAST-READY ANGLES (5)
|
||||
- Each: Hook, promised takeaway, data or example, source URL.
|
||||
|
||||
## FACT CARD LIST (10)
|
||||
- Each: Quote/claim, source URL, published date, metric/context, suggested angle tag.
|
||||
|
||||
VERIFICATION REQUIREMENTS:
|
||||
- Minimum 2 authoritative sources per major claim.
|
||||
- Prefer industry reports > research papers > news > blogs.
|
||||
- 2024-2025 data strongly preferred.
|
||||
- All numbers must include timeframe and methodology.
|
||||
- Every bullet must be concise for spoken narration and actionable for {target_audience}."""
|
||||
return prompt.strip()
|
||||
|
||||
|
||||
class TargetedResearchStrategy(ResearchStrategy):
|
||||
"""Targeted research strategy - focused on specific aspects."""
|
||||
|
||||
def get_mode(self) -> ResearchMode:
|
||||
return ResearchMode.TARGETED
|
||||
|
||||
def build_research_prompt(
|
||||
self,
|
||||
topic: str,
|
||||
industry: str,
|
||||
target_audience: str,
|
||||
config: ResearchConfig
|
||||
) -> str:
|
||||
"""Build targeted research prompt based on config preferences."""
|
||||
sections = []
|
||||
|
||||
if config.include_trends:
|
||||
sections.append("""## CURRENT TRENDS
|
||||
[3-5 trends with data and source URLs]""")
|
||||
|
||||
if config.include_statistics:
|
||||
sections.append("""## KEY STATISTICS
|
||||
[5-7 statistics with numbers and source URLs]""")
|
||||
|
||||
if config.include_expert_quotes:
|
||||
sections.append("""## EXPERT OPINIONS
|
||||
[3-4 expert quotes with attribution and source URLs]""")
|
||||
|
||||
if config.include_competitors:
|
||||
sections.append("""## COMPETITOR ANALYSIS
|
||||
Top Competitors: [3-5]
|
||||
Content Gaps: [3-5]""")
|
||||
|
||||
# Always include keywords and angles
|
||||
sections.append("""## KEYWORD ANALYSIS
|
||||
Primary: [2-3 variations]
|
||||
Secondary: [5-7 keywords]
|
||||
Long-Tail: [3-5 phrases]""")
|
||||
|
||||
sections.append("""## CONTENT ANGLES (3-5)
|
||||
[Unique blog angles with reasoning]""")
|
||||
|
||||
sections_str = "\n\n".join(sections)
|
||||
|
||||
prompt = f"""You are a blog content strategist conducting targeted research for a {industry} blog targeting {target_audience}.
|
||||
|
||||
Research Topic: "{topic}"
|
||||
|
||||
Provide focused analysis in this EXACT format:
|
||||
|
||||
{sections_str}
|
||||
|
||||
REQUIREMENTS:
|
||||
- Cite all claims with authoritative source URLs
|
||||
- Include specific numbers, dates, examples
|
||||
- Focus on actionable insights for {target_audience}
|
||||
- Use 2024-2025 data when available"""
|
||||
return prompt.strip()
|
||||
|
||||
|
||||
def get_strategy_for_mode(mode: ResearchMode) -> ResearchStrategy:
|
||||
"""Factory function to get the appropriate strategy for a mode."""
|
||||
strategy_map = {
|
||||
ResearchMode.BASIC: BasicResearchStrategy,
|
||||
ResearchMode.COMPREHENSIVE: ComprehensiveResearchStrategy,
|
||||
ResearchMode.TARGETED: TargetedResearchStrategy,
|
||||
}
|
||||
|
||||
strategy_class = strategy_map.get(mode, BasicResearchStrategy)
|
||||
return strategy_class()
|
||||
|
||||
169
backend/services/blog_writer/research/tavily_provider.py
Normal file
169
backend/services/blog_writer/research/tavily_provider.py
Normal file
@@ -0,0 +1,169 @@
|
||||
"""
|
||||
Tavily Research Provider
|
||||
|
||||
AI-powered search implementation using Tavily API for high-quality research.
|
||||
"""
|
||||
|
||||
import os
|
||||
from loguru import logger
|
||||
from models.subscription_models import APIProvider
|
||||
from services.research.tavily_service import TavilyService
|
||||
from .base_provider import ResearchProvider as BaseProvider
|
||||
|
||||
|
||||
class TavilyResearchProvider(BaseProvider):
|
||||
"""Tavily AI-powered search provider."""
|
||||
|
||||
def __init__(self):
|
||||
self.api_key = os.getenv("TAVILY_API_KEY")
|
||||
if not self.api_key:
|
||||
raise RuntimeError("TAVILY_API_KEY not configured")
|
||||
self.tavily_service = TavilyService()
|
||||
logger.info("✅ Tavily Research Provider initialized")
|
||||
|
||||
async def search(self, prompt, topic, industry, target_audience, config, user_id):
|
||||
"""Execute Tavily search and return standardized results."""
|
||||
# Build Tavily query
|
||||
query = f"{topic} {industry} {target_audience}"
|
||||
|
||||
# Get Tavily-specific config options
|
||||
topic = config.tavily_topic or "general"
|
||||
search_depth = config.tavily_search_depth or "basic"
|
||||
|
||||
logger.info(f"[Tavily Research] Executing search: {query}")
|
||||
|
||||
# Execute Tavily search
|
||||
result = await self.tavily_service.search(
|
||||
query=query,
|
||||
topic=topic,
|
||||
search_depth=search_depth,
|
||||
max_results=min(config.max_sources, 20),
|
||||
include_domains=config.tavily_include_domains or None,
|
||||
exclude_domains=config.tavily_exclude_domains or None,
|
||||
include_answer=config.tavily_include_answer or False,
|
||||
include_raw_content=config.tavily_include_raw_content or False,
|
||||
include_images=config.tavily_include_images or False,
|
||||
include_image_descriptions=config.tavily_include_image_descriptions or False,
|
||||
time_range=config.tavily_time_range,
|
||||
start_date=config.tavily_start_date,
|
||||
end_date=config.tavily_end_date,
|
||||
country=config.tavily_country,
|
||||
chunks_per_source=config.tavily_chunks_per_source or 3,
|
||||
auto_parameters=config.tavily_auto_parameters or False
|
||||
)
|
||||
|
||||
if not result.get("success"):
|
||||
raise RuntimeError(f"Tavily search failed: {result.get('error', 'Unknown error')}")
|
||||
|
||||
# Transform to standardized format
|
||||
sources = self._transform_sources(result.get("results", []))
|
||||
content = self._aggregate_content(result.get("results", []))
|
||||
|
||||
# Calculate cost (basic = 1 credit, advanced = 2 credits)
|
||||
cost = 0.001 if search_depth == "basic" else 0.002 # Estimate cost per search
|
||||
|
||||
logger.info(f"[Tavily Research] Search completed: {len(sources)} sources, depth: {search_depth}")
|
||||
|
||||
return {
|
||||
'sources': sources,
|
||||
'content': content,
|
||||
'search_type': search_depth,
|
||||
'provider': 'tavily',
|
||||
'search_queries': [query],
|
||||
'cost': {'total': cost},
|
||||
'answer': result.get("answer"), # If include_answer was requested
|
||||
'images': result.get("images", [])
|
||||
}
|
||||
|
||||
def get_provider_enum(self):
|
||||
"""Return TAVILY provider enum for subscription tracking."""
|
||||
return APIProvider.TAVILY
|
||||
|
||||
def estimate_tokens(self) -> int:
|
||||
"""Estimate token usage for Tavily (not token-based, but we estimate API calls)."""
|
||||
return 0 # Tavily is per-search, not token-based
|
||||
|
||||
def _transform_sources(self, results):
|
||||
"""Transform Tavily results to ResearchSource format."""
|
||||
sources = []
|
||||
for idx, result in enumerate(results):
|
||||
source_type = self._determine_source_type(result.get("url", ""))
|
||||
|
||||
sources.append({
|
||||
'title': result.get("title", ""),
|
||||
'url': result.get("url", ""),
|
||||
'excerpt': result.get("content", "")[:500], # First 500 chars
|
||||
'credibility_score': result.get("relevance_score", 0.5),
|
||||
'published_at': result.get("published_date"),
|
||||
'index': idx,
|
||||
'source_type': source_type,
|
||||
'content': result.get("content", ""),
|
||||
'raw_content': result.get("raw_content"), # If include_raw_content was requested
|
||||
'score': result.get("score", result.get("relevance_score", 0.5)),
|
||||
'favicon': result.get("favicon")
|
||||
})
|
||||
|
||||
return sources
|
||||
|
||||
def _determine_source_type(self, url):
|
||||
"""Determine source type from URL."""
|
||||
if not url:
|
||||
return 'web'
|
||||
|
||||
url_lower = url.lower()
|
||||
if 'arxiv.org' in url_lower or 'research' in url_lower or '.edu' in url_lower:
|
||||
return 'academic'
|
||||
elif any(news in url_lower for news in ['cnn.com', 'bbc.com', 'reuters.com', 'theguardian.com', 'nytimes.com']):
|
||||
return 'news'
|
||||
elif 'linkedin.com' in url_lower:
|
||||
return 'expert'
|
||||
elif '.gov' in url_lower:
|
||||
return 'government'
|
||||
else:
|
||||
return 'web'
|
||||
|
||||
def _aggregate_content(self, results):
|
||||
"""Aggregate content from Tavily results for LLM analysis."""
|
||||
content_parts = []
|
||||
|
||||
for idx, result in enumerate(results):
|
||||
content = result.get("content", "")
|
||||
if content:
|
||||
content_parts.append(f"Source {idx + 1}: {content}")
|
||||
|
||||
return "\n\n".join(content_parts)
|
||||
|
||||
def track_tavily_usage(self, user_id: str, cost: float, search_depth: str):
|
||||
"""Track Tavily API usage after successful call."""
|
||||
from services.database import get_db
|
||||
from services.subscription import PricingService
|
||||
from sqlalchemy import text
|
||||
|
||||
db = next(get_db())
|
||||
try:
|
||||
pricing_service = PricingService(db)
|
||||
current_period = pricing_service.get_current_billing_period(user_id)
|
||||
|
||||
# Update tavily_calls and tavily_cost via SQL UPDATE
|
||||
update_query = text("""
|
||||
UPDATE usage_summaries
|
||||
SET tavily_calls = COALESCE(tavily_calls, 0) + 1,
|
||||
tavily_cost = COALESCE(tavily_cost, 0) + :cost,
|
||||
total_calls = COALESCE(total_calls, 0) + 1,
|
||||
total_cost = COALESCE(total_cost, 0) + :cost
|
||||
WHERE user_id = :user_id AND billing_period = :period
|
||||
""")
|
||||
db.execute(update_query, {
|
||||
'cost': cost,
|
||||
'user_id': user_id,
|
||||
'period': current_period
|
||||
})
|
||||
db.commit()
|
||||
|
||||
logger.info(f"[Tavily] Tracked usage: user={user_id}, cost=${cost}, depth={search_depth}")
|
||||
except Exception as e:
|
||||
logger.error(f"[Tavily] Failed to track usage: {e}", exc_info=True)
|
||||
db.rollback()
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
223
backend/services/blog_writer/retry_utils.py
Normal file
223
backend/services/blog_writer/retry_utils.py
Normal file
@@ -0,0 +1,223 @@
|
||||
"""
|
||||
Enhanced Retry Utilities for Blog Writer
|
||||
|
||||
Provides advanced retry logic with exponential backoff, jitter, retry budgets,
|
||||
and specific error code handling for different types of API failures.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import random
|
||||
import time
|
||||
from typing import Callable, Any, Optional, Dict, List
|
||||
from dataclasses import dataclass
|
||||
from loguru import logger
|
||||
|
||||
from .exceptions import APIRateLimitException, APITimeoutException
|
||||
|
||||
|
||||
@dataclass
|
||||
class RetryConfig:
|
||||
"""Configuration for retry behavior."""
|
||||
max_attempts: int = 3
|
||||
base_delay: float = 1.0
|
||||
max_delay: float = 60.0
|
||||
exponential_base: float = 2.0
|
||||
jitter: bool = True
|
||||
max_total_time: float = 300.0 # 5 minutes max total time
|
||||
retryable_errors: List[str] = None
|
||||
|
||||
def __post_init__(self):
|
||||
if self.retryable_errors is None:
|
||||
self.retryable_errors = [
|
||||
"503", "502", "504", # Server errors
|
||||
"429", # Rate limit
|
||||
"timeout", "timed out",
|
||||
"connection", "network",
|
||||
"overloaded", "busy"
|
||||
]
|
||||
|
||||
|
||||
class RetryBudget:
|
||||
"""Tracks retry budget to prevent excessive retries."""
|
||||
|
||||
def __init__(self, max_total_time: float):
|
||||
self.max_total_time = max_total_time
|
||||
self.start_time = time.time()
|
||||
self.used_time = 0.0
|
||||
|
||||
def can_retry(self) -> bool:
|
||||
"""Check if we can still retry within budget."""
|
||||
self.used_time = time.time() - self.start_time
|
||||
return self.used_time < self.max_total_time
|
||||
|
||||
def remaining_time(self) -> float:
|
||||
"""Get remaining time in budget."""
|
||||
return max(0, self.max_total_time - self.used_time)
|
||||
|
||||
|
||||
def is_retryable_error(error: Exception, retryable_errors: List[str]) -> bool:
|
||||
"""Check if an error is retryable based on error message patterns."""
|
||||
error_str = str(error).lower()
|
||||
return any(pattern.lower() in error_str for pattern in retryable_errors)
|
||||
|
||||
|
||||
def calculate_delay(attempt: int, config: RetryConfig) -> float:
|
||||
"""Calculate delay for retry attempt with exponential backoff and jitter."""
|
||||
# Exponential backoff
|
||||
delay = config.base_delay * (config.exponential_base ** attempt)
|
||||
|
||||
# Cap at max delay
|
||||
delay = min(delay, config.max_delay)
|
||||
|
||||
# Add jitter to prevent thundering herd
|
||||
if config.jitter:
|
||||
jitter_range = delay * 0.1 # 10% jitter
|
||||
delay += random.uniform(-jitter_range, jitter_range)
|
||||
|
||||
return max(0, delay)
|
||||
|
||||
|
||||
async def retry_with_backoff(
|
||||
func: Callable,
|
||||
config: Optional[RetryConfig] = None,
|
||||
operation_name: str = "operation",
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> Any:
|
||||
"""
|
||||
Retry a function with enhanced backoff and budget management.
|
||||
|
||||
Args:
|
||||
func: Async function to retry
|
||||
config: Retry configuration
|
||||
operation_name: Name of operation for logging
|
||||
context: Additional context for logging
|
||||
|
||||
Returns:
|
||||
Function result
|
||||
|
||||
Raises:
|
||||
Last exception if all retries fail
|
||||
"""
|
||||
config = config or RetryConfig()
|
||||
budget = RetryBudget(config.max_total_time)
|
||||
last_exception = None
|
||||
|
||||
for attempt in range(config.max_attempts):
|
||||
try:
|
||||
# Check if we're still within budget
|
||||
if not budget.can_retry():
|
||||
logger.warning(f"Retry budget exceeded for {operation_name} after {budget.used_time:.2f}s")
|
||||
break
|
||||
|
||||
# Execute the function
|
||||
result = await func()
|
||||
logger.info(f"{operation_name} succeeded on attempt {attempt + 1}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
last_exception = e
|
||||
|
||||
# Check if this is the last attempt
|
||||
if attempt == config.max_attempts - 1:
|
||||
logger.error(f"{operation_name} failed after {config.max_attempts} attempts: {str(e)}")
|
||||
break
|
||||
|
||||
# Check if error is retryable
|
||||
if not is_retryable_error(e, config.retryable_errors):
|
||||
logger.warning(f"{operation_name} failed with non-retryable error: {str(e)}")
|
||||
break
|
||||
|
||||
# Calculate delay and wait
|
||||
delay = calculate_delay(attempt, config)
|
||||
remaining_time = budget.remaining_time()
|
||||
|
||||
# Don't wait longer than remaining budget
|
||||
if delay > remaining_time:
|
||||
logger.warning(f"Delay {delay:.2f}s exceeds remaining budget {remaining_time:.2f}s for {operation_name}")
|
||||
break
|
||||
|
||||
logger.warning(
|
||||
f"{operation_name} attempt {attempt + 1} failed: {str(e)}. "
|
||||
f"Retrying in {delay:.2f}s (attempt {attempt + 2}/{config.max_attempts})"
|
||||
)
|
||||
|
||||
await asyncio.sleep(delay)
|
||||
|
||||
# If we get here, all retries failed
|
||||
if last_exception:
|
||||
# Enhance exception with retry context
|
||||
if isinstance(last_exception, Exception):
|
||||
error_str = str(last_exception)
|
||||
if "429" in error_str or "rate limit" in error_str.lower():
|
||||
raise APIRateLimitException(
|
||||
f"Rate limit exceeded after {config.max_attempts} attempts",
|
||||
retry_after=int(delay * 2), # Suggest waiting longer
|
||||
context=context
|
||||
)
|
||||
elif "timeout" in error_str.lower():
|
||||
raise APITimeoutException(
|
||||
f"Request timed out after {config.max_attempts} attempts",
|
||||
timeout_seconds=int(config.max_total_time),
|
||||
context=context
|
||||
)
|
||||
|
||||
raise last_exception
|
||||
|
||||
raise Exception(f"{operation_name} failed after {config.max_attempts} attempts")
|
||||
|
||||
|
||||
def retry_decorator(
|
||||
config: Optional[RetryConfig] = None,
|
||||
operation_name: Optional[str] = None
|
||||
):
|
||||
"""
|
||||
Decorator to add retry logic to async functions.
|
||||
|
||||
Args:
|
||||
config: Retry configuration
|
||||
operation_name: Name of operation for logging
|
||||
"""
|
||||
def decorator(func: Callable) -> Callable:
|
||||
async def wrapper(*args, **kwargs):
|
||||
op_name = operation_name or func.__name__
|
||||
return await retry_with_backoff(
|
||||
lambda: func(*args, **kwargs),
|
||||
config=config,
|
||||
operation_name=op_name
|
||||
)
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
|
||||
# Predefined retry configurations for different operation types
|
||||
RESEARCH_RETRY_CONFIG = RetryConfig(
|
||||
max_attempts=3,
|
||||
base_delay=2.0,
|
||||
max_delay=30.0,
|
||||
max_total_time=180.0, # 3 minutes for research
|
||||
retryable_errors=["503", "429", "timeout", "overloaded", "connection"]
|
||||
)
|
||||
|
||||
OUTLINE_RETRY_CONFIG = RetryConfig(
|
||||
max_attempts=2,
|
||||
base_delay=1.5,
|
||||
max_delay=20.0,
|
||||
max_total_time=120.0, # 2 minutes for outline
|
||||
retryable_errors=["503", "429", "timeout", "overloaded"]
|
||||
)
|
||||
|
||||
CONTENT_RETRY_CONFIG = RetryConfig(
|
||||
max_attempts=3,
|
||||
base_delay=1.0,
|
||||
max_delay=15.0,
|
||||
max_total_time=90.0, # 1.5 minutes for content
|
||||
retryable_errors=["503", "429", "timeout", "overloaded"]
|
||||
)
|
||||
|
||||
SEO_RETRY_CONFIG = RetryConfig(
|
||||
max_attempts=2,
|
||||
base_delay=1.0,
|
||||
max_delay=10.0,
|
||||
max_total_time=60.0, # 1 minute for SEO
|
||||
retryable_errors=["503", "429", "timeout"]
|
||||
)
|
||||
879
backend/services/blog_writer/seo/blog_content_seo_analyzer.py
Normal file
879
backend/services/blog_writer/seo/blog_content_seo_analyzer.py
Normal file
@@ -0,0 +1,879 @@
|
||||
"""
|
||||
Blog Content SEO Analyzer
|
||||
|
||||
Specialized SEO analyzer for blog content with parallel processing.
|
||||
Leverages existing non-AI SEO tools and uses single AI prompt for structured analysis.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import re
|
||||
import textstat
|
||||
from datetime import datetime
|
||||
from typing import Dict, Any, List, Optional
|
||||
from utils.logger_utils import get_service_logger
|
||||
|
||||
from services.seo_analyzer import (
|
||||
ContentAnalyzer, KeywordAnalyzer,
|
||||
URLStructureAnalyzer, AIInsightGenerator
|
||||
)
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
|
||||
class BlogContentSEOAnalyzer:
|
||||
"""Specialized SEO analyzer for blog content with parallel processing"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the blog content SEO analyzer"""
|
||||
# Service-specific logger (no global reconfiguration)
|
||||
global logger
|
||||
logger = get_service_logger("blog_content_seo_analyzer")
|
||||
self.content_analyzer = ContentAnalyzer()
|
||||
self.keyword_analyzer = KeywordAnalyzer()
|
||||
self.url_analyzer = URLStructureAnalyzer()
|
||||
self.ai_insights = AIInsightGenerator()
|
||||
|
||||
logger.info("BlogContentSEOAnalyzer initialized")
|
||||
|
||||
async def analyze_blog_content(self, blog_content: str, research_data: Dict[str, Any], blog_title: Optional[str] = None, user_id: str = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Main analysis method with parallel processing
|
||||
|
||||
Args:
|
||||
blog_content: The blog content to analyze
|
||||
research_data: Research data containing keywords and other insights
|
||||
blog_title: Optional blog title
|
||||
user_id: Clerk user ID for subscription checking (required)
|
||||
|
||||
Returns:
|
||||
Comprehensive SEO analysis results
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
try:
|
||||
logger.info("Starting blog content SEO analysis")
|
||||
|
||||
# Extract keywords from research data
|
||||
keywords_data = self._extract_keywords_from_research(research_data)
|
||||
logger.info(f"Extracted keywords: {keywords_data}")
|
||||
|
||||
# Phase 1: Run non-AI analyzers in parallel
|
||||
logger.info("Running non-AI analyzers in parallel")
|
||||
non_ai_results = await self._run_non_ai_analyzers(blog_content, keywords_data)
|
||||
|
||||
# Phase 2: Single AI analysis for structured insights
|
||||
logger.info("Running AI analysis")
|
||||
ai_insights = await self._run_ai_analysis(blog_content, keywords_data, non_ai_results, user_id=user_id)
|
||||
|
||||
# Phase 3: Compile and format results
|
||||
logger.info("Compiling results")
|
||||
results = self._compile_blog_seo_results(non_ai_results, ai_insights, keywords_data)
|
||||
|
||||
logger.info(f"SEO analysis completed. Overall score: {results.get('overall_score', 0)}")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Blog SEO analysis failed: {e}")
|
||||
# Fail fast - don't return fallback data
|
||||
raise e
|
||||
|
||||
def _extract_keywords_from_research(self, research_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Extract keywords from research data"""
|
||||
try:
|
||||
logger.info(f"Extracting keywords from research data: {research_data}")
|
||||
|
||||
# Extract keywords from research data structure
|
||||
keyword_analysis = research_data.get('keyword_analysis', {})
|
||||
logger.info(f"Found keyword_analysis: {keyword_analysis}")
|
||||
|
||||
# Handle different possible structures
|
||||
primary_keywords = []
|
||||
long_tail_keywords = []
|
||||
semantic_keywords = []
|
||||
all_keywords = []
|
||||
|
||||
# Try to extract primary keywords from different possible locations
|
||||
if 'primary' in keyword_analysis:
|
||||
primary_keywords = keyword_analysis.get('primary', [])
|
||||
elif 'keywords' in research_data:
|
||||
# Fallback to top-level keywords
|
||||
primary_keywords = research_data.get('keywords', [])
|
||||
|
||||
# Extract other keyword types
|
||||
long_tail_keywords = keyword_analysis.get('long_tail', [])
|
||||
# Handle both 'semantic' and 'semantic_keywords' field names
|
||||
semantic_keywords = keyword_analysis.get('semantic', []) or keyword_analysis.get('semantic_keywords', [])
|
||||
all_keywords = keyword_analysis.get('all_keywords', primary_keywords)
|
||||
|
||||
result = {
|
||||
'primary': primary_keywords,
|
||||
'long_tail': long_tail_keywords,
|
||||
'semantic': semantic_keywords,
|
||||
'all_keywords': all_keywords,
|
||||
'search_intent': keyword_analysis.get('search_intent', 'informational')
|
||||
}
|
||||
|
||||
logger.info(f"Extracted keywords: {result}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to extract keywords from research data: {e}")
|
||||
logger.error(f"Research data structure: {research_data}")
|
||||
# Fail fast - don't return empty keywords
|
||||
raise ValueError(f"Keyword extraction failed: {e}")
|
||||
|
||||
async def _run_non_ai_analyzers(self, blog_content: str, keywords_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Run all non-AI analyzers in parallel for maximum performance"""
|
||||
|
||||
logger.info(f"Starting non-AI analyzers with content length: {len(blog_content)} chars")
|
||||
logger.info(f"Keywords data: {keywords_data}")
|
||||
|
||||
# Parallel execution of fast analyzers
|
||||
tasks = [
|
||||
self._analyze_content_structure(blog_content),
|
||||
self._analyze_keyword_usage(blog_content, keywords_data),
|
||||
self._analyze_readability(blog_content),
|
||||
self._analyze_content_quality(blog_content),
|
||||
self._analyze_heading_structure(blog_content)
|
||||
]
|
||||
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Check for exceptions and fail fast
|
||||
for i, result in enumerate(results):
|
||||
if isinstance(result, Exception):
|
||||
task_names = ['content_structure', 'keyword_analysis', 'readability_analysis', 'content_quality', 'heading_structure']
|
||||
logger.error(f"Task {task_names[i]} failed: {result}")
|
||||
raise result
|
||||
|
||||
# Log successful results
|
||||
task_names = ['content_structure', 'keyword_analysis', 'readability_analysis', 'content_quality', 'heading_structure']
|
||||
for i, (name, result) in enumerate(zip(task_names, results)):
|
||||
logger.info(f"✅ {name} completed: {type(result).__name__} with {len(result) if isinstance(result, dict) else 'N/A'} fields")
|
||||
|
||||
return {
|
||||
'content_structure': results[0],
|
||||
'keyword_analysis': results[1],
|
||||
'readability_analysis': results[2],
|
||||
'content_quality': results[3],
|
||||
'heading_structure': results[4]
|
||||
}
|
||||
|
||||
async def _analyze_content_structure(self, content: str) -> Dict[str, Any]:
|
||||
"""Analyze blog content structure"""
|
||||
try:
|
||||
# Parse markdown content
|
||||
lines = content.split('\n')
|
||||
|
||||
# Count sections, paragraphs, sentences
|
||||
sections = len([line for line in lines if line.startswith('##')])
|
||||
paragraphs = len([line for line in lines if line.strip() and not line.startswith('#')])
|
||||
sentences = len(re.findall(r'[.!?]+', content))
|
||||
|
||||
# Blog-specific structure analysis
|
||||
has_introduction = any('introduction' in line.lower() or 'overview' in line.lower()
|
||||
for line in lines[:10])
|
||||
has_conclusion = any('conclusion' in line.lower() or 'summary' in line.lower()
|
||||
for line in lines[-10:])
|
||||
has_cta = any('call to action' in line.lower() or 'learn more' in line.lower()
|
||||
for line in lines)
|
||||
|
||||
structure_score = self._calculate_structure_score(sections, paragraphs, has_introduction, has_conclusion)
|
||||
|
||||
return {
|
||||
'total_sections': sections,
|
||||
'total_paragraphs': paragraphs,
|
||||
'total_sentences': sentences,
|
||||
'has_introduction': has_introduction,
|
||||
'has_conclusion': has_conclusion,
|
||||
'has_call_to_action': has_cta,
|
||||
'structure_score': structure_score,
|
||||
'recommendations': self._get_structure_recommendations(sections, has_introduction, has_conclusion)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Content structure analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
async def _analyze_keyword_usage(self, content: str, keywords_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Analyze keyword usage and optimization"""
|
||||
try:
|
||||
# Extract keywords from research data
|
||||
primary_keywords = keywords_data.get('primary', [])
|
||||
long_tail_keywords = keywords_data.get('long_tail', [])
|
||||
semantic_keywords = keywords_data.get('semantic', [])
|
||||
|
||||
# Use existing KeywordAnalyzer
|
||||
keyword_result = self.keyword_analyzer.analyze(content, primary_keywords)
|
||||
|
||||
# Blog-specific keyword analysis
|
||||
keyword_analysis = {
|
||||
'primary_keywords': primary_keywords,
|
||||
'long_tail_keywords': long_tail_keywords,
|
||||
'semantic_keywords': semantic_keywords,
|
||||
'keyword_density': {},
|
||||
'keyword_distribution': {},
|
||||
'missing_keywords': [],
|
||||
'over_optimization': [],
|
||||
'recommendations': []
|
||||
}
|
||||
|
||||
# Analyze each keyword type
|
||||
for keyword in primary_keywords:
|
||||
density = self._calculate_keyword_density(content, keyword)
|
||||
keyword_analysis['keyword_density'][keyword] = density
|
||||
|
||||
# Check if keyword appears in headings
|
||||
in_headings = self._keyword_in_headings(content, keyword)
|
||||
keyword_analysis['keyword_distribution'][keyword] = {
|
||||
'density': density,
|
||||
'in_headings': in_headings,
|
||||
'first_occurrence': content.lower().find(keyword.lower())
|
||||
}
|
||||
|
||||
# Check for missing important keywords
|
||||
for keyword in primary_keywords:
|
||||
if keyword.lower() not in content.lower():
|
||||
keyword_analysis['missing_keywords'].append(keyword)
|
||||
|
||||
# Check for over-optimization
|
||||
for keyword, density in keyword_analysis['keyword_density'].items():
|
||||
if density > 3.0: # Over 3% density
|
||||
keyword_analysis['over_optimization'].append(keyword)
|
||||
|
||||
return keyword_analysis
|
||||
except Exception as e:
|
||||
logger.error(f"Keyword analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
async def _analyze_readability(self, content: str) -> Dict[str, Any]:
|
||||
"""Analyze content readability using textstat integration"""
|
||||
try:
|
||||
# Calculate readability metrics
|
||||
readability_metrics = {
|
||||
'flesch_reading_ease': textstat.flesch_reading_ease(content),
|
||||
'flesch_kincaid_grade': textstat.flesch_kincaid_grade(content),
|
||||
'gunning_fog': textstat.gunning_fog(content),
|
||||
'smog_index': textstat.smog_index(content),
|
||||
'automated_readability': textstat.automated_readability_index(content),
|
||||
'coleman_liau': textstat.coleman_liau_index(content)
|
||||
}
|
||||
|
||||
# Blog-specific readability analysis
|
||||
avg_sentence_length = self._calculate_avg_sentence_length(content)
|
||||
avg_paragraph_length = self._calculate_avg_paragraph_length(content)
|
||||
|
||||
readability_score = self._calculate_readability_score(readability_metrics)
|
||||
|
||||
return {
|
||||
'metrics': readability_metrics,
|
||||
'avg_sentence_length': avg_sentence_length,
|
||||
'avg_paragraph_length': avg_paragraph_length,
|
||||
'readability_score': readability_score,
|
||||
'target_audience': self._determine_target_audience(readability_metrics),
|
||||
'recommendations': self._get_readability_recommendations(readability_metrics, avg_sentence_length)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Readability analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
async def _analyze_content_quality(self, content: str) -> Dict[str, Any]:
|
||||
"""Analyze overall content quality"""
|
||||
try:
|
||||
# Word count analysis
|
||||
words = content.split()
|
||||
word_count = len(words)
|
||||
|
||||
# Content depth analysis
|
||||
unique_words = len(set(word.lower() for word in words))
|
||||
vocabulary_diversity = unique_words / word_count if word_count > 0 else 0
|
||||
|
||||
# Content flow analysis
|
||||
transition_words = ['however', 'therefore', 'furthermore', 'moreover', 'additionally', 'consequently']
|
||||
transition_count = sum(content.lower().count(word) for word in transition_words)
|
||||
|
||||
content_depth_score = self._calculate_content_depth_score(word_count, vocabulary_diversity)
|
||||
flow_score = self._calculate_flow_score(transition_count, word_count)
|
||||
|
||||
return {
|
||||
'word_count': word_count,
|
||||
'unique_words': unique_words,
|
||||
'vocabulary_diversity': vocabulary_diversity,
|
||||
'transition_words_used': transition_count,
|
||||
'content_depth_score': content_depth_score,
|
||||
'flow_score': flow_score,
|
||||
'recommendations': self._get_content_quality_recommendations(word_count, vocabulary_diversity, transition_count)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Content quality analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
async def _analyze_heading_structure(self, content: str) -> Dict[str, Any]:
|
||||
"""Analyze heading structure and hierarchy"""
|
||||
try:
|
||||
# Extract headings
|
||||
h1_headings = re.findall(r'^# (.+)$', content, re.MULTILINE)
|
||||
h2_headings = re.findall(r'^## (.+)$', content, re.MULTILINE)
|
||||
h3_headings = re.findall(r'^### (.+)$', content, re.MULTILINE)
|
||||
|
||||
# Analyze heading structure
|
||||
heading_hierarchy_score = self._calculate_heading_hierarchy_score(h1_headings, h2_headings, h3_headings)
|
||||
|
||||
return {
|
||||
'h1_count': len(h1_headings),
|
||||
'h2_count': len(h2_headings),
|
||||
'h3_count': len(h3_headings),
|
||||
'h1_headings': h1_headings,
|
||||
'h2_headings': h2_headings,
|
||||
'h3_headings': h3_headings,
|
||||
'heading_hierarchy_score': heading_hierarchy_score,
|
||||
'recommendations': self._get_heading_recommendations(h1_headings, h2_headings, h3_headings)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Heading structure analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
# Helper methods for calculations and scoring
|
||||
def _calculate_structure_score(self, sections: int, paragraphs: int, has_intro: bool, has_conclusion: bool) -> int:
|
||||
"""Calculate content structure score"""
|
||||
score = 0
|
||||
|
||||
# Section count (optimal: 3-8 sections)
|
||||
if 3 <= sections <= 8:
|
||||
score += 30
|
||||
elif sections < 3:
|
||||
score += 15
|
||||
else:
|
||||
score += 20
|
||||
|
||||
# Paragraph count (optimal: 8-20 paragraphs)
|
||||
if 8 <= paragraphs <= 20:
|
||||
score += 30
|
||||
elif paragraphs < 8:
|
||||
score += 15
|
||||
else:
|
||||
score += 20
|
||||
|
||||
# Introduction and conclusion
|
||||
if has_intro:
|
||||
score += 20
|
||||
if has_conclusion:
|
||||
score += 20
|
||||
|
||||
return min(score, 100)
|
||||
|
||||
def _calculate_keyword_density(self, content: str, keyword: str) -> float:
|
||||
"""Calculate keyword density percentage"""
|
||||
content_lower = content.lower()
|
||||
keyword_lower = keyword.lower()
|
||||
|
||||
word_count = len(content.split())
|
||||
keyword_count = content_lower.count(keyword_lower)
|
||||
|
||||
return (keyword_count / word_count * 100) if word_count > 0 else 0
|
||||
|
||||
def _keyword_in_headings(self, content: str, keyword: str) -> bool:
|
||||
"""Check if keyword appears in headings"""
|
||||
headings = re.findall(r'^#+ (.+)$', content, re.MULTILINE)
|
||||
return any(keyword.lower() in heading.lower() for heading in headings)
|
||||
|
||||
def _calculate_avg_sentence_length(self, content: str) -> float:
|
||||
"""Calculate average sentence length"""
|
||||
sentences = re.split(r'[.!?]+', content)
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
|
||||
if not sentences:
|
||||
return 0
|
||||
|
||||
total_words = sum(len(sentence.split()) for sentence in sentences)
|
||||
return total_words / len(sentences)
|
||||
|
||||
def _calculate_avg_paragraph_length(self, content: str) -> float:
|
||||
"""Calculate average paragraph length"""
|
||||
paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]
|
||||
|
||||
if not paragraphs:
|
||||
return 0
|
||||
|
||||
total_words = sum(len(paragraph.split()) for paragraph in paragraphs)
|
||||
return total_words / len(paragraphs)
|
||||
|
||||
def _calculate_readability_score(self, metrics: Dict[str, float]) -> int:
|
||||
"""Calculate overall readability score"""
|
||||
# Flesch Reading Ease (0-100, higher is better)
|
||||
flesch_score = metrics.get('flesch_reading_ease', 0)
|
||||
|
||||
# Convert to 0-100 scale
|
||||
if flesch_score >= 80:
|
||||
return 90
|
||||
elif flesch_score >= 60:
|
||||
return 80
|
||||
elif flesch_score >= 40:
|
||||
return 70
|
||||
elif flesch_score >= 20:
|
||||
return 60
|
||||
else:
|
||||
return 50
|
||||
|
||||
def _determine_target_audience(self, metrics: Dict[str, float]) -> str:
|
||||
"""Determine target audience based on readability metrics"""
|
||||
flesch_score = metrics.get('flesch_reading_ease', 0)
|
||||
|
||||
if flesch_score >= 80:
|
||||
return "General audience (8th grade level)"
|
||||
elif flesch_score >= 60:
|
||||
return "High school level"
|
||||
elif flesch_score >= 40:
|
||||
return "College level"
|
||||
else:
|
||||
return "Graduate level"
|
||||
|
||||
def _calculate_content_depth_score(self, word_count: int, vocabulary_diversity: float) -> int:
|
||||
"""Calculate content depth score"""
|
||||
score = 0
|
||||
|
||||
# Word count (optimal: 800-2000 words)
|
||||
if 800 <= word_count <= 2000:
|
||||
score += 50
|
||||
elif word_count < 800:
|
||||
score += 30
|
||||
else:
|
||||
score += 40
|
||||
|
||||
# Vocabulary diversity (optimal: 0.4-0.7)
|
||||
if 0.4 <= vocabulary_diversity <= 0.7:
|
||||
score += 50
|
||||
elif vocabulary_diversity < 0.4:
|
||||
score += 30
|
||||
else:
|
||||
score += 40
|
||||
|
||||
return min(score, 100)
|
||||
|
||||
def _calculate_flow_score(self, transition_count: int, word_count: int) -> int:
|
||||
"""Calculate content flow score"""
|
||||
if word_count == 0:
|
||||
return 0
|
||||
|
||||
transition_density = transition_count / (word_count / 100)
|
||||
|
||||
# Optimal transition density: 1-3 per 100 words
|
||||
if 1 <= transition_density <= 3:
|
||||
return 90
|
||||
elif transition_density < 1:
|
||||
return 60
|
||||
else:
|
||||
return 70
|
||||
|
||||
def _calculate_heading_hierarchy_score(self, h1: List[str], h2: List[str], h3: List[str]) -> int:
|
||||
"""Calculate heading hierarchy score"""
|
||||
score = 0
|
||||
|
||||
# Should have exactly 1 H1
|
||||
if len(h1) == 1:
|
||||
score += 40
|
||||
elif len(h1) == 0:
|
||||
score += 20
|
||||
else:
|
||||
score += 10
|
||||
|
||||
# Should have 3-8 H2 headings
|
||||
if 3 <= len(h2) <= 8:
|
||||
score += 40
|
||||
elif len(h2) < 3:
|
||||
score += 20
|
||||
else:
|
||||
score += 30
|
||||
|
||||
# H3 headings are optional but good for structure
|
||||
if len(h3) > 0:
|
||||
score += 20
|
||||
|
||||
return min(score, 100)
|
||||
|
||||
def _calculate_keyword_score(self, keyword_analysis: Dict[str, Any]) -> int:
|
||||
"""Calculate keyword optimization score"""
|
||||
score = 0
|
||||
|
||||
# Check keyword density (optimal: 1-3%)
|
||||
densities = keyword_analysis.get('keyword_density', {})
|
||||
for keyword, density in densities.items():
|
||||
if 1 <= density <= 3:
|
||||
score += 30
|
||||
elif density < 1:
|
||||
score += 15
|
||||
else:
|
||||
score += 10
|
||||
|
||||
# Check keyword distribution
|
||||
distributions = keyword_analysis.get('keyword_distribution', {})
|
||||
for keyword, dist in distributions.items():
|
||||
if dist.get('in_headings', False):
|
||||
score += 20
|
||||
if dist.get('first_occurrence', -1) < 100: # Early occurrence
|
||||
score += 20
|
||||
|
||||
# Penalize missing keywords
|
||||
missing = len(keyword_analysis.get('missing_keywords', []))
|
||||
score -= missing * 10
|
||||
|
||||
# Penalize over-optimization
|
||||
over_opt = len(keyword_analysis.get('over_optimization', []))
|
||||
score -= over_opt * 15
|
||||
|
||||
return max(0, min(score, 100))
|
||||
|
||||
def _calculate_weighted_score(self, scores: Dict[str, int]) -> int:
|
||||
"""Calculate weighted overall score"""
|
||||
weights = {
|
||||
'structure': 0.2,
|
||||
'keywords': 0.25,
|
||||
'readability': 0.2,
|
||||
'quality': 0.15,
|
||||
'headings': 0.1,
|
||||
'ai_insights': 0.1
|
||||
}
|
||||
|
||||
weighted_sum = sum(scores.get(key, 0) * weight for key, weight in weights.items())
|
||||
return int(weighted_sum)
|
||||
|
||||
# Recommendation methods
|
||||
def _get_structure_recommendations(self, sections: int, has_intro: bool, has_conclusion: bool) -> List[str]:
|
||||
"""Get structure recommendations"""
|
||||
recommendations = []
|
||||
|
||||
if sections < 3:
|
||||
recommendations.append("Add more sections to improve content structure")
|
||||
elif sections > 8:
|
||||
recommendations.append("Consider combining some sections for better flow")
|
||||
|
||||
if not has_intro:
|
||||
recommendations.append("Add an introduction section to set context")
|
||||
|
||||
if not has_conclusion:
|
||||
recommendations.append("Add a conclusion section to summarize key points")
|
||||
|
||||
return recommendations
|
||||
|
||||
def _get_readability_recommendations(self, metrics: Dict[str, float], avg_sentence_length: float) -> List[str]:
|
||||
"""Get readability recommendations"""
|
||||
recommendations = []
|
||||
|
||||
flesch_score = metrics.get('flesch_reading_ease', 0)
|
||||
|
||||
if flesch_score < 60:
|
||||
recommendations.append("Simplify language and use shorter sentences")
|
||||
|
||||
if avg_sentence_length > 20:
|
||||
recommendations.append("Break down long sentences for better readability")
|
||||
|
||||
if flesch_score > 80:
|
||||
recommendations.append("Consider adding more technical depth for expert audience")
|
||||
|
||||
return recommendations
|
||||
|
||||
def _get_content_quality_recommendations(self, word_count: int, vocabulary_diversity: float, transition_count: int) -> List[str]:
|
||||
"""Get content quality recommendations"""
|
||||
recommendations = []
|
||||
|
||||
if word_count < 800:
|
||||
recommendations.append("Expand content with more detailed explanations")
|
||||
elif word_count > 2000:
|
||||
recommendations.append("Consider breaking into multiple posts")
|
||||
|
||||
if vocabulary_diversity < 0.4:
|
||||
recommendations.append("Use more varied vocabulary to improve engagement")
|
||||
|
||||
if transition_count < 3:
|
||||
recommendations.append("Add more transition words to improve flow")
|
||||
|
||||
return recommendations
|
||||
|
||||
def _get_heading_recommendations(self, h1: List[str], h2: List[str], h3: List[str]) -> List[str]:
|
||||
"""Get heading recommendations"""
|
||||
recommendations = []
|
||||
|
||||
if len(h1) == 0:
|
||||
recommendations.append("Add a main H1 heading")
|
||||
elif len(h1) > 1:
|
||||
recommendations.append("Use only one H1 heading per post")
|
||||
|
||||
if len(h2) < 3:
|
||||
recommendations.append("Add more H2 headings to structure content")
|
||||
elif len(h2) > 8:
|
||||
recommendations.append("Consider using H3 headings for better hierarchy")
|
||||
|
||||
return recommendations
|
||||
|
||||
async def _run_ai_analysis(self, blog_content: str, keywords_data: Dict[str, Any], non_ai_results: Dict[str, Any], user_id: str = None) -> Dict[str, Any]:
|
||||
"""Run single AI analysis for structured insights (provider-agnostic)"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
try:
|
||||
# Prepare context for AI analysis
|
||||
context = {
|
||||
'blog_content': blog_content,
|
||||
'keywords_data': keywords_data,
|
||||
'non_ai_results': non_ai_results
|
||||
}
|
||||
|
||||
# Create AI prompt for structured analysis
|
||||
prompt = self._create_ai_analysis_prompt(context)
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content_quality_insights": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"engagement_score": {"type": "number"},
|
||||
"value_proposition": {"type": "string"},
|
||||
"content_gaps": {"type": "array", "items": {"type": "string"}},
|
||||
"improvement_suggestions": {"type": "array", "items": {"type": "string"}}
|
||||
}
|
||||
},
|
||||
"seo_optimization_insights": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"keyword_optimization": {"type": "string"},
|
||||
"content_relevance": {"type": "string"},
|
||||
"search_intent_alignment": {"type": "string"},
|
||||
"seo_improvements": {"type": "array", "items": {"type": "string"}}
|
||||
}
|
||||
},
|
||||
"user_experience_insights": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content_flow": {"type": "string"},
|
||||
"readability_assessment": {"type": "string"},
|
||||
"engagement_factors": {"type": "array", "items": {"type": "string"}},
|
||||
"ux_improvements": {"type": "array", "items": {"type": "string"}}
|
||||
}
|
||||
},
|
||||
"competitive_analysis": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content_differentiation": {"type": "string"},
|
||||
"unique_value": {"type": "string"},
|
||||
"competitive_advantages": {"type": "array", "items": {"type": "string"}},
|
||||
"market_positioning": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Provider-agnostic structured response respecting GPT_PROVIDER
|
||||
ai_response = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id # Pass user_id for subscription checking
|
||||
)
|
||||
|
||||
return ai_response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"AI analysis failed: {e}")
|
||||
raise e
|
||||
|
||||
def _create_ai_analysis_prompt(self, context: Dict[str, Any]) -> str:
|
||||
"""Create AI analysis prompt"""
|
||||
blog_content = context['blog_content']
|
||||
keywords_data = context['keywords_data']
|
||||
non_ai_results = context['non_ai_results']
|
||||
|
||||
prompt = f"""
|
||||
Analyze this blog content for SEO optimization and user experience. Provide structured insights based on the content and keyword data.
|
||||
|
||||
BLOG CONTENT:
|
||||
{blog_content[:2000]}...
|
||||
|
||||
KEYWORDS DATA:
|
||||
Primary Keywords: {keywords_data.get('primary', [])}
|
||||
Long-tail Keywords: {keywords_data.get('long_tail', [])}
|
||||
Semantic Keywords: {keywords_data.get('semantic', [])}
|
||||
Search Intent: {keywords_data.get('search_intent', 'informational')}
|
||||
|
||||
NON-AI ANALYSIS RESULTS:
|
||||
Structure Score: {non_ai_results.get('content_structure', {}).get('structure_score', 0)}
|
||||
Readability Score: {non_ai_results.get('readability_analysis', {}).get('readability_score', 0)}
|
||||
Content Quality Score: {non_ai_results.get('content_quality', {}).get('content_depth_score', 0)}
|
||||
|
||||
Please provide:
|
||||
1. Content Quality Insights: Assess engagement potential, value proposition, content gaps, and improvement suggestions
|
||||
2. SEO Optimization Insights: Evaluate keyword optimization, content relevance, search intent alignment, and SEO improvements
|
||||
3. User Experience Insights: Analyze content flow, readability, engagement factors, and UX improvements
|
||||
4. Competitive Analysis: Identify content differentiation, unique value, competitive advantages, and market positioning
|
||||
|
||||
Focus on actionable insights that can improve the blog's performance and user engagement.
|
||||
"""
|
||||
|
||||
return prompt
|
||||
|
||||
def _compile_blog_seo_results(self, non_ai_results: Dict[str, Any], ai_insights: Dict[str, Any], keywords_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Compile comprehensive SEO analysis results"""
|
||||
try:
|
||||
# Validate required data - fail fast if missing
|
||||
if not non_ai_results:
|
||||
raise ValueError("Non-AI analysis results are missing")
|
||||
|
||||
if not ai_insights:
|
||||
raise ValueError("AI insights are missing")
|
||||
|
||||
# Calculate category scores
|
||||
category_scores = {
|
||||
'structure': non_ai_results.get('content_structure', {}).get('structure_score', 0),
|
||||
'keywords': self._calculate_keyword_score(non_ai_results.get('keyword_analysis', {})),
|
||||
'readability': non_ai_results.get('readability_analysis', {}).get('readability_score', 0),
|
||||
'quality': non_ai_results.get('content_quality', {}).get('content_depth_score', 0),
|
||||
'headings': non_ai_results.get('heading_structure', {}).get('heading_hierarchy_score', 0),
|
||||
'ai_insights': ai_insights.get('content_quality_insights', {}).get('engagement_score', 0)
|
||||
}
|
||||
|
||||
# Calculate overall score
|
||||
overall_score = self._calculate_weighted_score(category_scores)
|
||||
|
||||
# Compile actionable recommendations
|
||||
actionable_recommendations = self._compile_actionable_recommendations(non_ai_results, ai_insights)
|
||||
|
||||
# Create visualization data
|
||||
visualization_data = self._create_visualization_data(category_scores, non_ai_results)
|
||||
|
||||
return {
|
||||
'overall_score': overall_score,
|
||||
'category_scores': category_scores,
|
||||
'detailed_analysis': non_ai_results,
|
||||
'ai_insights': ai_insights,
|
||||
'keywords_data': keywords_data,
|
||||
'visualization_data': visualization_data,
|
||||
'actionable_recommendations': actionable_recommendations,
|
||||
'generated_at': datetime.utcnow().isoformat(),
|
||||
'analysis_summary': self._create_analysis_summary(overall_score, category_scores, ai_insights)
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Results compilation failed: {e}")
|
||||
# Fail fast - don't return fallback data
|
||||
raise e
|
||||
|
||||
def _compile_actionable_recommendations(self, non_ai_results: Dict[str, Any], ai_insights: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Compile actionable recommendations from all sources"""
|
||||
recommendations = []
|
||||
|
||||
# Structure recommendations
|
||||
structure_recs = non_ai_results.get('content_structure', {}).get('recommendations', [])
|
||||
for rec in structure_recs:
|
||||
recommendations.append({
|
||||
'category': 'Structure',
|
||||
'priority': 'High',
|
||||
'recommendation': rec,
|
||||
'impact': 'Improves content organization and user experience'
|
||||
})
|
||||
|
||||
# Keyword recommendations
|
||||
keyword_recs = non_ai_results.get('keyword_analysis', {}).get('recommendations', [])
|
||||
for rec in keyword_recs:
|
||||
recommendations.append({
|
||||
'category': 'Keywords',
|
||||
'priority': 'High',
|
||||
'recommendation': rec,
|
||||
'impact': 'Improves search engine visibility'
|
||||
})
|
||||
|
||||
# Readability recommendations
|
||||
readability_recs = non_ai_results.get('readability_analysis', {}).get('recommendations', [])
|
||||
for rec in readability_recs:
|
||||
recommendations.append({
|
||||
'category': 'Readability',
|
||||
'priority': 'Medium',
|
||||
'recommendation': rec,
|
||||
'impact': 'Improves user engagement and comprehension'
|
||||
})
|
||||
|
||||
# AI insights recommendations
|
||||
ai_recs = ai_insights.get('content_quality_insights', {}).get('improvement_suggestions', [])
|
||||
for rec in ai_recs:
|
||||
recommendations.append({
|
||||
'category': 'Content Quality',
|
||||
'priority': 'Medium',
|
||||
'recommendation': rec,
|
||||
'impact': 'Enhances content value and engagement'
|
||||
})
|
||||
|
||||
return recommendations
|
||||
|
||||
def _create_visualization_data(self, category_scores: Dict[str, int], non_ai_results: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Create data for visualization components"""
|
||||
return {
|
||||
'score_radar': {
|
||||
'categories': list(category_scores.keys()),
|
||||
'scores': list(category_scores.values()),
|
||||
'max_score': 100
|
||||
},
|
||||
'keyword_analysis': {
|
||||
'densities': non_ai_results.get('keyword_analysis', {}).get('keyword_density', {}),
|
||||
'missing_keywords': non_ai_results.get('keyword_analysis', {}).get('missing_keywords', []),
|
||||
'over_optimization': non_ai_results.get('keyword_analysis', {}).get('over_optimization', [])
|
||||
},
|
||||
'readability_metrics': non_ai_results.get('readability_analysis', {}).get('metrics', {}),
|
||||
'content_stats': {
|
||||
'word_count': non_ai_results.get('content_quality', {}).get('word_count', 0),
|
||||
'sections': non_ai_results.get('content_structure', {}).get('total_sections', 0),
|
||||
'paragraphs': non_ai_results.get('content_structure', {}).get('total_paragraphs', 0)
|
||||
}
|
||||
}
|
||||
|
||||
def _create_analysis_summary(self, overall_score: int, category_scores: Dict[str, int], ai_insights: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Create analysis summary"""
|
||||
# Determine overall grade
|
||||
if overall_score >= 90:
|
||||
grade = 'A'
|
||||
status = 'Excellent'
|
||||
elif overall_score >= 80:
|
||||
grade = 'B'
|
||||
status = 'Good'
|
||||
elif overall_score >= 70:
|
||||
grade = 'C'
|
||||
status = 'Fair'
|
||||
elif overall_score >= 60:
|
||||
grade = 'D'
|
||||
status = 'Needs Improvement'
|
||||
else:
|
||||
grade = 'F'
|
||||
status = 'Poor'
|
||||
|
||||
# Find strongest and weakest categories
|
||||
strongest_category = max(category_scores.items(), key=lambda x: x[1])
|
||||
weakest_category = min(category_scores.items(), key=lambda x: x[1])
|
||||
|
||||
return {
|
||||
'overall_grade': grade,
|
||||
'status': status,
|
||||
'strongest_category': strongest_category[0],
|
||||
'weakest_category': weakest_category[0],
|
||||
'key_strengths': self._identify_key_strengths(category_scores),
|
||||
'key_weaknesses': self._identify_key_weaknesses(category_scores),
|
||||
'ai_summary': ai_insights.get('content_quality_insights', {}).get('value_proposition', '')
|
||||
}
|
||||
|
||||
def _identify_key_strengths(self, category_scores: Dict[str, int]) -> List[str]:
|
||||
"""Identify key strengths"""
|
||||
strengths = []
|
||||
|
||||
for category, score in category_scores.items():
|
||||
if score >= 80:
|
||||
strengths.append(f"Strong {category} optimization")
|
||||
|
||||
return strengths
|
||||
|
||||
def _identify_key_weaknesses(self, category_scores: Dict[str, int]) -> List[str]:
|
||||
"""Identify key weaknesses"""
|
||||
weaknesses = []
|
||||
|
||||
for category, score in category_scores.items():
|
||||
if score < 60:
|
||||
weaknesses.append(f"Needs improvement in {category}")
|
||||
|
||||
return weaknesses
|
||||
|
||||
def _create_error_result(self, error_message: str) -> Dict[str, Any]:
|
||||
"""Create error result - this should not be used in fail-fast mode"""
|
||||
raise ValueError(f"Error result creation not allowed in fail-fast mode: {error_message}")
|
||||
668
backend/services/blog_writer/seo/blog_seo_metadata_generator.py
Normal file
668
backend/services/blog_writer/seo/blog_seo_metadata_generator.py
Normal file
@@ -0,0 +1,668 @@
|
||||
"""
|
||||
Blog SEO Metadata Generator
|
||||
|
||||
Optimized SEO metadata generation service that uses maximum 2 AI calls
|
||||
to generate comprehensive metadata including titles, descriptions,
|
||||
Open Graph tags, Twitter cards, and structured data.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import re
|
||||
from datetime import datetime
|
||||
from typing import Dict, Any, List, Optional
|
||||
from loguru import logger
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
|
||||
class BlogSEOMetadataGenerator:
|
||||
"""Optimized SEO metadata generator with maximum 2 AI calls"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the metadata generator"""
|
||||
logger.info("BlogSEOMetadataGenerator initialized")
|
||||
|
||||
async def generate_comprehensive_metadata(
|
||||
self,
|
||||
blog_content: str,
|
||||
blog_title: str,
|
||||
research_data: Dict[str, Any],
|
||||
outline: Optional[List[Dict[str, Any]]] = None,
|
||||
seo_analysis: Optional[Dict[str, Any]] = None,
|
||||
user_id: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate comprehensive SEO metadata using maximum 2 AI calls
|
||||
|
||||
Args:
|
||||
blog_content: The blog content to analyze
|
||||
blog_title: The blog title
|
||||
research_data: Research data containing keywords and insights
|
||||
outline: Outline structure with sections and headings
|
||||
seo_analysis: SEO analysis results from previous phase
|
||||
user_id: Clerk user ID for subscription checking (required)
|
||||
|
||||
Returns:
|
||||
Comprehensive metadata including all SEO elements
|
||||
"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
try:
|
||||
logger.info("Starting comprehensive SEO metadata generation")
|
||||
|
||||
# Extract keywords and context from research data
|
||||
keywords_data = self._extract_keywords_from_research(research_data)
|
||||
logger.info(f"Extracted keywords: {keywords_data}")
|
||||
|
||||
# Call 1: Generate core SEO metadata (parallel with Call 2)
|
||||
logger.info("Generating core SEO metadata")
|
||||
core_metadata_task = self._generate_core_metadata(
|
||||
blog_content, blog_title, keywords_data, outline, seo_analysis, user_id=user_id
|
||||
)
|
||||
|
||||
# Call 2: Generate social media and structured data (parallel with Call 1)
|
||||
logger.info("Generating social media and structured data")
|
||||
social_metadata_task = self._generate_social_metadata(
|
||||
blog_content, blog_title, keywords_data, outline, seo_analysis, user_id=user_id
|
||||
)
|
||||
|
||||
# Wait for both calls to complete
|
||||
core_metadata, social_metadata = await asyncio.gather(
|
||||
core_metadata_task,
|
||||
social_metadata_task
|
||||
)
|
||||
|
||||
# Compile final response
|
||||
results = self._compile_metadata_response(core_metadata, social_metadata, blog_title)
|
||||
|
||||
logger.info(f"SEO metadata generation completed successfully")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"SEO metadata generation failed: {e}")
|
||||
# Fail fast - don't return fallback data
|
||||
raise e
|
||||
|
||||
def _extract_keywords_from_research(self, research_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Extract keywords and context from research data"""
|
||||
try:
|
||||
keyword_analysis = research_data.get('keyword_analysis', {})
|
||||
|
||||
# Handle both 'semantic' and 'semantic_keywords' field names
|
||||
semantic_keywords = keyword_analysis.get('semantic', []) or keyword_analysis.get('semantic_keywords', [])
|
||||
|
||||
return {
|
||||
'primary_keywords': keyword_analysis.get('primary', []),
|
||||
'long_tail_keywords': keyword_analysis.get('long_tail', []),
|
||||
'semantic_keywords': semantic_keywords,
|
||||
'all_keywords': keyword_analysis.get('all_keywords', []),
|
||||
'search_intent': keyword_analysis.get('search_intent', 'informational'),
|
||||
'target_audience': research_data.get('target_audience', 'general'),
|
||||
'industry': research_data.get('industry', 'general')
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to extract keywords from research: {e}")
|
||||
return {
|
||||
'primary_keywords': [],
|
||||
'long_tail_keywords': [],
|
||||
'semantic_keywords': [],
|
||||
'all_keywords': [],
|
||||
'search_intent': 'informational',
|
||||
'target_audience': 'general',
|
||||
'industry': 'general'
|
||||
}
|
||||
|
||||
async def _generate_core_metadata(
|
||||
self,
|
||||
blog_content: str,
|
||||
blog_title: str,
|
||||
keywords_data: Dict[str, Any],
|
||||
outline: Optional[List[Dict[str, Any]]] = None,
|
||||
seo_analysis: Optional[Dict[str, Any]] = None,
|
||||
user_id: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate core SEO metadata (Call 1)"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
try:
|
||||
# Create comprehensive prompt for core metadata
|
||||
prompt = self._create_core_metadata_prompt(
|
||||
blog_content, blog_title, keywords_data, outline, seo_analysis
|
||||
)
|
||||
|
||||
# Define simplified structured schema for core metadata
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"seo_title": {
|
||||
"type": "string",
|
||||
"description": "SEO-optimized title (50-60 characters)"
|
||||
},
|
||||
"meta_description": {
|
||||
"type": "string",
|
||||
"description": "Meta description (150-160 characters)"
|
||||
},
|
||||
"url_slug": {
|
||||
"type": "string",
|
||||
"description": "URL slug (lowercase, hyphens)"
|
||||
},
|
||||
"blog_tags": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Blog tags array"
|
||||
},
|
||||
"blog_categories": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Blog categories array"
|
||||
},
|
||||
"social_hashtags": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Social media hashtags array"
|
||||
},
|
||||
"reading_time": {
|
||||
"type": "integer",
|
||||
"description": "Reading time in minutes"
|
||||
},
|
||||
"focus_keyword": {
|
||||
"type": "string",
|
||||
"description": "Primary focus keyword"
|
||||
}
|
||||
},
|
||||
"required": ["seo_title", "meta_description", "url_slug", "blog_tags", "blog_categories", "social_hashtags", "reading_time", "focus_keyword"]
|
||||
}
|
||||
|
||||
# Get structured response using provider-agnostic llm_text_gen
|
||||
ai_response_raw = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id # Pass user_id for subscription checking
|
||||
)
|
||||
|
||||
# Handle response: llm_text_gen may return dict (from structured JSON) or str (needs parsing)
|
||||
ai_response = ai_response_raw
|
||||
if isinstance(ai_response_raw, str):
|
||||
try:
|
||||
import json
|
||||
ai_response = json.loads(ai_response_raw)
|
||||
except json.JSONDecodeError:
|
||||
logger.error(f"Failed to parse JSON response: {ai_response_raw[:200]}...")
|
||||
ai_response = None
|
||||
|
||||
# Check if we got a valid response
|
||||
if not ai_response or not isinstance(ai_response, dict):
|
||||
logger.error("Core metadata generation failed: Invalid response from LLM")
|
||||
# Return fallback response
|
||||
primary_keywords = ', '.join(keywords_data.get('primary_keywords', ['content']))
|
||||
word_count = len(blog_content.split())
|
||||
return {
|
||||
'seo_title': blog_title,
|
||||
'meta_description': f'Learn about {primary_keywords.split(", ")[0] if primary_keywords else "this topic"}.',
|
||||
'url_slug': blog_title.lower().replace(' ', '-').replace(':', '').replace(',', '')[:50],
|
||||
'blog_tags': primary_keywords.split(', ') if primary_keywords else ['content'],
|
||||
'blog_categories': ['Content Marketing', 'Technology'],
|
||||
'social_hashtags': ['#content', '#marketing', '#technology'],
|
||||
'reading_time': max(1, word_count // 200),
|
||||
'focus_keyword': primary_keywords.split(', ')[0] if primary_keywords else 'content'
|
||||
}
|
||||
|
||||
logger.info(f"Core metadata generation completed. Response keys: {list(ai_response.keys())}")
|
||||
logger.info(f"Core metadata response: {ai_response}")
|
||||
|
||||
return ai_response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Core metadata generation failed: {e}")
|
||||
raise e
|
||||
|
||||
async def _generate_social_metadata(
|
||||
self,
|
||||
blog_content: str,
|
||||
blog_title: str,
|
||||
keywords_data: Dict[str, Any],
|
||||
outline: Optional[List[Dict[str, Any]]] = None,
|
||||
seo_analysis: Optional[Dict[str, Any]] = None,
|
||||
user_id: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate social media and structured data (Call 2)"""
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
try:
|
||||
# Create comprehensive prompt for social metadata
|
||||
prompt = self._create_social_metadata_prompt(
|
||||
blog_content, blog_title, keywords_data, outline, seo_analysis
|
||||
)
|
||||
|
||||
# Define simplified structured schema for social metadata
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"open_graph": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"description": {"type": "string"},
|
||||
"image": {"type": "string"},
|
||||
"type": {"type": "string"},
|
||||
"site_name": {"type": "string"},
|
||||
"url": {"type": "string"}
|
||||
}
|
||||
},
|
||||
"twitter_card": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"card": {"type": "string"},
|
||||
"title": {"type": "string"},
|
||||
"description": {"type": "string"},
|
||||
"image": {"type": "string"},
|
||||
"site": {"type": "string"},
|
||||
"creator": {"type": "string"}
|
||||
}
|
||||
},
|
||||
"json_ld_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"@context": {"type": "string"},
|
||||
"@type": {"type": "string"},
|
||||
"headline": {"type": "string"},
|
||||
"description": {"type": "string"},
|
||||
"author": {"type": "object"},
|
||||
"publisher": {"type": "object"},
|
||||
"datePublished": {"type": "string"},
|
||||
"dateModified": {"type": "string"},
|
||||
"mainEntityOfPage": {"type": "string"},
|
||||
"keywords": {"type": "array"},
|
||||
"wordCount": {"type": "integer"}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["open_graph", "twitter_card", "json_ld_schema"]
|
||||
}
|
||||
|
||||
# Get structured response using provider-agnostic llm_text_gen
|
||||
ai_response_raw = llm_text_gen(
|
||||
prompt=prompt,
|
||||
json_struct=schema,
|
||||
system_prompt=None,
|
||||
user_id=user_id # Pass user_id for subscription checking
|
||||
)
|
||||
|
||||
# Handle response: llm_text_gen may return dict (from structured JSON) or str (needs parsing)
|
||||
ai_response = ai_response_raw
|
||||
if isinstance(ai_response_raw, str):
|
||||
try:
|
||||
import json
|
||||
ai_response = json.loads(ai_response_raw)
|
||||
except json.JSONDecodeError:
|
||||
logger.error(f"Failed to parse JSON response: {ai_response_raw[:200]}...")
|
||||
ai_response = None
|
||||
|
||||
# Check if we got a valid response
|
||||
if not ai_response or not isinstance(ai_response, dict) or not ai_response.get('open_graph') or not ai_response.get('twitter_card') or not ai_response.get('json_ld_schema'):
|
||||
logger.error("Social metadata generation failed: Invalid or empty response from LLM")
|
||||
# Return fallback response
|
||||
return {
|
||||
'open_graph': {
|
||||
'title': blog_title,
|
||||
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
|
||||
'image': 'https://example.com/image.jpg',
|
||||
'type': 'article',
|
||||
'site_name': 'Your Website',
|
||||
'url': 'https://example.com/blog'
|
||||
},
|
||||
'twitter_card': {
|
||||
'card': 'summary_large_image',
|
||||
'title': blog_title,
|
||||
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
|
||||
'image': 'https://example.com/image.jpg',
|
||||
'site': '@yourwebsite',
|
||||
'creator': '@author'
|
||||
},
|
||||
'json_ld_schema': {
|
||||
'@context': 'https://schema.org',
|
||||
'@type': 'Article',
|
||||
'headline': blog_title,
|
||||
'description': f'Learn about {keywords_data.get("primary_keywords", ["this topic"])[0] if keywords_data.get("primary_keywords") else "this topic"}.',
|
||||
'author': {'@type': 'Person', 'name': 'Author Name'},
|
||||
'publisher': {'@type': 'Organization', 'name': 'Your Website'},
|
||||
'datePublished': '2025-01-01T00:00:00Z',
|
||||
'dateModified': '2025-01-01T00:00:00Z',
|
||||
'mainEntityOfPage': 'https://example.com/blog',
|
||||
'keywords': keywords_data.get('primary_keywords', ['content']),
|
||||
'wordCount': len(blog_content.split())
|
||||
}
|
||||
}
|
||||
|
||||
logger.info(f"Social metadata generation completed. Response keys: {list(ai_response.keys())}")
|
||||
logger.info(f"Open Graph data: {ai_response.get('open_graph', 'Not found')}")
|
||||
logger.info(f"Twitter Card data: {ai_response.get('twitter_card', 'Not found')}")
|
||||
logger.info(f"JSON-LD data: {ai_response.get('json_ld_schema', 'Not found')}")
|
||||
|
||||
return ai_response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Social metadata generation failed: {e}")
|
||||
raise e
|
||||
|
||||
def _extract_content_highlights(self, blog_content: str, max_length: int = 2500) -> str:
|
||||
"""Extract key sections from blog content for prompt context"""
|
||||
try:
|
||||
lines = blog_content.split('\n')
|
||||
|
||||
# Get first paragraph (introduction)
|
||||
intro = ""
|
||||
for line in lines[:20]:
|
||||
if line.strip() and not line.strip().startswith('#'):
|
||||
intro += line.strip() + " "
|
||||
if len(intro) > 300:
|
||||
break
|
||||
|
||||
# Get section headings
|
||||
headings = [line.strip() for line in lines if line.strip().startswith('##')][:6]
|
||||
|
||||
# Get conclusion if available
|
||||
conclusion = ""
|
||||
for line in reversed(lines[-20:]):
|
||||
if line.strip() and not line.strip().startswith('#'):
|
||||
conclusion = line.strip() + " " + conclusion
|
||||
if len(conclusion) > 300:
|
||||
break
|
||||
|
||||
highlights = f"INTRODUCTION: {intro[:300]}...\n\n"
|
||||
highlights += f"SECTION HEADINGS: {' | '.join([h.replace('##', '').strip() for h in headings])}\n\n"
|
||||
if conclusion:
|
||||
highlights += f"CONCLUSION: {conclusion[:300]}..."
|
||||
|
||||
return highlights[:max_length]
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to extract content highlights: {e}")
|
||||
return blog_content[:2000] + "..."
|
||||
|
||||
def _create_core_metadata_prompt(
|
||||
self,
|
||||
blog_content: str,
|
||||
blog_title: str,
|
||||
keywords_data: Dict[str, Any],
|
||||
outline: Optional[List[Dict[str, Any]]] = None,
|
||||
seo_analysis: Optional[Dict[str, Any]] = None
|
||||
) -> str:
|
||||
"""Create high-quality prompt for core metadata generation"""
|
||||
|
||||
primary_keywords = ", ".join(keywords_data.get('primary_keywords', []))
|
||||
semantic_keywords = ", ".join(keywords_data.get('semantic_keywords', []))
|
||||
search_intent = keywords_data.get('search_intent', 'informational')
|
||||
target_audience = keywords_data.get('target_audience', 'general')
|
||||
industry = keywords_data.get('industry', 'general')
|
||||
word_count = len(blog_content.split())
|
||||
|
||||
# Extract outline structure
|
||||
outline_context = ""
|
||||
if outline:
|
||||
headings = [s.get('heading', '') for s in outline if s.get('heading')]
|
||||
outline_context = f"""
|
||||
OUTLINE STRUCTURE:
|
||||
- Total sections: {len(outline)}
|
||||
- Section headings: {', '.join(headings[:8])}
|
||||
- Content hierarchy: Well-structured with {len(outline)} main sections
|
||||
"""
|
||||
|
||||
# Extract SEO analysis insights
|
||||
seo_context = ""
|
||||
if seo_analysis:
|
||||
overall_score = seo_analysis.get('overall_score', seo_analysis.get('seo_score', 0))
|
||||
category_scores = seo_analysis.get('category_scores', {})
|
||||
applied_recs = seo_analysis.get('applied_recommendations', [])
|
||||
|
||||
seo_context = f"""
|
||||
SEO ANALYSIS RESULTS:
|
||||
- Overall SEO Score: {overall_score}/100
|
||||
- Category Scores: Structure {category_scores.get('structure', category_scores.get('Structure', 0))}, Keywords {category_scores.get('keywords', category_scores.get('Keywords', 0))}, Readability {category_scores.get('readability', category_scores.get('Readability', 0))}
|
||||
- Applied Recommendations: {len(applied_recs)} SEO optimizations have been applied
|
||||
- Content Quality: Optimized for search engines with keyword focus
|
||||
"""
|
||||
|
||||
# Get more content context (key sections instead of just first 1000 chars)
|
||||
content_preview = self._extract_content_highlights(blog_content)
|
||||
|
||||
prompt = f"""
|
||||
Generate comprehensive, personalized SEO metadata for this blog post.
|
||||
|
||||
=== BLOG CONTENT CONTEXT ===
|
||||
TITLE: {blog_title}
|
||||
CONTENT PREVIEW (key sections): {content_preview}
|
||||
WORD COUNT: {word_count} words
|
||||
READING TIME ESTIMATE: {max(1, word_count // 200)} minutes
|
||||
|
||||
{outline_context}
|
||||
|
||||
=== KEYWORD & AUDIENCE DATA ===
|
||||
PRIMARY KEYWORDS: {primary_keywords}
|
||||
SEMANTIC KEYWORDS: {semantic_keywords}
|
||||
SEARCH INTENT: {search_intent}
|
||||
TARGET AUDIENCE: {target_audience}
|
||||
INDUSTRY: {industry}
|
||||
|
||||
{seo_context}
|
||||
|
||||
=== METADATA GENERATION REQUIREMENTS ===
|
||||
1. SEO TITLE (50-60 characters, must include primary keyword):
|
||||
- Front-load primary keyword
|
||||
- Make it compelling and click-worthy
|
||||
- Include power words if appropriate for {target_audience} audience
|
||||
- Optimized for {search_intent} search intent
|
||||
|
||||
2. META DESCRIPTION (150-160 characters, must include CTA):
|
||||
- Include primary keyword naturally in first 120 chars
|
||||
- Add compelling call-to-action (e.g., "Learn more", "Discover how", "Get started")
|
||||
- Highlight value proposition for {target_audience} audience
|
||||
- Use {industry} industry-specific terminology where relevant
|
||||
|
||||
3. URL SLUG (lowercase, hyphens, 3-5 words):
|
||||
- Include primary keyword
|
||||
- Remove stop words
|
||||
- Keep it concise and readable
|
||||
|
||||
4. BLOG TAGS (5-8 relevant tags):
|
||||
- Mix of primary, semantic, and long-tail keywords
|
||||
- Industry-specific tags for {industry}
|
||||
- Audience-relevant tags for {target_audience}
|
||||
|
||||
5. BLOG CATEGORIES (2-3 categories):
|
||||
- Based on content structure and {industry} industry standards
|
||||
- Reflect main themes from outline sections
|
||||
|
||||
6. SOCIAL HASHTAGS (5-10 hashtags with #):
|
||||
- Include primary keyword as hashtag
|
||||
- Industry-specific hashtags for {industry}
|
||||
- Trending/relevant hashtags for {target_audience}
|
||||
|
||||
7. READING TIME (calculate from {word_count} words):
|
||||
- Average reading speed: 200 words/minute
|
||||
- Round to nearest minute
|
||||
|
||||
8. FOCUS KEYWORD (primary keyword for SEO):
|
||||
- Select the most important primary keyword
|
||||
- Should match the main topic and search intent
|
||||
|
||||
=== QUALITY REQUIREMENTS ===
|
||||
- All metadata must be unique, not generic
|
||||
- Incorporate insights from SEO analysis if provided
|
||||
- Reflect the actual content structure from outline
|
||||
- Use language appropriate for {target_audience} audience
|
||||
- Optimize for {search_intent} search intent
|
||||
- Make descriptions compelling and action-oriented
|
||||
|
||||
Generate metadata that is personalized, compelling, and SEO-optimized.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _create_social_metadata_prompt(
|
||||
self,
|
||||
blog_content: str,
|
||||
blog_title: str,
|
||||
keywords_data: Dict[str, Any],
|
||||
outline: Optional[List[Dict[str, Any]]] = None,
|
||||
seo_analysis: Optional[Dict[str, Any]] = None
|
||||
) -> str:
|
||||
"""Create high-quality prompt for social metadata generation"""
|
||||
|
||||
primary_keywords = ", ".join(keywords_data.get('primary_keywords', []))
|
||||
search_intent = keywords_data.get('search_intent', 'informational')
|
||||
target_audience = keywords_data.get('target_audience', 'general')
|
||||
industry = keywords_data.get('industry', 'general')
|
||||
current_date = datetime.now().isoformat()
|
||||
|
||||
# Add outline and SEO context similar to core metadata prompt
|
||||
outline_context = ""
|
||||
if outline:
|
||||
headings = [s.get('heading', '') for s in outline if s.get('heading')]
|
||||
outline_context = f"\nOUTLINE SECTIONS: {', '.join(headings[:6])}\n"
|
||||
|
||||
seo_context = ""
|
||||
if seo_analysis:
|
||||
overall_score = seo_analysis.get('overall_score', seo_analysis.get('seo_score', 0))
|
||||
seo_context = f"\nSEO SCORE: {overall_score}/100 (optimized content)\n"
|
||||
|
||||
content_preview = self._extract_content_highlights(blog_content, 1500)
|
||||
|
||||
prompt = f"""
|
||||
Generate engaging social media metadata for this blog post.
|
||||
|
||||
=== CONTENT ===
|
||||
TITLE: {blog_title}
|
||||
CONTENT: {content_preview}
|
||||
{outline_context}
|
||||
{seo_context}
|
||||
KEYWORDS: {primary_keywords}
|
||||
TARGET AUDIENCE: {target_audience}
|
||||
INDUSTRY: {industry}
|
||||
CURRENT DATE: {current_date}
|
||||
|
||||
=== GENERATION REQUIREMENTS ===
|
||||
|
||||
1. OPEN GRAPH (Facebook/LinkedIn):
|
||||
- title: 60 chars max, include primary keyword, compelling for {target_audience}
|
||||
- description: 160 chars max, include CTA and value proposition
|
||||
- image: Suggest an appropriate image URL (placeholder if none available)
|
||||
- type: "article"
|
||||
- site_name: Use appropriate site name for {industry} industry
|
||||
- url: Generate canonical URL structure
|
||||
|
||||
2. TWITTER CARD:
|
||||
- card: "summary_large_image"
|
||||
- title: 70 chars max, optimized for Twitter audience
|
||||
- description: 200 chars max with relevant hashtags inline
|
||||
- image: Match Open Graph image
|
||||
- site: @yourwebsite (placeholder, user should update)
|
||||
- creator: @author (placeholder, user should update)
|
||||
|
||||
3. JSON-LD SCHEMA (Article):
|
||||
- @context: "https://schema.org"
|
||||
- @type: "Article"
|
||||
- headline: Article title (optimized)
|
||||
- description: Article description (150-200 chars)
|
||||
- author: {{"@type": "Person", "name": "Author Name"}} (placeholder)
|
||||
- publisher: {{"@type": "Organization", "name": "Site Name", "logo": {{"@type": "ImageObject", "url": "logo-url"}}}}
|
||||
- datePublished: {current_date}
|
||||
- dateModified: {current_date}
|
||||
- mainEntityOfPage: {{"@type": "WebPage", "@id": "canonical-url"}}
|
||||
- keywords: Array of primary and semantic keywords
|
||||
- wordCount: {len(blog_content.split())}
|
||||
- articleSection: Primary category based on content
|
||||
- inLanguage: "en-US"
|
||||
|
||||
Make it engaging, personalized for {target_audience}, and optimized for {industry} industry.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _compile_metadata_response(
|
||||
self,
|
||||
core_metadata: Dict[str, Any],
|
||||
social_metadata: Dict[str, Any],
|
||||
original_title: str
|
||||
) -> Dict[str, Any]:
|
||||
"""Compile final metadata response"""
|
||||
try:
|
||||
# Extract data from AI responses
|
||||
seo_title = core_metadata.get('seo_title', original_title)
|
||||
meta_description = core_metadata.get('meta_description', '')
|
||||
url_slug = core_metadata.get('url_slug', '')
|
||||
blog_tags = core_metadata.get('blog_tags', [])
|
||||
blog_categories = core_metadata.get('blog_categories', [])
|
||||
social_hashtags = core_metadata.get('social_hashtags', [])
|
||||
canonical_url = core_metadata.get('canonical_url', '')
|
||||
reading_time = core_metadata.get('reading_time', 0)
|
||||
focus_keyword = core_metadata.get('focus_keyword', '')
|
||||
|
||||
open_graph = social_metadata.get('open_graph', {})
|
||||
twitter_card = social_metadata.get('twitter_card', {})
|
||||
json_ld_schema = social_metadata.get('json_ld_schema', {})
|
||||
|
||||
# Compile comprehensive response
|
||||
response = {
|
||||
'success': True,
|
||||
'title_options': [seo_title], # For backward compatibility
|
||||
'meta_descriptions': [meta_description], # For backward compatibility
|
||||
'seo_title': seo_title,
|
||||
'meta_description': meta_description,
|
||||
'url_slug': url_slug,
|
||||
'blog_tags': blog_tags,
|
||||
'blog_categories': blog_categories,
|
||||
'social_hashtags': social_hashtags,
|
||||
'canonical_url': canonical_url,
|
||||
'reading_time': reading_time,
|
||||
'focus_keyword': focus_keyword,
|
||||
'open_graph': open_graph,
|
||||
'twitter_card': twitter_card,
|
||||
'json_ld_schema': json_ld_schema,
|
||||
'generated_at': datetime.utcnow().isoformat(),
|
||||
'metadata_summary': {
|
||||
'total_metadata_types': 10,
|
||||
'ai_calls_used': 2,
|
||||
'optimization_score': self._calculate_optimization_score(core_metadata, social_metadata)
|
||||
}
|
||||
}
|
||||
|
||||
logger.info(f"Metadata compilation completed. Generated {len(response)} metadata fields")
|
||||
return response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Metadata compilation failed: {e}")
|
||||
raise e
|
||||
|
||||
def _calculate_optimization_score(self, core_metadata: Dict[str, Any], social_metadata: Dict[str, Any]) -> int:
|
||||
"""Calculate overall optimization score for the generated metadata"""
|
||||
try:
|
||||
score = 0
|
||||
|
||||
# Check core metadata completeness
|
||||
if core_metadata.get('seo_title'):
|
||||
score += 15
|
||||
if core_metadata.get('meta_description'):
|
||||
score += 15
|
||||
if core_metadata.get('url_slug'):
|
||||
score += 10
|
||||
if core_metadata.get('blog_tags'):
|
||||
score += 10
|
||||
if core_metadata.get('blog_categories'):
|
||||
score += 10
|
||||
if core_metadata.get('social_hashtags'):
|
||||
score += 10
|
||||
if core_metadata.get('focus_keyword'):
|
||||
score += 10
|
||||
|
||||
# Check social metadata completeness
|
||||
if social_metadata.get('open_graph'):
|
||||
score += 10
|
||||
if social_metadata.get('twitter_card'):
|
||||
score += 5
|
||||
if social_metadata.get('json_ld_schema'):
|
||||
score += 5
|
||||
|
||||
return min(score, 100) # Cap at 100
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to calculate optimization score: {e}")
|
||||
return 0
|
||||
@@ -0,0 +1,273 @@
|
||||
"""Blog SEO Recommendation Applier
|
||||
|
||||
Applies actionable SEO recommendations to existing blog content using the
|
||||
provider-agnostic `llm_text_gen` dispatcher. Ensures GPT_PROVIDER parity.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from typing import Dict, Any, List
|
||||
from utils.logger_utils import get_service_logger
|
||||
|
||||
from services.llm_providers.main_text_generation import llm_text_gen
|
||||
|
||||
|
||||
logger = get_service_logger("blog_seo_recommendation_applier")
|
||||
|
||||
|
||||
class BlogSEORecommendationApplier:
|
||||
"""Apply actionable SEO recommendations to blog content."""
|
||||
|
||||
def __init__(self):
|
||||
logger.debug("Initialized BlogSEORecommendationApplier")
|
||||
|
||||
async def apply_recommendations(self, payload: Dict[str, Any], user_id: str = None) -> Dict[str, Any]:
|
||||
"""Apply recommendations and return updated content."""
|
||||
|
||||
if not user_id:
|
||||
raise ValueError("user_id is required for subscription checking. Please provide Clerk user ID.")
|
||||
|
||||
title = payload.get("title", "Untitled Blog")
|
||||
sections: List[Dict[str, Any]] = payload.get("sections", [])
|
||||
outline = payload.get("outline", [])
|
||||
research = payload.get("research", {})
|
||||
recommendations = payload.get("recommendations", [])
|
||||
persona = payload.get("persona", {})
|
||||
tone = payload.get("tone")
|
||||
audience = payload.get("audience")
|
||||
|
||||
if not sections:
|
||||
return {"success": False, "error": "No sections provided for recommendation application"}
|
||||
|
||||
if not recommendations:
|
||||
logger.warning("apply_recommendations called without recommendations")
|
||||
return {"success": True, "title": title, "sections": sections, "applied": []}
|
||||
|
||||
prompt = self._build_prompt(
|
||||
title=title,
|
||||
sections=sections,
|
||||
outline=outline,
|
||||
research=research,
|
||||
recommendations=recommendations,
|
||||
persona=persona,
|
||||
tone=tone,
|
||||
audience=audience,
|
||||
)
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {"type": "string"},
|
||||
"heading": {"type": "string"},
|
||||
"content": {"type": "string"},
|
||||
"notes": {"type": "array", "items": {"type": "string"}},
|
||||
},
|
||||
"required": ["id", "heading", "content"],
|
||||
},
|
||||
},
|
||||
"applied_recommendations": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"category": {"type": "string"},
|
||||
"summary": {"type": "string"},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
"required": ["sections"],
|
||||
}
|
||||
|
||||
logger.info("Applying SEO recommendations via llm_text_gen")
|
||||
|
||||
result = await asyncio.to_thread(
|
||||
llm_text_gen,
|
||||
prompt,
|
||||
None,
|
||||
schema,
|
||||
user_id, # Pass user_id for subscription checking
|
||||
)
|
||||
|
||||
if not result or result.get("error"):
|
||||
error_msg = result.get("error", "Unknown error") if result else "No response from text generator"
|
||||
logger.error(f"SEO recommendation application failed: {error_msg}")
|
||||
return {"success": False, "error": error_msg}
|
||||
|
||||
raw_sections = result.get("sections", []) or []
|
||||
normalized_sections: List[Dict[str, Any]] = []
|
||||
|
||||
# Build lookup table from updated sections using their identifiers
|
||||
updated_map: Dict[str, Dict[str, Any]] = {}
|
||||
for updated in raw_sections:
|
||||
section_id = str(
|
||||
updated.get("id")
|
||||
or updated.get("section_id")
|
||||
or updated.get("heading")
|
||||
or ""
|
||||
).strip()
|
||||
|
||||
if not section_id:
|
||||
continue
|
||||
|
||||
heading = (
|
||||
updated.get("heading")
|
||||
or updated.get("title")
|
||||
or section_id
|
||||
)
|
||||
|
||||
content_text = updated.get("content", "")
|
||||
if isinstance(content_text, list):
|
||||
content_text = "\n\n".join(str(p).strip() for p in content_text if p)
|
||||
|
||||
updated_map[section_id] = {
|
||||
"id": section_id,
|
||||
"heading": heading,
|
||||
"content": str(content_text).strip(),
|
||||
"notes": updated.get("notes", []),
|
||||
}
|
||||
|
||||
if not updated_map and raw_sections:
|
||||
logger.warning("Updated sections missing identifiers; falling back to positional mapping")
|
||||
|
||||
for index, original in enumerate(sections):
|
||||
fallback_id = str(
|
||||
original.get("id")
|
||||
or original.get("section_id")
|
||||
or f"section_{index + 1}"
|
||||
).strip()
|
||||
|
||||
mapped = updated_map.get(fallback_id)
|
||||
|
||||
if not mapped and raw_sections:
|
||||
# Fall back to positional match if identifier lookup failed
|
||||
candidate = raw_sections[index] if index < len(raw_sections) else {}
|
||||
heading = (
|
||||
candidate.get("heading")
|
||||
or candidate.get("title")
|
||||
or original.get("heading")
|
||||
or original.get("title")
|
||||
or f"Section {index + 1}"
|
||||
)
|
||||
content_text = candidate.get("content") or original.get("content", "")
|
||||
if isinstance(content_text, list):
|
||||
content_text = "\n\n".join(str(p).strip() for p in content_text if p)
|
||||
mapped = {
|
||||
"id": fallback_id,
|
||||
"heading": heading,
|
||||
"content": str(content_text).strip(),
|
||||
"notes": candidate.get("notes", []),
|
||||
}
|
||||
|
||||
if not mapped:
|
||||
# Fallback to original content if nothing else available
|
||||
mapped = {
|
||||
"id": fallback_id,
|
||||
"heading": original.get("heading") or original.get("title") or f"Section {index + 1}",
|
||||
"content": str(original.get("content", "")).strip(),
|
||||
"notes": original.get("notes", []),
|
||||
}
|
||||
|
||||
normalized_sections.append(mapped)
|
||||
|
||||
applied = result.get("applied_recommendations", [])
|
||||
|
||||
logger.info("SEO recommendations applied successfully")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"title": result.get("title", title),
|
||||
"sections": normalized_sections,
|
||||
"applied": applied,
|
||||
}
|
||||
|
||||
def _build_prompt(
|
||||
self,
|
||||
*,
|
||||
title: str,
|
||||
sections: List[Dict[str, Any]],
|
||||
outline: List[Dict[str, Any]],
|
||||
research: Dict[str, Any],
|
||||
recommendations: List[Dict[str, Any]],
|
||||
persona: Dict[str, Any],
|
||||
tone: str | None,
|
||||
audience: str | None,
|
||||
) -> str:
|
||||
"""Construct prompt for applying recommendations."""
|
||||
|
||||
sections_str = []
|
||||
for section in sections:
|
||||
sections_str.append(
|
||||
f"ID: {section.get('id', 'section')}, Heading: {section.get('heading', 'Untitled')}\n"
|
||||
f"Current Content:\n{section.get('content', '')}\n"
|
||||
)
|
||||
|
||||
outline_str = "\n".join(
|
||||
[
|
||||
f"- {item.get('heading', 'Section')} (Target words: {item.get('target_words', 'N/A')})"
|
||||
for item in outline
|
||||
]
|
||||
)
|
||||
|
||||
research_summary = research.get("keyword_analysis", {}) if research else {}
|
||||
primary_keywords = ", ".join(research_summary.get("primary", [])[:10]) or "None"
|
||||
|
||||
recommendations_str = []
|
||||
for rec in recommendations:
|
||||
recommendations_str.append(
|
||||
f"Category: {rec.get('category', 'General')} | Priority: {rec.get('priority', 'Medium')}\n"
|
||||
f"Recommendation: {rec.get('recommendation', '')}\n"
|
||||
f"Impact: {rec.get('impact', '')}\n"
|
||||
)
|
||||
|
||||
persona_str = (
|
||||
f"Persona: {persona}\n"
|
||||
if persona
|
||||
else "Persona: (not provided)\n"
|
||||
)
|
||||
|
||||
style_guidance = []
|
||||
if tone:
|
||||
style_guidance.append(f"Desired tone: {tone}")
|
||||
if audience:
|
||||
style_guidance.append(f"Target audience: {audience}")
|
||||
style_str = "\n".join(style_guidance) if style_guidance else "Maintain current tone and audience alignment."
|
||||
|
||||
prompt = f"""
|
||||
You are an expert SEO content strategist. Update the blog content to apply the actionable recommendations.
|
||||
|
||||
Current Title: {title}
|
||||
|
||||
Primary Keywords (for context): {primary_keywords}
|
||||
|
||||
Outline Overview:
|
||||
{outline_str or 'No outline supplied'}
|
||||
|
||||
Existing Sections:
|
||||
{''.join(sections_str)}
|
||||
|
||||
Actionable Recommendations to Apply:
|
||||
{''.join(recommendations_str)}
|
||||
|
||||
{persona_str}
|
||||
{style_str}
|
||||
|
||||
Instructions:
|
||||
1. Carefully apply the recommendations while preserving factual accuracy and research alignment.
|
||||
2. Keep section identifiers (IDs) unchanged so the frontend can map updates correctly.
|
||||
3. Improve clarity, flow, and SEO optimization per the guidance.
|
||||
4. Return updated sections in the requested JSON format.
|
||||
5. Provide a short summary of which recommendations were addressed.
|
||||
"""
|
||||
|
||||
return prompt
|
||||
|
||||
|
||||
__all__ = ["BlogSEORecommendationApplier"]
|
||||
|
||||
|
||||
Reference in New Issue
Block a user