Made changes to Getting started with ALwrity and added lot of details on API keys
This commit is contained in:
181
lib/utils/website_analyzer/README.md
Normal file
181
lib/utils/website_analyzer/README.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Website Analyzer Module
|
||||
|
||||
A comprehensive website analysis toolkit that provides detailed insights into website performance, SEO metrics, and content quality. This module combines traditional web analysis techniques with AI-powered content evaluation to deliver actionable recommendations.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Comprehensive Website Analysis
|
||||
- Basic website information extraction
|
||||
- SSL/TLS certificate validation
|
||||
- DNS record analysis
|
||||
- WHOIS information retrieval
|
||||
- Content analysis and structure evaluation
|
||||
- Performance metrics assessment
|
||||
|
||||
### 2. Advanced SEO Analysis
|
||||
- Meta tag optimization analysis
|
||||
- Content quality evaluation
|
||||
- Keyword density analysis
|
||||
- Readability scoring
|
||||
- Heading structure analysis
|
||||
- AI-powered content recommendations
|
||||
|
||||
### 3. Technical Infrastructure
|
||||
- Asynchronous web crawling
|
||||
- Multi-threaded analysis
|
||||
- Robust error handling
|
||||
- Comprehensive logging
|
||||
- Type-safe data models
|
||||
|
||||
## Module Structure
|
||||
|
||||
### 1. `analyzer.py`
|
||||
The main analysis engine that provides comprehensive website analysis.
|
||||
|
||||
#### Key Components:
|
||||
- `WebsiteAnalyzer` class
|
||||
- URL validation
|
||||
- Basic website information extraction
|
||||
- SSL/TLS certificate checking
|
||||
- DNS record analysis
|
||||
- WHOIS information retrieval
|
||||
- Content analysis
|
||||
- Performance metrics assessment
|
||||
|
||||
#### Features:
|
||||
- Concurrent analysis using ThreadPoolExecutor
|
||||
- Robust error handling and logging
|
||||
- User-agent simulation for reliable scraping
|
||||
- Timeout handling for requests
|
||||
- Comprehensive result formatting
|
||||
|
||||
### 2. `seo_analyzer.py`
|
||||
Specialized SEO analysis module with AI integration.
|
||||
|
||||
#### Key Components:
|
||||
- `extract_content()`: Fetches and parses webpage content
|
||||
- `analyze_meta_tags()`: Evaluates meta tags and SEO elements
|
||||
- `analyze_content_with_ai()`: AI-powered content analysis
|
||||
- `analyze_seo()`: Main SEO analysis function
|
||||
|
||||
#### Features:
|
||||
- Meta tag optimization analysis
|
||||
- Content quality scoring
|
||||
- Keyword density analysis
|
||||
- Readability evaluation
|
||||
- AI-powered recommendations
|
||||
- Weighted scoring system
|
||||
|
||||
### 3. `models.py`
|
||||
Data models for structured analysis results.
|
||||
|
||||
#### Key Components:
|
||||
- `SEORecommendation`: Individual SEO recommendations
|
||||
- `MetaTagAnalysis`: Meta tag analysis results
|
||||
- `ContentAnalysis`: Content analysis metrics
|
||||
- `SEOAnalysisResult`: Complete analysis results
|
||||
|
||||
#### Features:
|
||||
- Type-safe data structures
|
||||
- Clear data organization
|
||||
- Easy serialization/deserialization
|
||||
- Comprehensive documentation
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Website Analysis
|
||||
```python
|
||||
from website_analyzer import analyze_website
|
||||
|
||||
# Analyze a website
|
||||
results = analyze_website("https://example.com")
|
||||
|
||||
# Access analysis results
|
||||
if results["success"]:
|
||||
data = results["data"]
|
||||
print(f"Domain: {data['domain']}")
|
||||
print(f"SSL Info: {data['analysis']['ssl_info']}")
|
||||
print(f"Content Info: {data['analysis']['content_info']}")
|
||||
```
|
||||
|
||||
### SEO Analysis
|
||||
```python
|
||||
from website_analyzer.seo_analyzer import analyze_seo
|
||||
|
||||
# Perform SEO analysis
|
||||
seo_results = analyze_seo("https://example.com", "your-openai-api-key")
|
||||
|
||||
# Access SEO results
|
||||
if seo_results.success:
|
||||
print(f"Overall Score: {seo_results.overall_score}")
|
||||
print(f"Meta Tags: {seo_results.meta_tags}")
|
||||
print(f"Content Analysis: {seo_results.content}")
|
||||
print(f"Recommendations: {seo_results.recommendations}")
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `requests`: HTTP requests
|
||||
- `beautifulsoup4`: HTML parsing
|
||||
- `python-whois`: WHOIS information
|
||||
- `dnspython`: DNS record analysis
|
||||
- `openai`: AI-powered analysis
|
||||
- `loguru`: Logging
|
||||
- `typing`: Type hints
|
||||
- `dataclasses`: Data models
|
||||
|
||||
## Error Handling
|
||||
|
||||
The module implements comprehensive error handling:
|
||||
- URL validation
|
||||
- Request timeouts
|
||||
- Connection errors
|
||||
- Parsing errors
|
||||
- API errors
|
||||
- DNS resolution errors
|
||||
- SSL/TLS errors
|
||||
|
||||
All errors are logged and returned in a structured format for easy handling.
|
||||
|
||||
## Logging
|
||||
|
||||
The module uses `loguru` for logging with the following features:
|
||||
- File rotation (500 MB)
|
||||
- 10-day retention
|
||||
- Debug level logging
|
||||
- Structured log format
|
||||
- Both file and stdout output
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **API Key Management**
|
||||
- Store API keys securely
|
||||
- Use environment variables
|
||||
- Implement rate limiting
|
||||
|
||||
2. **Error Handling**
|
||||
- Always check success status
|
||||
- Handle errors gracefully
|
||||
- Log errors appropriately
|
||||
|
||||
3. **Performance**
|
||||
- Use concurrent analysis
|
||||
- Implement timeouts
|
||||
- Cache results when possible
|
||||
|
||||
4. **Rate Limiting**
|
||||
- Respect website robots.txt
|
||||
- Implement delays between requests
|
||||
- Use appropriate user agents
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Commit your changes
|
||||
4. Push to the branch
|
||||
5. Create a Pull Request
|
||||
|
||||
## License
|
||||
|
||||
This module is part of the ALwrity project and is licensed under the MIT License.
|
||||
7
lib/utils/website_analyzer/__init__.py
Normal file
7
lib/utils/website_analyzer/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
||||
"""Website analyzer module for AI-powered website analysis."""
|
||||
|
||||
from .analyzer import analyze_website
|
||||
from .seo_analyzer import analyze_seo
|
||||
from .models import SEOAnalysisResult
|
||||
|
||||
__all__ = ['analyze_seo', 'SEOAnalysisResult', 'analyze_website']
|
||||
323
lib/utils/website_analyzer/analyzer.py
Normal file
323
lib/utils/website_analyzer/analyzer.py
Normal file
@@ -0,0 +1,323 @@
|
||||
"""Website scraping and AI analysis module."""
|
||||
|
||||
import asyncio
|
||||
from typing import Dict, List, Optional
|
||||
from bs4 import BeautifulSoup
|
||||
from urllib.parse import urljoin, urlparse
|
||||
import streamlit as st
|
||||
import re
|
||||
from loguru import logger
|
||||
from ...web_crawlers.async_web_crawler import AsyncWebCrawlerService
|
||||
from ...gpt_providers.text_generation.main_text_generation import llm_text_gen
|
||||
import os
|
||||
import sys
|
||||
import logging
|
||||
import json
|
||||
from datetime import datetime
|
||||
import requests
|
||||
import ssl
|
||||
import socket
|
||||
import whois
|
||||
import dns.resolver
|
||||
from requests.exceptions import RequestException
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler('website_analyzer.log')
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def analyze_website(url: str) -> Dict:
|
||||
"""
|
||||
Analyze a website and return comprehensive results.
|
||||
|
||||
Args:
|
||||
url (str): The URL to analyze
|
||||
|
||||
Returns:
|
||||
Dict: Analysis results including various metrics and checks
|
||||
"""
|
||||
logger.info(f"Starting website analysis for URL: {url}")
|
||||
try:
|
||||
analyzer = WebsiteAnalyzer()
|
||||
results = analyzer.analyze_website(url)
|
||||
|
||||
# Add success status to results
|
||||
if "error" in results:
|
||||
return {
|
||||
"success": False,
|
||||
"error": results["error"]
|
||||
}
|
||||
|
||||
# Add success status and wrap results
|
||||
return {
|
||||
"success": True,
|
||||
"data": results
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error in analyze_website: {str(e)}", exc_info=True)
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
class WebsiteAnalyzer:
|
||||
def __init__(self):
|
||||
self.session = requests.Session()
|
||||
self.session.headers.update({
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
|
||||
})
|
||||
logger.info("WebsiteAnalyzer initialized")
|
||||
|
||||
def analyze_website(self, url: str) -> Dict:
|
||||
"""
|
||||
Perform comprehensive analysis of a website.
|
||||
|
||||
Args:
|
||||
url (str): The URL to analyze
|
||||
|
||||
Returns:
|
||||
Dict: Analysis results including various metrics and checks
|
||||
"""
|
||||
logger.info(f"Starting analysis for URL: {url}")
|
||||
try:
|
||||
# Validate URL
|
||||
if not self._validate_url(url):
|
||||
logger.error(f"Invalid URL format: {url}")
|
||||
return {"error": "Invalid URL format"}
|
||||
|
||||
# Basic URL parsing
|
||||
parsed_url = urlparse(url)
|
||||
domain = parsed_url.netloc
|
||||
logger.debug(f"Parsed domain: {domain}")
|
||||
|
||||
# Initialize results dictionary
|
||||
results = {
|
||||
"url": url,
|
||||
"domain": domain,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"analysis": {}
|
||||
}
|
||||
|
||||
# Perform various analyses
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
# Basic website info
|
||||
basic_info = executor.submit(self._get_basic_info, url).result()
|
||||
results["analysis"]["basic_info"] = basic_info
|
||||
|
||||
# SSL/TLS info
|
||||
ssl_info = executor.submit(self._check_ssl, domain).result()
|
||||
results["analysis"]["ssl_info"] = ssl_info
|
||||
|
||||
# DNS info
|
||||
dns_info = executor.submit(self._check_dns, domain).result()
|
||||
results["analysis"]["dns_info"] = dns_info
|
||||
|
||||
# WHOIS info
|
||||
whois_info = executor.submit(self._get_whois_info, domain).result()
|
||||
results["analysis"]["whois_info"] = whois_info
|
||||
|
||||
# Content analysis
|
||||
content_info = executor.submit(self._analyze_content, url).result()
|
||||
results["analysis"]["content_info"] = content_info
|
||||
|
||||
# Performance metrics
|
||||
performance = executor.submit(self._check_performance, url).result()
|
||||
results["analysis"]["performance"] = performance
|
||||
|
||||
logger.info(f"Analysis completed successfully for {url}")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error during website analysis: {str(e)}", exc_info=True)
|
||||
return {"error": str(e)}
|
||||
|
||||
def _validate_url(self, url: str) -> bool:
|
||||
"""Validate URL format."""
|
||||
try:
|
||||
result = urlparse(url)
|
||||
return all([result.scheme, result.netloc])
|
||||
except Exception as e:
|
||||
logger.error(f"URL validation error: {str(e)}")
|
||||
return False
|
||||
|
||||
def _get_basic_info(self, url: str) -> Dict:
|
||||
"""Get basic website information."""
|
||||
logger.debug(f"Getting basic info for {url}")
|
||||
try:
|
||||
response = self.session.get(url, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
return {
|
||||
"status_code": response.status_code,
|
||||
"content_type": response.headers.get('content-type', ''),
|
||||
"title": soup.title.string if soup.title else '',
|
||||
"meta_description": self._get_meta_description(soup),
|
||||
"headers": dict(response.headers),
|
||||
"robots_txt": self._get_robots_txt(url),
|
||||
"sitemap": self._get_sitemap(url)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting basic info: {str(e)}", exc_info=True)
|
||||
return {"error": str(e)}
|
||||
|
||||
def _check_ssl(self, domain: str) -> Dict:
|
||||
"""Check SSL/TLS certificate information."""
|
||||
logger.debug(f"Checking SSL for {domain}")
|
||||
try:
|
||||
context = ssl.create_default_context()
|
||||
with socket.create_connection((domain, 443)) as sock:
|
||||
with context.wrap_socket(sock, server_hostname=domain) as ssock:
|
||||
cert = ssock.getpeercert()
|
||||
return {
|
||||
"has_ssl": True,
|
||||
"issuer": dict(x[0] for x in cert['issuer']),
|
||||
"expiry": datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z').isoformat(),
|
||||
"version": cert['version'],
|
||||
"subject": dict(x[0] for x in cert['subject'])
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"SSL check error: {str(e)}", exc_info=True)
|
||||
return {"has_ssl": False, "error": str(e)}
|
||||
|
||||
def _check_dns(self, domain: str) -> Dict:
|
||||
"""Check DNS records."""
|
||||
logger.debug(f"Checking DNS for {domain}")
|
||||
try:
|
||||
records = {}
|
||||
for record_type in ['A', 'AAAA', 'MX', 'NS', 'TXT']:
|
||||
try:
|
||||
answers = dns.resolver.resolve(domain, record_type)
|
||||
records[record_type] = [str(rdata) for rdata in answers]
|
||||
except dns.resolver.NoAnswer:
|
||||
records[record_type] = []
|
||||
except Exception as e:
|
||||
logger.warning(f"Error resolving {record_type} record: {str(e)}")
|
||||
records[record_type] = []
|
||||
return records
|
||||
except Exception as e:
|
||||
logger.error(f"DNS check error: {str(e)}", exc_info=True)
|
||||
return {"error": str(e)}
|
||||
|
||||
def _get_whois_info(self, domain: str) -> Dict:
|
||||
"""Get WHOIS information for a domain."""
|
||||
try:
|
||||
w = whois.whois(domain)
|
||||
|
||||
def format_date(date_value):
|
||||
if isinstance(date_value, list):
|
||||
return date_value[0].isoformat() if date_value else 'Unknown'
|
||||
return date_value.isoformat() if date_value else 'Unknown'
|
||||
|
||||
return {
|
||||
'registrar': w.registrar if hasattr(w, 'registrar') else 'Unknown',
|
||||
'creation_date': format_date(w.creation_date),
|
||||
'expiration_date': format_date(w.expiration_date),
|
||||
'updated_date': format_date(w.updated_date) if hasattr(w, 'updated_date') else 'Unknown',
|
||||
'name_servers': w.name_servers if hasattr(w, 'name_servers') else [],
|
||||
'domain_name': w.domain_name if hasattr(w, 'domain_name') else domain,
|
||||
'text': w.text if hasattr(w, 'text') else ''
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"WHOIS check error: {str(e)}")
|
||||
return {
|
||||
'registrar': 'Unknown',
|
||||
'creation_date': 'Unknown',
|
||||
'expiration_date': 'Unknown',
|
||||
'updated_date': 'Unknown',
|
||||
'name_servers': [],
|
||||
'domain_name': domain,
|
||||
'text': ''
|
||||
}
|
||||
|
||||
def _analyze_content(self, url: str) -> Dict:
|
||||
"""Analyze website content."""
|
||||
logger.debug(f"Analyzing content for {url}")
|
||||
try:
|
||||
response = self.session.get(url, timeout=10)
|
||||
response.raise_for_status()
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Get all text content
|
||||
text_content = soup.get_text()
|
||||
|
||||
# Count words
|
||||
words = re.findall(r'\w+', text_content.lower())
|
||||
word_count = len(words)
|
||||
|
||||
# Count headings
|
||||
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
|
||||
|
||||
# Count images
|
||||
images = soup.find_all('img')
|
||||
|
||||
# Count links
|
||||
links = soup.find_all('a')
|
||||
|
||||
return {
|
||||
"word_count": word_count,
|
||||
"heading_count": len(headings),
|
||||
"image_count": len(images),
|
||||
"link_count": len(links),
|
||||
"has_meta_description": bool(self._get_meta_description(soup)),
|
||||
"has_robots_txt": bool(self._get_robots_txt(url)),
|
||||
"has_sitemap": bool(self._get_sitemap(url))
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Content analysis error: {str(e)}", exc_info=True)
|
||||
return {"error": str(e)}
|
||||
|
||||
def _check_performance(self, url: str) -> Dict:
|
||||
"""Check website performance metrics."""
|
||||
logger.debug(f"Checking performance for {url}")
|
||||
try:
|
||||
start_time = datetime.now()
|
||||
response = self.session.get(url, timeout=10)
|
||||
end_time = datetime.now()
|
||||
|
||||
load_time = (end_time - start_time).total_seconds()
|
||||
|
||||
return {
|
||||
"load_time": load_time,
|
||||
"status_code": response.status_code,
|
||||
"content_length": len(response.content),
|
||||
"headers": dict(response.headers)
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Performance check error: {str(e)}", exc_info=True)
|
||||
return {"error": str(e)}
|
||||
|
||||
def _get_meta_description(self, soup: BeautifulSoup) -> Optional[str]:
|
||||
"""Extract meta description from HTML."""
|
||||
meta_desc = soup.find('meta', attrs={'name': 'description'})
|
||||
return meta_desc.get('content') if meta_desc else None
|
||||
|
||||
def _get_robots_txt(self, url: str) -> Optional[str]:
|
||||
"""Get robots.txt content."""
|
||||
try:
|
||||
robots_url = f"{url.rstrip('/')}/robots.txt"
|
||||
response = self.session.get(robots_url, timeout=5)
|
||||
if response.status_code == 200:
|
||||
return response.text
|
||||
except Exception as e:
|
||||
logger.warning(f"Error fetching robots.txt: {str(e)}")
|
||||
return None
|
||||
|
||||
def _get_sitemap(self, url: str) -> Optional[str]:
|
||||
"""Get sitemap.xml content."""
|
||||
try:
|
||||
sitemap_url = f"{url.rstrip('/')}/sitemap.xml"
|
||||
response = self.session.get(sitemap_url, timeout=5)
|
||||
if response.status_code == 200:
|
||||
return response.text
|
||||
except Exception as e:
|
||||
logger.warning(f"Error fetching sitemap.xml: {str(e)}")
|
||||
return None
|
||||
45
lib/utils/website_analyzer/models.py
Normal file
45
lib/utils/website_analyzer/models.py
Normal file
@@ -0,0 +1,45 @@
|
||||
"""Data models for website analysis results."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict, Optional
|
||||
from datetime import datetime
|
||||
|
||||
@dataclass
|
||||
class SEORecommendation:
|
||||
"""A single SEO recommendation."""
|
||||
priority: str # 'high', 'medium', 'low'
|
||||
category: str # 'content', 'technical', 'meta', etc.
|
||||
issue: str
|
||||
recommendation: str
|
||||
impact: str
|
||||
|
||||
@dataclass
|
||||
class MetaTagAnalysis:
|
||||
"""Analysis of meta tags."""
|
||||
title: Dict[str, str] # {'status': 'good', 'value': 'actual title', 'recommendation': 'suggestion'}
|
||||
description: Dict[str, str]
|
||||
keywords: Dict[str, str]
|
||||
has_robots: bool
|
||||
has_sitemap: bool
|
||||
|
||||
@dataclass
|
||||
class ContentAnalysis:
|
||||
"""Analysis of page content."""
|
||||
word_count: int
|
||||
headings_structure: Dict[str, int] # {'h1': 1, 'h2': 3, etc}
|
||||
keyword_density: Dict[str, float]
|
||||
readability_score: float
|
||||
content_quality_score: float
|
||||
|
||||
@dataclass
|
||||
class SEOAnalysisResult:
|
||||
"""Complete SEO analysis result."""
|
||||
url: str
|
||||
analyzed_at: datetime
|
||||
overall_score: float # 0-100
|
||||
meta_tags: MetaTagAnalysis
|
||||
content: ContentAnalysis
|
||||
recommendations: List[SEORecommendation]
|
||||
errors: List[str]
|
||||
warnings: List[str]
|
||||
success: bool
|
||||
233
lib/utils/website_analyzer/seo_analyzer.py
Normal file
233
lib/utils/website_analyzer/seo_analyzer.py
Normal file
@@ -0,0 +1,233 @@
|
||||
"""SEO analyzer module with AI integration."""
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Tuple, Optional
|
||||
from urllib.parse import urlparse
|
||||
import openai
|
||||
from loguru import logger
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
from .models import (
|
||||
SEOAnalysisResult,
|
||||
MetaTagAnalysis,
|
||||
ContentAnalysis,
|
||||
SEORecommendation
|
||||
)
|
||||
|
||||
def extract_content(url: str) -> Tuple[Optional[str], Optional[BeautifulSoup], List[str]]:
|
||||
"""Extract content from URL."""
|
||||
errors = []
|
||||
try:
|
||||
response = requests.get(url, timeout=10)
|
||||
response.raise_for_status()
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
return response.text, soup, errors
|
||||
except requests.RequestException as e:
|
||||
error_msg = f"Error fetching URL: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
errors.append(error_msg)
|
||||
return None, None, errors
|
||||
|
||||
def analyze_meta_tags(soup: BeautifulSoup) -> MetaTagAnalysis:
|
||||
"""Analyze meta tags using BeautifulSoup."""
|
||||
# Title analysis
|
||||
title = soup.title.string if soup.title else ""
|
||||
title_analysis = {
|
||||
'status': 'good' if title and 30 <= len(title) <= 60 else 'needs_improvement',
|
||||
'value': title,
|
||||
'recommendation': '' if title and 30 <= len(title) <= 60 else 'Title should be between 30-60 characters'
|
||||
}
|
||||
|
||||
# Meta description analysis
|
||||
meta_desc = soup.find('meta', attrs={'name': 'description'})
|
||||
desc = meta_desc.get('content', '') if meta_desc else ""
|
||||
desc_analysis = {
|
||||
'status': 'good' if desc and 120 <= len(desc) <= 160 else 'needs_improvement',
|
||||
'value': desc,
|
||||
'recommendation': '' if desc and 120 <= len(desc) <= 160 else 'Description should be between 120-160 characters'
|
||||
}
|
||||
|
||||
# Keywords analysis
|
||||
meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
|
||||
keywords = meta_keywords.get('content', '') if meta_keywords else ""
|
||||
keywords_analysis = {
|
||||
'status': 'good' if keywords else 'needs_improvement',
|
||||
'value': keywords,
|
||||
'recommendation': '' if keywords else 'Add relevant keywords meta tag'
|
||||
}
|
||||
|
||||
return MetaTagAnalysis(
|
||||
title=title_analysis,
|
||||
description=desc_analysis,
|
||||
keywords=keywords_analysis,
|
||||
has_robots=bool(soup.find('meta', attrs={'name': 'robots'})),
|
||||
has_sitemap=bool(soup.find('link', attrs={'rel': 'sitemap'}))
|
||||
)
|
||||
|
||||
def analyze_content_with_ai(content: str) -> Tuple[ContentAnalysis, List[SEORecommendation]]:
|
||||
"""Analyze content using AI."""
|
||||
try:
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
# Get API key from environment
|
||||
api_key = os.getenv('OPENAI_API_KEY')
|
||||
if not api_key:
|
||||
raise ValueError("OpenAI API key not found in environment variables")
|
||||
|
||||
# Initialize OpenAI client
|
||||
client = openai.OpenAI(api_key=api_key)
|
||||
|
||||
# Prepare prompt for content analysis
|
||||
prompt = f"""Analyze the following webpage content for SEO and provide a structured analysis:
|
||||
Content: {content[:4000]}... # Truncate to avoid token limits
|
||||
|
||||
Provide analysis in the following format:
|
||||
1. Word count
|
||||
2. Heading structure analysis
|
||||
3. Keyword density for main topics
|
||||
4. Readability score (0-100)
|
||||
5. Content quality score (0-100)
|
||||
6. List of SEO recommendations with priority (high/medium/low), category, issue, recommendation, and impact
|
||||
|
||||
Format the response as JSON."""
|
||||
|
||||
# Get AI analysis
|
||||
response = client.chat.completions.create(
|
||||
model="gpt-4",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are an SEO expert analyzing website content."},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
response_format={"type": "json_object"}
|
||||
)
|
||||
|
||||
# Parse AI response
|
||||
analysis = response.choices[0].message.content
|
||||
|
||||
# Create ContentAnalysis object
|
||||
content_analysis = ContentAnalysis(
|
||||
word_count=len(content.split()),
|
||||
headings_structure=analysis.get('heading_structure', {}),
|
||||
keyword_density=analysis.get('keyword_density', {}),
|
||||
readability_score=analysis.get('readability_score', 0),
|
||||
content_quality_score=analysis.get('content_quality_score', 0)
|
||||
)
|
||||
|
||||
# Create recommendations
|
||||
recommendations = [
|
||||
SEORecommendation(
|
||||
priority=rec['priority'],
|
||||
category=rec['category'],
|
||||
issue=rec['issue'],
|
||||
recommendation=rec['recommendation'],
|
||||
impact=rec['impact']
|
||||
)
|
||||
for rec in analysis.get('recommendations', [])
|
||||
]
|
||||
|
||||
return content_analysis, recommendations
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in AI analysis: {str(e)}")
|
||||
return ContentAnalysis(
|
||||
word_count=len(content.split()),
|
||||
headings_structure={},
|
||||
keyword_density={},
|
||||
readability_score=0,
|
||||
content_quality_score=0
|
||||
), []
|
||||
|
||||
def analyze_seo(url: str) -> SEOAnalysisResult:
|
||||
"""Main function to analyze website SEO."""
|
||||
errors = []
|
||||
warnings = []
|
||||
|
||||
# Validate URL
|
||||
try:
|
||||
parsed_url = urlparse(url)
|
||||
if not all([parsed_url.scheme, parsed_url.netloc]):
|
||||
errors.append("Invalid URL format")
|
||||
raise ValueError("Invalid URL format")
|
||||
except Exception as e:
|
||||
errors.append(f"URL parsing error: {str(e)}")
|
||||
return SEOAnalysisResult(
|
||||
url=url,
|
||||
analyzed_at=datetime.now(),
|
||||
overall_score=0,
|
||||
meta_tags=None,
|
||||
content=None,
|
||||
recommendations=[],
|
||||
errors=errors,
|
||||
warnings=warnings,
|
||||
success=False
|
||||
)
|
||||
|
||||
# Extract content
|
||||
content, soup, extract_errors = extract_content(url)
|
||||
errors.extend(extract_errors)
|
||||
|
||||
if not content or not soup:
|
||||
return SEOAnalysisResult(
|
||||
url=url,
|
||||
analyzed_at=datetime.now(),
|
||||
overall_score=0,
|
||||
meta_tags=None,
|
||||
content=None,
|
||||
recommendations=[],
|
||||
errors=errors,
|
||||
warnings=warnings,
|
||||
success=False
|
||||
)
|
||||
|
||||
try:
|
||||
# Analyze meta tags
|
||||
meta_analysis = analyze_meta_tags(soup)
|
||||
|
||||
# Analyze content with AI
|
||||
content_analysis, recommendations = analyze_content_with_ai(content)
|
||||
|
||||
# Calculate overall score
|
||||
meta_score = sum([
|
||||
1 if meta_analysis.title['status'] == 'good' else 0,
|
||||
1 if meta_analysis.description['status'] == 'good' else 0,
|
||||
1 if meta_analysis.keywords['status'] == 'good' else 0,
|
||||
1 if meta_analysis.has_robots else 0,
|
||||
1 if meta_analysis.has_sitemap else 0
|
||||
]) * 20 # Scale to 100
|
||||
|
||||
overall_score = (
|
||||
meta_score * 0.3 + # 30% weight for meta tags
|
||||
content_analysis.readability_score * 0.3 + # 30% weight for readability
|
||||
content_analysis.content_quality_score * 0.4 # 40% weight for content quality
|
||||
)
|
||||
|
||||
return SEOAnalysisResult(
|
||||
url=url,
|
||||
analyzed_at=datetime.now(),
|
||||
overall_score=overall_score,
|
||||
meta_tags=meta_analysis,
|
||||
content=content_analysis,
|
||||
recommendations=recommendations,
|
||||
errors=errors,
|
||||
warnings=warnings,
|
||||
success=True
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error in SEO analysis: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
errors.append(error_msg)
|
||||
return SEOAnalysisResult(
|
||||
url=url,
|
||||
analyzed_at=datetime.now(),
|
||||
overall_score=0,
|
||||
meta_tags=None,
|
||||
content=None,
|
||||
recommendations=[],
|
||||
errors=errors,
|
||||
warnings=warnings,
|
||||
success=False
|
||||
)
|
||||
Reference in New Issue
Block a user