Made changes to Getting started with ALwrity and added lot of details on API keys

This commit is contained in:
ajaysi
2025-04-01 13:11:40 +05:30
committed by ي
parent 367f9bac2c
commit 7d6ea91e6a
68 changed files with 8384 additions and 823 deletions

View File

@@ -0,0 +1,181 @@
# Website Analyzer Module
A comprehensive website analysis toolkit that provides detailed insights into website performance, SEO metrics, and content quality. This module combines traditional web analysis techniques with AI-powered content evaluation to deliver actionable recommendations.
## Features
### 1. Comprehensive Website Analysis
- Basic website information extraction
- SSL/TLS certificate validation
- DNS record analysis
- WHOIS information retrieval
- Content analysis and structure evaluation
- Performance metrics assessment
### 2. Advanced SEO Analysis
- Meta tag optimization analysis
- Content quality evaluation
- Keyword density analysis
- Readability scoring
- Heading structure analysis
- AI-powered content recommendations
### 3. Technical Infrastructure
- Asynchronous web crawling
- Multi-threaded analysis
- Robust error handling
- Comprehensive logging
- Type-safe data models
## Module Structure
### 1. `analyzer.py`
The main analysis engine that provides comprehensive website analysis.
#### Key Components:
- `WebsiteAnalyzer` class
- URL validation
- Basic website information extraction
- SSL/TLS certificate checking
- DNS record analysis
- WHOIS information retrieval
- Content analysis
- Performance metrics assessment
#### Features:
- Concurrent analysis using ThreadPoolExecutor
- Robust error handling and logging
- User-agent simulation for reliable scraping
- Timeout handling for requests
- Comprehensive result formatting
### 2. `seo_analyzer.py`
Specialized SEO analysis module with AI integration.
#### Key Components:
- `extract_content()`: Fetches and parses webpage content
- `analyze_meta_tags()`: Evaluates meta tags and SEO elements
- `analyze_content_with_ai()`: AI-powered content analysis
- `analyze_seo()`: Main SEO analysis function
#### Features:
- Meta tag optimization analysis
- Content quality scoring
- Keyword density analysis
- Readability evaluation
- AI-powered recommendations
- Weighted scoring system
### 3. `models.py`
Data models for structured analysis results.
#### Key Components:
- `SEORecommendation`: Individual SEO recommendations
- `MetaTagAnalysis`: Meta tag analysis results
- `ContentAnalysis`: Content analysis metrics
- `SEOAnalysisResult`: Complete analysis results
#### Features:
- Type-safe data structures
- Clear data organization
- Easy serialization/deserialization
- Comprehensive documentation
## Usage Examples
### Basic Website Analysis
```python
from website_analyzer import analyze_website
# Analyze a website
results = analyze_website("https://example.com")
# Access analysis results
if results["success"]:
data = results["data"]
print(f"Domain: {data['domain']}")
print(f"SSL Info: {data['analysis']['ssl_info']}")
print(f"Content Info: {data['analysis']['content_info']}")
```
### SEO Analysis
```python
from website_analyzer.seo_analyzer import analyze_seo
# Perform SEO analysis
seo_results = analyze_seo("https://example.com", "your-openai-api-key")
# Access SEO results
if seo_results.success:
print(f"Overall Score: {seo_results.overall_score}")
print(f"Meta Tags: {seo_results.meta_tags}")
print(f"Content Analysis: {seo_results.content}")
print(f"Recommendations: {seo_results.recommendations}")
```
## Dependencies
- `requests`: HTTP requests
- `beautifulsoup4`: HTML parsing
- `python-whois`: WHOIS information
- `dnspython`: DNS record analysis
- `openai`: AI-powered analysis
- `loguru`: Logging
- `typing`: Type hints
- `dataclasses`: Data models
## Error Handling
The module implements comprehensive error handling:
- URL validation
- Request timeouts
- Connection errors
- Parsing errors
- API errors
- DNS resolution errors
- SSL/TLS errors
All errors are logged and returned in a structured format for easy handling.
## Logging
The module uses `loguru` for logging with the following features:
- File rotation (500 MB)
- 10-day retention
- Debug level logging
- Structured log format
- Both file and stdout output
## Best Practices
1. **API Key Management**
- Store API keys securely
- Use environment variables
- Implement rate limiting
2. **Error Handling**
- Always check success status
- Handle errors gracefully
- Log errors appropriately
3. **Performance**
- Use concurrent analysis
- Implement timeouts
- Cache results when possible
4. **Rate Limiting**
- Respect website robots.txt
- Implement delays between requests
- Use appropriate user agents
## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request
## License
This module is part of the ALwrity project and is licensed under the MIT License.

View File

@@ -0,0 +1,7 @@
"""Website analyzer module for AI-powered website analysis."""
from .analyzer import analyze_website
from .seo_analyzer import analyze_seo
from .models import SEOAnalysisResult
__all__ = ['analyze_seo', 'SEOAnalysisResult', 'analyze_website']

View File

@@ -0,0 +1,323 @@
"""Website scraping and AI analysis module."""
import asyncio
from typing import Dict, List, Optional
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import streamlit as st
import re
from loguru import logger
from ...web_crawlers.async_web_crawler import AsyncWebCrawlerService
from ...gpt_providers.text_generation.main_text_generation import llm_text_gen
import os
import sys
import logging
import json
from datetime import datetime
import requests
import ssl
import socket
import whois
import dns.resolver
from requests.exceptions import RequestException
from concurrent.futures import ThreadPoolExecutor
# Configure logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(),
logging.FileHandler('website_analyzer.log')
]
)
logger = logging.getLogger(__name__)
def analyze_website(url: str) -> Dict:
"""
Analyze a website and return comprehensive results.
Args:
url (str): The URL to analyze
Returns:
Dict: Analysis results including various metrics and checks
"""
logger.info(f"Starting website analysis for URL: {url}")
try:
analyzer = WebsiteAnalyzer()
results = analyzer.analyze_website(url)
# Add success status to results
if "error" in results:
return {
"success": False,
"error": results["error"]
}
# Add success status and wrap results
return {
"success": True,
"data": results
}
except Exception as e:
logger.error(f"Error in analyze_website: {str(e)}", exc_info=True)
return {
"success": False,
"error": str(e)
}
class WebsiteAnalyzer:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
logger.info("WebsiteAnalyzer initialized")
def analyze_website(self, url: str) -> Dict:
"""
Perform comprehensive analysis of a website.
Args:
url (str): The URL to analyze
Returns:
Dict: Analysis results including various metrics and checks
"""
logger.info(f"Starting analysis for URL: {url}")
try:
# Validate URL
if not self._validate_url(url):
logger.error(f"Invalid URL format: {url}")
return {"error": "Invalid URL format"}
# Basic URL parsing
parsed_url = urlparse(url)
domain = parsed_url.netloc
logger.debug(f"Parsed domain: {domain}")
# Initialize results dictionary
results = {
"url": url,
"domain": domain,
"timestamp": datetime.now().isoformat(),
"analysis": {}
}
# Perform various analyses
with ThreadPoolExecutor(max_workers=4) as executor:
# Basic website info
basic_info = executor.submit(self._get_basic_info, url).result()
results["analysis"]["basic_info"] = basic_info
# SSL/TLS info
ssl_info = executor.submit(self._check_ssl, domain).result()
results["analysis"]["ssl_info"] = ssl_info
# DNS info
dns_info = executor.submit(self._check_dns, domain).result()
results["analysis"]["dns_info"] = dns_info
# WHOIS info
whois_info = executor.submit(self._get_whois_info, domain).result()
results["analysis"]["whois_info"] = whois_info
# Content analysis
content_info = executor.submit(self._analyze_content, url).result()
results["analysis"]["content_info"] = content_info
# Performance metrics
performance = executor.submit(self._check_performance, url).result()
results["analysis"]["performance"] = performance
logger.info(f"Analysis completed successfully for {url}")
return results
except Exception as e:
logger.error(f"Error during website analysis: {str(e)}", exc_info=True)
return {"error": str(e)}
def _validate_url(self, url: str) -> bool:
"""Validate URL format."""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception as e:
logger.error(f"URL validation error: {str(e)}")
return False
def _get_basic_info(self, url: str) -> Dict:
"""Get basic website information."""
logger.debug(f"Getting basic info for {url}")
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
return {
"status_code": response.status_code,
"content_type": response.headers.get('content-type', ''),
"title": soup.title.string if soup.title else '',
"meta_description": self._get_meta_description(soup),
"headers": dict(response.headers),
"robots_txt": self._get_robots_txt(url),
"sitemap": self._get_sitemap(url)
}
except Exception as e:
logger.error(f"Error getting basic info: {str(e)}", exc_info=True)
return {"error": str(e)}
def _check_ssl(self, domain: str) -> Dict:
"""Check SSL/TLS certificate information."""
logger.debug(f"Checking SSL for {domain}")
try:
context = ssl.create_default_context()
with socket.create_connection((domain, 443)) as sock:
with context.wrap_socket(sock, server_hostname=domain) as ssock:
cert = ssock.getpeercert()
return {
"has_ssl": True,
"issuer": dict(x[0] for x in cert['issuer']),
"expiry": datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z').isoformat(),
"version": cert['version'],
"subject": dict(x[0] for x in cert['subject'])
}
except Exception as e:
logger.error(f"SSL check error: {str(e)}", exc_info=True)
return {"has_ssl": False, "error": str(e)}
def _check_dns(self, domain: str) -> Dict:
"""Check DNS records."""
logger.debug(f"Checking DNS for {domain}")
try:
records = {}
for record_type in ['A', 'AAAA', 'MX', 'NS', 'TXT']:
try:
answers = dns.resolver.resolve(domain, record_type)
records[record_type] = [str(rdata) for rdata in answers]
except dns.resolver.NoAnswer:
records[record_type] = []
except Exception as e:
logger.warning(f"Error resolving {record_type} record: {str(e)}")
records[record_type] = []
return records
except Exception as e:
logger.error(f"DNS check error: {str(e)}", exc_info=True)
return {"error": str(e)}
def _get_whois_info(self, domain: str) -> Dict:
"""Get WHOIS information for a domain."""
try:
w = whois.whois(domain)
def format_date(date_value):
if isinstance(date_value, list):
return date_value[0].isoformat() if date_value else 'Unknown'
return date_value.isoformat() if date_value else 'Unknown'
return {
'registrar': w.registrar if hasattr(w, 'registrar') else 'Unknown',
'creation_date': format_date(w.creation_date),
'expiration_date': format_date(w.expiration_date),
'updated_date': format_date(w.updated_date) if hasattr(w, 'updated_date') else 'Unknown',
'name_servers': w.name_servers if hasattr(w, 'name_servers') else [],
'domain_name': w.domain_name if hasattr(w, 'domain_name') else domain,
'text': w.text if hasattr(w, 'text') else ''
}
except Exception as e:
logger.error(f"WHOIS check error: {str(e)}")
return {
'registrar': 'Unknown',
'creation_date': 'Unknown',
'expiration_date': 'Unknown',
'updated_date': 'Unknown',
'name_servers': [],
'domain_name': domain,
'text': ''
}
def _analyze_content(self, url: str) -> Dict:
"""Analyze website content."""
logger.debug(f"Analyzing content for {url}")
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Get all text content
text_content = soup.get_text()
# Count words
words = re.findall(r'\w+', text_content.lower())
word_count = len(words)
# Count headings
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
# Count images
images = soup.find_all('img')
# Count links
links = soup.find_all('a')
return {
"word_count": word_count,
"heading_count": len(headings),
"image_count": len(images),
"link_count": len(links),
"has_meta_description": bool(self._get_meta_description(soup)),
"has_robots_txt": bool(self._get_robots_txt(url)),
"has_sitemap": bool(self._get_sitemap(url))
}
except Exception as e:
logger.error(f"Content analysis error: {str(e)}", exc_info=True)
return {"error": str(e)}
def _check_performance(self, url: str) -> Dict:
"""Check website performance metrics."""
logger.debug(f"Checking performance for {url}")
try:
start_time = datetime.now()
response = self.session.get(url, timeout=10)
end_time = datetime.now()
load_time = (end_time - start_time).total_seconds()
return {
"load_time": load_time,
"status_code": response.status_code,
"content_length": len(response.content),
"headers": dict(response.headers)
}
except Exception as e:
logger.error(f"Performance check error: {str(e)}", exc_info=True)
return {"error": str(e)}
def _get_meta_description(self, soup: BeautifulSoup) -> Optional[str]:
"""Extract meta description from HTML."""
meta_desc = soup.find('meta', attrs={'name': 'description'})
return meta_desc.get('content') if meta_desc else None
def _get_robots_txt(self, url: str) -> Optional[str]:
"""Get robots.txt content."""
try:
robots_url = f"{url.rstrip('/')}/robots.txt"
response = self.session.get(robots_url, timeout=5)
if response.status_code == 200:
return response.text
except Exception as e:
logger.warning(f"Error fetching robots.txt: {str(e)}")
return None
def _get_sitemap(self, url: str) -> Optional[str]:
"""Get sitemap.xml content."""
try:
sitemap_url = f"{url.rstrip('/')}/sitemap.xml"
response = self.session.get(sitemap_url, timeout=5)
if response.status_code == 200:
return response.text
except Exception as e:
logger.warning(f"Error fetching sitemap.xml: {str(e)}")
return None

View File

@@ -0,0 +1,45 @@
"""Data models for website analysis results."""
from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime
@dataclass
class SEORecommendation:
"""A single SEO recommendation."""
priority: str # 'high', 'medium', 'low'
category: str # 'content', 'technical', 'meta', etc.
issue: str
recommendation: str
impact: str
@dataclass
class MetaTagAnalysis:
"""Analysis of meta tags."""
title: Dict[str, str] # {'status': 'good', 'value': 'actual title', 'recommendation': 'suggestion'}
description: Dict[str, str]
keywords: Dict[str, str]
has_robots: bool
has_sitemap: bool
@dataclass
class ContentAnalysis:
"""Analysis of page content."""
word_count: int
headings_structure: Dict[str, int] # {'h1': 1, 'h2': 3, etc}
keyword_density: Dict[str, float]
readability_score: float
content_quality_score: float
@dataclass
class SEOAnalysisResult:
"""Complete SEO analysis result."""
url: str
analyzed_at: datetime
overall_score: float # 0-100
meta_tags: MetaTagAnalysis
content: ContentAnalysis
recommendations: List[SEORecommendation]
errors: List[str]
warnings: List[str]
success: bool

View File

@@ -0,0 +1,233 @@
"""SEO analyzer module with AI integration."""
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from typing import Dict, List, Tuple, Optional
from urllib.parse import urlparse
import openai
from loguru import logger
import os
from dotenv import load_dotenv
from .models import (
SEOAnalysisResult,
MetaTagAnalysis,
ContentAnalysis,
SEORecommendation
)
def extract_content(url: str) -> Tuple[Optional[str], Optional[BeautifulSoup], List[str]]:
"""Extract content from URL."""
errors = []
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
return response.text, soup, errors
except requests.RequestException as e:
error_msg = f"Error fetching URL: {str(e)}"
logger.error(error_msg)
errors.append(error_msg)
return None, None, errors
def analyze_meta_tags(soup: BeautifulSoup) -> MetaTagAnalysis:
"""Analyze meta tags using BeautifulSoup."""
# Title analysis
title = soup.title.string if soup.title else ""
title_analysis = {
'status': 'good' if title and 30 <= len(title) <= 60 else 'needs_improvement',
'value': title,
'recommendation': '' if title and 30 <= len(title) <= 60 else 'Title should be between 30-60 characters'
}
# Meta description analysis
meta_desc = soup.find('meta', attrs={'name': 'description'})
desc = meta_desc.get('content', '') if meta_desc else ""
desc_analysis = {
'status': 'good' if desc and 120 <= len(desc) <= 160 else 'needs_improvement',
'value': desc,
'recommendation': '' if desc and 120 <= len(desc) <= 160 else 'Description should be between 120-160 characters'
}
# Keywords analysis
meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
keywords = meta_keywords.get('content', '') if meta_keywords else ""
keywords_analysis = {
'status': 'good' if keywords else 'needs_improvement',
'value': keywords,
'recommendation': '' if keywords else 'Add relevant keywords meta tag'
}
return MetaTagAnalysis(
title=title_analysis,
description=desc_analysis,
keywords=keywords_analysis,
has_robots=bool(soup.find('meta', attrs={'name': 'robots'})),
has_sitemap=bool(soup.find('link', attrs={'rel': 'sitemap'}))
)
def analyze_content_with_ai(content: str) -> Tuple[ContentAnalysis, List[SEORecommendation]]:
"""Analyze content using AI."""
try:
# Load environment variables
load_dotenv()
# Get API key from environment
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
raise ValueError("OpenAI API key not found in environment variables")
# Initialize OpenAI client
client = openai.OpenAI(api_key=api_key)
# Prepare prompt for content analysis
prompt = f"""Analyze the following webpage content for SEO and provide a structured analysis:
Content: {content[:4000]}... # Truncate to avoid token limits
Provide analysis in the following format:
1. Word count
2. Heading structure analysis
3. Keyword density for main topics
4. Readability score (0-100)
5. Content quality score (0-100)
6. List of SEO recommendations with priority (high/medium/low), category, issue, recommendation, and impact
Format the response as JSON."""
# Get AI analysis
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an SEO expert analyzing website content."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
# Parse AI response
analysis = response.choices[0].message.content
# Create ContentAnalysis object
content_analysis = ContentAnalysis(
word_count=len(content.split()),
headings_structure=analysis.get('heading_structure', {}),
keyword_density=analysis.get('keyword_density', {}),
readability_score=analysis.get('readability_score', 0),
content_quality_score=analysis.get('content_quality_score', 0)
)
# Create recommendations
recommendations = [
SEORecommendation(
priority=rec['priority'],
category=rec['category'],
issue=rec['issue'],
recommendation=rec['recommendation'],
impact=rec['impact']
)
for rec in analysis.get('recommendations', [])
]
return content_analysis, recommendations
except Exception as e:
logger.error(f"Error in AI analysis: {str(e)}")
return ContentAnalysis(
word_count=len(content.split()),
headings_structure={},
keyword_density={},
readability_score=0,
content_quality_score=0
), []
def analyze_seo(url: str) -> SEOAnalysisResult:
"""Main function to analyze website SEO."""
errors = []
warnings = []
# Validate URL
try:
parsed_url = urlparse(url)
if not all([parsed_url.scheme, parsed_url.netloc]):
errors.append("Invalid URL format")
raise ValueError("Invalid URL format")
except Exception as e:
errors.append(f"URL parsing error: {str(e)}")
return SEOAnalysisResult(
url=url,
analyzed_at=datetime.now(),
overall_score=0,
meta_tags=None,
content=None,
recommendations=[],
errors=errors,
warnings=warnings,
success=False
)
# Extract content
content, soup, extract_errors = extract_content(url)
errors.extend(extract_errors)
if not content or not soup:
return SEOAnalysisResult(
url=url,
analyzed_at=datetime.now(),
overall_score=0,
meta_tags=None,
content=None,
recommendations=[],
errors=errors,
warnings=warnings,
success=False
)
try:
# Analyze meta tags
meta_analysis = analyze_meta_tags(soup)
# Analyze content with AI
content_analysis, recommendations = analyze_content_with_ai(content)
# Calculate overall score
meta_score = sum([
1 if meta_analysis.title['status'] == 'good' else 0,
1 if meta_analysis.description['status'] == 'good' else 0,
1 if meta_analysis.keywords['status'] == 'good' else 0,
1 if meta_analysis.has_robots else 0,
1 if meta_analysis.has_sitemap else 0
]) * 20 # Scale to 100
overall_score = (
meta_score * 0.3 + # 30% weight for meta tags
content_analysis.readability_score * 0.3 + # 30% weight for readability
content_analysis.content_quality_score * 0.4 # 40% weight for content quality
)
return SEOAnalysisResult(
url=url,
analyzed_at=datetime.now(),
overall_score=overall_score,
meta_tags=meta_analysis,
content=content_analysis,
recommendations=recommendations,
errors=errors,
warnings=warnings,
success=True
)
except Exception as e:
error_msg = f"Error in SEO analysis: {str(e)}"
logger.error(error_msg)
errors.append(error_msg)
return SEOAnalysisResult(
url=url,
analyzed_at=datetime.now(),
overall_score=0,
meta_tags=None,
content=None,
recommendations=[],
errors=errors,
warnings=warnings,
success=False
)