Made changes to Getting started with ALwrity and added lot of details on API keys

This commit is contained in:
ajaysi
2025-04-01 13:11:40 +05:30
parent 367f9bac2c
commit 6c833e2773
68 changed files with 8384 additions and 823 deletions

151
lib/web_crawlers/README.md Normal file
View File

@@ -0,0 +1,151 @@
# Web Crawler Guide for Content Creators
## What is a Web Crawler?
A web crawler is a powerful tool that helps content creators gather, analyze, and understand content from websites. It's like having a digital assistant that can quickly scan websites and extract valuable information to help you create better content.
## Key Features
### 1. Content Extraction
- **Main Content**: Extracts the primary content from web pages
- **Meta Information**: Captures titles, descriptions, and meta tags
- **Structure Analysis**: Identifies headings and content hierarchy
- **Media Elements**: Collects links and images with their descriptions
### 2. AI-Powered Analysis
- **Topic Identification**: Automatically identifies main topics
- **Content Quality Assessment**: Evaluates readability and engagement
- **SEO Analysis**: Provides SEO scores and recommendations
- **Content Gap Analysis**: Identifies missing information
- **Opportunity Detection**: Suggests areas for improvement
### 3. Smart Processing
- **Fast Performance**: Uses advanced async technology for quick results
- **Error Handling**: Gracefully handles website access issues
- **Content Cleaning**: Removes unnecessary elements for clean analysis
- **Multiple Page Support**: Can analyze multiple pages efficiently
## Use Cases for Content Creators
### 1. Content Research
- **Competitor Analysis**: Study competitor content and strategies
- **Topic Research**: Gather information for new content ideas
- **Industry Trends**: Track industry developments and updates
- **Content Inspiration**: Find inspiration from successful content
### 2. Content Optimization
- **SEO Improvement**: Identify SEO opportunities
- **Content Structure**: Analyze and improve content organization
- **Readability Enhancement**: Get suggestions for better readability
- **Engagement Optimization**: Improve content engagement
### 3. Content Strategy
- **Gap Analysis**: Identify content gaps in your niche
- **Topic Planning**: Plan content topics and themes
- **Audience Understanding**: Better understand target audience needs
- **Performance Tracking**: Monitor content performance
## How to Use the Web Crawler
### 1. Basic Usage
1. **Enter URL**: Provide the website URL you want to analyze
2. **Start Crawling**: The crawler will automatically extract content
3. **Review Results**: Get comprehensive analysis of the content
### 2. Advanced Features
- **Custom Analysis**: Set specific parameters for content analysis
- **Batch Processing**: Analyze multiple pages at once
- **Detailed Reports**: Get in-depth content analysis reports
- **Export Options**: Export results in various formats
### 3. Analysis Options
- **Content Quality**: Evaluate writing style and structure
- **SEO Metrics**: Check SEO performance
- **Engagement Factors**: Analyze reader engagement potential
- **Improvement Suggestions**: Get actionable recommendations
## Benefits for Content Creators
### 1. Time Savings
- Quick content research
- Automated analysis
- Efficient data gathering
- Streamlined workflow
### 2. Quality Improvement
- Better content structure
- Enhanced readability
- Improved SEO performance
- Higher engagement potential
### 3. Strategic Advantage
- Data-driven decisions
- Competitive insights
- Content optimization
- Performance tracking
## Best Practices
### 1. Before Crawling
- Identify clear objectives
- Select relevant websites
- Set analysis parameters
- Prepare for data collection
### 2. During Analysis
- Review extracted content
- Validate information
- Check for accuracy
- Note important insights
### 3. After Analysis
- Apply findings to content
- Track improvements
- Update content strategy
- Monitor results
## Common Applications
### 1. Blog Content
- Topic research
- Content structure analysis
- SEO optimization
- Engagement improvement
### 2. Article Writing
- Research gathering
- Fact verification
- Source analysis
- Content enhancement
### 3. Website Content
- Page optimization
- Content audit
- Structure improvement
- SEO enhancement
### 4. Social Media Content
- Trend analysis
- Content ideas
- Engagement optimization
- Performance tracking
## Tips for Optimal Results
1. **Be Specific**: Clearly define your analysis goals
2. **Choose Quality Sources**: Select reliable websites for analysis
3. **Review Results**: Always verify extracted information
4. **Apply Insights**: Use findings to improve your content
5. **Track Progress**: Monitor improvements over time
## ALwrity, Need Help?
If you encounter any issues or need assistance:
1. Check the documentation
2. Review error messages
3. Verify website accessibility
4. Contact support if needed
---
*Note: This tool is designed to help content creators gather and analyze web content efficiently. Always respect website terms of service and robots.txt files when crawling websites.*

View File

@@ -0,0 +1,246 @@
"""Web crawler module using requests and BeautifulSoup."""
from typing import Dict, List, Optional
import json
from loguru import logger
import requests
import aiohttp
import asyncio
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from pydantic import BaseModel, Field
import os
from ..gpt_providers.text_generation.main_text_generation import llm_text_gen
class WebsiteContent(BaseModel):
"""Model for website content analysis."""
title: str = Field("", description="Title of the webpage")
description: str = Field("", description="Meta description of the webpage")
main_content: str = Field("", description="Main content of the webpage")
headings: List[str] = Field([], description="All headings on the page")
links: List[Dict[str, str]] = Field([], description="All links on the page")
images: List[Dict[str, str]] = Field([], description="All images on the page")
meta_tags: Dict[str, str] = Field({}, description="Meta tags from the page")
class AsyncWebCrawlerService:
"""Service for crawling websites."""
def __init__(self):
"""Initialize the crawler service."""
logger.info("[AsyncWebCrawlerService.__init__] Initializing crawler service")
self.visited_urls = set()
self.base_url = None
self.domain = None
self.session = None
self.max_pages = 10 # Limit the number of pages to crawl
self.timeout = 30 # Timeout in seconds for requests
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
async def __aenter__(self):
"""Create aiohttp session when entering context."""
logger.debug("[AsyncWebCrawlerService.__aenter__] Creating aiohttp session")
self.session = aiohttp.ClientSession(headers=self.headers)
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Close aiohttp session when exiting context."""
logger.debug("[AsyncWebCrawlerService.__aexit__] Closing aiohttp session")
if self.session:
await self.session.close()
async def fetch_url(self, url: str) -> str:
"""
Fetch URL content asynchronously.
Args:
url (str): URL to fetch
Returns:
str: HTML content
"""
logger.debug(f"[AsyncWebCrawlerService.fetch_url] Fetching URL: {url}")
if not self.session:
logger.debug("[AsyncWebCrawlerService.fetch_url] Creating new session")
self.session = aiohttp.ClientSession(headers=self.headers)
async with self.session.get(url) as response:
if response.status == 200:
logger.debug(f"[AsyncWebCrawlerService.fetch_url] Successfully fetched URL: {url}")
return await response.text()
else:
error_msg = f"Failed to fetch URL: Status code {response.status}"
logger.error(f"[AsyncWebCrawlerService.fetch_url] {error_msg}")
raise Exception(error_msg)
async def crawl_website(self, url: str) -> Dict:
"""
Crawl a website and extract its content.
Args:
url (str): The URL to crawl
Returns:
Dict: Extracted website content and metadata
"""
try:
logger.info(f"[AsyncWebCrawlerService.crawl_website] Starting crawl for URL: {url}")
# Fetch the page content
try:
html_content = await self.fetch_url(url)
logger.debug("[AsyncWebCrawlerService.crawl_website] Successfully fetched HTML content")
except Exception as e:
error_msg = f"Failed to fetch content from {url}: {str(e)}"
logger.error(f"[AsyncWebCrawlerService.crawl_website] {error_msg}")
return {
'success': False,
'error': error_msg
}
# Parse HTML with BeautifulSoup
logger.debug("[AsyncWebCrawlerService.crawl_website] Parsing HTML content")
soup = BeautifulSoup(html_content, 'html.parser')
# Extract main content (focusing on article-like content)
main_content_elements = soup.find_all(['article', 'main', 'div'], class_=['content', 'main-content', 'article', 'post'])
if not main_content_elements:
main_content_elements = soup.find_all(['p', 'article', 'section'])
main_content = ' '.join([elem.get_text(strip=True) for elem in main_content_elements])
# If still no content, get all paragraph text
if not main_content:
main_content = ' '.join([p.get_text(strip=True) for p in soup.find_all('p')])
logger.debug(f"[AsyncWebCrawlerService.crawl_website] Extracted {len(main_content)} characters of main content")
# Extract content
content = {
'title': soup.title.string.strip() if soup.title else '',
'description': soup.find('meta', {'name': 'description'}).get('content', '').strip() if soup.find('meta', {'name': 'description'}) else '',
'main_content': main_content,
'headings': [h.get_text(strip=True) for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])],
'links': [{'text': a.get_text(strip=True), 'href': urljoin(url, a.get('href', ''))} for a in soup.find_all('a', href=True)],
'images': [{'alt': img.get('alt', '').strip(), 'src': urljoin(url, img.get('src', ''))} for img in soup.find_all('img', src=True)],
'meta_tags': {
meta.get('name', meta.get('property', '')): meta.get('content', '').strip()
for meta in soup.find_all('meta')
if (meta.get('name') or meta.get('property')) and meta.get('content')
}
}
logger.debug(f"[AsyncWebCrawlerService.crawl_website] Extracted {len(content['links'])} links and {len(content['images'])} images")
# Close the session if it exists
if self.session:
logger.debug("[AsyncWebCrawlerService.crawl_website] Closing session")
await self.session.close()
self.session = None
logger.info("[AsyncWebCrawlerService.crawl_website] Successfully completed website crawl")
return {
'success': True,
'content': content,
'url': url
}
except Exception as e:
error_msg = f"Error crawling {url}: {str(e)}"
logger.error(f"[AsyncWebCrawlerService.crawl_website] {error_msg}")
# Ensure session is closed even if there's an error
if self.session:
logger.debug("[AsyncWebCrawlerService.crawl_website] Closing session after error")
await self.session.close()
self.session = None
return {
'success': False,
'error': str(e)
}
async def analyze_content_with_llm(self, content: Dict, api_key: str, gpt_provider: str) -> Dict:
"""
Analyze content using LLM.
Args:
content (Dict): Content to analyze
api_key (str): API key for the LLM service
gpt_provider (str): Provider to use (openai/google)
Returns:
Dict: Analysis results
"""
try:
logger.info(f"[AsyncWebCrawlerService.analyze_content_with_llm] Starting content analysis with {gpt_provider}")
# Prepare the content for analysis
main_content = content.get("main_content", "")
if isinstance(main_content, dict):
main_content = main_content.get("text", "")
logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Prepared {len(main_content)} characters for analysis")
# Construct the prompt for analysis
prompt = f"""Analyze the following website content and provide a comprehensive analysis:
Content:
{main_content[:4000]} # Limit content length for API
Please provide analysis in the following JSON format:
{{
"topics": ["topic1", "topic2", ...],
"key_insights": ["insight1", "insight2", ...],
"content_quality": {{
"readability": "score",
"engagement": "score",
"completeness": "score"
}},
"recommendations": ["rec1", "rec2", ...],
"seo_score": "score",
"content_gaps": ["gap1", "gap2", ...],
"opportunities": ["opp1", "opp2", ...],
"priority_areas": ["area1", "area2", ...]
}}
Ensure the response is valid JSON."""
# Call the LLM function
logger.debug("[AsyncWebCrawlerService.analyze_content_with_llm] Calling llm_text_gen with prompt")
response = llm_text_gen(prompt)
if not response:
logger.error("[AsyncWebCrawlerService.analyze_content_with_llm] No response from LLM")
return {}
# Clean up the response before parsing
logger.debug("[AsyncWebCrawlerService.analyze_content_with_llm] Cleaning response for JSON parsing")
try:
# Remove any leading/trailing whitespace
cleaned_response = response.strip()
# If response starts with a newline or other characters before {, clean it
start_idx = cleaned_response.find('{')
end_idx = cleaned_response.rfind('}')
if start_idx != -1 and end_idx != -1:
cleaned_response = cleaned_response[start_idx:end_idx + 1]
# Fix any line breaks within strings
cleaned_response = cleaned_response.replace('\n', ' ')
logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Attempting to parse cleaned response: {cleaned_response[:100]}...")
# Parse the cleaned response
analysis_result = json.loads(cleaned_response)
logger.info("[AsyncWebCrawlerService.analyze_content_with_llm] Successfully parsed LLM response")
logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Analysis result keys: {analysis_result.keys()}")
return analysis_result
except json.JSONDecodeError as e:
logger.error(f"[AsyncWebCrawlerService.analyze_content_with_llm] Failed to parse LLM response as JSON: {str(e)}")
logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Raw response: {response[:100]}...")
return {}
except Exception as e:
logger.error(f"[AsyncWebCrawlerService.analyze_content_with_llm] Error analyzing content with LLM: {str(e)}")
return {}

View File

@@ -0,0 +1,94 @@
"""Web crawler for ALwrity style analysis."""
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
from loguru import logger
async def analyze_website_style(url: str, sample_text: str = None) -> dict:
"""
Analyze website content or sample text for style analysis.
Args:
url: Website URL to analyze
sample_text: Optional sample text to analyze instead of website
Returns:
dict: Analysis results including content style metrics
"""
try:
if sample_text:
# Analyze sample text directly
return {
"success": True,
"content": sample_text,
"metrics": {
"word_count": len(sample_text.split()),
"sentence_count": len(sample_text.split('.')),
"avg_sentence_length": len(sample_text.split()) / max(len(sample_text.split('.')), 1)
}
}
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig() # Default crawl run configuration
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=run_config
)
print(result.markdown) # Print clean markdown content
logger.debug(f"Crawl result: {result}")
if result.success:
# Process content for style analysis
content = result.markdown
sentences = [s.strip() for s in content.split('.') if s.strip()]
return {
"success": True,
"content": content,
"metrics": {
"word_count": len(content.split()),
"sentence_count": len(sentences),
"avg_sentence_length": len(content.split()) / max(len(sentences), 1),
"internal_links": len(result.links["internal"]),
"images": len(result.media["images"])
}
}
else:
return {
"success": False,
"error": result.error_message
}
except Exception as e:
logger.error(f"Error in style analysis: {str(e)}")
return {
"success": False,
"error": str(e)
}
def analyze_style(url: str = None, sample_text: str = None) -> dict:
"""
Synchronous wrapper for style analysis.
Args:
url: Website URL to analyze
sample_text: Optional sample text to analyze
Returns:
dict: Analysis results
"""
return asyncio.run(analyze_website_style(url, sample_text))
# Deep Crawling
# One of Crawl4AI's most powerful features is its ability to perform
# configurable deep crawling that can explore websites beyond a single page.
# With fine-tuned control over crawl depth, domain boundaries,
# and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
#
#
#
#
#