Made changes to Getting started with ALwrity and added lot of details on API keys

2025-04-01 13:11:40 +05:30
parent 367f9bac2c
commit 6c833e2773
68 changed files with 8384 additions and 823 deletions
--- a/lib/web_crawlers/README.md
+++ b/lib/web_crawlers/README.md
@@ -0,0 +1,151 @@
+# Web Crawler Guide for Content Creators
+
+## What is a Web Crawler?
+
+A web crawler is a powerful tool that helps content creators gather, analyze, and understand content from websites. It's like having a digital assistant that can quickly scan websites and extract valuable information to help you create better content.
+
+## Key Features
+
+### 1. Content Extraction
+- **Main Content**: Extracts the primary content from web pages
+- **Meta Information**: Captures titles, descriptions, and meta tags
+- **Structure Analysis**: Identifies headings and content hierarchy
+- **Media Elements**: Collects links and images with their descriptions
+
+### 2. AI-Powered Analysis
+- **Topic Identification**: Automatically identifies main topics
+- **Content Quality Assessment**: Evaluates readability and engagement
+- **SEO Analysis**: Provides SEO scores and recommendations
+- **Content Gap Analysis**: Identifies missing information
+- **Opportunity Detection**: Suggests areas for improvement
+
+### 3. Smart Processing
+- **Fast Performance**: Uses advanced async technology for quick results
+- **Error Handling**: Gracefully handles website access issues
+- **Content Cleaning**: Removes unnecessary elements for clean analysis
+- **Multiple Page Support**: Can analyze multiple pages efficiently
+
+## Use Cases for Content Creators
+
+### 1. Content Research
+- **Competitor Analysis**: Study competitor content and strategies
+- **Topic Research**: Gather information for new content ideas
+- **Industry Trends**: Track industry developments and updates
+- **Content Inspiration**: Find inspiration from successful content
+
+### 2. Content Optimization
+- **SEO Improvement**: Identify SEO opportunities
+- **Content Structure**: Analyze and improve content organization
+- **Readability Enhancement**: Get suggestions for better readability
+- **Engagement Optimization**: Improve content engagement
+
+### 3. Content Strategy
+- **Gap Analysis**: Identify content gaps in your niche
+- **Topic Planning**: Plan content topics and themes
+- **Audience Understanding**: Better understand target audience needs
+- **Performance Tracking**: Monitor content performance
+
+## How to Use the Web Crawler
+
+### 1. Basic Usage
+1. **Enter URL**: Provide the website URL you want to analyze
+2. **Start Crawling**: The crawler will automatically extract content
+3. **Review Results**: Get comprehensive analysis of the content
+
+### 2. Advanced Features
+- **Custom Analysis**: Set specific parameters for content analysis
+- **Batch Processing**: Analyze multiple pages at once
+- **Detailed Reports**: Get in-depth content analysis reports
+- **Export Options**: Export results in various formats
+
+### 3. Analysis Options
+- **Content Quality**: Evaluate writing style and structure
+- **SEO Metrics**: Check SEO performance
+- **Engagement Factors**: Analyze reader engagement potential
+- **Improvement Suggestions**: Get actionable recommendations
+
+## Benefits for Content Creators
+
+### 1. Time Savings
+- Quick content research
+- Automated analysis
+- Efficient data gathering
+- Streamlined workflow
+
+### 2. Quality Improvement
+- Better content structure
+- Enhanced readability
+- Improved SEO performance
+- Higher engagement potential
+
+### 3. Strategic Advantage
+- Data-driven decisions
+- Competitive insights
+- Content optimization
+- Performance tracking
+
+## Best Practices
+
+### 1. Before Crawling
+- Identify clear objectives
+- Select relevant websites
+- Set analysis parameters
+- Prepare for data collection
+
+### 2. During Analysis
+- Review extracted content
+- Validate information
+- Check for accuracy
+- Note important insights
+
+### 3. After Analysis
+- Apply findings to content
+- Track improvements
+- Update content strategy
+- Monitor results
+
+## Common Applications
+
+### 1. Blog Content
+- Topic research
+- Content structure analysis
+- SEO optimization
+- Engagement improvement
+
+### 2. Article Writing
+- Research gathering
+- Fact verification
+- Source analysis
+- Content enhancement
+
+### 3. Website Content
+- Page optimization
+- Content audit
+- Structure improvement
+- SEO enhancement
+
+### 4. Social Media Content
+- Trend analysis
+- Content ideas
+- Engagement optimization
+- Performance tracking
+
+## Tips for Optimal Results
+
+1. **Be Specific**: Clearly define your analysis goals
+2. **Choose Quality Sources**: Select reliable websites for analysis
+3. **Review Results**: Always verify extracted information
+4. **Apply Insights**: Use findings to improve your content
+5. **Track Progress**: Monitor improvements over time
+
+## ALwrity, Need Help?
+
+If you encounter any issues or need assistance:
+1. Check the documentation
+2. Review error messages
+3. Verify website accessibility
+4. Contact support if needed
+
+---
+
+*Note: This tool is designed to help content creators gather and analyze web content efficiently. Always respect website terms of service and robots.txt files when crawling websites.* 
--- a/lib/web_crawlers/async_web_crawler.py
+++ b/lib/web_crawlers/async_web_crawler.py
@@ -0,0 +1,246 @@
+"""Web crawler module using requests and BeautifulSoup."""
+
+from typing import Dict, List, Optional
+import json
+from loguru import logger
+import requests
+import aiohttp
+import asyncio
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin, urlparse
+from pydantic import BaseModel, Field
+import os
+from ..gpt_providers.text_generation.main_text_generation import llm_text_gen
+
+class WebsiteContent(BaseModel):
+    """Model for website content analysis."""
+    title: str = Field("", description="Title of the webpage")
+    description: str = Field("", description="Meta description of the webpage")
+    main_content: str = Field("", description="Main content of the webpage")
+    headings: List[str] = Field([], description="All headings on the page")
+    links: List[Dict[str, str]] = Field([], description="All links on the page")
+    images: List[Dict[str, str]] = Field([], description="All images on the page")
+    meta_tags: Dict[str, str] = Field({}, description="Meta tags from the page")
+
+class AsyncWebCrawlerService:
+    """Service for crawling websites."""
+    
+    def __init__(self):
+        """Initialize the crawler service."""
+        logger.info("[AsyncWebCrawlerService.__init__] Initializing crawler service")
+        self.visited_urls = set()
+        self.base_url = None
+        self.domain = None
+        self.session = None
+        self.max_pages = 10  # Limit the number of pages to crawl
+        self.timeout = 30  # Timeout in seconds for requests
+        self.headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+        }
+    
+    async def __aenter__(self):
+        """Create aiohttp session when entering context."""
+        logger.debug("[AsyncWebCrawlerService.__aenter__] Creating aiohttp session")
+        self.session = aiohttp.ClientSession(headers=self.headers)
+        return self
+        
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        """Close aiohttp session when exiting context."""
+        logger.debug("[AsyncWebCrawlerService.__aexit__] Closing aiohttp session")
+        if self.session:
+            await self.session.close()
+    
+    async def fetch_url(self, url: str) -> str:
+        """
+        Fetch URL content asynchronously.
+        
+        Args:
+            url (str): URL to fetch
+            
+        Returns:
+            str: HTML content
+        """
+        logger.debug(f"[AsyncWebCrawlerService.fetch_url] Fetching URL: {url}")
+        if not self.session:
+            logger.debug("[AsyncWebCrawlerService.fetch_url] Creating new session")
+            self.session = aiohttp.ClientSession(headers=self.headers)
+            
+        async with self.session.get(url) as response:
+            if response.status == 200:
+                logger.debug(f"[AsyncWebCrawlerService.fetch_url] Successfully fetched URL: {url}")
+                return await response.text()
+            else:
+                error_msg = f"Failed to fetch URL: Status code {response.status}"
+                logger.error(f"[AsyncWebCrawlerService.fetch_url] {error_msg}")
+                raise Exception(error_msg)
+    
+    async def crawl_website(self, url: str) -> Dict:
+        """
+        Crawl a website and extract its content.
+        
+        Args:
+            url (str): The URL to crawl
+            
+        Returns:
+            Dict: Extracted website content and metadata
+        """
+        try:
+            logger.info(f"[AsyncWebCrawlerService.crawl_website] Starting crawl for URL: {url}")
+            
+            # Fetch the page content
+            try:
+                html_content = await self.fetch_url(url)
+                logger.debug("[AsyncWebCrawlerService.crawl_website] Successfully fetched HTML content")
+            except Exception as e:
+                error_msg = f"Failed to fetch content from {url}: {str(e)}"
+                logger.error(f"[AsyncWebCrawlerService.crawl_website] {error_msg}")
+                return {
+                    'success': False,
+                    'error': error_msg
+                }
+            
+            # Parse HTML with BeautifulSoup
+            logger.debug("[AsyncWebCrawlerService.crawl_website] Parsing HTML content")
+            soup = BeautifulSoup(html_content, 'html.parser')
+            
+            # Extract main content (focusing on article-like content)
+            main_content_elements = soup.find_all(['article', 'main', 'div'], class_=['content', 'main-content', 'article', 'post'])
+            if not main_content_elements:
+                main_content_elements = soup.find_all(['p', 'article', 'section'])
+            
+            main_content = ' '.join([elem.get_text(strip=True) for elem in main_content_elements])
+            
+            # If still no content, get all paragraph text
+            if not main_content:
+                main_content = ' '.join([p.get_text(strip=True) for p in soup.find_all('p')])
+            
+            logger.debug(f"[AsyncWebCrawlerService.crawl_website] Extracted {len(main_content)} characters of main content")
+            
+            # Extract content
+            content = {
+                'title': soup.title.string.strip() if soup.title else '',
+                'description': soup.find('meta', {'name': 'description'}).get('content', '').strip() if soup.find('meta', {'name': 'description'}) else '',
+                'main_content': main_content,
+                'headings': [h.get_text(strip=True) for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])],
+                'links': [{'text': a.get_text(strip=True), 'href': urljoin(url, a.get('href', ''))} for a in soup.find_all('a', href=True)],
+                'images': [{'alt': img.get('alt', '').strip(), 'src': urljoin(url, img.get('src', ''))} for img in soup.find_all('img', src=True)],
+                'meta_tags': {
+                    meta.get('name', meta.get('property', '')): meta.get('content', '').strip()
+                    for meta in soup.find_all('meta')
+                    if (meta.get('name') or meta.get('property')) and meta.get('content')
+                }
+            }
+            
+            logger.debug(f"[AsyncWebCrawlerService.crawl_website] Extracted {len(content['links'])} links and {len(content['images'])} images")
+            
+            # Close the session if it exists
+            if self.session:
+                logger.debug("[AsyncWebCrawlerService.crawl_website] Closing session")
+                await self.session.close()
+                self.session = None
+            
+            logger.info("[AsyncWebCrawlerService.crawl_website] Successfully completed website crawl")
+            return {
+                'success': True,
+                'content': content,
+                'url': url
+            }
+                
+        except Exception as e:
+            error_msg = f"Error crawling {url}: {str(e)}"
+            logger.error(f"[AsyncWebCrawlerService.crawl_website] {error_msg}")
+            # Ensure session is closed even if there's an error
+            if self.session:
+                logger.debug("[AsyncWebCrawlerService.crawl_website] Closing session after error")
+                await self.session.close()
+                self.session = None
+            return {
+                'success': False,
+                'error': str(e)
+            }
+    
+    async def analyze_content_with_llm(self, content: Dict, api_key: str, gpt_provider: str) -> Dict:
+        """
+        Analyze content using LLM.
+        
+        Args:
+            content (Dict): Content to analyze
+            api_key (str): API key for the LLM service
+            gpt_provider (str): Provider to use (openai/google)
+            
+        Returns:
+            Dict: Analysis results
+        """
+        try:
+            logger.info(f"[AsyncWebCrawlerService.analyze_content_with_llm] Starting content analysis with {gpt_provider}")
+            
+            # Prepare the content for analysis
+            main_content = content.get("main_content", "")
+            if isinstance(main_content, dict):
+                main_content = main_content.get("text", "")
+            
+            logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Prepared {len(main_content)} characters for analysis")
+            
+            # Construct the prompt for analysis
+            prompt = f"""Analyze the following website content and provide a comprehensive analysis:
+
+Content:
+{main_content[:4000]}  # Limit content length for API
+
+Please provide analysis in the following JSON format:
+{{
+    "topics": ["topic1", "topic2", ...],
+    "key_insights": ["insight1", "insight2", ...],
+    "content_quality": {{
+        "readability": "score",
+        "engagement": "score",
+        "completeness": "score"
+    }},
+    "recommendations": ["rec1", "rec2", ...],
+    "seo_score": "score",
+    "content_gaps": ["gap1", "gap2", ...],
+    "opportunities": ["opp1", "opp2", ...],
+    "priority_areas": ["area1", "area2", ...]
+}}
+
+Ensure the response is valid JSON."""
+
+            # Call the LLM function
+            logger.debug("[AsyncWebCrawlerService.analyze_content_with_llm] Calling llm_text_gen with prompt")
+            response = llm_text_gen(prompt)
+            
+            if not response:
+                logger.error("[AsyncWebCrawlerService.analyze_content_with_llm] No response from LLM")
+                return {}
+            
+            # Clean up the response before parsing
+            logger.debug("[AsyncWebCrawlerService.analyze_content_with_llm] Cleaning response for JSON parsing")
+            try:
+                # Remove any leading/trailing whitespace
+                cleaned_response = response.strip()
+                
+                # If response starts with a newline or other characters before {, clean it
+                start_idx = cleaned_response.find('{')
+                end_idx = cleaned_response.rfind('}')
+                if start_idx != -1 and end_idx != -1:
+                    cleaned_response = cleaned_response[start_idx:end_idx + 1]
+                
+                # Fix any line breaks within strings
+                cleaned_response = cleaned_response.replace('\n', ' ')
+                
+                logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Attempting to parse cleaned response: {cleaned_response[:100]}...")
+                
+                # Parse the cleaned response
+                analysis_result = json.loads(cleaned_response)
+                logger.info("[AsyncWebCrawlerService.analyze_content_with_llm] Successfully parsed LLM response")
+                logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Analysis result keys: {analysis_result.keys()}")
+                return analysis_result
+                
+            except json.JSONDecodeError as e:
+                logger.error(f"[AsyncWebCrawlerService.analyze_content_with_llm] Failed to parse LLM response as JSON: {str(e)}")
+                logger.debug(f"[AsyncWebCrawlerService.analyze_content_with_llm] Raw response: {response[:100]}...")
+                return {}
+                
+        except Exception as e:
+            logger.error(f"[AsyncWebCrawlerService.analyze_content_with_llm] Error analyzing content with LLM: {str(e)}")
+            return {} 
--- a/lib/web_crawlers/crawl4ai_web_crawler.py
+++ b/lib/web_crawlers/crawl4ai_web_crawler.py
@@ -0,0 +1,94 @@
+"""Web crawler for ALwrity style analysis."""
+
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
+from loguru import logger
+
+async def analyze_website_style(url: str, sample_text: str = None) -> dict:
+    """
+    Analyze website content or sample text for style analysis.
+    
+    Args:
+        url: Website URL to analyze
+        sample_text: Optional sample text to analyze instead of website
+        
+    Returns:
+        dict: Analysis results including content style metrics
+    """
+    try:
+        if sample_text:
+            # Analyze sample text directly
+            return {
+                "success": True,
+                "content": sample_text,
+                "metrics": {
+                    "word_count": len(sample_text.split()),
+                    "sentence_count": len(sample_text.split('.')),
+                    "avg_sentence_length": len(sample_text.split()) / max(len(sample_text.split('.')), 1)
+                }
+            }
+        browser_config = BrowserConfig()  # Default browser configuration
+        run_config = CrawlerRunConfig()   # Default crawl run configuration
+        
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(
+                url=url,
+                config=run_config
+            )
+            print(result.markdown)  # Print clean markdown content
+
+            logger.debug(f"Crawl result: {result}")
+            if result.success:
+                # Process content for style analysis
+                content = result.markdown
+                sentences = [s.strip() for s in content.split('.') if s.strip()]
+                
+                return {
+                    "success": True,
+                    "content": content,
+                    "metrics": {
+                        "word_count": len(content.split()),
+                        "sentence_count": len(sentences),
+                        "avg_sentence_length": len(content.split()) / max(len(sentences), 1),
+                        "internal_links": len(result.links["internal"]),
+                        "images": len(result.media["images"])
+                    }
+                }
+            else:
+                return {
+                    "success": False,
+                    "error": result.error_message
+                }
+
+    except Exception as e:
+        logger.error(f"Error in style analysis: {str(e)}")
+        return {
+            "success": False,
+            "error": str(e)
+        }
+
+def analyze_style(url: str = None, sample_text: str = None) -> dict:
+    """
+    Synchronous wrapper for style analysis.
+    
+    Args:
+        url: Website URL to analyze
+        sample_text: Optional sample text to analyze
+        
+    Returns:
+        dict: Analysis results
+    """
+    return asyncio.run(analyze_website_style(url, sample_text))
+
+
+# Deep Crawling
+# One of Crawl4AI's most powerful features is its ability to perform 
+# configurable deep crawling that can explore websites beyond a single page.
+#  With fine-tuned control over crawl depth, domain boundaries,
+#  and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
+# 
+#
+#
+#
+#