Add AI marketing and writing tools from PRs #220, #310

New tools added to ToBeMigrated/ directory:

ai_marketing_tools/:
- ai_backlinker: AI-powered backlink generation
- ai_google_ads_generator: Google Ads generation with templates

ai_writers/:
- ai_blog_faqs_writer: FAQ generation for blogs
- ai_copywriter: Multiple copywriter frameworks (AIDA, PAS, 4C, 4R, etc.)
- ai_finance_report_generator: Financial report generation
- ai_story_illustrator: Story illustration
- ai_story_video_generator: Story video generation
- ai_story_writer: AI story writing
- github_blogs: GitHub blog integration
- speech_to_blog: Audio to blog conversion
- twitter_writers: Twitter/X content generation
- youtube_writers: YouTube content generation

These tools are in ToBeMigrated/ for future migration to the main backend.
This commit is contained in:
ajaysi
2026-03-22 11:47:21 +05:30
parent 1fd9720dac
commit 3c58fd555b
91 changed files with 26451 additions and 0 deletions

View File

@@ -0,0 +1,259 @@
# GitHub Blog Generator
A powerful AI-powered content generation system that automatically creates comprehensive documentation, tutorials, and guides from GitHub repositories. This module transforms GitHub repository data into various types of high-quality technical content.
## Features
### 1. Content Generation Types
The system can generate the following types of content from GitHub repositories:
- **Getting Started Guides**
- Introduction and Overview
- Prerequisites and Setup
- Installation Instructions
- Basic Usage Examples
- Common Use Cases
- Best Practices
- Next Steps and Resources
- **Technical Documentation**
- Architecture Overview
- Core Components
- Technical Specifications
- Integration Points
- Performance Considerations
- Security Features
- API Documentation
- Configuration Options
- Deployment Guidelines
- Troubleshooting Guide
- **Tutorial Series**
- Beginner Tutorials
- Basic concepts
- Simple examples
- Step-by-step instructions
- Intermediate Tutorials
- Advanced features
- Real-world examples
- Best practices
- Advanced Tutorials
- Complex use cases
- Performance optimization
- Integration patterns
- **Comparison Analysis**
- Feature Comparison
- Performance Analysis
- Use Case Suitability
- Community and Support
- Learning Curve
- Integration Capabilities
- Future Prospects
- **Case Studies**
- Problem Statement
- Solution Implementation
- Technical Challenges
- Results and Benefits
- Lessons Learned
- Future Improvements
- **Contribution Guides**
- Development Setup
- Code Style Guidelines
- Testing Requirements
- Documentation Standards
- Pull Request Process
- Review Guidelines
- Community Guidelines
- **Security Guides**
- Security Architecture
- Authentication & Authorization
- Data Protection
- Secure Configuration
- Vulnerability Management
- Incident Response
- Compliance Requirements
- **Performance Guides**
- Performance Metrics
- Optimization Techniques
- Benchmarking Guidelines
- Resource Management
- Scaling Strategies
- Monitoring Setup
- Troubleshooting
### 2. GitHub Content Scraping
The module includes a sophisticated GitHub content scraper with the following capabilities:
- **Rate Limiting**
- Configurable API call limits
- Automatic request throttling
- Concurrent request management
- **Caching System**
- Configurable cache duration (TTL)
- Automatic cache invalidation
- Efficient storage of scraped content
- **Content Extraction**
- Repository metadata
- README content
- File contents
- Repository topics
- Contributor information
- License information
### 3. Content Enhancement
- **Online Research Integration**
- Automatic topic research
- Related content discovery
- Industry trend analysis
- **FAQ Generation**
- Automatic FAQ creation
- Common question identification
- Comprehensive answers
- **Metadata Generation**
- SEO-optimized titles
- Meta descriptions
- Tags and categories
- Content structuring
## Usage Examples
### Basic Usage
```python
from lib.ai_writers.github_blogs import GitHubBlogGenerator
# Initialize the generator
generator = GitHubBlogGenerator()
# Generate content for a GitHub repository
content = await generator.generate_content(
github_url="https://github.com/owner/repo",
content_types=["getting_started", "technical_docs", "tutorials"]
)
# Save the generated content
generator.save_content(content, "my_repository")
```
### Advanced Usage
```python
from lib.ai_writers.github_blogs import GitHubBlogGenerator
# Initialize with custom settings
generator = GitHubBlogGenerator(
cache_dir=".custom_cache",
ttl_hours=48
)
# Generate all content types
content_types = [
"getting_started",
"technical_docs",
"tutorials",
"comparison",
"case_studies",
"contribution",
"security",
"performance"
]
# Generate content for multiple repositories
urls = [
"https://github.com/owner/repo1",
"https://github.com/owner/repo2"
]
for url in urls:
content = await generator.generate_content(url, content_types)
generator.save_content(content, url.split("/")[-1])
```
## Configuration Options
### GitHubBlogGenerator
- `cache_dir` (str): Directory for caching scraped content (default: ".github_cache")
- `ttl_hours` (int): Time-to-live for cached content in hours (default: 24)
### Content Generation
- `gpt_provider` (str): Choice of AI provider ("gemini" or "openai")
- `content_types` (List[str]): Types of content to generate
- `github_url` (str): URL of the GitHub repository
## Output Format
All generated content is saved in Markdown format with the following structure:
```markdown
# [Title]
[Generated content based on content type]
## Metadata
- Title: [SEO-optimized title]
- Description: [Meta description]
- Tags: [Generated tags]
- Categories: [Generated categories]
```
## Best Practices
1. **Rate Limiting**
- Configure appropriate rate limits based on your GitHub API quota
- Use caching to minimize API calls
- Implement proper error handling for rate limit exceeded scenarios
2. **Content Generation**
- Start with basic content types before generating advanced content
- Review generated content for accuracy and completeness
- Customize prompts for specific repository types
3. **Caching**
- Set appropriate TTL based on repository update frequency
- Clear cache when repository content changes significantly
- Monitor cache size and performance
4. **Error Handling**
- Implement proper error handling for API failures
- Log errors for debugging
- Provide fallback mechanisms for failed content generation
## Dependencies
- Python 3.8+
- aiohttp
- beautifulsoup4
- loguru
- pydantic
- requests
- pandas
## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request
## License
[Your License Here]
## Support
For support, please [create an issue](https://github.com/your-repo/issues) or contact the maintainers.

View File

@@ -0,0 +1,254 @@
"""
Enhanced GitHub Content Generator
This module provides various content generation capabilities from GitHub repository data,
including getting started guides, technical documentation, tutorials, and more.
"""
import sys
from typing import Dict, List, Optional
from loguru import logger
from lib.gpt_providers.text_generation.main_text_generation import llm_text_gen
logger.remove()
logger.add(sys.stdout,
colorize=True,
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
def generate_technical_documentation(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate comprehensive technical documentation from repository data."""
prompt = f"""As an expert technical writer, create detailed technical documentation for the following GitHub repository:
Repository Data:
{repo_data}
Please create a comprehensive technical documentation that includes:
1. Architecture Overview
2. Core Components
3. Technical Specifications
4. Integration Points
5. Performance Considerations
6. Security Features
7. API Documentation (if applicable)
8. Configuration Options
9. Deployment Guidelines
10. Troubleshooting Guide
Format the documentation in markdown with appropriate headers, code blocks, and diagrams.
Include real-world examples and best practices.
"""
return _get_llm_response(prompt, gpt_provider)
def generate_getting_started_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a beginner-friendly getting started guide."""
prompt = f"""As an expert programmer and teacher, create a comprehensive getting started guide for the following GitHub repository:
Repository Data:
{repo_data}
Create a step-by-step guide that includes:
1. Introduction and Overview
2. Prerequisites and Setup
3. Installation Instructions
4. Basic Usage Examples
5. Common Use Cases
6. Best Practices
7. Next Steps and Resources
Make the guide:
- Beginner-friendly with clear explanations
- Include practical examples with code snippets
- Add emojis for better readability
- Include troubleshooting tips
- Provide links to additional resources
"""
return _get_llm_response(prompt, gpt_provider)
def generate_tutorial_series(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a series of tutorials for different skill levels."""
prompt = f"""As an expert educator, create a series of tutorials for the following GitHub repository:
Repository Data:
{repo_data}
Create a structured tutorial series that includes:
1. Beginner Tutorial
- Basic concepts
- Simple examples
- Step-by-step instructions
2. Intermediate Tutorial
- Advanced features
- Real-world examples
- Best practices
3. Advanced Tutorial
- Complex use cases
- Performance optimization
- Integration patterns
Each tutorial should:
- Be self-contained
- Include practical examples
- Have clear learning objectives
- Include exercises and challenges
"""
return _get_llm_response(prompt, gpt_provider)
def generate_comparison_analysis(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a comparison analysis with similar tools/frameworks."""
prompt = f"""As a technical analyst, create a comprehensive comparison analysis for the following GitHub repository:
Repository Data:
{repo_data}
Create a detailed comparison that includes:
1. Feature Comparison
2. Performance Analysis
3. Use Case Suitability
4. Community and Support
5. Learning Curve
6. Integration Capabilities
7. Future Prospects
Include:
- Pros and Cons
- Real-world use cases
- Industry adoption
- Community feedback
- Future roadmap
"""
return _get_llm_response(prompt, gpt_provider)
def generate_case_studies(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate real-world case studies and success stories."""
prompt = f"""As a technical writer, create compelling case studies for the following GitHub repository:
Repository Data:
{repo_data}
Create detailed case studies that include:
1. Problem Statement
2. Solution Implementation
3. Technical Challenges
4. Results and Benefits
5. Lessons Learned
6. Future Improvements
Make the case studies:
- Based on real-world scenarios
- Include technical details
- Show measurable results
- Provide actionable insights
"""
return _get_llm_response(prompt, gpt_provider)
def generate_contribution_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a comprehensive contribution guide."""
prompt = f"""As an open-source maintainer, create a detailed contribution guide for the following GitHub repository:
Repository Data:
{repo_data}
Create a contribution guide that includes:
1. Development Setup
2. Code Style Guidelines
3. Testing Requirements
4. Documentation Standards
5. Pull Request Process
6. Review Guidelines
7. Community Guidelines
Make the guide:
- Clear and concise
- Include examples
- Cover all contribution types
- Provide templates
"""
return _get_llm_response(prompt, gpt_provider)
def generate_security_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a security best practices guide."""
prompt = f"""As a security expert, create a comprehensive security guide for the following GitHub repository:
Repository Data:
{repo_data}
Create a security guide that includes:
1. Security Architecture
2. Authentication & Authorization
3. Data Protection
4. Secure Configuration
5. Vulnerability Management
6. Incident Response
7. Compliance Requirements
Make the guide:
- Practical and actionable
- Include security checklists
- Provide code examples
- Cover common vulnerabilities
"""
return _get_llm_response(prompt, gpt_provider)
def generate_performance_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
"""Generate a performance optimization guide."""
prompt = f"""As a performance optimization expert, create a detailed performance guide for the following GitHub repository:
Repository Data:
{repo_data}
Create a performance guide that includes:
1. Performance Metrics
2. Optimization Techniques
3. Benchmarking Guidelines
4. Resource Management
5. Scaling Strategies
6. Monitoring Setup
7. Troubleshooting
Make the guide:
- Data-driven
- Include benchmarks
- Provide optimization tips
- Cover different scales
"""
return _get_llm_response(prompt, gpt_provider)
def _get_llm_response(prompt: str, gpt_provider: str) -> str:
"""Get response from the specified LLM provider."""
system_prompt = """You are an expert technical writer and GitHub repository analyst with deep expertise in software development, documentation, and technical communication.
Your role is to create high-quality, accurate, and engaging content based on GitHub repository data. You should:
1. **Technical Accuracy**
- Ensure all technical information is precise and up-to-date
- Verify code examples and configurations
- Cross-reference documentation and source code
- Maintain consistency with repository standards
2. **Content Structure**
- Use clear hierarchical organization
- Include appropriate code blocks and examples
- Add relevant diagrams and visual aids
- Break complex topics into digestible sections
3. **Writing Style**
- Maintain a professional yet approachable tone
- Use active voice and clear language
- Include practical examples and use cases
- Add relevant emojis for better readability
4. **Best Practices**
- Follow industry-standard documentation practices
- Include troubleshooting sections
- Add performance considerations
- Address security implications
"""
try:
llm_response = llm_text_gen(prompt, system_prompt=system_prompt)
except Exception as err:
logger.error(f"Failed to get response from {gpt_provider}: {err}")
raise

View File

@@ -0,0 +1,157 @@
"""
Enhanced GitHub Blog Generator
This module provides comprehensive content generation from GitHub repositories,
including technical documentation, tutorials, case studies, and more.
"""
import os
import sys
import datetime
import json
from typing import Dict, List, Optional
from pathlib import Path
from loguru import logger
logger.remove()
logger.add(sys.stdout,
colorize=True,
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
from .scrape_github_readme import GitHubScraper, GitHubContent
from .scrape_github_readme import get_gh_details_vision, get_readme_content
from .scrape_github_readme import research_github_topics, check_if_already_written
from .github_getting_started import (
generate_technical_documentation,
generate_getting_started_guide,
generate_tutorial_series,
generate_comparison_analysis,
generate_case_studies,
generate_contribution_guide,
generate_security_guide,
generate_performance_guide
)
class GitHubBlogGenerator:
"""Generator for various types of GitHub-related content."""
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24):
"""Initialize the blog generator."""
self.cache_dir = Path(cache_dir)
self.scraper = GitHubScraper(cache_dir, ttl_hours)
self.output_dir = Path("generated_content")
self.output_dir.mkdir(exist_ok=True)
async def generate_content(self, github_url: str, content_types: List[str] = None) -> Dict[str, str]:
"""Generate various types of content from a GitHub repository."""
if content_types is None:
content_types = ["getting_started", "technical_docs", "tutorials"]
try:
# Scrape GitHub content
repo_content = await self.scraper.scrape_github_content(github_url)
# Generate different types of content
generated_content = {}
for content_type in content_types:
if content_type == "getting_started":
content = generate_getting_started_guide(repo_content.dict())
elif content_type == "technical_docs":
content = generate_technical_documentation(repo_content.dict())
elif content_type == "tutorials":
content = generate_tutorial_series(repo_content.dict())
elif content_type == "comparison":
content = generate_comparison_analysis(repo_content.dict())
elif content_type == "case_studies":
content = generate_case_studies(repo_content.dict())
elif content_type == "contribution":
content = generate_contribution_guide(repo_content.dict())
elif content_type == "security":
content = generate_security_guide(repo_content.dict())
elif content_type == "performance":
content = generate_performance_guide(repo_content.dict())
else:
logger.warning(f"Unknown content type: {content_type}")
continue
generated_content[content_type] = content
# Generate FAQs from online research
try:
research_report = do_online_research(repo_content.title, "gemini", github_url)
faqs = generate_blog_faq(research_report, "gemini")
generated_content["faqs"] = faqs
except Exception as err:
logger.error(f"Failed to generate FAQs: {err}")
return generated_content
except Exception as err:
logger.error(f"Failed to generate content: {err}")
raise
def save_content(self, content: Dict[str, str], base_filename: str):
"""Save generated content to files."""
try:
for content_type, content_text in content.items():
# Generate metadata for each content type
title, meta_desc, tags, categories = blog_metadata(content_text, "gemini")
# Create filename with content type
filename = f"{base_filename}_{content_type}.md"
# Save content to file
save_blog_to_file(
content_text,
title,
meta_desc,
tags,
categories,
None # No image path for now
)
logger.info(f"Saved {content_type} content to {filename}")
except Exception as err:
logger.error(f"Failed to save content: {err}")
raise
async def main():
"""Example usage of the GitHub blog generator."""
generator = GitHubBlogGenerator()
# Example GitHub URLs
urls = [
"https://github.com/owner/repo",
"https://github.com/owner/another-repo"
]
content_types = [
"getting_started",
"technical_docs",
"tutorials",
"comparison",
"case_studies",
"contribution",
"security",
"performance"
]
for url in urls:
try:
# Generate content
content = await generator.generate_content(url, content_types)
# Create base filename from URL
base_filename = url.split("/")[-1]
# Save content
generator.save_content(content, base_filename)
except Exception as e:
logger.error(f"Error processing {url}: {e}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,427 @@
"""
Enhanced GitHub Content Scraper with Rate Limiting and Caching
This module provides functionality to scrape GitHub repositories, READMEs, and code files
for content marketing purposes. It includes async support, rate limiting, caching,
and comprehensive metadata collection.
"""
import os
import sys
import json
import asyncio
import aiohttp
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Union
from urllib.parse import urljoin, urlparse
import pandas as pd
from bs4 import BeautifulSoup
from loguru import logger
import requests
from pydantic import BaseModel, Field
import time
import pickle
from pathlib import Path
# Configure logging
logger.remove()
logger.add(sys.stdout,
colorize=True,
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
class RateLimiter:
"""Rate limiter for GitHub API requests."""
def __init__(self, calls_per_minute: int = 30):
self.calls_per_minute = calls_per_minute
self.interval = 60 / calls_per_minute # seconds between calls
self.last_call_time = 0
self.lock = asyncio.Lock()
async def acquire(self):
"""Acquire rate limit token."""
async with self.lock:
current_time = time.time()
time_since_last_call = current_time - self.last_call_time
if time_since_last_call < self.interval:
await asyncio.sleep(self.interval - time_since_last_call)
self.last_call_time = time.time()
class Cache:
"""Cache for GitHub content."""
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24):
self.cache_dir = Path(cache_dir)
self.ttl = timedelta(hours=ttl_hours)
self.cache_dir.mkdir(exist_ok=True)
def _get_cache_path(self, key: str) -> Path:
"""Get cache file path for a key."""
return self.cache_dir / f"{hash(key)}.cache"
def get(self, key: str) -> Optional[Dict]:
"""Get cached value for key."""
cache_path = self._get_cache_path(key)
if not cache_path.exists():
return None
try:
with open(cache_path, 'rb') as f:
data = pickle.load(f)
if datetime.now() - data['timestamp'] > self.ttl:
cache_path.unlink()
return None
return data['value']
except Exception as e:
logger.warning(f"Cache read error for {key}: {e}")
return None
def set(self, key: str, value: Dict):
"""Set cache value for key."""
cache_path = self._get_cache_path(key)
try:
with open(cache_path, 'wb') as f:
pickle.dump({
'timestamp': datetime.now(),
'value': value
}, f)
except Exception as e:
logger.warning(f"Cache write error for {key}: {e}")
class GitHubContent(BaseModel):
"""Model for GitHub content analysis."""
title: str = Field("", description="Title of the content")
description: str = Field("", description="Description of the content")
content: str = Field("", description="Main content")
language: str = Field("", description="Programming language")
stars: int = Field(0, description="Number of stars")
forks: int = Field(0, description="Number of forks")
watchers: int = Field(0, description="Number of watchers")
last_updated: str = Field("", description="Last update date")
topics: List[str] = Field([], description="Repository topics")
contributors: List[str] = Field([], description="Contributor usernames")
readme_url: str = Field("", description="URL of the README")
raw_content_url: str = Field("", description="URL for raw content")
license: str = Field("", description="Repository license")
dependencies: List[str] = Field([], description="Project dependencies")
metadata: Dict = Field({}, description="Additional metadata")
class GitHubScraper:
"""Service for scraping GitHub content with rate limiting and caching."""
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24, calls_per_minute: int = 30):
"""Initialize the scraper service."""
self.session = None
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'application/vnd.github.v3+json'
}
self.rate_limiter = RateLimiter(calls_per_minute)
self.cache = Cache(cache_dir, ttl_hours)
async def __aenter__(self):
"""Create aiohttp session when entering context."""
self.session = aiohttp.ClientSession(headers=self.headers)
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Close aiohttp session when exiting context."""
if self.session:
await self.session.close()
async def fetch_url(self, url: str, use_cache: bool = True) -> str:
"""Fetch URL content asynchronously with rate limiting and caching."""
if use_cache:
cached_content = self.cache.get(url)
if cached_content:
logger.debug(f"Cache hit for {url}")
return cached_content
await self.rate_limiter.acquire()
try:
async with self.session.get(url) as response:
if response.status == 200:
content = await response.text()
if use_cache:
self.cache.set(url, content)
return content
else:
error_msg = f"Failed to fetch URL: Status code {response.status}"
logger.error(error_msg)
raise Exception(error_msg)
except Exception as e:
logger.error(f"Error fetching URL {url}: {e}")
raise
def parse_github_url(self, url: str) -> Dict[str, str]:
"""Parse GitHub URL to extract repository information."""
parsed = urlparse(url)
path_parts = parsed.path.strip('/').split('/')
if len(path_parts) < 2:
raise ValueError("Invalid GitHub URL format")
return {
'owner': path_parts[0],
'repo': path_parts[1],
'branch': path_parts[3] if len(path_parts) > 3 else 'main',
'path': '/'.join(path_parts[4:]) if len(path_parts) > 4 else ''
}
async def get_repo_metadata(self, owner: str, repo: str) -> Dict:
"""Get repository metadata from GitHub API with caching."""
cache_key = f"metadata_{owner}_{repo}"
cached_metadata = self.cache.get(cache_key)
if cached_metadata:
return cached_metadata
await self.rate_limiter.acquire()
api_url = f"https://api.github.com/repos/{owner}/{repo}"
try:
async with self.session.get(api_url) as response:
if response.status == 200:
metadata = await response.json()
self.cache.set(cache_key, metadata)
return metadata
else:
logger.error(f"Failed to fetch repo metadata: {response.status}")
return {}
except Exception as e:
logger.error(f"Error fetching repo metadata: {e}")
return {}
async def get_readme_content(self, owner: str, repo: str, branch: str = 'main') -> Dict:
"""Get README content from GitHub with caching."""
cache_key = f"readme_{owner}_{repo}_{branch}"
cached_content = self.cache.get(cache_key)
if cached_content:
return cached_content
try:
# Try to get README from API first
await self.rate_limiter.acquire()
api_url = f"https://api.github.com/repos/{owner}/{repo}/readme"
async with self.session.get(api_url) as response:
if response.status == 200:
readme_data = await response.json()
content = {
'content': readme_data.get('content', ''),
'encoding': readme_data.get('encoding', 'base64'),
'url': readme_data.get('html_url', '')
}
self.cache.set(cache_key, content)
return content
# Fallback to scraping if API fails
readme_url = f"https://github.com/{owner}/{repo}/blob/{branch}/README.md"
html_content = await self.fetch_url(readme_url, use_cache=True)
soup = BeautifulSoup(html_content, 'html.parser')
# Find the README content
readme_content = soup.find('div', {'class': 'markdown-body'})
if readme_content:
content = {
'content': readme_content.get_text(),
'encoding': 'text',
'url': readme_url
}
self.cache.set(cache_key, content)
return content
return {}
except Exception as e:
logger.error(f"Error fetching README: {e}")
return {}
async def get_file_content(self, owner: str, repo: str, path: str, branch: str = 'main') -> Dict:
"""Get content of a specific file from GitHub with caching."""
cache_key = f"file_{owner}_{repo}_{path}_{branch}"
cached_content = self.cache.get(cache_key)
if cached_content:
return cached_content
try:
# Try to get file content from API first
await self.rate_limiter.acquire()
api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={branch}"
async with self.session.get(api_url) as response:
if response.status == 200:
file_data = await response.json()
content = {
'content': file_data.get('content', ''),
'encoding': file_data.get('encoding', 'base64'),
'url': file_data.get('html_url', '')
}
self.cache.set(cache_key, content)
return content
# Fallback to scraping if API fails
file_url = f"https://github.com/{owner}/{repo}/blob/{branch}/{path}"
html_content = await self.fetch_url(file_url, use_cache=True)
soup = BeautifulSoup(html_content, 'html.parser')
# Find the file content
file_content = soup.find('div', {'class': 'file-content'})
if file_content:
content = {
'content': file_content.get_text(),
'encoding': 'text',
'url': file_url
}
self.cache.set(cache_key, content)
return content
return {}
except Exception as e:
logger.error(f"Error fetching file content: {e}")
return {}
async def get_repo_topics(self, owner: str, repo: str) -> List[str]:
"""Get repository topics with caching."""
cache_key = f"topics_{owner}_{repo}"
cached_topics = self.cache.get(cache_key)
if cached_topics:
return cached_topics
try:
await self.rate_limiter.acquire()
api_url = f"https://api.github.com/repos/{owner}/{repo}/topics"
async with self.session.get(api_url, headers={'Accept': 'application/vnd.github.mercy-preview+json'}) as response:
if response.status == 200:
data = await response.json()
topics = data.get('names', [])
self.cache.set(cache_key, topics)
return topics
return []
except Exception as e:
logger.error(f"Error fetching topics: {e}")
return []
async def get_contributors(self, owner: str, repo: str) -> List[str]:
"""Get repository contributors with caching."""
cache_key = f"contributors_{owner}_{repo}"
cached_contributors = self.cache.get(cache_key)
if cached_contributors:
return cached_contributors
try:
await self.rate_limiter.acquire()
api_url = f"https://api.github.com/repos/{owner}/{repo}/contributors"
async with self.session.get(api_url) as response:
if response.status == 200:
contributors = await response.json()
contributor_list = [contributor['login'] for contributor in contributors]
self.cache.set(cache_key, contributor_list)
return contributor_list
return []
except Exception as e:
logger.error(f"Error fetching contributors: {e}")
return []
async def scrape_github_content(self, url: str) -> GitHubContent:
"""Main function to scrape GitHub content with caching."""
cache_key = f"content_{url}"
cached_content = self.cache.get(cache_key)
if cached_content:
return GitHubContent(**cached_content)
try:
# Parse the GitHub URL
repo_info = self.parse_github_url(url)
# Get repository metadata
metadata = await self.get_repo_metadata(repo_info['owner'], repo_info['repo'])
# Get content based on URL type
if not repo_info['path'] or repo_info['path'].lower() == 'readme.md':
content_data = await self.get_readme_content(
repo_info['owner'],
repo_info['repo'],
repo_info['branch']
)
else:
content_data = await self.get_file_content(
repo_info['owner'],
repo_info['repo'],
repo_info['path'],
repo_info['branch']
)
# Get additional metadata
topics = await self.get_repo_topics(repo_info['owner'], repo_info['repo'])
contributors = await self.get_contributors(repo_info['owner'], repo_info['repo'])
# Create GitHubContent object
content = GitHubContent(
title=metadata.get('name', ''),
description=metadata.get('description', ''),
content=content_data.get('content', ''),
language=metadata.get('language', ''),
stars=metadata.get('stargazers_count', 0),
forks=metadata.get('forks_count', 0),
watchers=metadata.get('watchers_count', 0),
last_updated=metadata.get('updated_at', ''),
topics=topics,
contributors=contributors,
readme_url=content_data.get('url', ''),
raw_content_url=metadata.get('html_url', ''),
license=metadata.get('license', {}).get('name', ''),
metadata={
'size': metadata.get('size', 0),
'open_issues': metadata.get('open_issues_count', 0),
'default_branch': metadata.get('default_branch', 'main'),
'created_at': metadata.get('created_at', ''),
'pushed_at': metadata.get('pushed_at', '')
}
)
# Cache the complete content
self.cache.set(cache_key, content.dict())
return content
except Exception as e:
logger.error(f"Error scraping GitHub content: {e}")
raise
async def main():
"""Example usage of the GitHub scraper with rate limiting and caching."""
scraper = GitHubScraper(
cache_dir=".github_cache",
ttl_hours=24,
calls_per_minute=30
)
async with scraper:
# Example URLs
urls = [
"https://github.com/owner/repo",
"https://github.com/owner/repo/blob/main/README.md",
"https://github.com/owner/repo/blob/main/src/main.py"
]
for url in urls:
try:
content = await scraper.scrape_github_content(url)
print(f"Scraped content from {url}:")
print(json.dumps(content.dict(), indent=2))
except Exception as e:
print(f"Error scraping {url}: {e}")
if __name__ == "__main__":
asyncio.run(main())