AI Image and Audio Generation Improvements.
AI Video Generation Pre-Flight Checklist. Cost Estimate Improvements.
This commit is contained in:
@@ -1,259 +0,0 @@
|
||||
# GitHub Blog Generator
|
||||
|
||||
A powerful AI-powered content generation system that automatically creates comprehensive documentation, tutorials, and guides from GitHub repositories. This module transforms GitHub repository data into various types of high-quality technical content.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Content Generation Types
|
||||
|
||||
The system can generate the following types of content from GitHub repositories:
|
||||
|
||||
- **Getting Started Guides**
|
||||
- Introduction and Overview
|
||||
- Prerequisites and Setup
|
||||
- Installation Instructions
|
||||
- Basic Usage Examples
|
||||
- Common Use Cases
|
||||
- Best Practices
|
||||
- Next Steps and Resources
|
||||
|
||||
- **Technical Documentation**
|
||||
- Architecture Overview
|
||||
- Core Components
|
||||
- Technical Specifications
|
||||
- Integration Points
|
||||
- Performance Considerations
|
||||
- Security Features
|
||||
- API Documentation
|
||||
- Configuration Options
|
||||
- Deployment Guidelines
|
||||
- Troubleshooting Guide
|
||||
|
||||
- **Tutorial Series**
|
||||
- Beginner Tutorials
|
||||
- Basic concepts
|
||||
- Simple examples
|
||||
- Step-by-step instructions
|
||||
- Intermediate Tutorials
|
||||
- Advanced features
|
||||
- Real-world examples
|
||||
- Best practices
|
||||
- Advanced Tutorials
|
||||
- Complex use cases
|
||||
- Performance optimization
|
||||
- Integration patterns
|
||||
|
||||
- **Comparison Analysis**
|
||||
- Feature Comparison
|
||||
- Performance Analysis
|
||||
- Use Case Suitability
|
||||
- Community and Support
|
||||
- Learning Curve
|
||||
- Integration Capabilities
|
||||
- Future Prospects
|
||||
|
||||
- **Case Studies**
|
||||
- Problem Statement
|
||||
- Solution Implementation
|
||||
- Technical Challenges
|
||||
- Results and Benefits
|
||||
- Lessons Learned
|
||||
- Future Improvements
|
||||
|
||||
- **Contribution Guides**
|
||||
- Development Setup
|
||||
- Code Style Guidelines
|
||||
- Testing Requirements
|
||||
- Documentation Standards
|
||||
- Pull Request Process
|
||||
- Review Guidelines
|
||||
- Community Guidelines
|
||||
|
||||
- **Security Guides**
|
||||
- Security Architecture
|
||||
- Authentication & Authorization
|
||||
- Data Protection
|
||||
- Secure Configuration
|
||||
- Vulnerability Management
|
||||
- Incident Response
|
||||
- Compliance Requirements
|
||||
|
||||
- **Performance Guides**
|
||||
- Performance Metrics
|
||||
- Optimization Techniques
|
||||
- Benchmarking Guidelines
|
||||
- Resource Management
|
||||
- Scaling Strategies
|
||||
- Monitoring Setup
|
||||
- Troubleshooting
|
||||
|
||||
### 2. GitHub Content Scraping
|
||||
|
||||
The module includes a sophisticated GitHub content scraper with the following capabilities:
|
||||
|
||||
- **Rate Limiting**
|
||||
- Configurable API call limits
|
||||
- Automatic request throttling
|
||||
- Concurrent request management
|
||||
|
||||
- **Caching System**
|
||||
- Configurable cache duration (TTL)
|
||||
- Automatic cache invalidation
|
||||
- Efficient storage of scraped content
|
||||
|
||||
- **Content Extraction**
|
||||
- Repository metadata
|
||||
- README content
|
||||
- File contents
|
||||
- Repository topics
|
||||
- Contributor information
|
||||
- License information
|
||||
|
||||
### 3. Content Enhancement
|
||||
|
||||
- **Online Research Integration**
|
||||
- Automatic topic research
|
||||
- Related content discovery
|
||||
- Industry trend analysis
|
||||
|
||||
- **FAQ Generation**
|
||||
- Automatic FAQ creation
|
||||
- Common question identification
|
||||
- Comprehensive answers
|
||||
|
||||
- **Metadata Generation**
|
||||
- SEO-optimized titles
|
||||
- Meta descriptions
|
||||
- Tags and categories
|
||||
- Content structuring
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from lib.ai_writers.github_blogs import GitHubBlogGenerator
|
||||
|
||||
# Initialize the generator
|
||||
generator = GitHubBlogGenerator()
|
||||
|
||||
# Generate content for a GitHub repository
|
||||
content = await generator.generate_content(
|
||||
github_url="https://github.com/owner/repo",
|
||||
content_types=["getting_started", "technical_docs", "tutorials"]
|
||||
)
|
||||
|
||||
# Save the generated content
|
||||
generator.save_content(content, "my_repository")
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
|
||||
```python
|
||||
from lib.ai_writers.github_blogs import GitHubBlogGenerator
|
||||
|
||||
# Initialize with custom settings
|
||||
generator = GitHubBlogGenerator(
|
||||
cache_dir=".custom_cache",
|
||||
ttl_hours=48
|
||||
)
|
||||
|
||||
# Generate all content types
|
||||
content_types = [
|
||||
"getting_started",
|
||||
"technical_docs",
|
||||
"tutorials",
|
||||
"comparison",
|
||||
"case_studies",
|
||||
"contribution",
|
||||
"security",
|
||||
"performance"
|
||||
]
|
||||
|
||||
# Generate content for multiple repositories
|
||||
urls = [
|
||||
"https://github.com/owner/repo1",
|
||||
"https://github.com/owner/repo2"
|
||||
]
|
||||
|
||||
for url in urls:
|
||||
content = await generator.generate_content(url, content_types)
|
||||
generator.save_content(content, url.split("/")[-1])
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### GitHubBlogGenerator
|
||||
|
||||
- `cache_dir` (str): Directory for caching scraped content (default: ".github_cache")
|
||||
- `ttl_hours` (int): Time-to-live for cached content in hours (default: 24)
|
||||
|
||||
### Content Generation
|
||||
|
||||
- `gpt_provider` (str): Choice of AI provider ("gemini" or "openai")
|
||||
- `content_types` (List[str]): Types of content to generate
|
||||
- `github_url` (str): URL of the GitHub repository
|
||||
|
||||
## Output Format
|
||||
|
||||
All generated content is saved in Markdown format with the following structure:
|
||||
|
||||
```markdown
|
||||
# [Title]
|
||||
|
||||
[Generated content based on content type]
|
||||
|
||||
## Metadata
|
||||
- Title: [SEO-optimized title]
|
||||
- Description: [Meta description]
|
||||
- Tags: [Generated tags]
|
||||
- Categories: [Generated categories]
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Rate Limiting**
|
||||
- Configure appropriate rate limits based on your GitHub API quota
|
||||
- Use caching to minimize API calls
|
||||
- Implement proper error handling for rate limit exceeded scenarios
|
||||
|
||||
2. **Content Generation**
|
||||
- Start with basic content types before generating advanced content
|
||||
- Review generated content for accuracy and completeness
|
||||
- Customize prompts for specific repository types
|
||||
|
||||
3. **Caching**
|
||||
- Set appropriate TTL based on repository update frequency
|
||||
- Clear cache when repository content changes significantly
|
||||
- Monitor cache size and performance
|
||||
|
||||
4. **Error Handling**
|
||||
- Implement proper error handling for API failures
|
||||
- Log errors for debugging
|
||||
- Provide fallback mechanisms for failed content generation
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Python 3.8+
|
||||
- aiohttp
|
||||
- beautifulsoup4
|
||||
- loguru
|
||||
- pydantic
|
||||
- requests
|
||||
- pandas
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Commit your changes
|
||||
4. Push to the branch
|
||||
5. Create a Pull Request
|
||||
|
||||
## License
|
||||
|
||||
[Your License Here]
|
||||
|
||||
## Support
|
||||
|
||||
For support, please [create an issue](https://github.com/your-repo/issues) or contact the maintainers.
|
||||
@@ -1,254 +0,0 @@
|
||||
"""
|
||||
Enhanced GitHub Content Generator
|
||||
|
||||
This module provides various content generation capabilities from GitHub repository data,
|
||||
including getting started guides, technical documentation, tutorials, and more.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from typing import Dict, List, Optional
|
||||
from loguru import logger
|
||||
|
||||
from lib.gpt_providers.text_generation.main_text_generation import llm_text_gen
|
||||
|
||||
logger.remove()
|
||||
logger.add(sys.stdout,
|
||||
colorize=True,
|
||||
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
|
||||
|
||||
def generate_technical_documentation(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate comprehensive technical documentation from repository data."""
|
||||
prompt = f"""As an expert technical writer, create detailed technical documentation for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Please create a comprehensive technical documentation that includes:
|
||||
1. Architecture Overview
|
||||
2. Core Components
|
||||
3. Technical Specifications
|
||||
4. Integration Points
|
||||
5. Performance Considerations
|
||||
6. Security Features
|
||||
7. API Documentation (if applicable)
|
||||
8. Configuration Options
|
||||
9. Deployment Guidelines
|
||||
10. Troubleshooting Guide
|
||||
|
||||
Format the documentation in markdown with appropriate headers, code blocks, and diagrams.
|
||||
Include real-world examples and best practices.
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_getting_started_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a beginner-friendly getting started guide."""
|
||||
prompt = f"""As an expert programmer and teacher, create a comprehensive getting started guide for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a step-by-step guide that includes:
|
||||
1. Introduction and Overview
|
||||
2. Prerequisites and Setup
|
||||
3. Installation Instructions
|
||||
4. Basic Usage Examples
|
||||
5. Common Use Cases
|
||||
6. Best Practices
|
||||
7. Next Steps and Resources
|
||||
|
||||
Make the guide:
|
||||
- Beginner-friendly with clear explanations
|
||||
- Include practical examples with code snippets
|
||||
- Add emojis for better readability
|
||||
- Include troubleshooting tips
|
||||
- Provide links to additional resources
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_tutorial_series(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a series of tutorials for different skill levels."""
|
||||
prompt = f"""As an expert educator, create a series of tutorials for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a structured tutorial series that includes:
|
||||
1. Beginner Tutorial
|
||||
- Basic concepts
|
||||
- Simple examples
|
||||
- Step-by-step instructions
|
||||
|
||||
2. Intermediate Tutorial
|
||||
- Advanced features
|
||||
- Real-world examples
|
||||
- Best practices
|
||||
|
||||
3. Advanced Tutorial
|
||||
- Complex use cases
|
||||
- Performance optimization
|
||||
- Integration patterns
|
||||
|
||||
Each tutorial should:
|
||||
- Be self-contained
|
||||
- Include practical examples
|
||||
- Have clear learning objectives
|
||||
- Include exercises and challenges
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_comparison_analysis(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a comparison analysis with similar tools/frameworks."""
|
||||
prompt = f"""As a technical analyst, create a comprehensive comparison analysis for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a detailed comparison that includes:
|
||||
1. Feature Comparison
|
||||
2. Performance Analysis
|
||||
3. Use Case Suitability
|
||||
4. Community and Support
|
||||
5. Learning Curve
|
||||
6. Integration Capabilities
|
||||
7. Future Prospects
|
||||
|
||||
Include:
|
||||
- Pros and Cons
|
||||
- Real-world use cases
|
||||
- Industry adoption
|
||||
- Community feedback
|
||||
- Future roadmap
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_case_studies(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate real-world case studies and success stories."""
|
||||
prompt = f"""As a technical writer, create compelling case studies for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create detailed case studies that include:
|
||||
1. Problem Statement
|
||||
2. Solution Implementation
|
||||
3. Technical Challenges
|
||||
4. Results and Benefits
|
||||
5. Lessons Learned
|
||||
6. Future Improvements
|
||||
|
||||
Make the case studies:
|
||||
- Based on real-world scenarios
|
||||
- Include technical details
|
||||
- Show measurable results
|
||||
- Provide actionable insights
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_contribution_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a comprehensive contribution guide."""
|
||||
prompt = f"""As an open-source maintainer, create a detailed contribution guide for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a contribution guide that includes:
|
||||
1. Development Setup
|
||||
2. Code Style Guidelines
|
||||
3. Testing Requirements
|
||||
4. Documentation Standards
|
||||
5. Pull Request Process
|
||||
6. Review Guidelines
|
||||
7. Community Guidelines
|
||||
|
||||
Make the guide:
|
||||
- Clear and concise
|
||||
- Include examples
|
||||
- Cover all contribution types
|
||||
- Provide templates
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_security_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a security best practices guide."""
|
||||
prompt = f"""As a security expert, create a comprehensive security guide for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a security guide that includes:
|
||||
1. Security Architecture
|
||||
2. Authentication & Authorization
|
||||
3. Data Protection
|
||||
4. Secure Configuration
|
||||
5. Vulnerability Management
|
||||
6. Incident Response
|
||||
7. Compliance Requirements
|
||||
|
||||
Make the guide:
|
||||
- Practical and actionable
|
||||
- Include security checklists
|
||||
- Provide code examples
|
||||
- Cover common vulnerabilities
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def generate_performance_guide(repo_data: Dict, gpt_provider: str = "gemini") -> str:
|
||||
"""Generate a performance optimization guide."""
|
||||
prompt = f"""As a performance optimization expert, create a detailed performance guide for the following GitHub repository:
|
||||
|
||||
Repository Data:
|
||||
{repo_data}
|
||||
|
||||
Create a performance guide that includes:
|
||||
1. Performance Metrics
|
||||
2. Optimization Techniques
|
||||
3. Benchmarking Guidelines
|
||||
4. Resource Management
|
||||
5. Scaling Strategies
|
||||
6. Monitoring Setup
|
||||
7. Troubleshooting
|
||||
|
||||
Make the guide:
|
||||
- Data-driven
|
||||
- Include benchmarks
|
||||
- Provide optimization tips
|
||||
- Cover different scales
|
||||
"""
|
||||
return _get_llm_response(prompt, gpt_provider)
|
||||
|
||||
def _get_llm_response(prompt: str, gpt_provider: str) -> str:
|
||||
"""Get response from the specified LLM provider."""
|
||||
system_prompt = """You are an expert technical writer and GitHub repository analyst with deep expertise in software development, documentation, and technical communication.
|
||||
|
||||
Your role is to create high-quality, accurate, and engaging content based on GitHub repository data. You should:
|
||||
|
||||
1. **Technical Accuracy**
|
||||
- Ensure all technical information is precise and up-to-date
|
||||
- Verify code examples and configurations
|
||||
- Cross-reference documentation and source code
|
||||
- Maintain consistency with repository standards
|
||||
|
||||
2. **Content Structure**
|
||||
- Use clear hierarchical organization
|
||||
- Include appropriate code blocks and examples
|
||||
- Add relevant diagrams and visual aids
|
||||
- Break complex topics into digestible sections
|
||||
|
||||
3. **Writing Style**
|
||||
- Maintain a professional yet approachable tone
|
||||
- Use active voice and clear language
|
||||
- Include practical examples and use cases
|
||||
- Add relevant emojis for better readability
|
||||
|
||||
4. **Best Practices**
|
||||
- Follow industry-standard documentation practices
|
||||
- Include troubleshooting sections
|
||||
- Add performance considerations
|
||||
- Address security implications
|
||||
"""
|
||||
try:
|
||||
|
||||
llm_response = llm_text_gen(prompt, system_prompt=system_prompt)
|
||||
except Exception as err:
|
||||
logger.error(f"Failed to get response from {gpt_provider}: {err}")
|
||||
raise
|
||||
@@ -1,157 +0,0 @@
|
||||
"""
|
||||
Enhanced GitHub Blog Generator
|
||||
|
||||
This module provides comprehensive content generation from GitHub repositories,
|
||||
including technical documentation, tutorials, case studies, and more.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import datetime
|
||||
import json
|
||||
from typing import Dict, List, Optional
|
||||
from pathlib import Path
|
||||
|
||||
from loguru import logger
|
||||
logger.remove()
|
||||
logger.add(sys.stdout,
|
||||
colorize=True,
|
||||
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
|
||||
|
||||
from .scrape_github_readme import GitHubScraper, GitHubContent
|
||||
from .scrape_github_readme import get_gh_details_vision, get_readme_content
|
||||
from .scrape_github_readme import research_github_topics, check_if_already_written
|
||||
from .github_getting_started import (
|
||||
generate_technical_documentation,
|
||||
generate_getting_started_guide,
|
||||
generate_tutorial_series,
|
||||
generate_comparison_analysis,
|
||||
generate_case_studies,
|
||||
generate_contribution_guide,
|
||||
generate_security_guide,
|
||||
generate_performance_guide
|
||||
)
|
||||
|
||||
|
||||
class GitHubBlogGenerator:
|
||||
"""Generator for various types of GitHub-related content."""
|
||||
|
||||
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24):
|
||||
"""Initialize the blog generator."""
|
||||
self.cache_dir = Path(cache_dir)
|
||||
self.scraper = GitHubScraper(cache_dir, ttl_hours)
|
||||
self.output_dir = Path("generated_content")
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
|
||||
async def generate_content(self, github_url: str, content_types: List[str] = None) -> Dict[str, str]:
|
||||
"""Generate various types of content from a GitHub repository."""
|
||||
if content_types is None:
|
||||
content_types = ["getting_started", "technical_docs", "tutorials"]
|
||||
|
||||
try:
|
||||
# Scrape GitHub content
|
||||
repo_content = await self.scraper.scrape_github_content(github_url)
|
||||
|
||||
# Generate different types of content
|
||||
generated_content = {}
|
||||
|
||||
for content_type in content_types:
|
||||
if content_type == "getting_started":
|
||||
content = generate_getting_started_guide(repo_content.dict())
|
||||
elif content_type == "technical_docs":
|
||||
content = generate_technical_documentation(repo_content.dict())
|
||||
elif content_type == "tutorials":
|
||||
content = generate_tutorial_series(repo_content.dict())
|
||||
elif content_type == "comparison":
|
||||
content = generate_comparison_analysis(repo_content.dict())
|
||||
elif content_type == "case_studies":
|
||||
content = generate_case_studies(repo_content.dict())
|
||||
elif content_type == "contribution":
|
||||
content = generate_contribution_guide(repo_content.dict())
|
||||
elif content_type == "security":
|
||||
content = generate_security_guide(repo_content.dict())
|
||||
elif content_type == "performance":
|
||||
content = generate_performance_guide(repo_content.dict())
|
||||
else:
|
||||
logger.warning(f"Unknown content type: {content_type}")
|
||||
continue
|
||||
|
||||
generated_content[content_type] = content
|
||||
|
||||
# Generate FAQs from online research
|
||||
try:
|
||||
research_report = do_online_research(repo_content.title, "gemini", github_url)
|
||||
faqs = generate_blog_faq(research_report, "gemini")
|
||||
generated_content["faqs"] = faqs
|
||||
except Exception as err:
|
||||
logger.error(f"Failed to generate FAQs: {err}")
|
||||
|
||||
return generated_content
|
||||
|
||||
except Exception as err:
|
||||
logger.error(f"Failed to generate content: {err}")
|
||||
raise
|
||||
|
||||
def save_content(self, content: Dict[str, str], base_filename: str):
|
||||
"""Save generated content to files."""
|
||||
try:
|
||||
for content_type, content_text in content.items():
|
||||
# Generate metadata for each content type
|
||||
title, meta_desc, tags, categories = blog_metadata(content_text, "gemini")
|
||||
|
||||
# Create filename with content type
|
||||
filename = f"{base_filename}_{content_type}.md"
|
||||
|
||||
# Save content to file
|
||||
save_blog_to_file(
|
||||
content_text,
|
||||
title,
|
||||
meta_desc,
|
||||
tags,
|
||||
categories,
|
||||
None # No image path for now
|
||||
)
|
||||
|
||||
logger.info(f"Saved {content_type} content to {filename}")
|
||||
|
||||
except Exception as err:
|
||||
logger.error(f"Failed to save content: {err}")
|
||||
raise
|
||||
|
||||
async def main():
|
||||
"""Example usage of the GitHub blog generator."""
|
||||
generator = GitHubBlogGenerator()
|
||||
|
||||
# Example GitHub URLs
|
||||
urls = [
|
||||
"https://github.com/owner/repo",
|
||||
"https://github.com/owner/another-repo"
|
||||
]
|
||||
|
||||
content_types = [
|
||||
"getting_started",
|
||||
"technical_docs",
|
||||
"tutorials",
|
||||
"comparison",
|
||||
"case_studies",
|
||||
"contribution",
|
||||
"security",
|
||||
"performance"
|
||||
]
|
||||
|
||||
for url in urls:
|
||||
try:
|
||||
# Generate content
|
||||
content = await generator.generate_content(url, content_types)
|
||||
|
||||
# Create base filename from URL
|
||||
base_filename = url.split("/")[-1]
|
||||
|
||||
# Save content
|
||||
generator.save_content(content, base_filename)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing {url}: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -1,427 +0,0 @@
|
||||
"""
|
||||
Enhanced GitHub Content Scraper with Rate Limiting and Caching
|
||||
|
||||
This module provides functionality to scrape GitHub repositories, READMEs, and code files
|
||||
for content marketing purposes. It includes async support, rate limiting, caching,
|
||||
and comprehensive metadata collection.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Optional, Union
|
||||
from urllib.parse import urljoin, urlparse
|
||||
import pandas as pd
|
||||
from bs4 import BeautifulSoup
|
||||
from loguru import logger
|
||||
import requests
|
||||
from pydantic import BaseModel, Field
|
||||
import time
|
||||
import pickle
|
||||
from pathlib import Path
|
||||
|
||||
# Configure logging
|
||||
logger.remove()
|
||||
logger.add(sys.stdout,
|
||||
colorize=True,
|
||||
format="<level>{level}</level>|<green>{file}:{line}:{function}</green>| {message}")
|
||||
|
||||
class RateLimiter:
|
||||
"""Rate limiter for GitHub API requests."""
|
||||
|
||||
def __init__(self, calls_per_minute: int = 30):
|
||||
self.calls_per_minute = calls_per_minute
|
||||
self.interval = 60 / calls_per_minute # seconds between calls
|
||||
self.last_call_time = 0
|
||||
self.lock = asyncio.Lock()
|
||||
|
||||
async def acquire(self):
|
||||
"""Acquire rate limit token."""
|
||||
async with self.lock:
|
||||
current_time = time.time()
|
||||
time_since_last_call = current_time - self.last_call_time
|
||||
|
||||
if time_since_last_call < self.interval:
|
||||
await asyncio.sleep(self.interval - time_since_last_call)
|
||||
|
||||
self.last_call_time = time.time()
|
||||
|
||||
class Cache:
|
||||
"""Cache for GitHub content."""
|
||||
|
||||
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24):
|
||||
self.cache_dir = Path(cache_dir)
|
||||
self.ttl = timedelta(hours=ttl_hours)
|
||||
self.cache_dir.mkdir(exist_ok=True)
|
||||
|
||||
def _get_cache_path(self, key: str) -> Path:
|
||||
"""Get cache file path for a key."""
|
||||
return self.cache_dir / f"{hash(key)}.cache"
|
||||
|
||||
def get(self, key: str) -> Optional[Dict]:
|
||||
"""Get cached value for key."""
|
||||
cache_path = self._get_cache_path(key)
|
||||
|
||||
if not cache_path.exists():
|
||||
return None
|
||||
|
||||
try:
|
||||
with open(cache_path, 'rb') as f:
|
||||
data = pickle.load(f)
|
||||
if datetime.now() - data['timestamp'] > self.ttl:
|
||||
cache_path.unlink()
|
||||
return None
|
||||
return data['value']
|
||||
except Exception as e:
|
||||
logger.warning(f"Cache read error for {key}: {e}")
|
||||
return None
|
||||
|
||||
def set(self, key: str, value: Dict):
|
||||
"""Set cache value for key."""
|
||||
cache_path = self._get_cache_path(key)
|
||||
|
||||
try:
|
||||
with open(cache_path, 'wb') as f:
|
||||
pickle.dump({
|
||||
'timestamp': datetime.now(),
|
||||
'value': value
|
||||
}, f)
|
||||
except Exception as e:
|
||||
logger.warning(f"Cache write error for {key}: {e}")
|
||||
|
||||
class GitHubContent(BaseModel):
|
||||
"""Model for GitHub content analysis."""
|
||||
title: str = Field("", description="Title of the content")
|
||||
description: str = Field("", description="Description of the content")
|
||||
content: str = Field("", description="Main content")
|
||||
language: str = Field("", description="Programming language")
|
||||
stars: int = Field(0, description="Number of stars")
|
||||
forks: int = Field(0, description="Number of forks")
|
||||
watchers: int = Field(0, description="Number of watchers")
|
||||
last_updated: str = Field("", description="Last update date")
|
||||
topics: List[str] = Field([], description="Repository topics")
|
||||
contributors: List[str] = Field([], description="Contributor usernames")
|
||||
readme_url: str = Field("", description="URL of the README")
|
||||
raw_content_url: str = Field("", description="URL for raw content")
|
||||
license: str = Field("", description="Repository license")
|
||||
dependencies: List[str] = Field([], description="Project dependencies")
|
||||
metadata: Dict = Field({}, description="Additional metadata")
|
||||
|
||||
class GitHubScraper:
|
||||
"""Service for scraping GitHub content with rate limiting and caching."""
|
||||
|
||||
def __init__(self, cache_dir: str = ".github_cache", ttl_hours: int = 24, calls_per_minute: int = 30):
|
||||
"""Initialize the scraper service."""
|
||||
self.session = None
|
||||
self.headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
|
||||
'Accept': 'application/vnd.github.v3+json'
|
||||
}
|
||||
self.rate_limiter = RateLimiter(calls_per_minute)
|
||||
self.cache = Cache(cache_dir, ttl_hours)
|
||||
|
||||
async def __aenter__(self):
|
||||
"""Create aiohttp session when entering context."""
|
||||
self.session = aiohttp.ClientSession(headers=self.headers)
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
"""Close aiohttp session when exiting context."""
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
async def fetch_url(self, url: str, use_cache: bool = True) -> str:
|
||||
"""Fetch URL content asynchronously with rate limiting and caching."""
|
||||
if use_cache:
|
||||
cached_content = self.cache.get(url)
|
||||
if cached_content:
|
||||
logger.debug(f"Cache hit for {url}")
|
||||
return cached_content
|
||||
|
||||
await self.rate_limiter.acquire()
|
||||
|
||||
try:
|
||||
async with self.session.get(url) as response:
|
||||
if response.status == 200:
|
||||
content = await response.text()
|
||||
if use_cache:
|
||||
self.cache.set(url, content)
|
||||
return content
|
||||
else:
|
||||
error_msg = f"Failed to fetch URL: Status code {response.status}"
|
||||
logger.error(error_msg)
|
||||
raise Exception(error_msg)
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching URL {url}: {e}")
|
||||
raise
|
||||
|
||||
def parse_github_url(self, url: str) -> Dict[str, str]:
|
||||
"""Parse GitHub URL to extract repository information."""
|
||||
parsed = urlparse(url)
|
||||
path_parts = parsed.path.strip('/').split('/')
|
||||
|
||||
if len(path_parts) < 2:
|
||||
raise ValueError("Invalid GitHub URL format")
|
||||
|
||||
return {
|
||||
'owner': path_parts[0],
|
||||
'repo': path_parts[1],
|
||||
'branch': path_parts[3] if len(path_parts) > 3 else 'main',
|
||||
'path': '/'.join(path_parts[4:]) if len(path_parts) > 4 else ''
|
||||
}
|
||||
|
||||
async def get_repo_metadata(self, owner: str, repo: str) -> Dict:
|
||||
"""Get repository metadata from GitHub API with caching."""
|
||||
cache_key = f"metadata_{owner}_{repo}"
|
||||
cached_metadata = self.cache.get(cache_key)
|
||||
if cached_metadata:
|
||||
return cached_metadata
|
||||
|
||||
await self.rate_limiter.acquire()
|
||||
|
||||
api_url = f"https://api.github.com/repos/{owner}/{repo}"
|
||||
try:
|
||||
async with self.session.get(api_url) as response:
|
||||
if response.status == 200:
|
||||
metadata = await response.json()
|
||||
self.cache.set(cache_key, metadata)
|
||||
return metadata
|
||||
else:
|
||||
logger.error(f"Failed to fetch repo metadata: {response.status}")
|
||||
return {}
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching repo metadata: {e}")
|
||||
return {}
|
||||
|
||||
async def get_readme_content(self, owner: str, repo: str, branch: str = 'main') -> Dict:
|
||||
"""Get README content from GitHub with caching."""
|
||||
cache_key = f"readme_{owner}_{repo}_{branch}"
|
||||
cached_content = self.cache.get(cache_key)
|
||||
if cached_content:
|
||||
return cached_content
|
||||
|
||||
try:
|
||||
# Try to get README from API first
|
||||
await self.rate_limiter.acquire()
|
||||
api_url = f"https://api.github.com/repos/{owner}/{repo}/readme"
|
||||
async with self.session.get(api_url) as response:
|
||||
if response.status == 200:
|
||||
readme_data = await response.json()
|
||||
content = {
|
||||
'content': readme_data.get('content', ''),
|
||||
'encoding': readme_data.get('encoding', 'base64'),
|
||||
'url': readme_data.get('html_url', '')
|
||||
}
|
||||
self.cache.set(cache_key, content)
|
||||
return content
|
||||
|
||||
# Fallback to scraping if API fails
|
||||
readme_url = f"https://github.com/{owner}/{repo}/blob/{branch}/README.md"
|
||||
html_content = await self.fetch_url(readme_url, use_cache=True)
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# Find the README content
|
||||
readme_content = soup.find('div', {'class': 'markdown-body'})
|
||||
if readme_content:
|
||||
content = {
|
||||
'content': readme_content.get_text(),
|
||||
'encoding': 'text',
|
||||
'url': readme_url
|
||||
}
|
||||
self.cache.set(cache_key, content)
|
||||
return content
|
||||
|
||||
return {}
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching README: {e}")
|
||||
return {}
|
||||
|
||||
async def get_file_content(self, owner: str, repo: str, path: str, branch: str = 'main') -> Dict:
|
||||
"""Get content of a specific file from GitHub with caching."""
|
||||
cache_key = f"file_{owner}_{repo}_{path}_{branch}"
|
||||
cached_content = self.cache.get(cache_key)
|
||||
if cached_content:
|
||||
return cached_content
|
||||
|
||||
try:
|
||||
# Try to get file content from API first
|
||||
await self.rate_limiter.acquire()
|
||||
api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={branch}"
|
||||
async with self.session.get(api_url) as response:
|
||||
if response.status == 200:
|
||||
file_data = await response.json()
|
||||
content = {
|
||||
'content': file_data.get('content', ''),
|
||||
'encoding': file_data.get('encoding', 'base64'),
|
||||
'url': file_data.get('html_url', '')
|
||||
}
|
||||
self.cache.set(cache_key, content)
|
||||
return content
|
||||
|
||||
# Fallback to scraping if API fails
|
||||
file_url = f"https://github.com/{owner}/{repo}/blob/{branch}/{path}"
|
||||
html_content = await self.fetch_url(file_url, use_cache=True)
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# Find the file content
|
||||
file_content = soup.find('div', {'class': 'file-content'})
|
||||
if file_content:
|
||||
content = {
|
||||
'content': file_content.get_text(),
|
||||
'encoding': 'text',
|
||||
'url': file_url
|
||||
}
|
||||
self.cache.set(cache_key, content)
|
||||
return content
|
||||
|
||||
return {}
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching file content: {e}")
|
||||
return {}
|
||||
|
||||
async def get_repo_topics(self, owner: str, repo: str) -> List[str]:
|
||||
"""Get repository topics with caching."""
|
||||
cache_key = f"topics_{owner}_{repo}"
|
||||
cached_topics = self.cache.get(cache_key)
|
||||
if cached_topics:
|
||||
return cached_topics
|
||||
|
||||
try:
|
||||
await self.rate_limiter.acquire()
|
||||
api_url = f"https://api.github.com/repos/{owner}/{repo}/topics"
|
||||
async with self.session.get(api_url, headers={'Accept': 'application/vnd.github.mercy-preview+json'}) as response:
|
||||
if response.status == 200:
|
||||
data = await response.json()
|
||||
topics = data.get('names', [])
|
||||
self.cache.set(cache_key, topics)
|
||||
return topics
|
||||
return []
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching topics: {e}")
|
||||
return []
|
||||
|
||||
async def get_contributors(self, owner: str, repo: str) -> List[str]:
|
||||
"""Get repository contributors with caching."""
|
||||
cache_key = f"contributors_{owner}_{repo}"
|
||||
cached_contributors = self.cache.get(cache_key)
|
||||
if cached_contributors:
|
||||
return cached_contributors
|
||||
|
||||
try:
|
||||
await self.rate_limiter.acquire()
|
||||
api_url = f"https://api.github.com/repos/{owner}/{repo}/contributors"
|
||||
async with self.session.get(api_url) as response:
|
||||
if response.status == 200:
|
||||
contributors = await response.json()
|
||||
contributor_list = [contributor['login'] for contributor in contributors]
|
||||
self.cache.set(cache_key, contributor_list)
|
||||
return contributor_list
|
||||
return []
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching contributors: {e}")
|
||||
return []
|
||||
|
||||
async def scrape_github_content(self, url: str) -> GitHubContent:
|
||||
"""Main function to scrape GitHub content with caching."""
|
||||
cache_key = f"content_{url}"
|
||||
cached_content = self.cache.get(cache_key)
|
||||
if cached_content:
|
||||
return GitHubContent(**cached_content)
|
||||
|
||||
try:
|
||||
# Parse the GitHub URL
|
||||
repo_info = self.parse_github_url(url)
|
||||
|
||||
# Get repository metadata
|
||||
metadata = await self.get_repo_metadata(repo_info['owner'], repo_info['repo'])
|
||||
|
||||
# Get content based on URL type
|
||||
if not repo_info['path'] or repo_info['path'].lower() == 'readme.md':
|
||||
content_data = await self.get_readme_content(
|
||||
repo_info['owner'],
|
||||
repo_info['repo'],
|
||||
repo_info['branch']
|
||||
)
|
||||
else:
|
||||
content_data = await self.get_file_content(
|
||||
repo_info['owner'],
|
||||
repo_info['repo'],
|
||||
repo_info['path'],
|
||||
repo_info['branch']
|
||||
)
|
||||
|
||||
# Get additional metadata
|
||||
topics = await self.get_repo_topics(repo_info['owner'], repo_info['repo'])
|
||||
contributors = await self.get_contributors(repo_info['owner'], repo_info['repo'])
|
||||
|
||||
# Create GitHubContent object
|
||||
content = GitHubContent(
|
||||
title=metadata.get('name', ''),
|
||||
description=metadata.get('description', ''),
|
||||
content=content_data.get('content', ''),
|
||||
language=metadata.get('language', ''),
|
||||
stars=metadata.get('stargazers_count', 0),
|
||||
forks=metadata.get('forks_count', 0),
|
||||
watchers=metadata.get('watchers_count', 0),
|
||||
last_updated=metadata.get('updated_at', ''),
|
||||
topics=topics,
|
||||
contributors=contributors,
|
||||
readme_url=content_data.get('url', ''),
|
||||
raw_content_url=metadata.get('html_url', ''),
|
||||
license=metadata.get('license', {}).get('name', ''),
|
||||
metadata={
|
||||
'size': metadata.get('size', 0),
|
||||
'open_issues': metadata.get('open_issues_count', 0),
|
||||
'default_branch': metadata.get('default_branch', 'main'),
|
||||
'created_at': metadata.get('created_at', ''),
|
||||
'pushed_at': metadata.get('pushed_at', '')
|
||||
}
|
||||
)
|
||||
|
||||
# Cache the complete content
|
||||
self.cache.set(cache_key, content.dict())
|
||||
|
||||
return content
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error scraping GitHub content: {e}")
|
||||
raise
|
||||
|
||||
async def main():
|
||||
"""Example usage of the GitHub scraper with rate limiting and caching."""
|
||||
scraper = GitHubScraper(
|
||||
cache_dir=".github_cache",
|
||||
ttl_hours=24,
|
||||
calls_per_minute=30
|
||||
)
|
||||
|
||||
async with scraper:
|
||||
# Example URLs
|
||||
urls = [
|
||||
"https://github.com/owner/repo",
|
||||
"https://github.com/owner/repo/blob/main/README.md",
|
||||
"https://github.com/owner/repo/blob/main/src/main.py"
|
||||
]
|
||||
|
||||
for url in urls:
|
||||
try:
|
||||
content = await scraper.scrape_github_content(url)
|
||||
print(f"Scraped content from {url}:")
|
||||
print(json.dumps(content.dict(), indent=2))
|
||||
except Exception as e:
|
||||
print(f"Error scraping {url}: {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user