Blog Scraper: Technical Description
Overview
The Blog Scraper is designed to automatically collect and process blog posts from the Metis.io website. It uses a lightweight HTML parsing approach to extract relevant information efficiently and store it in a structured format suitable for consumption by Alith AI agents.
Architecture
The blog scraper follows a straightforward pipeline architecture:
- Fetch HTML: Retrieve the blog listing page
- Extract Post Metadata: Parse HTML to identify blog posts and their metadata
- Fetch Individual Posts: Retrieve the full content of each blog post
- Process and Structure: Transform the raw data into a structured format
- Store Results: Save the processed data as JSON
Implementation Details
Core Components
scripts/addknowledge_blog.py # Main scraper implementation
.github/workflows/addknowledge_blog.yaml # Automation workflow
knowledge/metis/blog.json # Output knowledge file
Dependencies
The blog scraper has minimal dependencies to ensure reliability:
requests
: For making HTTP requests to the blog websitebeautifulsoup4
: For parsing HTML contenthashlib
: For generating unique identifiers- Standard library modules (
json
,datetime
,os
)
Key Functions
Fetching Blog Posts
def fetch_blog_posts():
response = requests.get(BLOG_URL)
if response.status_code != 200:
raise Exception("Failed to fetch blog page.")
soup = BeautifulSoup(response.text, "html.parser")
return soup.find_all(
"div", {"role": "listitem", "class": "collection-item tech w-dyn-item"}
)
Parsing Blog Items
def parse_blog_item(item):
link_tag = item.find("a", {"aria-label": "link-article"})
link = link_tag["href"]
date = item.find("div", class_="text-1-pc").text.strip()
author = (
item.find("div", class_="autor-tag").text.strip()
if item.find("div", class_="autor-tag")
else "Unknown"
)
title = item.find("div", {"fs-cmsfilter-field": "title"}).text.strip()
summary = item.find("div", class_="text-intro-pc").text.strip()
return {
"url": f"https://metis.io{link}",
"date": date,
"author": author,
"title": title,
"summary": summary,
}
Generating Unique IDs
def generate_date_hash_id(blog):
try:
pub_date = datetime.strptime(blog["date"], "%b %d, %Y")
date_str = pub_date.strftime("%d%m%Y")
hash_str = hashlib.sha256(blog["url"].encode()).hexdigest()[:8]
return f"{date_str}{hash_str}"
except (ValueError, KeyError):
print(f"Warning: Invalid date format for blog: '{blog['url']}'. Using url hash instead.")
url_hash = hashlib.sha256(blog["url"].encode()).hexdigest()[:16]
return url_hash
Configuration Options
The blog scraper offers several configurable parameters:
# Constants
BLOG_URL = "https://metis.io/blog" # Source URL for blog posts
MAX_BLOGS = 10 # Maximum number of new blogs to process at once
DAYS_TO_KEEP = 99999 # Retention period in days (set high for minimal maintenance)
JSON_PATH = "knowledge/metis/blog.json" # Output file path
Data Flow
- The script checks for existing blog data in the JSON_PATH
- It fetches the latest blog posts from the Metis blog
- For each new post (not already in the database):
- Parse the metadata
- Generate a unique ID
- Scrape the full content
- Add to the collection
- Remove any posts older than DAYS_TO_KEEP (if configured)
- Save the updated collection back to the JSON file
Output Structure
The blog scraper produces a JSON file with the following structure:
{
"latest_id": "12032025abcd1234",
"blogs": [
{
"id": "12032025abcd1234",
"url": "https://metis.io/blog-post-url",
"date": "Mar 12, 2025",
"author": "Author Name",
"title": "Blog Post Title",
"summary": "A brief summary of the blog post...",
"content": "The full text content of the blog post..."
},
// Additional blog posts...
]
}
Error Handling
The scraper implements several error handling mechanisms:
- Connection failures are caught and reported
- Invalid date formats are handled gracefully with fallback ID generation
- File operations use try/except blocks to prevent crashes
- Invalid HTML structures are handled with conditional checks
Automation with GitHub Actions
The blog scraper is automated through a GitHub Actions workflow that:
- Runs daily at midnight UTC
- Sets up the necessary Python environment
- Installs required dependencies
- Executes the scraping script
- Commits and pushes any changes to the repository
name: "Knowledge Scraping - blogs"
on:
workflow_dispatch:
schedule:
- cron: '0 0 * * *' # Runs daily at midnight UTC
jobs:
scrape-blogs:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install requests beautifulsoup4
- name: Run Your Script
run: |
echo "Running the scraping task"
python scripts/addknowledge_blog.py
- name: Commit and push if content changed
run: |
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
Performance Considerations
The blog scraper is designed to be lightweight and efficient:
- It processes only new content, avoiding redundant operations
- The HTML parsing is targeted to specific elements, minimizing memory usage
- The script runs quickly, typically completing in a few seconds
- Error handling ensures that temporary failures don’t disrupt the knowledge base
Customization Guide
To adapt the blog scraper for other sources:
- Modify the
BLOG_URL
constant to point to the new source - Update the HTML selectors in
fetch_blog_posts()
andparse_blog_item()
functions - Adjust the date format handling in
generate_date_hash_id()
if necessary - Consider changing the retention period via
DAYS_TO_KEEP
based on the source’s update frequency