Blog Scraper: Technical Description

Overview

The Blog Scraper is designed to automatically collect and process blog posts from the Metis.io website. It uses a lightweight HTML parsing approach to extract relevant information efficiently and store it in a structured format suitable for consumption by Alith AI agents.

Architecture

The blog scraper follows a straightforward pipeline architecture:

Fetch HTML: Retrieve the blog listing page
Extract Post Metadata: Parse HTML to identify blog posts and their metadata
Fetch Individual Posts: Retrieve the full content of each blog post
Process and Structure: Transform the raw data into a structured format
Store Results: Save the processed data as JSON

Implementation Details

Core Components


scripts/addknowledge_blog.py     # Main scraper implementation
.github/workflows/addknowledge_blog.yaml  # Automation workflow
knowledge/metis/blog.json        # Output knowledge file

Dependencies

The blog scraper has minimal dependencies to ensure reliability:

requests: For making HTTP requests to the blog website
beautifulsoup4: For parsing HTML content
hashlib: For generating unique identifiers
Standard library modules (json, datetime, os)

Key Functions

Fetching Blog Posts


def fetch_blog_posts():
    response = requests.get(BLOG_URL)
    if response.status_code != 200:
        raise Exception("Failed to fetch blog page.")
    soup = BeautifulSoup(response.text, "html.parser")
    return soup.find_all(
        "div", {"role": "listitem", "class": "collection-item tech w-dyn-item"}
    )

Parsing Blog Items


def parse_blog_item(item):
    link_tag = item.find("a", {"aria-label": "link-article"})
    link = link_tag["href"]
    date = item.find("div", class_="text-1-pc").text.strip()
    author = (
        item.find("div", class_="autor-tag").text.strip()
        if item.find("div", class_="autor-tag")
        else "Unknown"
    )
    title = item.find("div", {"fs-cmsfilter-field": "title"}).text.strip()
    summary = item.find("div", class_="text-intro-pc").text.strip()
    return {
        "url": f"https://metis.io{link}",
        "date": date,
        "author": author,
        "title": title,
        "summary": summary,
    }

Generating Unique IDs


def generate_date_hash_id(blog):
    try:
        pub_date = datetime.strptime(blog["date"], "%b %d, %Y")
        date_str = pub_date.strftime("%d%m%Y")
        hash_str = hashlib.sha256(blog["url"].encode()).hexdigest()[:8]
        return f"{date_str}{hash_str}"
    except (ValueError, KeyError):
        print(f"Warning: Invalid date format for blog: '{blog['url']}'. Using url hash instead.")
        url_hash = hashlib.sha256(blog["url"].encode()).hexdigest()[:16]
        return url_hash

Configuration Options

The blog scraper offers several configurable parameters:


# Constants
BLOG_URL = "https://metis.io/blog"  # Source URL for blog posts
MAX_BLOGS = 10  # Maximum number of new blogs to process at once
DAYS_TO_KEEP = 99999  # Retention period in days (set high for minimal maintenance)
JSON_PATH = "knowledge/metis/blog.json"  # Output file path

Data Flow

The script checks for existing blog data in the JSON_PATH
It fetches the latest blog posts from the Metis blog
For each new post (not already in the database):
- Parse the metadata
- Generate a unique ID
- Scrape the full content
- Add to the collection
Remove any posts older than DAYS_TO_KEEP (if configured)
Save the updated collection back to the JSON file

Output Structure

The blog scraper produces a JSON file with the following structure:


{
  "latest_id": "12032025abcd1234",
  "blogs": [
    {
      "id": "12032025abcd1234",
      "url": "https://metis.io/blog-post-url",
      "date": "Mar 12, 2025",
      "author": "Author Name",
      "title": "Blog Post Title",
      "summary": "A brief summary of the blog post...",
      "content": "The full text content of the blog post..."
    }
    // Additional blog posts...
  ]
}

Error Handling

The scraper implements several error handling mechanisms:

Connection failures are caught and reported
Invalid date formats are handled gracefully with fallback ID generation
File operations use try/except blocks to prevent crashes
Invalid HTML structures are handled with conditional checks

Automation with GitHub Actions

The blog scraper is automated through a GitHub Actions workflow that:

Runs daily at midnight UTC
Sets up the necessary Python environment
Installs required dependencies
Executes the scraping script
Commits and pushes any changes to the repository


name: "Knowledge Scraping - blogs"
on:
  workflow_dispatch:
  schedule:
    - cron: "0 0 * * *" # Runs daily at midnight UTC
jobs:
  scrape-blogs:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.x"
      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install requests beautifulsoup4
      - name: Run Your Script
        run: |
          echo "Running the scraping task"
          python scripts/addknowledge_blog.py
      - name: Commit and push if content changed
        run: |
          git config user.name "Automated"
          git config user.email "actions@users.noreply.github.com"
          git add -A
          timestamp=$(date -u)
          git commit -m "Latest data: ${timestamp}" || exit 0
          git push

Performance Considerations

The blog scraper is designed to be lightweight and efficient:

It processes only new content, avoiding redundant operations
The HTML parsing is targeted to specific elements, minimizing memory usage
The script runs quickly, typically completing in a few seconds
Error handling ensures that temporary failures don’t disrupt the knowledge base

Customization Guide

To adapt the blog scraper for other sources:

Modify the BLOG_URL constant to point to the new source
Update the HTML selectors in fetch_blog_posts() and parse_blog_item() functions
Adjust the date format handling in generate_date_hash_id() if necessary
Consider changing the retention period via DAYS_TO_KEEP based on the source’s update frequency