Blog Scraper: Technical Description
Overview
The Blog Scraper is designed to automatically collect and process blog posts from the Metis.io website. It uses a lightweight HTML parsing approach to extract relevant information efficiently and store it in a structured format suitable for consumption by Alith AI agents.
Architecture
The blog scraper follows a straightforward pipeline architecture:
- Fetch HTML: Retrieve the blog listing page
- Extract Post Metadata: Parse HTML to identify blog posts and their metadata
- Fetch Individual Posts: Retrieve the full content of each blog post
- Process and Structure: Transform the raw data into a structured format
- Store Results: Save the processed data as JSON
Implementation Details
Core Components
scripts/addknowledge_blog.py     # Main scraper implementation
.github/workflows/addknowledge_blog.yaml  # Automation workflow
knowledge/metis/blog.json        # Output knowledge fileDependencies
The blog scraper has minimal dependencies to ensure reliability:
- requests: For making HTTP requests to the blog website
- beautifulsoup4: For parsing HTML content
- hashlib: For generating unique identifiers
- Standard library modules (json,datetime,os)
Key Functions
Fetching Blog Posts
def fetch_blog_posts():
    response = requests.get(BLOG_URL)
    if response.status_code != 200:
        raise Exception("Failed to fetch blog page.")
    soup = BeautifulSoup(response.text, "html.parser")
    return soup.find_all(
        "div", {"role": "listitem", "class": "collection-item tech w-dyn-item"}
    )Parsing Blog Items
def parse_blog_item(item):
    link_tag = item.find("a", {"aria-label": "link-article"})
    link = link_tag["href"]
    date = item.find("div", class_="text-1-pc").text.strip()
    author = (
        item.find("div", class_="autor-tag").text.strip()
        if item.find("div", class_="autor-tag")
        else "Unknown"
    )
    title = item.find("div", {"fs-cmsfilter-field": "title"}).text.strip()
    summary = item.find("div", class_="text-intro-pc").text.strip()
    return {
        "url": f"https://metis.io{link}",
        "date": date,
        "author": author,
        "title": title,
        "summary": summary,
    }Generating Unique IDs
def generate_date_hash_id(blog):
    try:
        pub_date = datetime.strptime(blog["date"], "%b %d, %Y")
        date_str = pub_date.strftime("%d%m%Y")
        hash_str = hashlib.sha256(blog["url"].encode()).hexdigest()[:8]
        return f"{date_str}{hash_str}"
    except (ValueError, KeyError):
        print(f"Warning: Invalid date format for blog: '{blog['url']}'. Using url hash instead.")
        url_hash = hashlib.sha256(blog["url"].encode()).hexdigest()[:16]
        return url_hashConfiguration Options
The blog scraper offers several configurable parameters:
# Constants
BLOG_URL = "https://metis.io/blog"  # Source URL for blog posts
MAX_BLOGS = 10  # Maximum number of new blogs to process at once
DAYS_TO_KEEP = 99999  # Retention period in days (set high for minimal maintenance)
JSON_PATH = "knowledge/metis/blog.json"  # Output file pathData Flow
- The script checks for existing blog data in the JSON_PATH
- It fetches the latest blog posts from the Metis blog
- For each new post (not already in the database):
- Parse the metadata
- Generate a unique ID
- Scrape the full content
- Add to the collection
 
- Remove any posts older than DAYS_TO_KEEP (if configured)
- Save the updated collection back to the JSON file
Output Structure
The blog scraper produces a JSON file with the following structure:
{
  "latest_id": "12032025abcd1234",
  "blogs": [
    {
      "id": "12032025abcd1234",
      "url": "https://metis.io/blog-post-url",
      "date": "Mar 12, 2025",
      "author": "Author Name",
      "title": "Blog Post Title",
      "summary": "A brief summary of the blog post...",
      "content": "The full text content of the blog post..."
    }
    // Additional blog posts...
  ]
}Error Handling
The scraper implements several error handling mechanisms:
- Connection failures are caught and reported
- Invalid date formats are handled gracefully with fallback ID generation
- File operations use try/except blocks to prevent crashes
- Invalid HTML structures are handled with conditional checks
Automation with GitHub Actions
The blog scraper is automated through a GitHub Actions workflow that:
- Runs daily at midnight UTC
- Sets up the necessary Python environment
- Installs required dependencies
- Executes the scraping script
- Commits and pushes any changes to the repository
name: "Knowledge Scraping - blogs"
on:
  workflow_dispatch:
  schedule:
    - cron: "0 0 * * *" # Runs daily at midnight UTC
jobs:
  scrape-blogs:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.x"
      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install requests beautifulsoup4
      - name: Run Your Script
        run: |
          echo "Running the scraping task"
          python scripts/addknowledge_blog.py
      - name: Commit and push if content changed
        run: |
          git config user.name "Automated"
          git config user.email "actions@users.noreply.github.com"
          git add -A
          timestamp=$(date -u)
          git commit -m "Latest data: ${timestamp}" || exit 0
          git pushPerformance Considerations
The blog scraper is designed to be lightweight and efficient:
- It processes only new content, avoiding redundant operations
- The HTML parsing is targeted to specific elements, minimizing memory usage
- The script runs quickly, typically completing in a few seconds
- Error handling ensures that temporary failures don’t disrupt the knowledge base
Customization Guide
To adapt the blog scraper for other sources:
- Modify the BLOG_URLconstant to point to the new source
- Update the HTML selectors in fetch_blog_posts()andparse_blog_item()functions
- Adjust the date format handling in generate_date_hash_id()if necessary
- Consider changing the retention period via DAYS_TO_KEEPbased on the source’s update frequency