Website Scout Artifact System & Blog Pipeline

Built youtube_scout-style artifact storage for website_scout and shipped a frontmatter script that automates the blog publishing workflow end to end.

2026-01-06 By Sean Weldon

Atlas Development Log — Website Scout Artifact System & Blog Pipeline

Overview

Extended website_scout with a proper artifact storage system mirroring youtube_scout's pattern, created an automated frontmatter injection script for sean_weldon_site blog publishing, and ran the complete workflow on Claude Code documentation. The session focused on standardizing intermediate output storage across scout products and streamlining the content-to-blog pipeline.

1. Objectives

Analyze youtube_scout artifact storage patterns and implement for website_scout
Create automated frontmatter script for sean_weldon_site integration
Update project-level CLAUDE.md with standardized patterns
Document all CLI commands and flags
Test complete workflow on real content

Success looks like: Running website_scout synthesize followed by add_frontmatter.py --copy-to-blog produces a properly formatted blog post with YAML frontmatter in the correct directory.

2. Key Developments

Technical Progress:

Implemented per-URL artifact directories (./artifacts/{url-slug}/)
Added artifact saving to fetch stage (raw_html.html, fetch_metadata.json)
Added artifact saving to extract stage (clean_text.txt, extracted_content.json)
Created scripts/add_frontmatter.py for automated blog frontmatter injection
Added --ignore-robots flag to bypass robots.txt restrictions

System / Agent Improvements:

All pipeline stages now accept artifact_dir parameter
Stages return artifacts dict with file paths for traceability
CLI displays artifact directory location on completion

Integrations Added:

Frontmatter script auto-detects blog posts directory
--copy-to-blog flag copies with date-prefixed filename
Auto-extracts title, description, key_takeaways, and tags from content

3. Design Decisions

Artifact Directory Naming

Decision: Use URL-based slug (example-com-blog-post) instead of title-based
Rationale: URLs are available before content extraction; titles require parsing
Alternative considered: Using content hash or timestamp
Trade-off: Less human-readable than title-based names

Default Category for Website Content

Decision: Default to "Education and Research" category
Rationale: Website scout targets documentation/educational content, matching youtube_scout's --research mode
Alternative considered: "AI" or auto-detection based on content
Trade-off: May need override for non-research content

Flat Module Layout

Decision: Documented preference for flat layout (product_scout/) over src/ layout
Rationale: Simpler CLI invocation without PYTHONPATH manipulation
Alternative considered: Standardizing on src/ layout
Trade-off: Less conventional Python packaging structure

4. Challenges & Solutions

robots.txt Blocking Documentation Sites

Problem: Claude Code docs at code.claude.com blocked by robots.txt
Root cause: Site restricts automated crawling
Solution: Added --ignore-robots CLI flag to bypass for authorized use cases

5. Code Changes

File	Change
`config/website_scout.yaml`	Added `artifacts:` config section
`website_scout/cli.py`	Added `sanitize_url_for_dirname()`, artifact dir creation, `--ignore-robots` flag
`website_scout/fetch.py`	Added `artifact_dir` param, `_persist_raw_html()`, `_persist_metadata()`
`website_scout/extract.py`	Added `artifact_dir` param, `_persist_clean_text()`, `_persist_extracted()`
`scripts/add_frontmatter.py`	New file: frontmatter injection for sean_weldon_site
`README.md`	Added JS rendering docs, Artifacts section, updated CLI options
`products/.claude/CLAUDE.md`	Added Artifact Storage section, flat layout docs, dependencies
`products/.claude/commands.md`	New file: complete CLI command reference

6. Next Steps

Add --copy-to-blog integration directly to website_scout CLI
Create unified scout CLI that wraps both youtube_scout and website_scout
Add artifact cleanup command for old/failed runs
Consider auto-detecting research vs summary content type

7. Session Notes

The artifact storage pattern is now consistent across scout products:

youtube_scout: artifacts/{Video-Title}/ with 7 files
website_scout: artifacts/{url-slug}/ with 6 files

Key insight: Separating artifact storage from final output allows re-running individual stages without full pipeline execution. This is valuable for debugging LLM synthesis or adjusting frontmatter without re-fetching content.

Complete workflow tested successfully:

python -m website_scout.cli synthesize URL --render-js --ignore-robots
python scripts/add_frontmatter.py output/slug.md --copy-to-blog

Generated blog post from Claude Code plugin documentation as proof of concept.