Website Scout Artifact System & Blog Pipeline
By Sean WeldonAtlas Development Log — Website Scout Artifact System & Blog Pipeline
Overview
Extended website_scout with a proper artifact storage system mirroring youtube_scout's pattern, created an automated frontmatter injection script for sean_weldon_site blog publishing, and ran the complete workflow on Claude Code documentation. The session focused on standardizing intermediate output storage across scout products and streamlining the content-to-blog pipeline.
1. Objectives
- Analyze youtube_scout artifact storage patterns and implement for website_scout
- Create automated frontmatter script for sean_weldon_site integration
- Update project-level CLAUDE.md with standardized patterns
- Document all CLI commands and flags
- Test complete workflow on real content
Success looks like:
Running website_scout synthesize followed by add_frontmatter.py --copy-to-blog produces a properly formatted blog post with YAML frontmatter in the correct directory.
2. Key Developments
Technical Progress:
- Implemented per-URL artifact directories (
./artifacts/{url-slug}/) - Added artifact saving to fetch stage (raw_html.html, fetch_metadata.json)
- Added artifact saving to extract stage (clean_text.txt, extracted_content.json)
- Created
scripts/add_frontmatter.pyfor automated blog frontmatter injection - Added
--ignore-robotsflag to bypass robots.txt restrictions
System / Agent Improvements:
- All pipeline stages now accept
artifact_dirparameter - Stages return
artifactsdict with file paths for traceability - CLI displays artifact directory location on completion
Integrations Added:
- Frontmatter script auto-detects blog posts directory
--copy-to-blogflag copies with date-prefixed filename- Auto-extracts title, description, key_takeaways, and tags from content
3. Design Decisions
Artifact Directory Naming
- Decision: Use URL-based slug (
example-com-blog-post) instead of title-based - Rationale: URLs are available before content extraction; titles require parsing
- Alternative considered: Using content hash or timestamp
- Trade-off: Less human-readable than title-based names
Default Category for Website Content
- Decision: Default to "Education and Research" category
- Rationale: Website scout targets documentation/educational content, matching youtube_scout's
--researchmode - Alternative considered: "AI" or auto-detection based on content
- Trade-off: May need override for non-research content
Flat Module Layout
- Decision: Documented preference for flat layout (
product_scout/) over src/ layout - Rationale: Simpler CLI invocation without PYTHONPATH manipulation
- Alternative considered: Standardizing on src/ layout
- Trade-off: Less conventional Python packaging structure
4. Challenges & Solutions
robots.txt Blocking Documentation Sites
- Problem: Claude Code docs at code.claude.com blocked by robots.txt
- Root cause: Site restricts automated crawling
- Solution: Added
--ignore-robotsCLI flag to bypass for authorized use cases
5. Code Changes
| File | Change |
|---|---|
config/website_scout.yaml |
Added artifacts: config section |
website_scout/cli.py |
Added sanitize_url_for_dirname(), artifact dir creation, --ignore-robots flag |
website_scout/fetch.py |
Added artifact_dir param, _persist_raw_html(), _persist_metadata() |
website_scout/extract.py |
Added artifact_dir param, _persist_clean_text(), _persist_extracted() |
scripts/add_frontmatter.py |
New file: frontmatter injection for sean_weldon_site |
README.md |
Added JS rendering docs, Artifacts section, updated CLI options |
products/.claude/CLAUDE.md |
Added Artifact Storage section, flat layout docs, dependencies |
products/.claude/commands.md |
New file: complete CLI command reference |
6. Next Steps
- Add
--copy-to-blogintegration directly to website_scout CLI - Create unified
scoutCLI that wraps both youtube_scout and website_scout - Add artifact cleanup command for old/failed runs
- Consider auto-detecting research vs summary content type
7. Session Notes
The artifact storage pattern is now consistent across scout products:
- youtube_scout:
artifacts/{Video-Title}/with 7 files - website_scout:
artifacts/{url-slug}/with 6 files
Key insight: Separating artifact storage from final output allows re-running individual stages without full pipeline execution. This is valuable for debugging LLM synthesis or adjusting frontmatter without re-fetching content.
Complete workflow tested successfully:
python -m website_scout.cli synthesize URL --render-js --ignore-robots
python scripts/add_frontmatter.py output/slug.md --copy-to-blog
Generated blog post from Claude Code plugin documentation as proof of concept.