website-cloner · davidbmar.com

What it is

website-cloner is a Node.js utility designed to create static replicas of dynamic websites. It operates in distinct phases: URL enumeration via BFS, asset extraction (HTML/CSS/JS/Images), link rewriting for relative paths, dynamic content detection (marking API calls/forms for LLM review), and optional deployment to AWS S3 with static hosting configuration. It includes both a CLI for automation and a Web UI for interactive use.

Features

Breadth-First Search URL enumeration without downloading content
Asset extraction for HTML, CSS, JS, images, and fonts
Automatic link rewriting from absolute to relative paths
Dynamic content detection marking APIs, forms, and WebSockets
One-click deployment to AWS S3 with static website hosting
Web UI and CLI interfaces with real-time progress tracking

Quickstart

bash setup.sh
cp config.example.json mysite-config.json
node clone-website.js --config=mysite-config.json --full

Architecture

flowchart TD
    User[User] -->|Config| CLI[clone-website.js]
    CLI --> Enum[Enumerator]
    CLI --> Down[Downloader]
    CLI --> Rewriter[LinkRewriter]
    CLI --> Dyn[DynamicDetector]
    CLI --> S3[S3Uploader]
    
    Enum -->|manifest.json| Down
    Down -->|Local Files| Rewriter
    Rewriter -->|Modified Files| Dyn
    Dyn -->|Marked Files| S3
    S3 -->|Upload| AWS[AWS S3 Bucket]
    
    subgraph Libs
    Axios[axios]
    Cheerio[cheerio]
    PQueue[p-queue]
    end
    
    Down --> Axios
    Rewriter --> Cheerio
    Down --> PQueue

How it's built

Built with Node.js using ES modules. Key libraries include `axios` for HTTP requests, `cheerio` for HTML parsing and manipulation, `commander` for CLI argument parsing, `p-queue` for concurrency control, and `@aws-sdk/client-s3` for cloud deployment. The architecture separates concerns into modular classes: Enumerator, Downloader, LinkRewriter, DynamicDetector, and S3Uploader, orchestrated by a main CLI entry point.

How it runs

sequenceDiagram
    participant U as User
    participant C as CLI
    participant E as Enumerator
    participant D as Downloader
    participant R as LinkRewriter
    participant DD as DynamicDetector
    participant S as S3Uploader
    
    U->>C: Run with --full flag
    C->>E: Execute Phase 2 (Enumerate)
    E->>E: BFS Crawl Target URL
    E-->>C: Generate manifest.json
    
    C->>D: Execute Phase 3 (Download)
    D->>D: Load manifest.json
    D->>D: Download HTML & Assets
    D-->>C: Save to output directory
    
    C->>R: Execute Phase 4 (Rewrite)
    R->>R: Parse HTML with Cheerio
    R->>R: Convert Absolute to Relative Links
    R-->>C: Save modified files
    
    C->>DD: Execute Phase 5 (Detect)
    DD->>DD: Scan for API/Form/WebSocket usage
    DD->>DD: Add data-marker attributes
    DD-->>C: Save marked files
    
    C->>S: Execute Phase 6 (Deploy)
    S->>S: Configure S3 Bucket
    S->>S: Upload Static Assets
    S-->>U: Return S3 Website URL

How to apply & reuse

Use this tool to archive legacy sites, create static backups of dynamic content, or migrate simple websites to serverless hosting on S3. It is particularly useful for developers needing to inspect how a site's assets are structured or to prepare a site for LLM-based refactoring by marking dynamic elements.

At a glance

CapabilitiesStatic Site GenerationWeb ScrapingAWS S3 DeploymentLink RewritingDynamic Content Analysis

ComponentsCLI InterfaceWeb UI ServerURL EnumeratorAsset DownloaderLink RewriterDynamic DetectorS3 Uploader404 Page Generator

TechNode.jsJavaScriptES ModulesCheerioAxiosAWS SDK v3Commander.jsP-Queue

Depends onNode.js runtimenpm package managerAWS Account (for S3 deployment)EC2 IAM Role or AWS Credentials

Integrates withAWS S3Nginx (via bootstrap script)Systemd (via bootstrap script)

PatternsCommand Line InterfacePhase-based ProcessingBreadth-First SearchStatic Site HostingDependency Injection (Logger/Config)

Reuse tagsweb-scrapingstatic-site-generatoraws-s3nodejs-toolsite-migrationarchiving