A six-phase static site cloning tool that enumerates URLs, downloads assets, rewrites links, detects dynamic content, and deploys to AWS S3.
https://github.com/davidbmar/website-cloner · public · shipped
website-cloner is a Node.js utility designed to create static replicas of dynamic websites. It operates in distinct phases: URL enumeration via BFS, asset extraction (HTML/CSS/JS/Images), link rewriting for relative paths, dynamic content detection (marking API calls/forms for LLM review), and optional deployment to AWS S3 with static hosting configuration. It includes both a CLI for automation and a Web UI for interactive use.
bash setup.sh cp config.example.json mysite-config.json node clone-website.js --config=mysite-config.json --full
flowchart TD
User[User] -->|Config| CLI[clone-website.js]
CLI --> Enum[Enumerator]
CLI --> Down[Downloader]
CLI --> Rewriter[LinkRewriter]
CLI --> Dyn[DynamicDetector]
CLI --> S3[S3Uploader]
Enum -->|manifest.json| Down
Down -->|Local Files| Rewriter
Rewriter -->|Modified Files| Dyn
Dyn -->|Marked Files| S3
S3 -->|Upload| AWS[AWS S3 Bucket]
subgraph Libs
Axios[axios]
Cheerio[cheerio]
PQueue[p-queue]
end
Down --> Axios
Rewriter --> Cheerio
Down --> PQueue
Built with Node.js using ES modules. Key libraries include `axios` for HTTP requests, `cheerio` for HTML parsing and manipulation, `commander` for CLI argument parsing, `p-queue` for concurrency control, and `@aws-sdk/client-s3` for cloud deployment. The architecture separates concerns into modular classes: Enumerator, Downloader, LinkRewriter, DynamicDetector, and S3Uploader, orchestrated by a main CLI entry point.
sequenceDiagram
participant U as User
participant C as CLI
participant E as Enumerator
participant D as Downloader
participant R as LinkRewriter
participant DD as DynamicDetector
participant S as S3Uploader
U->>C: Run with --full flag
C->>E: Execute Phase 2 (Enumerate)
E->>E: BFS Crawl Target URL
E-->>C: Generate manifest.json
C->>D: Execute Phase 3 (Download)
D->>D: Load manifest.json
D->>D: Download HTML & Assets
D-->>C: Save to output directory
C->>R: Execute Phase 4 (Rewrite)
R->>R: Parse HTML with Cheerio
R->>R: Convert Absolute to Relative Links
R-->>C: Save modified files
C->>DD: Execute Phase 5 (Detect)
DD->>DD: Scan for API/Form/WebSocket usage
DD->>DD: Add data-marker attributes
DD-->>C: Save marked files
C->>S: Execute Phase 6 (Deploy)
S->>S: Configure S3 Bucket
S->>S: Upload Static Assets
S-->>U: Return S3 Website URL
Use this tool to archive legacy sites, create static backups of dynamic content, or migrate simple websites to serverless hosting on S3. It is particularly useful for developers needing to inspect how a site's assets are structured or to prepare a site for LLM-based refactoring by marking dynamic elements.
✓ all on main — nothing unmerged.