I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
Docker is designed to be undetectable by default, the best way I have found is to set env IN_DOCKER=True manually in your Dockerfile + check that there is no $DISPLAY configured + that you're on linux. Usually if all/most of those are true you can safely add --no-sandbox --disable-setuid-sandbox --disable-dev-shm-usage etc. all the docker-specific flags. Thats what we do in https://github.com/ArchiveBox/ArchiveBox/blob/dev/Dockerfile...
Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
This brings back memories. Around twenty years ago, internet was still expensive dial-up, so I used to go to an internet cafe, run HTTrack to download websites and manga, copy everything onto my tiny 128MB USB stick (felt very large at that time), then bring it home and read offline ;))
You could use python -m http.server instead. I haven't tried it yet, but it should work.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
I have a bunch of opinionated/personal-use binaries like this in my $HOME/bin/, like delete-all-npm, clean-rust-cache, download-youtube-playlist, and get-markdown <url>. It feels good, and I don't need to remember any commands. Sometimes my coding agent can figure out how to call some of those tools too ;))
Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
I think the zim flow was perfect for offline use. I know I will be making use of it as soon as I can figure out how to pass chrome the cookies so I can be signed into the site. Didn't see it in the page, but I didn't look closely yet.
Not yet supporting cookies, since I created this tool for shadowing public websites first. I will add options to pass cookies later. It will pass them to the underlying Chrome/Chromium process, so it should not be hard to do.
I did something like that a very long time ago (Of course, I have forgotten)
reply