man pages are truly the best software documentation out there, with famously zero problems with formatting, content, or navigability.
In fact, haven't we all at some point wished we could access the man pages in a form factor more suitable for cozy reading, so we might curl up by the fire with a mug of cocoa and the documentation for xargs(1)
? Good news, in this post you can join me as I figure out how to accomplish just that, by compiling the man pages on my own machine into an epub with nice formatting.
What's that, no one has ever wanted this and I'm just a very strange person? Well tough tapioca, I'm doing it anyway. This is my website and if you don't like it, you can just click here to travel to a different web page.
Part 1: Scraping and Converting
I spent a fair bit of time working on the problem of getting the actual page contents, but only for the pages I actually would want to read. I had some ideas for scraping the Debian package archive and using the Debian popularity contest metrics to fetch only the most used packages' docs.
Then I realized, a bit emberrassed, that I was doing all this on a Debian machine that had all the software I considered essential, so all that scraping and sorting was moot. If I wanted to learn the details of a program enough to peruse all its docs, I would also have to have the actual program installed. The files are inside the computer!
So, the first step is to convert all the local man files to ...something. Probably some kind of XML eventually, but I'm hoping I won't need to manually figure that part out. For now, markdown is a decent choice.
Actually the zeroth step is collecting all the page files
$ find /usr/share/man/man* -type f | sort | head -n 5
/usr/share/man/man1/7z.1.gz
/usr/share/man/man1/aa-enabled.1.gz
/usr/share/man/man1/aa-exec.1.gz
/usr/share/man/man1/aa-features-abi.1.gz
/usr/share/man/man1/acorn.1.gz
The vast majority are gzip compressed, some are uncompressed, and a tiny handful are bzip2 compressed. Let's sort that out.
function handle_decompress {
local page="$1"
local extension="${page##*.}"
case "$extension" in
gz)
gzip -dc "$page"
;;
bz2)
bzip2 -dc "$page"
;;
*)
cat "$page"
;;
esac
}
OK, now those of you who've been shouting "PANDOC" at your screen - well first of all, you're worrying your neighbors. But you were correct, that's where we're going.
for page in $( find /usr/share/man/man* -type f); do
handle_decompress $page | pandoc -f man -t markdown
done
This just prints the markdown to stdout so we can verify it's working. It does work, but it's pretty slow. With a small refactor we can use GNU parallel to speed things up a lot.
BOOK_PAGES="./book"
export BOOK_PAGES
# ..snip..
export -f handle_decompress
function process_manpage {
local section_dir="$BOOK_PAGES/$(basename "$(dirname $1 )" )"
mkdir -p "$section_dir"
local clean_filename=$(basename "$1" .gz)
local output_file="$section_dir/$clean_filename.md"
handle_decompress "$1" | pandoc -f man -t markdown > $output file
}
export -f process_manpage
for man_dir in /usr/share/man/man*; do
find "$man_dir" -type f | parallel process_manpage {}
done
I also made the outputs redirect to a specified directory while maintaining the original subdirectory structure. The export -f
lines are needed to make the functions available in each parallel
sub-process. Now we've got the pages, let's make the book!
Part 2: Making the Book
I was anticipating the process for this as more or less a pandoc -> calibre pipeline, with pandoc for file conversion on the individual pages and calibre for the overall book structure. But it turns out pandoc can just output epub directly. Markdown was a lucky choice for the initial conversion 1, since pandoc loves working with markdown. We can just supply a list of markdown files in the correct chapter order.
pandoc $( find $BOOK_DIR -type f | sort ) -o manpages.epub
It works! But the chapter divisions are all borked; each subheader within a page is counted as a new chapter. To fix this we can use sed
to bump all the headers down by one level.
# ...
handle_decompress "$1" | pandoc -f man -t markdown | sed 's/^\(#\+\)\s*\([^#]\+\)/#\1 \2/'> $output file
#...
This just adds one extra '#' to all headers. So each page becomes a chapter, we can append the page's name as a h1
header to the top
# ...
echo "$clean_filename" | sed 's/\(.\+\).\([0-9]\)/# \1(\2)\n/' > "$output_file"
handle_decompress "$1" | \
pandoc -f man -t markdown | \
sed 's/^\(#\+\)\s*\([^#]\+\)/#\1 \2/' >> "$output_file"
# ...
.. and then generate a table of contents based on those inserted headers.
pandoc $(find "$BOOK_PAGES" -type f | sort ) \
--toc --toc-depth=1 \
--metadata title="Man Pages" \
-o manpages.epub
Voila! Here's the full script for completeness.
#!/bin/bash
BOOK_PAGES="./book"
export BOOK_PAGES
# stop on error
set -eu
# print decompressed file contents to stdout
function handle_decompress {
local page="$1"
local extension="${page##*.}"
case "$extension" in
gz)
gzip -dc "$page"
;;
bz2)
bzip2 -dc "$page"
;;
*)
cat "$page"
;;
esac
}
export -f handle_decompress
function process_manpage {
local section_dir="$BOOK_PAGES/$(basename "$(dirname $1 )" )"
mkdir -p "$section_dir"
local clean_filename=$(basename "$1" .gz)
local output_file="$section_dir/$clean_filename.md"
# make the header
echo "$clean_filename" | sed 's/\(.\+\).\([0-9]\)/# \1(\2)\n/' > "$output_file"
# decompress, transform, edit headers, and write to output file
handle_decompress "$1" | \
pandoc -f man -t markdown | \
sed 's/^\(#\+\)\s*\([^#]\+\)/#\1 \2/' >> "$output_file"
}
export -f process_manpage
for man_dir in /usr/share/man/man*; do
find "$man_dir" -type f | parallel process_manpage {}
done
# actually compile the pages into a book
pandoc $(find "$BOOK_PAGES" -type f | sort ) \
--toc --toc-depth=1 \
--metadata title="Man Pages" \
-o manpages.epub
It works quite well! There's a few tweaks I might make in the future
- The chapters are ordered correctly by section, and alphabetically within each section (so all the
man 1
pages come before theman 2
pages, etc), but there's no "official" subdivision based on section; if you want to view only the pages in section 8, you need to scroll past all of sections 1-7 in the table of contents. Probably I can change the regex for headers to make this work. - The actual compilation step takes a long time since it's not (easily) made parallel. I don't know a good solution to this that also keeps this project within the realm of "easy breezy afternoon one-off". That said, it's Good Enough for me.
Or it appears that way to you anyway, through the magic of "writing the post after completing the project"