I use Anki a lot. In a nutshell, it's a spaced repetition program that works almost eerily well for memorizing certain kinds of data. Lately I've been using it to help study for the CompTIA A+ certification exam, which requires you to know, among other things, approximately a million facts about wireless communication standards -- exactly the sort of thing Anki is great at. Anki has a web sharing feature called AnkiWeb where you can find pre-made decks on everything from Aramaic to Zoology, but I've found that there's no substitute for making your own cards; you get a deck tailored to exactly your needs, plus it's a great study exercise in itself.
I wrote a small Rust program to convert the markdown notes I take as I read or watch video lectures into Anki cards in the .apkg
format; It's not especially complex since most of the actual conversion work is offloaded to the genanki crate. What's relevant here is that it takes a "deck id" parameter, which defines the deck where the generated cards will be placed once added to a user's collection. The deck id isn't normally shown to the user, though, it's an internal implementation detail without a matching UI element. My workaround so far has been to generate the cards assigned to a dummy deck, then manually move them to the actual deck, but why spend one minute on a task when you could instead waste a whole morning automating it?
Anyway, that was a lot of prelude to say this post is about cracking open a .apkg
file to find where the deck id is hidden, and also just to generally see what's in there. I exported my main study deck (including media - this will come up later) to a file, and I invite you to join me on a tour of its internals. Let's go!
Part 1: Taking Inventory
Let's start by figuring out what a .apkg
file actually is.
$ file COMPTIA\ A+\ 1.apkg
COMPTIA A+ 1.apkg: Zip archive data, at least v2.0 to extract, compression method=store
$ 7z x COMPTIA\ A+\ 1.apkg
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (806EA),ASM,AES-NI)
Scanning the drive for archives:
1 file, 3076466 bytes (3005 KiB)
Extracting archive: COMPTIA A+ 1.apkg
--
Path = COMPTIA A+ 1.apkg
Type = zip
Physical Size = 3076466
Everything is Ok
Files: 183
Size: 3891960
Compressed: 3076466
$ ls -alh # all files, with all data, human readable sizes
total 7.3M
drwxr-xr-x 2 k k 4.0K Mar 8 11:24 .
drwxr-xr-x 6 k k 4.0K Mar 8 11:23 ..
-rw-r--r-- 1 k k 213 Mar 8 2024 0
-rw-r--r-- 1 k k 375 Mar 8 2024 1
-rw-r--r-- 1 k k 1.4K Mar 8 2024 10
-rw-r--r-- 1 k k 371 Mar 8 2024 100
-rw-r--r-- 1 k k 1.1K Mar 8 2024 101
-rw-r--r-- 1 k k 216 Mar 8 2024 102
-rw-r--r-- 1 k k 782 Mar 8 2024 103
-rw-r--r-- 1 k k 226 Mar 8 2024 104
(...)
-rw-r--r-- 1 k k 1.3K Mar 8 2024 98
-rw-r--r-- 1 k k 1.8K Mar 8 2024 99
-rw-r--r-- 1 k k 49K Mar 8 2024 collection.anki2
-rw-r--r-- 1 k k 1.2M Mar 8 2024 collection.anki21
-rw------- 1 k k 3.0M Mar 8 11:24 'COMPTIA A+ 1.apkg'
-rw-r--r-- 1 k k 8.7K Mar 8 2024 media
-rw-r--r-- 1 k k 2 Mar 8 2024 meta
It's a zip archive with a whole lot of stuff:
collection.anki2
andcollection.anki21
, which I suspect are where we'll find the actual deck data.- a lil tiny
meta
file - a beefier
media
file - 99 extensionless files ranging in size from a few hundred to a few thousand KiB
Let's start with the collection files
$ file collection.anki2
collection.anki2: SQLite 3.x database, last written using SQLite version 3044000, page size 512, file counter 5, database pages 98, 1st free page 59, free pages 55, cookie 0x2f, schema 4, UTF-8, version-valid-for 5
$ file collection.anki21
collection.anki21: SQLite 3.x database, last written using SQLite version 3044000, file counter 3, database pages 303, 1st free page 300, free pages 27, cookie 0x2e, schema 4, UTF-8, version-valid-for 3
Aha! Sqlite! We can work with sqlite! First, though, I want to try identifying the other files
$ file meta
meta: data
Oh, it's data. That clears everything up.
$ cat meta
$ hexdump meta
0000000 0208
0000002
Checksum? ¯\_(ツ)_/¯ Anyhow, on to media
:
$ file media
media: JSON text data
$ jq '.' media
{
"90": "f3327b4a32294533a064d02a593d1bf8-ao-5-A.svg",
"46": "8136b7a2cf124cc98436eb041f6ac446-ao-2-A.svg",
"51": "ab18887cee264ca78e187ce85eb58c8a-ao-4-Q.svg",
(...)
"130": "raid.png",
(...)
"105": "8136b7a2cf124cc98436eb041f6ac446-ao-2-Q.svg",
"86": "e7078dd0ffd042c0a4266a828f47b3a2-oa-12-Q.svg"
}
It's a JSON mapping between integers and image file names, some PNG but mostly SVG. Maybe this is connected to all those files that are just numbers.
$ file 130
130: PNG image data, 847 x 324, 8-bit/color RGB, non-interlaced
$ du -hs 130 # size of single file, human readable
148K 130
So there's the mystery of the wildly varying file sizes solved. The big ones are PNG, and the small ones are SVG. Anki lets you embed images in your cards; the PNG images with normal file names are ones I've added manually. raid.png
is, predictably, a diagram of RAID configuration types. The SVG files are programmatically generated compliments to those images2.
Ok, enough stalling, it's time to look into the actual database(s).
Part 2: The database
$ sqlite3 collection.anki2
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> .tables
cards col graves notes revlog
hacker voice I'm in.
I'll spare you the full account of how I stumbled through this database through trial and error, and just link this comprehensive overview of the schema from the AnkiDroid app repo (I only found this after expending a fair chunk of time on my own reverse engineering attempts, but it's about the journey, right?).
There's five tables - cards
, col
, graves
, notes
, and revlog
3. graves
and revlog
contain scheduling information we don't care about for now. col
is short for "collections" and has metadata on the database as a whole. That leaves cards
and notes
; to be honest I'm not sure which one would have the deck id, so I'll check both.
sqlite> pragma table_info(notes);
0|id|INTEGER|0||1
1|guid|TEXT|1||0
2|mid|INTEGER|1||0
3|mod|INTEGER|1||0
4|usn|INTEGER|1||0
5|tags|TEXT|1||0
6|flds|TEXT|1||0
7|sfld|INTEGER|1||0
8|csum|INTEGER|1||0
9|flags|INTEGER|1||0
10|data|TEXT|1||0
sqlite> pragma table_info(cards);
0|id|INTEGER|0||1
1|nid|INTEGER|1||0
2|did|INTEGER|1||0
3|ord|INTEGER|1||0
4|mod|INTEGER|1||0
5|usn|INTEGER|1||0
6|type|INTEGER|1||0
7|queue|INTEGER|1||0
8|due|INTEGER|1||0
9|ivl|INTEGER|1||0
10|factor|INTEGER|1||0
11|reps|INTEGER|1||0
12|lapses|INTEGER|1||0
13|left|INTEGER|1||0
14|odue|INTEGER|1||0
15|odid|INTEGER|1||0
16|flags|INTEGER|1||0
17|data|TEXT|1||0
That did
field should be what we're after (I now know, after reading the linked page). Let's get it.
sqlite> select did from cards;
1
Huh? There should be a row for each card in the deck, and I'm pretty sure the deck id is not 1. There's two database files in the archive though, let's check the other one.
$ sqlite3 collection.anki21
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> select * from cards limit 5;
1706642722425|1706642722424|1706642737485|0|1707681587|94|2|2|42|29|2650|5|0|0|0|0|0|{"pos":0}
1706642722427|1706642722426|1706642737485|0|1708459261|164|2|2|80|58|2950|5|0|0|0|0|0|{"pos":0}
1706642722428|1706642722426|1706642737485|1|1709324046|242|2|2|113|81|2950|5|0|0|0|0|0|{"pos":0}
1706642722429|1706642722426|1706642737485|2|1708277757|151|2|2|61|41|2800|5|0|0|0|0|0|{"pos":0}
1706642722431|1706642722430|1706642737485|0|1708126546|141|2|2|50|32|2650|6|0|0|0|0|0|{"pos":0}
That did it. My hypothesis is that collection.anki2
has the default deck configuration, and collection.anki21
has the actual user-supplied content, but I'm not positive. Anyway, all these cards ought to have the same deck id if we didn't mess up.
sqlite> select distinct did from cards;
1706642737485
Jackpot! 4
Final Thoughts
The .apkg
format is pretty comprehensible, actually. It's just a zip archive containing sqlite database files with the actual note data, plus any associated media files. Finding the schema online was a big time save, but honestly I feel like I could've figured it out on my own eventually. My main takeaway is a desire to write an Anki add-on, now that I know it's just sqlite under the hood.
I love how 7zip reassures you that everything is OK. I wish more software provided positive affirmations.
Specifically, they're part of the Image Occlusion Enhanced plugin, which I highly recommend. In a nutshell, it lets you make cards by blocking out part of an image and guessing what's hidden. This is especially useful for learning spatial information like the regions of the cortex, or, if you're lazy like me, screenshotting a table from a PDF and blocking out cells to memorize. The SVG files are applied as a mask to accomplish this.
If you're not already familiar with Anki you might be wondering why "cards" and "notes" are different entities with different tables. A note is a set of "fields" (which can be anything - text, images, audio, whatever) that are somehow conceptually linked. A card is a view over that set of fields. A note produces at least one card, but usually more. For example, a note with fields containing "Capital of France" and "Paris" could produce 2 separate cards, one where you're given the prompt "Capital of France" and must answer "Paris", and one where you're given "Paris" and must answer "Capital of France."
I kind of misled you earlier. I did know the deck id at one point, because my Rust program is a rewrite of a much simpler Python script which I used to create the deck in the first place. That script needed to put something in the id field for the new deck, so I declared it by fiat to be an arbitrary unique value, the epoch timestamp at the time of creation. But then, I lost the original script in a drive failure and with it the id for the deck, necessitating both a rewrite and all this spelunking. But! One of the facts I discovered in this project was that Anki itself assigns deck ids as the current epoch timestamp anyway! I was accidentally correct all along!