../anki-apkg

Shell Adventures: What's Inside an Anki Collection File?

I use Anki a lot. In a nutshell, it's a spaced repetition program that works almost eerily well for memorizing certain kinds of data. Lately I've been using it to help study for the CompTIA A+ certification exam, which requires you to know, among other things, approximately a million facts about wireless communication standards -- exactly the sort of thing Anki is great at. Anki has a web sharing feature called AnkiWeb where you can find pre-made decks on everything from Aramaic to Zoology, but I've found that there's no substitute for making your own cards; you get a deck tailored to exactly your needs, plus it's a great study exercise in itself.

I wrote a small Rust program to convert the markdown notes I take as I read or watch video lectures into Anki cards in the .apkg format; It's not especially complex since most of the actual conversion work is offloaded to the genanki crate. What's relevant here is that it takes a "deck id" parameter, which defines the deck where the generated cards will be placed once added to a user's collection. The deck id isn't normally shown to the user, though, it's an internal implementation detail without a matching UI element. My workaround so far has been to generate the cards assigned to a dummy deck, then manually move them to the actual deck, but why spend one minute on a task when you could instead waste a whole morning automating it?

Anyway, that was a lot of prelude to say this post is about cracking open a .apkg file to find where the deck id is hidden, and also just to generally see what's in there. I exported my main study deck (including media - this will come up later) to a file, and I invite you to join me on a tour of its internals. Let's go!

Part 1: Taking Inventory

Let's start by figuring out what a .apkg file actually is.

$ file COMPTIA\ A+\ 1.apkg 

COMPTIA A+ 1.apkg: Zip archive data, at least v2.0 to extract, compression method=store

$ 7z x COMPTIA\ A+\ 1.apkg 

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (806EA),ASM,AES-NI)

Scanning the drive for archives:
1 file, 3076466 bytes (3005 KiB)

Extracting archive: COMPTIA A+ 1.apkg
--
Path = COMPTIA A+ 1.apkg
Type = zip
Physical Size = 3076466

Everything is Ok

Files: 183
Size:       3891960
Compressed: 3076466

$ ls -alh # all files, with all data, human readable sizes

total 7.3M
drwxr-xr-x 2 k k 4.0K Mar  8 11:24  .
drwxr-xr-x 6 k k 4.0K Mar  8 11:23  ..
-rw-r--r-- 1 k k  213 Mar  8  2024  0
-rw-r--r-- 1 k k  375 Mar  8  2024  1
-rw-r--r-- 1 k k 1.4K Mar  8  2024  10
-rw-r--r-- 1 k k  371 Mar  8  2024  100
-rw-r--r-- 1 k k 1.1K Mar  8  2024  101
-rw-r--r-- 1 k k  216 Mar  8  2024  102
-rw-r--r-- 1 k k  782 Mar  8  2024  103
-rw-r--r-- 1 k k  226 Mar  8  2024  104
(...)
-rw-r--r-- 1 k k 1.3K Mar  8  2024  98
-rw-r--r-- 1 k k 1.8K Mar  8  2024  99
-rw-r--r-- 1 k k  49K Mar  8  2024  collection.anki2
-rw-r--r-- 1 k k 1.2M Mar  8  2024  collection.anki21
-rw------- 1 k k 3.0M Mar  8 11:24 'COMPTIA A+ 1.apkg'
-rw-r--r-- 1 k k 8.7K Mar  8  2024  media
-rw-r--r-- 1 k k    2 Mar  8  2024  meta

1

It's a zip archive with a whole lot of stuff:

Let's start with the collection files

$ file collection.anki2 
collection.anki2: SQLite 3.x database, last written using SQLite version 3044000, page size 512, file counter 5, database pages 98, 1st free page 59, free pages 55, cookie 0x2f, schema 4, UTF-8, version-valid-for 5

$ file collection.anki21
collection.anki21: SQLite 3.x database, last written using SQLite version 3044000, file counter 3, database pages 303, 1st free page 300, free pages 27, cookie 0x2e, schema 4, UTF-8, version-valid-for 3

Aha! Sqlite! We can work with sqlite! First, though, I want to try identifying the other files

$ file meta 
meta: data

Oh, it's data. That clears everything up.

$ cat meta
$ hexdump meta
0000000 0208                                   
0000002

Checksum? ¯\_(ツ)_/¯ Anyhow, on to media:

$ file media 
media: JSON text data

$ jq '.' media 
{
  "90": "f3327b4a32294533a064d02a593d1bf8-ao-5-A.svg",
  "46": "8136b7a2cf124cc98436eb041f6ac446-ao-2-A.svg",
  "51": "ab18887cee264ca78e187ce85eb58c8a-ao-4-Q.svg",
   (...)
  "130": "raid.png",
   (...)
 "105": "8136b7a2cf124cc98436eb041f6ac446-ao-2-Q.svg",
  "86": "e7078dd0ffd042c0a4266a828f47b3a2-oa-12-Q.svg"
}

It's a JSON mapping between integers and image file names, some PNG but mostly SVG. Maybe this is connected to all those files that are just numbers.

$ file 130
130: PNG image data, 847 x 324, 8-bit/color RGB, non-interlaced

$ du -hs 130 # size of single file, human readable
148K    130

So there's the mystery of the wildly varying file sizes solved. The big ones are PNG, and the small ones are SVG. Anki lets you embed images in your cards; the PNG images with normal file names are ones I've added manually. raid.png is, predictably, a diagram of RAID configuration types. The SVG files are programmatically generated compliments to those images2.

Ok, enough stalling, it's time to look into the actual database(s).

Part 2: The database

$ sqlite3 collection.anki2
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> .tables
cards   col     graves  notes   revlog

hacker voice I'm in.

I'll spare you the full account of how I stumbled through this database through trial and error, and just link this comprehensive overview of the schema from the AnkiDroid app repo (I only found this after expending a fair chunk of time on my own reverse engineering attempts, but it's about the journey, right?).

There's five tables - cards, col, graves, notes, and revlog3. graves and revlog contain scheduling information we don't care about for now. col is short for "collections" and has metadata on the database as a whole. That leaves cards and notes; to be honest I'm not sure which one would have the deck id, so I'll check both.

sqlite> pragma table_info(notes);
0|id|INTEGER|0||1
1|guid|TEXT|1||0
2|mid|INTEGER|1||0
3|mod|INTEGER|1||0
4|usn|INTEGER|1||0
5|tags|TEXT|1||0
6|flds|TEXT|1||0
7|sfld|INTEGER|1||0
8|csum|INTEGER|1||0
9|flags|INTEGER|1||0
10|data|TEXT|1||0

sqlite> pragma table_info(cards);
0|id|INTEGER|0||1
1|nid|INTEGER|1||0
2|did|INTEGER|1||0
3|ord|INTEGER|1||0
4|mod|INTEGER|1||0
5|usn|INTEGER|1||0
6|type|INTEGER|1||0
7|queue|INTEGER|1||0
8|due|INTEGER|1||0
9|ivl|INTEGER|1||0
10|factor|INTEGER|1||0
11|reps|INTEGER|1||0
12|lapses|INTEGER|1||0
13|left|INTEGER|1||0
14|odue|INTEGER|1||0
15|odid|INTEGER|1||0
16|flags|INTEGER|1||0
17|data|TEXT|1||0

That did field should be what we're after (I now know, after reading the linked page). Let's get it.

sqlite> select did from cards;
1

Huh? There should be a row for each card in the deck, and I'm pretty sure the deck id is not 1. There's two database files in the archive though, let's check the other one.

$ sqlite3 collection.anki21
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> select * from cards limit 5;
1706642722425|1706642722424|1706642737485|0|1707681587|94|2|2|42|29|2650|5|0|0|0|0|0|{"pos":0}
1706642722427|1706642722426|1706642737485|0|1708459261|164|2|2|80|58|2950|5|0|0|0|0|0|{"pos":0}
1706642722428|1706642722426|1706642737485|1|1709324046|242|2|2|113|81|2950|5|0|0|0|0|0|{"pos":0}
1706642722429|1706642722426|1706642737485|2|1708277757|151|2|2|61|41|2800|5|0|0|0|0|0|{"pos":0}
1706642722431|1706642722430|1706642737485|0|1708126546|141|2|2|50|32|2650|6|0|0|0|0|0|{"pos":0}

That did it. My hypothesis is that collection.anki2 has the default deck configuration, and collection.anki21 has the actual user-supplied content, but I'm not positive. Anyway, all these cards ought to have the same deck id if we didn't mess up.

sqlite> select distinct did from cards;
1706642737485

Jackpot! 4

Final Thoughts

The .apkg format is pretty comprehensible, actually. It's just a zip archive containing sqlite database files with the actual note data, plus any associated media files. Finding the schema online was a big time save, but honestly I feel like I could've figured it out on my own eventually. My main takeaway is a desire to write an Anki add-on, now that I know it's just sqlite under the hood.


1

I love how 7zip reassures you that everything is OK. I wish more software provided positive affirmations.

2

Specifically, they're part of the Image Occlusion Enhanced plugin, which I highly recommend. In a nutshell, it lets you make cards by blocking out part of an image and guessing what's hidden. This is especially useful for learning spatial information like the regions of the cortex, or, if you're lazy like me, screenshotting a table from a PDF and blocking out cells to memorize. The SVG files are applied as a mask to accomplish this.

3

If you're not already familiar with Anki you might be wondering why "cards" and "notes" are different entities with different tables. A note is a set of "fields" (which can be anything - text, images, audio, whatever) that are somehow conceptually linked. A card is a view over that set of fields. A note produces at least one card, but usually more. For example, a note with fields containing "Capital of France" and "Paris" could produce 2 separate cards, one where you're given the prompt "Capital of France" and must answer "Paris", and one where you're given "Paris" and must answer "Capital of France."

4

I kind of misled you earlier. I did know the deck id at one point, because my Rust program is a rewrite of a much simpler Python script which I used to create the deck in the first place. That script needed to put something in the id field for the new deck, so I declared it by fiat to be an arbitrary unique value, the epoch timestamp at the time of creation. But then, I lost the original script in a drive failure and with it the id for the deck, necessitating both a rewrite and all this spelunking. But! One of the facts I discovered in this project was that Anki itself assigns deck ids as the current epoch timestamp anyway! I was accidentally correct all along!