Part 1 - Compiled Dictionary Format - Aspell Developer's Manual

Next: Part 2 - Quickly Finding Similar Soundslike, Previous: How It All Works, Up: How It All Works

15.1 Part 1 - The Compiled Dictionary Format

In this part you will see how the data is laid out in the compiled dictionary for Aspell 0.60. See source file readonly_ws.cpp.

Aspell's main compiled wordlist dictionary file is made as follows:

header
jump table for editdist 1
jump table for editdist 2
data block
hash table

There is nothing particularly interesting about the header. Just a bunch of meta information.

The jump tables are described in the next section ...

Words in the data block are grouped based on the soundslike. Each group is as follows:

     <8 bit: offset to next item><8 bit: soundslike size><soundslike>
     <null><words>

Each word group is as follows:

     <8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
     [<affix info><null>]

The offset for the final word in each group points to the next word in the following group. The offset for the final word and soundslike group in the dictionary is 0.

There is some provisions for additional info to be stored with the word but for simplicity, it's left out here. If soundslike data is not used then the soundslike block it not used.

This format makes it easy to iterate over the data without using the hash table.

Each soundslike group can be a maximum of about 256 bytes. If this limit is reached then the soundslike group is split. Using 2 bytes for the soundslike offset would of solved this problem however 256 bytes is normally sufficient, thus I would of wasted some space by using an extra byte. More importantly, Using 2 bytes means I would of had to worry about alignment issues.

The soundslike groups are sorted in more or less alphabetic order.

The hash table is a simple open address hash table. The key is the dictionary word in all lowercase form with all accents removed (what is known as the "clean" form of the word). The value stored in the table is a 32-bit offset to the beginning of the word. A 32-bit integer offset is used rather than a pointer so that the compiled dictionary can be mmaped to make loading the dictionary very fast and so that the memory can be shared between processed, and on 64 bit platforms using pointers would have doubled the size of the hash table.

Additional information for each word can be derived from each offset:

     word size: offset - 1
     offset to next word: offset - 2
     flags: offset - 3

I use helper functions for getting this information. Doing it this way rather than having a data structure is slightly evil, but admittedly, I have my reasons.

In the next section, you will see how Aspell uses the jump tables to search the list for soundslike with edit-distance of 1 or 2.