Next: , Previous: Replacement Tables, Up: Adding Support For Other Languages


7.6 Affix Compression

Aspell, as of version 0.60, now has support for affix compression. The codebase comes from MySpell found in OpenOffice.

To add support for affix compression add the following lines to the language data file.

     affix          lang
     affix-compress true

The line `affix lang' adds support for recognizing affix information, and the line `affix-compress true' enables affix compression.

The affix file is expected to be named lang_affix.dat. It is the exact same format as those used by MySpell. More information can be found in the myspell/ directory of the distribution or at http://lingucomponent.openoffice.org/dictionary.html.

Affix compression can also be used with soundslike lookup. Aspell does this by only storing the soundslike for the root word. When a word is misspelled it will search for a soundslike close to all possible roots of the misspelled word.

When no soundslike information, or the simple soundslike, is used it may be beneficial to specify the option partially-expand which will partially expand a word with affix information so that the affix flags do not affect the first 3 letters of the word. This will allow Aspell to get more accurate results when scanning the list for near misses since the full word can be used and not just the root. Specifying this option, however, will also effectively expand any prefixes. Thus this option should not be used for prefix heavy languages such as Hebrew.

An existing word list, without affix info, can be affix compressed using using aspell munch-list.

7.6.1 Format of the Affix File

An affix is either a prefix or a suffix attached to root words to make other words. For example supply -> supplied by dropping the "y" and adding an "ied" (the suffix).

Here is an example of how to define one specific suffix borrowed from the English affix file.

     SFX D Y 4
     SFX D   0     d          e
     SFX D   y     ied        [^aeiou]y
     SFX D   0     ed         [^ey]
     SFX D   0     ed         [aeiou]y

This file is space delimited and case sensitive. So this information can be interpreted as follows:

The first line has 4 fields:

1 SFX indicates this is a suffix
2 D is the name of the character which represents this suffix
3 Y indicates it can be combined with prefixes (cross product)
4 4 indicates that sequence of 4 affix entries are needed to properly store the affix information

The remaining lines describe the unique information for the 4 affix entries that make up this affix. Each line can be interpreted as follows: (note fields 1 and 2 are used as a check against line 1 info)

1 SFX indicates this is a suffix
2 D is the name of the character which represents this affix
3 y the string of chars to strip off before adding affix (a 0 here indicates the NULL string)
4 ied the string of affix characters to add (a 0 here indicates the NULL string)
5 [^aeiou]y the conditions which must be met before the affix can be applied

Field 5 is interesting. Since this is a suffix, field 5 tells us that there are 2 conditions that must be met. The first condition is that the next to the last character in the word must not be any of the following "a", "e", "i", "o" or "u". The second condition is that the last character of the word must end in "y".

FIXME: Add info in two-fold suffix stripping and CIRCUMFIX flag.

7.6.2 When Compared With Ispell

Now for comparison purposes, here is the same information from the Ispell english.aff compression file which was used as the basis for the OOo one.

     flag *D:
         E           >       D               # As in create > created
         [^AEIOU]Y   >       -Y,IED          # As in imply > implied
         [^EY]       >       ED              # As in cross > crossed
         [AEIOU]Y    >       ED              # As in convey > conveyed

The Ispell information has exactly the same information but in a slightly different (case-insensitive) format:

Here are the ways to see the mapping from Ispell .aff format to our OOo format.

  1. The Ispell english.aff has flag D under the "suffix" section so you know it is a suffix.
  2. The D is the character assigned to this suffix
  3. `*' indicates that it can be combined with prefixes
  4. Each line following the : describes the affix entries needed to define this suffix

In addition all chars in Ispell aff files are in uppercase.

7.6.3 Specifying Affix Flags

Affix flags are specified in the word list by specifying them after the `/' character:

     word/flags

For example:

     create/DG

will associate the `D' and `G' flag with the word create.