Aspell's International Plans

Last Updated November 14, 2000


Current Plans | Notes | Current Language Data | Back to Aspell's Home

Current Plans and Status

Here are my current plans for International support in Aspell and the Current Status as of Aspell .32.6. Your feedback is more than appreciated. Please send feedback to the Aspell Mailing list (aspell@franklin.oit.unc.edu), or directly to me (kevina at users sourceforge net) ) as I am sure I am still overlooking things.

Notes on 8-bit Characters

There is a very good reason I use 8-bit characters in Aspell. Speed and simplicity. While many parts of my code can fairly be easily be converted to some sort of wide character as my code is clean. Other parts can not be.

One of the reasons because is many, many places I use a direct lookup to find out various information about characters. With 8-bit characters this is very feasible because there is only 256 of them. With 16-bit wide characters this will waste a LOT of space. With 32-bit characters this is just plain impossible. Converting the lookup tables to some other form, while certainly possible, will degrade performance significantly.

Furthermore, some of my algorithms relay on words consisting only on a small number of distinct characters (often around 30 when case and accents are not considered). When the possible character can consist of any Unicode character this number because several thousand, if that. In order for these algorithms to still be used some sort of limit will need to be placed on the possible characters the word can contain. If I impose that limit, I might as well use some sort of 8-bit characters set which will automatically place the limit on what the characters can be.

There is also the issue of how I should store the word lists in memory? As a string of 32 bit wide characters. Now that is using up 4 times more memory than charters would and for languages that can fit within an 8-bit character that is, in my view, a gross waste of memory. So maybe I should store them is some variable width format such as UTF-8. Unfortunately, way, way to many of may algorithms will simply not work with variable width characters without significant modification which will very likely degrade performance. So the solution is to work with the characters as 32-bit wide characters and than convert it to a shorter representation when storing them in the lookup tables. Now than can lead to an inefficiency. I could also use 16 bit wide characters however that may not be good enough to hold all of future versions of Unicode and it has the same problems.

As a response to the space waste used by storing word lists in some sort of wide format some one asked:

Since hard drive are cheaper and cheaper, you could store dictionary in a usable (uncompressed) form and use it directly with memory mapping. Then the efficiency would directly depend on the disk caching method, and only the used part of the dictionaries would relay be loaded into memory. You would no more have to load plain dictionaries into main memory, you'll just want to compute some indexes (or something like that) after mapping.

However, the fact of the matter is that most of the dictionary will be read into memory anyway if it is available. If it is not available than there would be a good deal of disk swaps. Making characters 32-bit wide will increase the change that there are more disk swap. So the bottom line is that it will be cheaper to convert the characters from something like UTF-8 into some sort of wide character. I could also use some sort of disk space lookup table such as the Berkeley Database. However this will DEFINITELY degrade performance.

The bottom line is that keeping Aspell 8-bit internally is a very well though out decision that is not likely to change any time soon. Fell free to challenge me on it, but, don't expect me to change my mind unless you can bring up some point that I have not thought of before and quite possible a patch to solve cleanly convert Aspell to Unicode internally with out a serious performance lost OR serious memory usage increase.

Notes on Affix Compression

Due to large amount of affixation in many non-English languages. The most natural way to store a word list in a compact fashion is with affix compression. Affix compression basically stores the root word and then a list of prefixes and suffixes that the word can take. Affix compression will save space for almost any language and general affix knowledge will also allow help to improve Aspell suggestion intelligence.

Affix compression will save space a lot of space for languages with a lot of affixation however it is not vital. The reason is that without affix compression all you have to do is list all all of the possible combinations. I release that this wastes space however it is doable.
For example the word list that comes with Aspell has
70,598 words
After running it through the munchlist script it has
30,953 words
Which leads to a ratio of
2.3

Now a polish word lists has the numbers.
1,041,430
146,626
7.1

Which means that the polish language affix compression saves about 3.1 times more space than it would for the English dictionary. Not that big of a deal.

Also, notice that this dictionary is mighty large. Especially considering that the largest English word list I have has these numbers.
120,361
73,358
1.6

So my question is do you really need that large of a dictionary? The original poster sending me these figures agrees that a lot of those 146,626 base words are not needed. So, lets say that we reduce the base list to down to 35,000, than the numbers are...
248,000
35,000
7.1

Which has about 3.5 times more words than Aspell English dictionary. True this is large and is wasteful however it is certainly manageable.

So once again affix compression will save space however the expanded word lists are manageable with out affix compression. Especially sense Aspell now mmaps the Word Lists in.

Also Affix compression will not likely work well with Aspell current suggestion strategy of finding all words with a similar soundslike. This is because I will still needs to store the soundslike data and each soundslike will still need to point to all words with that soundslike. Nevertheless it is still doable, is is just not likely to save enough space to make it worth it. Please see my email "Compiled Word List Compression Notes" posted to the Aspell mailing list for some more light on the subject as well as alternate compression strategies.

So, in order to get the full benefit of Affix Compression the soundslike data would not be used which will result in suggestions that are not much better than Ispell's. For languages with phonetic spelling this is not likely to be a problem however for languages like English which don't have phonetic spelling of words this is likely to be unacceptable.

I do plan on eventually implementing affix compression, it is just being put on hold until the rest of the international code is done. If you really care about affix compression than implement it your self. But first talk to me so I can make sure you are doing it in a manner that is consistent with the rest of the aspell library. Please see the emails titled "Affix Compression" and "More Affix Compression Notes" posted to the Aspell mailing list for some notes from the author of Ispell on some of the issues involved.


Current Language Data

I am currently trying to build a database of language information. You can help by sending me an email at kevina at users sourceforge net with the following information.
  1. What 8-bit character sets are typically used for your language and any shortcomings those character sets may have.
  2. What additional charters can appear in a word in your language and where they can appear such as the "-" in French or the "'" in English.
  3. The level of affection your language has.
  4. How phonetic the spelling of the language is.
  5. Additional considerations that should be taken into account for spell checking your language such as allowing run together words.
  6. Links to word lists for you language and a critique of how good you think they are.

Current Data

If your language is already here please review your language and let me know of any errors or missing information based on the above criteria.

Character Maps Special Characters Phonetic Spelling Affication Level Allow Run together Words
Language Default Others Border Other Word Middle Chars
Danish ISO-8859-1 ISO-8859-5 ISO-8859-10  -   No High Yes  s
English ISO-8859-1 ASCII    ' No 2.3 No  
Esperanto ISO-8859-3    -  ' Yes High Yes  
German ISO-8859-1 ISO-8859-2     ? High No  
Portuguese ISO-8859-1    -   ? High No  
Russian KOI8-R ISO-8859-5  -   Mostly High Yes  
Slovenian ISO-8859-2    -   No Low Yes  
Spanish ISO-8859-1       Mostly High No  
Welsh ISO-8859-14      ' Mostly High No  


Current Plans | Notes | Current Language Data | Back to Aspell's Home