These translations are not supported in the current version of Aspell. However, They will be as soon as I am done with the rewriting of the filter interfaces. This should happen soon but not right away. I am currently aiming to get this done by the end of January 2001.
Future versions of Aspell will support have better support for specifying how words can be included in run-together words as well as affix knowledge which will ultimately lead to affix compress much like it is in Ispell. However, unless some one contributes code to support affix compression, support for this will not come until after everything else is done.
One of the reasons because is many, many places I use a direct lookup to find out various information about characters. With 8-bit characters this is very feasible because there is only 256 of them. With 16-bit wide characters this will waste a LOT of space. With 32-bit characters this is just plain impossible. Converting the lookup tables to some other form, while certainly possible, will degrade performance significantly.
Furthermore, some of my algorithms relay on words consisting only on a small number of distinct characters (often around 30 when case and accents are not considered). When the possible character can consist of any Unicode character this number because several thousand, if that. In order for these algorithms to still be used some sort of limit will need to be placed on the possible characters the word can contain. If I impose that limit, I might as well use some sort of 8-bit characters set which will automatically place the limit on what the characters can be.
There is also the issue of how I should store the word lists in memory? As a string of 32 bit wide characters. Now that is using up 4 times more memory than charters would and for languages that can fit within an 8-bit character that is, in my view, a gross waste of memory. So maybe I should store them is some variable width format such as UTF-8. Unfortunately, way, way to many of may algorithms will simply not work with variable width characters without significant modification which will very likely degrade performance. So the solution is to work with the characters as 32-bit wide characters and than convert it to a shorter representation when storing them in the lookup tables. Now than can lead to an inefficiency. I could also use 16 bit wide characters however that may not be good enough to hold all of future versions of Unicode and it has the same problems.
As a response to the space waste used by storing word lists in some sort of wide format some one asked:
Since hard drive are cheaper and cheaper, you could store dictionary in a usable (uncompressed) form and use it directly with memory mapping. Then the efficiency would directly depend on the disk caching method, and only the used part of the dictionaries would relay be loaded into memory. You would no more have to load plain dictionaries into main memory, you'll just want to compute some indexes (or something like that) after mapping.
However, the fact of the matter is that most of the dictionary will be read into memory anyway if it is available. If it is not available than there would be a good deal of disk swaps. Making characters 32-bit wide will increase the change that there are more disk swap. So the bottom line is that it will be cheaper to convert the characters from something like UTF-8 into some sort of wide character. I could also use some sort of disk space lookup table such as the Berkeley Database. However this will DEFINITELY degrade performance.
The bottom line is that keeping Aspell 8-bit internally is a very well though out decision that is not likely to change any time soon. Fell free to challenge me on it, but, don't expect me to change my mind unless you can bring up some point that I have not thought of before and quite possible a patch to solve cleanly convert Aspell to Unicode internally with out a serious performance lost OR serious memory usage increase.
Affix compression will save space a lot of space for languages with a
lot of affixation however it is not vital. The reason is that without
affix compression all you have to do is list all all of the possible
combinations. I release that this wastes space however it is doable.
|For example the word list that comes with Aspell has|
|After running it through the munchlist script it has|
|Which leads to a ratio of|
Now a polish word lists has the numbers.
Which means that the polish language affix compression saves about 3.1 times more space than it would for the English dictionary. Not that big of a deal.
Also, notice that this dictionary is mighty large. Especially considering that the largest English word list I have has these numbers.
So my question is do you really need that large of a dictionary? The original poster sending me these figures agrees that a lot of those 146,626 base words are not needed. So, lets say that we reduce the base list to down to 35,000, than the numbers are...
Which has about 3.5 times more words than Aspell English dictionary. True this is large and is wasteful however it is certainly manageable.
So once again affix compression will save space however the expanded word lists are manageable with out affix compression. Especially sense Aspell now mmaps the Word Lists in.
Also Affix compression will not likely work well with Aspell current suggestion strategy of finding all words with a similar soundslike. This is because I will still needs to store the soundslike data and each soundslike will still need to point to all words with that soundslike. Nevertheless it is still doable, is is just not likely to save enough space to make it worth it. Please see my email "Compiled Word List Compression Notes" posted to the Aspell mailing list for some more light on the subject as well as alternate compression strategies.
So, in order to get the full benefit of Affix Compression the soundslike data would not be used which will result in suggestions that are not much better than Ispell's. For languages with phonetic spelling this is not likely to be a problem however for languages like English which don't have phonetic spelling of words this is likely to be unacceptable.
I do plan on eventually implementing affix compression, it is just being put on hold until the rest of the international code is done. If you really care about affix compression than implement it your self. But first talk to me so I can make sure you are doing it in a manner that is consistent with the rest of the aspell library. Please see the emails titled "Affix Compression" and "More Affix Compression Notes" posted to the Aspell mailing list for some notes from the author of Ispell on some of the issues involved.
|Character Maps||Special Characters||Phonetic Spelling||Affication Level||Allow Run together Words|
|Language||Default||Others||Border||Other Word||Middle Chars|