Aspell's International Plans

Last Updated November 14, 2000

Current Plans | Notes | Current Language Data | Back to Aspell's Home

Current Plans and Status

Here are my current plans for International support in Aspell and the Current Status as of Aspell .32.6. Your feedback is more than appreciated. Please send feedback to the Aspell Mailing list (aspell@franklin.oit.unc.edu), or directly to me (kevina at users sourceforge net) ) as I am sure I am still overlooking things.

Internally store languages that can fit within an existing 8-bit character set as such. The character set should be chosen from one of the standard character sets in common use which include the ISO8859 Series, VISCII, and the KOI8 Series character maps. If none of the existing character sets are adequate a new one can be created.

Support for languages which do not fit within 8-bit character set will come latter. The reason for this is that almost all languages which do not fit within an 8-bit character set can not be spell checked in the traditional fashion. When I expand Aspell to support spell checking these languages I will also expand Aspell to work with wide characters. However for right now this brings an extra level of complexity that I don't want to deal with.
Provide translation maps to convert the various charsets, HTML, Tex, Nroff, and ASCII representations to and from the Unicode character set. When simple translation maps are insufficient or inefficient, such as UTF-8 or SCSU, then use some simple C++ functions instead. In some cases a combined approach might be most appropriate such as with HTML. These translation maps and/or functions will allow conversion from any input format to any output format. Two good source for these map are The various key maps for Yudit, The Official mappings from the Unicode Consortium, and Roman Czyborra textual reference tables scattered about the ISO8859, VISCII, and Cyrillic pages.
These translations are not supported in the current version of Aspell. However, They will be as soon as I am done with the rewriting of the filter interfaces. This should happen soon but not right away. I am currently aiming to get this done by the end of January 2001.
Information about the characters--such as which are consonants, which are vowels, which are symbols, and upper to lower case conversion--are currently stored as part of Unicode character map table. The Official data from the Unicode Consortium provides most of this information.
This leaves precious little information needed for the actual language support file. The language support file includes information such as the language name, the charset to use, special non-letter characters which can be part of the word such as ' or -, and run-together behavior.
Future versions of Aspell will support have better support for specifying how words can be included in run-together words as well as affix knowledge which will ultimately lead to affix compress much like it is in Ispell. However, unless some one contributes code to support affix compression, support for this will not come until after everything else is done.
Additional information which can be included in the language support file is information to convert a word to its rough soundslike equivalent. If this information is not included than the default algorithm of simply stripping all the vowels will be used for the rough soundslike equivalent will be used. To see how this information is used please see the How Aspell Works chapter of the Aspell manual.

Section 5.3: Phonetic Code

Notes on 8-bit Characters

There is a very good reason I use 8-bit characters in Aspell. Speed and simplicity. While many parts of my code can fairly be easily be converted to some sort of wide character as my code is clean. Other parts can not be.

One of the reasons because is many, many places I use a direct lookup to find out various information about characters. With 8-bit characters this is very feasible because there is only 256 of them. With 16-bit wide characters this will waste a LOT of space. With 32-bit characters this is just plain impossible. Converting the lookup tables to some other form, while certainly possible, will degrade performance significantly.

Furthermore, some of my algorithms relay on words consisting only on a small number of distinct characters (often around 30 when case and accents are not considered). When the possible character can consist of any Unicode character this number because several thousand, if that. In order for these algorithms to still be used some sort of limit will need to be placed on the possible characters the word can contain. If I impose that limit, I might as well use some sort of 8-bit characters set which will automatically place the limit on what the characters can be.

There is also the issue of how I should store the word lists in memory? As a string of 32 bit wide characters. Now that is using up 4 times more memory than charters would and for languages that can fit within an 8-bit character that is, in my view, a gross waste of memory. So maybe I should store them is some variable width format such as UTF-8. Unfortunately, way, way to many of may algorithms will simply not work with variable width characters without significant modification which will very likely degrade performance. So the solution is to work with the characters as 32-bit wide characters and than convert it to a shorter representation when storing them in the lookup tables. Now than can lead to an inefficiency. I could also use 16 bit wide characters however that may not be good enough to hold all of future versions of Unicode and it has the same problems.

As a response to the space waste used by storing word lists in some sort of wide format some one asked:

Since hard drive are cheaper and cheaper, you could store dictionary in a usable (uncompressed) form and use it directly with memory mapping. Then the efficiency would directly depend on the disk caching method, and only the used part of the dictionaries would relay be loaded into memory. You would no more have to load plain dictionaries into main memory, you'll just want to compute some indexes (or something like that) after mapping.

However, the fact of the matter is that most of the dictionary will be read into memory anyway if it is available. If it is not available than there would be a good deal of disk swaps. Making characters 32-bit wide will increase the change that there are more disk swap. So the bottom line is that it will be cheaper to convert the characters from something like UTF-8 into some sort of wide character. I could also use some sort of disk space lookup table such as the Berkeley Database. However this will DEFINITELY degrade performance.

The bottom line is that keeping Aspell 8-bit internally is a very well though out decision that is not likely to change any time soon. Fell free to challenge me on it, but, don't expect me to change my mind unless you can bring up some point that I have not thought of before and quite possible a patch to solve cleanly convert Aspell to Unicode internally with out a serious performance lost OR serious memory usage increase.

Notes on Affix Compression

Due to large amount of affixation in many non-English languages. The most natural way to store a word list in a compact fashion is with affix compression. Affix compression basically stores the root word and then a list of prefixes and suffixes that the word can take. Affix compression will save space for almost any language and general affix knowledge will also allow help to improve Aspell suggestion intelligence.

Affix compression will save space a lot of space for languages with a lot of affixation however it is not vital. The reason is that without affix compression all you have to do is list all all of the possible combinations. I release that this wastes space however it is doable.

For example the word list that comes with Aspell has

70,598 words

After running it through the munchlist script it has

30,953 words

Which leads to a ratio of

2.3

Now a polish word lists has the numbers.

1,041,430

146,626

7.1

Which means that the polish language affix compression saves about 3.1 times more space than it would for the English dictionary. Not that big of a deal.

Also, notice that this dictionary is mighty large. Especially considering that the largest English word list I have has these numbers.

120,361

73,358

1.6

So my question is do you really need that large of a dictionary? The original poster sending me these figures agrees that a lot of those 146,626 base words are not needed. So, lets say that we reduce the base list to down to 35,000, than the numbers are...

248,000

35,000

7.1

Which has about 3.5 times more words than Aspell English dictionary. True this is large and is wasteful however it is certainly manageable.

So once again affix compression will save space however the expanded word lists are manageable with out affix compression. Especially sense Aspell now mmaps the Word Lists in.

Also Affix compression will not likely work well with Aspell current suggestion strategy of finding all words with a similar soundslike. This is because I will still needs to store the soundslike data and each soundslike will still need to point to all words with that soundslike. Nevertheless it is still doable, is is just not likely to save enough space to make it worth it. Please see my email "Compiled Word List Compression Notes" posted to the Aspell mailing list for some more light on the subject as well as alternate compression strategies.

So, in order to get the full benefit of Affix Compression the soundslike data would not be used which will result in suggestions that are not much better than Ispell's. For languages with phonetic spelling this is not likely to be a problem however for languages like English which don't have phonetic spelling of words this is likely to be unacceptable.

I do plan on eventually implementing affix compression, it is just being put on hold until the rest of the international code is done. If you really care about affix compression than implement it your self. But first talk to me so I can make sure you are doing it in a manner that is consistent with the rest of the aspell library. Please see the emails titled "Affix Compression" and "More Affix Compression Notes" posted to the Aspell mailing list for some notes from the author of Ispell on some of the issues involved.

Current Language Data

I am currently trying to build a database of language information. You can help by sending me an email at kevina at users sourceforge net with the following information.

What 8-bit character sets are typically used for your language and any shortcomings those character sets may have.
What additional charters can appear in a word in your language and where they can appear such as the "-" in French or the "'" in English.
The level of affection your language has.
How phonetic the spelling of the language is.
Additional considerations that should be taken into account for spell checking your language such as allowing run together words.
Links to word lists for you language and a critique of how good you think they are.

Current Data

If your language is already here please review your language and let me know of any errors or missing information based on the above criteria.

Character Maps Special Characters Phonetic Spelling Affication Level Allow Run together Words

Language Default Others Border Other Word Middle Chars

Danish ISO-8859-1 ISO-8859-5 ISO-8859-10 - No High Yes s

English ISO-8859-1 ASCII ' No 2.3 No

Esperanto ISO-8859-3 - ' Yes High Yes

German ISO-8859-1 ISO-8859-2 ? High No

Portuguese ISO-8859-1 - ? High No

Russian KOI8-R ISO-8859-5 - Mostly High Yes

Slovenian ISO-8859-2 - No Low Yes

Spanish ISO-8859-1 Mostly High No

Welsh ISO-8859-14 ' Mostly High No

	Character Maps	Special Characters	Phonetic Spelling	Affication Level	Allow Run together Words
Language	Default	Others	Border	Other Word		Middle Chars
Danish	ISO-8859-1	ISO-8859-5 ISO-8859-10	-		No	High	Yes	s
English	ISO-8859-1	ASCII		'	No	2.3	No
Esperanto	ISO-8859-3		-	'	Yes	High	Yes
German	ISO-8859-1	ISO-8859-2			?	High	No
Portuguese	ISO-8859-1		-		?	High	No
Russian	KOI8-R	ISO-8859-5	-		Mostly	High	Yes
Slovenian	ISO-8859-2		-		No	Low	Yes
Spanish	ISO-8859-1				Mostly	High	No
Welsh	ISO-8859-14			'	Mostly	High	No

Current Plans | Notes | Current Language Data | Back to Aspell's Home

For example the word list that comes with Aspell has
	70,598 words
After running it through the munchlist script it has
	30,953 words
Which leads to a ratio of
	2.3