Date: Thu, 18 Mar 2004 02:40:27 -0500 (EST) From: Kevin Atkinson Subject: Aspell Now Has Full UTF-8 Support Aspell now fully supports spell checking documents in UTF-8. In addition Aspell now has support for accepting all input and printing all output in UTF-8 or any other encoding that Aspell supports. The fact that Aspell is still 8-bit internally can now be made completely transparent to the end user. Previous versions of Aspel supported Unicode to some extent; however, word list still had to be in an 8-bit character set. Furthermore, spell checking documents in an encoding that is different from the internal encoding was pragmatic. This has all changed now. With this change Aspell can now support any language that no more than 220 distinct characters, including different capitalizations and accents, _even if_ there is not an existing 8-bit encoding that supports the language. All one has to do is creating a new character data file which is a fairly simple task. The internal encoding never has to be seen by the end-user, including the word list author, since not even the word list has to be in the same encoding that Aspell uses. Full UTF-8 support was added with 0.51-20040219, the next snapshot, 0.51-20040227 fixed a few bugs, while the latest 0.60-20040317 uses a new, simpler, format for the character data files. Aspell snapshots can be downloaded from ftp://alpha.gnu.org/gnu/aspell/. Notes on 8-bit Characters ************************* There is a very good reason I use 8-bit characters in Aspell. Speed and simplicity. While many parts of my code can fairly be easily be converted to some sort of wide character as my code is clean. Other parts can not be. One of the reasons because is many, many places I use a direct lookup to find out various information about characters. With 8-bit characters this is very feasible because there is only 256 of them. With 16-bit wide characters this will waste a LOT of space. With 32-bit characters this is just plain impossible. Converting the lookup tables to some other form, while certainly possible, will degrade performance significantly. Furthermore, some of my algorithms relay on words consisting only on a small number of distinct characters (often around 30 when case and accents are not considered). When the possible character can consist of any Unicode character this number because several thousand, if that. In order for these algorithms to still be used some sort of limit will need to be placed on the possible characters the word can contain. If I impose that limit, I might as well use some sort of 8-bit characters set which will automatically place the limit on what the characters can be. There is also the issue of how I should store the word lists in memory? As a string of 32 bit wide characters. Now that is using up 4 times more memory than charters would and for languages that can fit within an 8-bit character that is, in my view, a gross waste of memory. So maybe I should store them is some variable width format such as UTF-8. Unfortunately, way, way to many of may algorithms will simply not work with variable width characters without significant modification which will very likely degrade performance. So the solution is to work with the characters as 32-bit wide characters and than convert it to a shorter representation when storing them in the lookup tables. Now than can lead to an inefficiency. I could also use 16 bit wide characters however that may not be good enough to hold all of future versions of Unicode and it has the same problems. As a response to the space waste used by storing word lists in some sort of wide format some one asked: Since hard drive are cheaper and cheaper, you could store dictionary in a usable (uncompressed) form and use it directly with memory mapping. Then the efficiency would directly depend on the disk caching method, and only the used part of the dictionaries would relay be loaded into memory. You would no more have to load plain dictionaries into main memory, you'll just want to compute some indexes (or something like that) after mapping. However, the fact of the matter is that most of the dictionary will be read into memory anyway if it is available. If it is not available than there would be a good deal of disk swaps. Making characters 32-bit wide will increase the change that there are more disk swap. So the bottom line is that it will be cheaper to convert the characters from something like UTF-8 into some sort of wide character. I could also use some sort of disk space lookup table such as the Berkeley Database. However this will *definitely* degrade performance. The bottom line is that keeping Aspell 8-bit internally is a very well though out decision that is not likely to change any time soon. Fell free to challenge me on it, but, don't expect me to change my mind unless you can bring up some point that I have not thought of before and quite possible a patch to solve cleanly convert Aspell to Unicode internally with out a serious performance lost OR serious memory usage increase. Languages Which Aspell can Support ********************************** Even though Aspell will remain 8-bit internally it should still be be able to support any written languages not based on a logographic script. A only logographic writing system in current use are those based on hànzi which includes Chinese, Japanese, and sometimes Korean. Languages with 220 or Fewer Unique Symbols ========================================== Aspell 0.60 should be able to support the following languages as, to the best of my knowledge, they all contain 220 or fewer symbols and can thus, fit within an 8-bit character set. If an existing character set does not exists than a new one can be invented. This is true even if the script is not yet supported by Unicode as the private use area can be used. Code Language Name Script Dictionary Gettext Available Translation ab Abkhazian Cyrillic - - ae Avestan Avestan - - af Afrikaans Latin Yes - an Aragonese Latin - - ar Arabic Arabic - - as Assamese Bengali - - ay Aymara Latin - - az Azerbaijani Arabic - - az Cyrillic - - az Latin - - ba Bashkir Cyrillic - - be Belarusian Cyrillic - Yes bg Bulgarian Cyrillic Yes - bh Bihari Devanagari - - bn Bengali Bengali - - bo Tibetan Tibetan - - br Breton Latin Yes - bs Bosnian Latin - - ca Catalan/Valencian Latin Yes - ce Chechen Cyrillic - - ch Chamorro Latin - - co Corsican Latin - - cr Cree Canadian Syllabics - - cr Latin - - cs Czech Latin Yes - cv Chuvash Cyrillic - - cy Welsh Latin Yes - da Danish Latin Yes - de German Latin Yes - dv Divehi Dhives Akuru - - dz Dzongkha Tibetan - - el Greek Greek Yes - en English Latin Yes - eo Esperanto Latin Yes - es Spanish Latin Yes Incomplete et Estonian Latin - - eu Basque Latin - - fa Persian Arabic - - fi Finnish Latin - - fj Fijian Latin - - fo Faroese Latin Yes - fr French Latin Yes Yes fy Frisian Latin - - ga Irish Latin Yes Yes gd Scottish Gaelic Latin - - gl Gallegan Latin Yes - gn Guarani Latin - - gu Gujarati Gujarati - - gv Manx Latin - - ha Hausa Latin - - he Hebrew Hebrew - - hi Hindi Devanagari - - hr Croatian Latin Yes - hu Hungarian Latin - - hy Armenian Armenian - - ia Interlingua (IALA) Latin - - id Indonesian Arabic - - id Latin Yes - io Ido Latin - - is Icelandic Latin Yes - it Italian Latin Yes - iu Inuktitut Canadian Syllabics - - iu Latin - - ja Japanese Latin - - jv Javanese Javanese - - jv Latin - - ka Georgian Georgian - - kk Kazakh Cyrillic - - kl Kalaallisut/Greenlandic Latin - - km Khmer Khmer - - kn Kannada Kannada - - ko Korean Hangeul - - kr Kanuri Latin - - ks Kashmiri Arabic - - ks Devanagari - - ku Kurdish Arabic - - ku Cyrillic - - ku Latin - - kv Komi Cyrillic - - kw Cornish Latin - - ky Kirghiz Arabic - - ky Cyrillic - - ky Latin - - la Latin Latin - - lo Lao Lao - - lt Lithuanian Latin - - lv Latvian Latin - - mi Maori Latin Yes - mk Makasar Lontara/Makasar - - ml Malayalam Latin - - ml Malayalam - - mn Mongolian Cyrillic - - mn Mongolian - - mo Moldavian Cyrillic - - mr Marathi Devanagari - - ms Malay Arabic - - ms Latin Yes - mt Maltese Latin - - my Burmese Myanmar - - nb Norwegian Bokmal Latin - - ne Nepali Devanagari - - nl Dutch Latin Yes Yes nn Norwegian Nynorsk Latin - - no Norwegian Latin Yes - nv Navajo Latin - - oc Occitan/Provencal Latin - - oj Ojibwa Ojibwe - - or Oriya Oriya - - os Ossetic Cyrillic - - pa Punjabi Gurmukhi - - pi Pali Devanagari - - pi Sinhala - - pl Polish Latin Yes - ps Pushto Arabic - - pt Portuguese Latin Yes Yes qu Quechua Latin - - rm Raeto-Romance Latin - - ro Romanian Latin Yes - ru Russian Cyrillic Yes - sa Sanskrit Devanagari - - sa Sinhala - - sd Sindhi Arabic - - se Northern Sami Latin - - sk Slovak Latin Yes - sl Slovenian Latin Yes - sn Shona Latin - - so Somali Latin - - sq Albanian Latin - - sr Serbian Cyrillic - Yes sr Latin - - su Sundanese Latin - - sv Swedish Latin Yes - sw Swahili Latin - - ta Tamil Tamil - - te Telugu Telugu - - tg Tajik Arabic - - tg Cyrillic - - tg Latin - - tk Turkmen Arabic - - tk Cyrillic - - tk Latin - - tl Tagalog Latin - - tl Tagalog - - tr Turkish Arabic - - tr Latin - - tt Tatar Cyrillic - - ty Tahitian Latin - - ug Uighur Arabic - - ug Cyrillic - - ug Latin - - ug Uyghur - - uk Ukrainian Cyrillic Yes - ur Urdu Arabic - - uz Uzbek Cyrillic - - uz Latin - - vi Vietnamese Latin - - vo Volapuk Latin - - wa Walloon Latin - Incomplete yi Yiddish Hebrew - - yo Yoruba Latin - - zu Zulu Latin - - Languages in Which the Exact Script Used in Unknown =================================================== Aspell can most likely support any of the following languages; however, I am unsure what script they are written in. Most of them are probably written in Latin but I am not sure. If you have any information about these languages please email me at . Code Language Name aa Afar ak Akan av Avaric bi Bislama bm Bambara cu Old Slavonic ee Ewe ff Fulah ho Hiri Motu ht Haitian Creole hz Herero ie Interlingue ig Igbo ii Sichuan Yi ik Inupiaq kg Kongo ki Kikuyu/Gikuyu kj Kwanyama lb Luxembourgish lg Ganda li Limburgan ln Lingala lu Luba-Katanga mg Malagasy mh Marshallese na Nauru nd North Ndebele ng Ndonga nr South Ndebele ny Nyanja rn Rundi rw Kinyarwanda sc Sardinian sg Sango si Sinhalese sm Samoan ss Swati st Southern Sotho tn Tswana to Tonga ts Tsonga tw Twi ve Venda wo Wolof xh Xhosa za Zhuang The Ethiopic Script =================== Even though the Ethiopic script has more than 220 distinct characters with a little work Aspell can still handle it. The idea is to split each character into two parts based on the matrix representation. The first 3 bits will be the first part and could be mapped to `10000???'. The next 6 bits will be the second part and could be mapped to `11??????'. The combined character will then be mapped with the upper bits coming first. Thus each Ethiopic syllabary will have the form `11?????? 10000???'. By mapping the first and second parts to separate 8-bit characters it is easy to tell which part represents the consonant and which part represents the vowel of the syllabary. This encoding of the syllabary is far more useful to Aspell than if they were stored in UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell will work well with this encoding with out any additional modifications. However, additional improvements may be possible by taking advantage of the consonant-vowel structure of this encoding. In fact, the split consonant-vowel representation may prove to be so useful that it may be beneficial to encode other syllabary in this fashion, even if they are less than 220 of them. The code to break up a syllabary into the consonant-vowel parts does not exists as of Aspell 0.60. However, it will be fairly easy to add it as part of the Unicode normalization process once that is written. The Thai Script =============== The Thai script presents a different problem for Aspell. The problem is not that there are more than 220 unique symbols, but that there are no spaces between words. This means that there is no easy way to split a sentence into individual words. However, it is still possible to spell check Thai, it is just a lot more difficult. I will be happy to work within someone who is interested in adding Thai support to Aspell, but it is not likely something I will do in the foreseeable future. Languages which use Hànzi Characters ==================================== Hànzi Characters are used to write Chinese, Japanese, Korean, and were once used to write Vietnamese. Each hànzi character represents a syllable of a spoken word and also has a meaning. Since there are around 3,000 of them in common usage it is unlikely that Aspell will ever be able to support spell checking languages written using hànzi. However, I am not even sure if these languages need spell checking since hànzi characters are generally not entered in directly. Furthermore even if Aspell could spell check hànzi the exiting suggestion strategy will not work well at all, and thus a completely new strategy will need to be developed. Japanese ======== Modern Japanese is written in a mixture of hiragana, katakana, kanji, and sometimes romaji. Hiragana, Katakana are both syllabary unique to Japan, kanji is a modified form of hànzi, and romaji uses the Latin alphabet. With some work, Aspell should be able to check the non-kanji part of Japanese text. However, based on my limited understanding of Japanese hiragana is often used at the end of kanji. Thus if Aspell was to simply separate out the hiragana from kanji it would end up with a lot of word endings which are not proper words and will thus be flagged as misspellings. Languages Written in Multiple Scripts ===================================== Aspell should be able to check text written in the same language but in multiple scripts with some work. If the number of unique symbols in both scripts is less than 220 than a special character set can be used to allow both scripts to be encoding in the same dictionary. However this may not be the most efficient solution. An alternate solution is to store each script in its own dictionary and allow Aspell to chose the correct dictionary based on which script the given word is written in. Aspell currently does not support this mode of spell checking however it is something that I hope to eventually support. -- http://kevin.atkinson.dhs.org