From kevina@users.sourceforge.net Fri Jun 29 17:59:57 2001 Date: Fri, 15 Jun 2001 06:08:01 -0400 (EDT) From: Kevin Atkinson To: aspell-announce@lists.sourceforge.net Subject: [aspell-announce] Official Non-English Word Lists Packages Now Available [Please redistribute this announcement as you see fit] In an effort to make installing word lists for non-English languages in Aspell straightforward I have decided to release foreign language dictionaries in a standard format. A preliminary version of my efforts are currently available at the Aspell home page (http://aspell.sourceforge.net). There you will find support for the following languages: Breton (br), Catalan (ca), Czech (cs), Danish (da), Dutch (nl), Esperanto (eo), Faroese (fo), French (fo), French (fr), German (de), Italian (it), Norwegian (no), Polish (pl), Russian (ru), Spanish (es), Swedish (sv). Please check them out and let me know what you think, but please keep in mind that these are preliminary and subject to change. These packages contain everything needed to add support for a given language to Aspell. These packages will also install the necessary files so that the Word List will be correctly recognized by Pspell -- something that is often not handled correctly by Word List author's Aspell packages. Support for variants in languages (such as American, British, Canadian, Swiss German, etc...) is not yet available as I am currently not sure the best way to do this. However, support *will* be available by the next version of Pspell and Aspell as I am planning on also packaging the English dictionaries this way and distribute it separately. Authors of the word lists used to create these packages are encouraged to check them out to make sure I did things correctly. In the near future I will allow you (the word list author) to maintain the packages your self, but for right now I want to maintain tight control over them as the format is not quite finalized yet. Also, I am especially interested in feedback from Package Maintainers (RPM, Debian, etc.) as the Makefile is rather primitive and probably does not install things correctly nor support the options needed by maintainers to make package simple. Suggestions are more than welcome, but patches to the proc script (the script which does all the real work) are even more welcome. I am also strongly interested in what you think about the layout of the dictionary files and language names. Technical Notes for Word List authors and Package Maintainers In order to make things as straight forward, portable, and uniform, I have decided to enforce the following rules. 1) All language names are now the two letter ISO code (en, da) etc. 2) The actual dictionary files must all start with the two letter code and end in .rws and may only contain ASCII characters. 3) Alias are created using Aspell's multi files and not symbolic links. These rules are very different from the current way dictionaries are handled but I fell they will make life easier for everyone. The reason for the first rule is because in the past the Aspell language names were a mixture of the name spelled out in English and the name spelled out in the native language and in some cases involved non-ASCII characters which was just asking for trouble on non-Unix like platforms and probably some older Unix ones. Some people want as far as doing it both ways by symbolically linking one language data file to the other. This amazingly worked but it is a complete abuse of how languages names and data files are meant to be used. Finally others, thought that language variants (American, Swiss German, etc.) should be considered separate languages and either attempt to specify them as a language at the command line, for example trying "aspell --lang=canadian ..." or creating separate data files for them. All of this did no good but to confuse people so I wanted to formalize this and was originally planning on using the language name spelled out in ASCII characters but released that in many cases I didn't know what this should be so I decided to go with the universal known language codes. The second rule is there so that is is clear which words lists belong to which languages. I require them to be all ASCII characters for maximum portability. However, the end user is not expected to use these words lists directly. Instead they are expected to use one of the aliases created via the .multi file. These alias can be anything what so every and may included non ASCII characters. Symbolic links are not used as there are Unix specific and not supported by Win32. Non-ASCII characters are okay for aliases as they can simply not be installed on platforms which don't support them. Draft Documentation on the layout of Aspell dicts packages The overall goal of Aspell dicts is to provide a uniform method to distribute dictionaries for Aspell for any language that Aspell supports. This documentation is still in an early stage and rather incomplete. It is meant to give you enough of an overview so you know what is going on, but probably won't be enough information for you to actually create a distribution. Layout of the Distribution: An Aspell Word List Package contains several type of files, many of them generated by the proc script. These must be provided: info: the main file which contains all of the important word lists *.cwl: compressed word list files Copyright: the copyright notice And these are automatically generated/provided for you configure: the configure script which finds the appropriate paths nd generated the actual makefile. ??.dat: the data file for the language. *.multi: the dictionary files Makefile.pre: the makefile which configure uses. And finally several optional ones. ??_phonet.dat: The optional phonet data file README: A readme file. If one is not provided a genetic one will be created COPYING: The actual license agreement. Automatically provided for some licenses doc/* additional documentation *** Format of the Info File (Note: For a better idea of how this file is laid out see some of the sample info files included) The info file is the main file which contains most of the information. It has two types of entries. Single value settings, and group settings. Single value settings have the form: And group settings which have the form: : ... If there is ANY whitespace before a key it is assumed to belong to a group entry. The following Single value settings are mandatory: name_english: The english name of the language code: The two letter Code copyright: The copyright one of: LGPL GPL FDL Artistic Copyrighted (Copyright message must remain) Open Source (Meets OSI definition) Public Domain (ie none) Other Unknown version: A version string charset: charset to use soundslike: one of none generic phonet If it is phonet the file _phonet.dat is expected to be present In addition there must be at least one of each of the following group entries: author: name: The name of the author email: The email address of the author. Multiple author groups may be specified. dict: The defining entry for a dictionary name: The name of this dict alias: An alternate name (may be repeated) add: A word list to add (may be repeated) For right now there should only be one dict entry and its name should be the same as the language code. The proc script has the ability to handle more than one but I hav enot worked out the details yet. In additional to the above the info file can also contain the following optional entries name_ascii: The language name in spelled in its own language in all ascii characters name_native: Like above but not limited to ASCII characters. And a bunch of other entries which I will document latter. *** The *.cwl For each add entry in the dict entry there should in general be one word list. Each of these words lists will be compiled into a separate hash files so you should keep the number to a minimum. Each file is expected to have the following format: [-...].cwl These files are expected to be compressed with word-list-compress. To compress a file so something like the following export LANG=C cat | sort -u | word-list-compress c > ...cwl the LANG=C is important or other wise the file will not be compressed optionally. *** Copyright file The copyright file simply states the terms in which this word list is available. If the license is a standard one or is more than a paragraph or so the actual license should be included in a separate file "COPYING". If you are using one of the GNU licenses the COPYING file will automatically be generated for you. *** running proc Once the info and *.cwl files are created you are ready to run the proc script. To do so simply run type: perl proc create and if there are no errors you should have the above listed generated files. To try building a word list run configure with ./configure and then to build and install it make make install To create a distribution do a make dist -- Kevin Atkinson kevina at users sourceforge net http://www.ibiblio.org/kevina/ _______________________________________________ aspell-announce mailing list aspell-announce@lists.sourceforge.net http://lists.sourceforge.net/lists/listinfo/aspell-announce