Words With Symbols in Them - GNU Aspell 0.61-cvs

Next: Unicode Normalization, Previous: Compound Words, Up: Language Related Issues

C.2 Words With Spaces or Other Symbols in Them

Many languages, including English, have words with non-letter symbols in them. For example the apostrophe. These symbols generally appear in the middle of a word, but they can also appear at the end, such as in an abbreviation. If a symbol can only appear as part of a word then Aspell can treat it as if it were a letter.

However, the problem is most of these symbols have other uses. For example, the apostrophe is often used as a single quote and the abbreviations marker is also used as a period. Thus, Aspell cannot blindly treat them as if they were letters.

Aspell currently handles the case where the symbol can only appear in the middle of the word fairly well. It simply assumes that if there is a letter both before and after the symbol than it is part of the word. This works most of the time but it is not fool proof. For example, suppose the user forgot to leave a space after the period:

       ... and the dog went up the tree.Then the cat ...

Aspell would think “tree.Then” is one word. A better solution might be to then try to check “tree” and “Then” separately. But what if one of them is not in the dictionary? Should Aspell assume “tree.Then” is one word?

The case where the symbol can appear at the beginning or end of the word is more difficult to deal with. The symbol may or may not actually be part of the word. Aspell currently handles this case by first trying to spell check the word with the symbol and if that fails, try it without. The problem is, if the word is misspelled, should Aspell assume the symbol belongs with the word or not? Currently Aspell assumes it does, which is not always the correct thing to do.

Numbers in words present a different challenge to Aspell. If Aspell treats numbers as letters then every possible number a user might write in a document must be specified in the dictionary. This could easily be solved by having special code to assume all numbers are correctly spelled. Yet, what about something like “4th”. Since the “th” suffix can appear after any number we are left with the same problem. The solution would be to have a special symbol for “any number”.

Words with spaces in them, such as foreign phrases, are even more trouble to deal with. The basic problem is that when tokenizing a string there is no good way to keep phrases together. One solution is to use trial and error. If a word is not in the dictionary try grouping it with the previous or next word and see if the combined word is in the dictionary. But what if the combined word is not, should the misspelled word be grouped when looking for suggestions? One solution is to also store each part of the phrase in the dictionary, but tag it as part of a phrase and not an independent word.

To further complicate things, most applications that use spell checkers are accustom to parsing the document themselves and sending it to the spell checker a word at a time. In order to support words with spaces in them a more complicated interface will be required.