The hack for IL, AL and LA is only needed because these abbreviations
are removed by the legacy tokenizer as a stop word. There is no need
to keep the hack for future tokenizers. Move it therefore to the
token extraction function.
This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.