Creating Inverted Indexes
Indexes should only contain domain relevant information. This requires
some document processing.
- Remove formatting: we only care about the text, not how it's presented
- Remove stopwords: words like "a", "and", and "the" don't contribute much because they're very common
- Stem words: combine morphological and inflectional varients to one entry. "Swim", "Swimming", and "Swimmer" are all indexed under "Swim".
- Match synonyms: combine synonyms into one entry. "Car", "Auto", and "Automobile" all map to "Car"