Text Categorization
PreProcess: Stopwords
- Description:
This file includes stopwords. Stopwords are:
- high frequency words, such as a preposition
- grammar words, which does not contribute the meaning of the sentence too much, such as "the"
This stopword file is the default list of stopwords. Users are allowed to define their own stopwords list in JDI configuration.
- Input:
- Procedures:
- None, static file, copy from previous version
- Output File:
- stopWords.txt, used in TC.JDI and TC.STI
- Notes:
Stopwords could be generated automatically by applying JDJ. The concept is:
- High frequency words do not have significant meaning to JD.
- Don't use stopwords option in JD
- Find all words with their JD scores (documents count) are
- small (the biggest one is < 0.25)
- not too much deviation between JD scores
or
- Find the similarity (cosine coefficient) of JDs (document count) between words and "the" are
- small (the biggest one is < 0.25)
- not too much deviation between JD scores