Text Categorization

PreProcess: Stopwords

  • Description:
    This file includes stopwords. Stopwords are:
    • high frequency words, such as a preposition
    • grammar words, which does not contribute the meaning of the sentence too much, such as "the"
    This stopword file is the default list of stopwords. Users are allowed to define their own stopwords list in JDI configuration.

  • Input:
    • stopWords.txt
      stopwords

  • Procedures:
    • None, static file, copy from previous version

  • Output File:
    • stopWords.txt, used in TC.JDI and TC.STI
      StopWord

  • Notes:
    Stopwords could be generated automatically by applying JDJ. The concept is:
    • High frequency words do not have significant meaning to JD.
    • Don't use stopwords option in JD
    • Find all words with their JD scores (documents count) are
      • small (the biggest one is < 0.25)
      • not too much deviation between JD scores
      or
    • Find the similarity (cosine coefficient) of JDs (document count) between words and "the" are
      • small (the biggest one is < 0.25)
      • not too much deviation between JD scores