Text Categorization

PreProcess: Word-ST Table


  • Description:

    A table (file) stores the Word-St scores is generated, loaded into DB table. This table is then used to perform ST indexing on phrase. There are two types of scores:

    • word count
    • document count

  • Input:

  • Java File & Algorithm:
    • Read in WC and DC scores for all Word-Jdid from wordJdidWcDcTable (wordJdidWcDcTable.txt)
      • The order of JD scores are not sorted
      • JD scores are not in the table if it is 0
    • Read in WC and DC scores for all ST-Jdid from stJdsTable (stJdsTable.txt)
      • The order of JD scores are sorted
      • JD scores are in the table even if it is 0
    • Calculate cosine coefficient on Vectors of Wc and Dc for all Word-Jdid and ST-Jdid to form Word-St-Wc-Dc tables
      • Make sure all JD vectors have same amount of vector components
    • Print out the tables

  • Output Files:
    • wordStTable.txt
      WordST indexST AbbreviationTUIWord scoresDocument scores

  • Notes: