Text Categorization

PreProcess: ST-JDs Table


  • Description:

    JDI is applied on St-Documents to get St-JD Scores table. The JD scores vector includes

    • word count score
    • document count score

  • Input:

  • Java File & Algorithm:

    Run JDI (use the latest word-JD table) on each ST through St-Documents to and get

    • Word count score
    • Document count score

    The default input filter option of JDI should be used. The settings are as follows:

    • Remove stopwords
    • Use restrictwords
    • Use normalized signal filter between 2 ~ 510754
      => Please note the default max. signal in JDI.2008 is 645881 (not 510754). This is because there is a SCR (44) for the change after STI table is generated. Along with 5 stop words changes (SCR-43), there is minor different in the stJdsTables for ftcn, neop, orgf.
      =>The max. signal must include cancer, blood, risk and exclude function and therapy. Susanne suggests use "cancer" as upper limit since it is not a stop word.
    • Use min. word count of 2
    • Use min. document count of 2
    • Use min. length of 3

  • Output Files:
    • stJdsTable.txt

      STTUIWc ScoresDc ScoreJdidJd Name

  • Notes: