Text Categorization

JDI: Text


  • Description:

    Read in the input text (title, abstract, phrase) and perform JD indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a text, such as:
      • title
      • abstract
      • phrase
    • a file, such as 9801.2004.TI.in
    • a file, such as 9801.2004.AB.in
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter (if it's MEDLINE TI or AB)
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
      • Get JD scores for each (legal) word in the text from DB: WORD_JD_SCORES table
      • Calculate Avg. JD scores for the text
    • Post-process (Output Filter):
      • Print out Input text (term)

      • Output Filter details
      • Score entries display number
      • No output message
      • Cluster option
      • JD candidates
      • Use alphabetical order for JDs have same score (Ex: "taylor", "assault")

  • Sample commands:
    > jdi -p
    => index a text from standard input with prompt
    
    > jdi -d -i:9801.2004.TI.in -o:9801.2004.TI.out
    => index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TI.out
    • a file, such as 9801.2004.AB.out
    • a file, such as 9801.2004.TIAB.out