Text Categorization

JDI: Text

Description:
Read in the input text (title, abstract, phrase) and perform JD indexing based on
- word frequency count
- document count for word
Inputs:
- a text, such as:
  - title
  - abstract
  - phrase
- a file, such as 9801.2004.TI.in
- a file, such as 9801.2004.AB.in
- a file, such as 9801.2004.TIAB.in
Algorithm:
- Pre-Process (Input Filter):
  - Tokenize all words of the input term
  - Apply Word Extraction Filter (if it's MEDLINE TI or AB)
  - Apply acronym filter (TBD)
  - Filter out not legal words
  - Filter out duplicated words if unique flag is true
  - Assign the final words for processing
- Process:
  - Get JD scores for each (legal) word in the text from DB: WORD_JD_SCORES table
  - Calculate Avg. JD scores for the text
- Post-process (Output Filter):
  - Print out Input text (term)
  - Output Filter details
  - Score entries display number
  - No output message
  - Cluster option
  - JD candidates
  - Use alphabetical order for JDs have same score (Ex: "taylor", "assault")

Sample commands:

> jdi -p
=> index a text from standard input with prompt

> jdi -d -i:9801.2004.TI.in -o:9801.2004.TI.out
=> index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out

Sample Outputs:
- a file, such as 9801.2004.TI.out
- a file, such as 9801.2004.AB.out
- a file, such as 9801.2004.TIAB.out