Text Categorization

STRI: Text

Description:
Read in the input text and perform ST real-time indexing based on
- word frequency count
- document count for word
Inputs:
- a phrase, such as the combination of title and abstract
- a file, such as 9801.2004.TIAB.in
Algorithm:
- Pre-Process (Input Filter):
  - Tokenize all words of the input term
  - Apply Word Extraction Filter
  - Apply acronym filter (TBD)
  - Filter out not legal words
  - Filter out duplicated words if unique flag is true
  - Assign the final words for processing
- Process:
  - Get JDI scores for the input text jdi
  - Calculate Vector similarity (cosine coefficient) on JDI scores (from above, word-JD) and ST-Jd scores.
- Post-process (Output Filter):
  - Print out input text (term)
  - Detail output filter
  - Score entries display number
  - No output message
  - Cluster option
  - ST candidates
  - Use alphabetical order for Sts have same score

Sample commands:

> stri -p
=> index a text from standard input with prompt

> stri -i:9801.2004.TIAB.in -o:9801.2004.TIAB.out
=> index text from file, 9801.2004.TIAB.in, and send the results to a file, 9801.2004.TIAB.out

Sample Outputs:
- a file, such as 9801.2004.TIAB.out