Text Categorization

STRI: Text


  • Description:

    Read in the input text and perform ST real-time indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a phrase, such as the combination of title and abstract
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
    • Post-process (Output Filter):
      • Print out input text (term)
      • Detail output filter
      • Score entries display number
      • No output message
      • Cluster option
      • ST candidates
      • Use alphabetical order for Sts have same score

  • Sample commands:
    > stri -p
    => index a text from standard input with prompt
    
    > stri -i:9801.2004.TIAB.in -o:9801.2004.TIAB.out
    => index text from file, 9801.2004.TIAB.in, and send the results to a file, 9801.2004.TIAB.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TIAB.out