Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

JDI: Text


  • Description:

    Read in the input text (title, abstract, phrase) and perform JD indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a text, such as:
      • title
      • abstract
      • phrase
    • a file, such as 9801.2004.TI.in
    • a file, such as 9801.2004.AB.in
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter (if it's MEDLINE TI or AB)
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
      • Get JD scores for each (legal) word in the text from DB: WORD_JD_SCORES table
      • Calculate Avg. JD scores for the text
    • Post-process (Output Filter):
      • Print out Input text (term)

      • Output Filter details
      • Score entries display number
      • No output message
      • Cluster option
      • JD candidates
      • Use alphabetical order for JDs have same score (Ex: "taylor", "assault")

  • Sample commands:
    > jdi -p
    => index a text from standard input with prompt
    
    > jdi -d -i:9801.2004.TI.in -o:9801.2004.TI.out
    => index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TI.out
    • a file, such as 9801.2004.AB.out
    • a file, such as 9801.2004.TIAB.out