Text Categorization

MEDLINE Tokenizer

  • Description:

    Read in Medline file and retrieve specified fields. The output is used as the input for JDI (both phrases and MeSHs).

  • Usage:
    > mlt -h
    
    Synopsis:
      Mlt [options]
    
    Description:
      Mlt is a program to tokenize MEDLINE citations by specifying field tags
    
    Options:
      -ci       Show configuration information
      -h        Print program help information (this is it)
      -i:STR    Specify input file (must specify)
      -pmid     Preserve PMID in the first field
      -s        Sort output by PMID
      -o:STR    Specify output file (must specify)
      -t:STR    Specify MEDLINE field tag:TI|AB|TIAB|MHs|TIABMHs|ALL|S_ALL (must specify)
                or any MEDLINE field tag
      -v        Print the current version of Mlt
      -x:STR    Specify an alternate configuration file
     
    

  • Sample Inputs:
    • 9801.2004.baseline.sorted
    • 9801.2005.baseline.sorted

  • Algorithm:
    • Read in file and save MEDLINE records into Java objects, CitationObjs.
    • Sort Citations by PMID if -s sorting flag is chose
    • Print out field data by specified field tags:
      • MHs: starred MeSHs (MH and SH are separated by '|');
      • TIABMHs: combination of TI, AB, and MHs
      • ALL: the original format
      • TIAB: combination of TI and AB
      • Field tag: legal field tag in MEDLINE, such as TI, AB, etc.. Please refer MEDLINE field tags for details.
    • Re-format MHs:
      • read in MEDLINE file and generate MH (Mesh)
      • Take care of multiple lines MH
      • Filter out MH without star (*)
      • Tokenize MeSH Main heading
      • Tokenize MeSH sub heading and only keep those with *
        => Indexing rules do not allow * on both MH and SH. However, our code is able to handle this situation
      • MeSH Main heading is always unique
      • Unify MeSH sub headings (sub heading may be duplicated)
      • Sub heading is changed to its abbreviation form
      • Print out main heading and sub heading use separator "|"

  • Sample commands:
    > mlt -t:TIAB -i:9801.2004.baseline.sorted -o:9801.2004.TIAB
    
    => Read in file 9801.2004.baseline.sorted, retrieve field TI and AB and send the results to file 9801.2004.TIAB

  • Sample outputs:
    • 9801.2004.TI
    • 9801.2004.AB
    • 9801.2004.TIAB
    • 9801.2004.MH
    • 9801.2004.TIABMH