MEDLINE Tokenizer
- Description:
Read in Medline file and retrieve specified fields. The output is used as the input for JDI (both phrases and MeSHs).
- Usage:
> mlt -h
Synopsis:
Mlt [options]
Description:
Mlt is a program to tokenize MEDLINE citations by specifying field tags
Options:
-ci Show configuration information
-h Print program help information (this is it)
-i:STR Specify input file (must specify)
-pmid Preserve PMID in the first field
-s Sort output by PMID
-o:STR Specify output file (must specify)
-t:STR Specify MEDLINE field tag:TI|AB|TIAB|MHs|TIABMHs|ALL|S_ALL (must specify)
or any MEDLINE field tag
-v Print the current version of Mlt
-x:STR Specify an alternate configuration file
Sample Inputs:
- 9801.2004.baseline.sorted
- 9801.2005.baseline.sorted
Algorithm:
- Read in file and save MEDLINE records into Java objects, CitationObjs.
- Sort Citations by PMID if -s sorting flag is chose
- Print out field data by specified field tags:
- MHs: starred MeSHs (MH and SH are separated by '|');
- TIABMHs: combination of TI, AB, and MHs
- ALL: the original format
- TIAB: combination of TI and AB
- Field tag: legal field tag in MEDLINE, such as TI, AB, etc..
Please refer MEDLINE field tags for details.
- Re-format MHs:
- read in MEDLINE file and generate MH (Mesh)
- Take care of multiple lines MH
- Filter out MH without star (*)
- Tokenize MeSH Main heading
- Tokenize MeSH sub heading and only keep those with *
=> Indexing rules do not allow * on both MH and SH. However, our code is able to handle this situation
- MeSH Main heading is always unique
- Unify MeSH sub headings (sub heading may be duplicated)
- Sub heading is changed to its abbreviation form
- Print out main heading and sub heading use separator "|"
Sample commands:
> mlt -t:TIAB -i:9801.2004.baseline.sorted -o:9801.2004.TIAB
=> Read in file 9801.2004.baseline.sorted, retrieve field TI and AB and send the results to file 9801.2004.TIAB
Sample outputs:
- 9801.2004.TI
- 9801.2004.AB
- 9801.2004.TIAB
- 9801.2004.MH
- 9801.2004.TIABMH