PreProcess - JDI, Phase II
JDI is a novel approach to fully automated indexing based on NLM's practice of maintaining a subject index to journal title using a set of 122 MeSH terms, know as JDs (journal descriptors) corresponding to biomedical specialties. For example, the Journal of Pediatric Surgery is indexed by the JDs Pediatrics and Surgery.
The JDI system associates JDs with words in titles and abstracts in a training set of about 1.3 million MEDLINE records in approximately 4000 MEDLINE journals. Each record "inherits" the JDs from the journal in the record. A word in the training set can then be described by a list of JDs ranked according to the number of co-occurrences between the word and the JDs. Text as input to the JDI system can be indexed based on averaging the word-JD co-occurrences for the words in the text that are also in the training set, ranking the JDs in decreasing order of these averages. For example, JDI of the phrase "appendectomy in children" would result in Surgery and Pediatrics as the top two JDs indexing this text. Normally, JDI is used for indexing documents which are MEDLINE citations (titles and abstracts of journal articles).
This phase uses both Lisp files and file from MEDLINE and Meta-Thesaurus to generate JDI input file in Java format. This set of data is tested by comparing all files to Lisp files. Also, the result of file.9801 is verified. It is used in tc2007.