Text Categorization

PreProcess - STI, Phase III

This page describes the automatic pre-process tasks of generating input files for STI (Semantic Type Indexing). There are two phases of this pre-process for STI:

Phase I:
Generate all files to Java input format from Lisp files. This set of data is tested by comparing to all Lisp files and result of file.9801 and used in tc2006.
Phase II:
use Java program to generate files from scratch (Meta-thesaurus, etc.). A new defined/refined algorithm is used to generate ST "document". This set of data is tested by comparing final files in phase I by similarity and used in tc2008.
Phase III:
A modified version based on phase II by using Java program to generate files from scratch (Meta-thesaurus, etc.). A new refined algorithm utilized frequency, St-Groups, and STRI filter is used to generate ST "document". This set of data is tested by running through NLM's WSD collection test.

The details procedures of phase III approach is shown in the follow diagram and described as below:

Top Directory ${TC_PRE_2008}/data/${YEAR}/Sti/
- ${TC_PRE_2008}: is the version of pre-Process software
- ${YEAR}: is the version of TC release, tc${YEAR}
Input files
Required files to generate training set
- MRSTY (Semantic Types, Meta-Thesaurus release, from ash:/u03/umls/Releases/2008AB/Full/ORF/META/MRSTY)
- MRCONSO.RRF (Concept & Source, Meta-Thesaurus release, from ash:/u03/umls/Releases/2008AB/Full/RRF/META/MRCONSO.RRF)
- SRDEF.txt (ST Abbreviations, from Semantic Network - SRDEF.txt)
- stGroups.txt (ST groups, from Semantic Network - SemGroups.txt)
Semantic Types
Semantic Types Groups
ST Documents
St-Jds Table
Word-St Table

Three steps are used in STI training set to generate Word-St table.

Generate 1st run stDocuments
- based on algorithm developed on phrase II to get all words associated with ST
- uses frequency instead using unique word
- Apply JDI on 1st run stDocuments to generate St-Jd table
Separate 1st run stDocuments by StGroup
- 1 stGroup: for words only associate with STs that belong to only 1 stGroup
- Multi stGroups: for words associated with STs belong to multiple stGroups
Refined stDocuments by STRI filter
- The 1st run St-Jd table is used in STRI
- From our study, we use the following criteria to reach the best result
  - 1 stGroup: StdDev and Top 15 rank
  - Multi stGroups: Top 3 rank
- Combine the refined stDocuments of 1-StGroup and multi-StGroup
Generate the Word-St table
- Apply JDI on refined stDocuments to generate St-Jd table
- Get Word-Jd-Wc-Dc table from JDI
- Calculate similarity (cosine coefficient) on St-Jd table and Word-Jd-Wc-Dc table to generate Word-St-Wc-Dc table