Text Categorization

PreProcess - STI, Phase I

This page describes pre-process tasks of generating input files for ST index. A set of (135) Semantic Types in the Semantic Network in NLM's UMLS (Unified Medical Language System) is used for STI. Concepts in the UMLS Metathesaurus are assigned one or more STs which semantically characterize those concepts. For example, concept Aspirin is assigned the STs Pharmacologic Substance and Organic Chemical.

STI (Semantic Type Indexing) uses JDI methodology as basis. It calculates the average ST scores for the phrase from Word-St table, and then print out ST rank, ST scores, according to decreasing order of the ST scores.

Three steps are used in STI training set to generate Word-St table.

  • Generate ST "documents" (all words associated with ST)
  • Apply JDI on ST "documents" to generate ST-JD table
  • Calculate similarity (cosine coefficient) on JDI of ST "documents" (ST-JD) and JDI on individual training set words (Word-JD) to generate Word-ST table

Independent files (reformat from lisp files)