Text Categorization

Semantic Type Indexing

STI (Semantic Type Indexing) uses JDI methodology as the basis to calculate the average ST scores from the Word-St table. STs are a set of 135 categories in the Semantic Network in NLM's Unified Medical Language System. Concepts in the UMLS Metathesaurus are assigned one or more STs which form an "isa" link from the concept to the ST. For example, the Metathesaurus concept Aspirin is assigned the STs Pharmacologic Substance and Organic Chemical. The set of UMLS Metathesaurus concepts assigned to an ST can be regarded as an "ST document". In other words, ST-documents are created comprised of UMLS Metathesaurus strings belonging to the ST. Word-St tables are generated by following steps:

  • Generate St-Documents from UMLS Meta-thesaurus (ST Concepts|Words)
  • Use JDI to index ST-Documents (ST|JD|Wc|Dc)
  • Use cosine coefficient on JDI of ST-Documents (ST|JD|Wc|Dc) and JDI on individual training set words (Word|JD|Wc|Dc) to get Word|ST|Wc|Dc.

The STI program along with MEDLINE Tokenizer are used to indexing MEDLINE records on:

  • Text: phrase, titles, abstracts, combination of titles and abstracts

I. Java Software Components:

II. Programs: