Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

CSpell - Consumer Medical Terms from UMLS

I. Pre-Process - Source: CSpell Dictionary, Medical Terms

Medical Terms that are not in the SPELCIALIST Lexicon were collected and added to the CSpell dictionary. These terms are also used as source for Lexicon build. Please see CSpell Dictionary - Medical Terms for details. The basic algorithm are described as follows:

  • Retrieve terms from UMLS - MRCONSO.RRF, which are English, preferred term
  • Matches (34) semantic types from 5 categories (problem, interventions, drugs, anatomy, population)
  • lowercased
  • Combined with static data form gopher and problem list

  • retrieve unigram
  • convert to coreTerm
  • Filter out digit, punctuation, numbers, unit, measurement
  • Filter out terms already in the Lexicon

  • Med.cm-l.dic

II. Process - Generate LMW candidates

  • Input Files:
    • Med.cm-l.dic (source of terms)
    • noCui.data.all (to assign type if no CUI)
    • umlsDicBySt.data.all (to retrieve CUI)
    • nGram.2017.noPipe.core.lc (to retrieve frequency)
  • Run Program:
    • shell>cd ${MULTIWORDS}/bin/20.CSpellMedTerms
      1
      2
  • Algorithm:
    • Get unigram from sources with frequency
    • Associate terms with frequency to unigram
    • Assign CUI or source (CUI_expo or CUI_prob)

    • Re-arrange by grouping singulars and plurals together, then frequency (easier for linguist to tag same term at the same time)
  • Output Files:
    • ${MULTIWORDS}/data/2017/outData/20.CSpellMedTerms/Cand_List
    • Format:
      Element WordFrequency of Element WordTermFrequency of TermCUI/Sources*

      * Field 5: CUI, CUI_expo, CUI_prob

    • cCandidates.data
    • cCandidates.data.gbp

III. Post-Process

  • Get the unigram that has frequency greater than 1500 (same as use MEDLINE)
  • Get unigram (field 1), then use option 10 in 12.CandidateList to auto tag
  • Get multiwords (field 3), then use option 10 in 12.CandidateList to auto tag

  • Use option 3 in 12.CandidateList to add invalid LMW to invalidBaseLmw.data file.

IV. Results

Yearmin. frequencyTypeTotal CandidatesValid LMWInvalid LMW
20171500Unigram1258738
Terms1233371196
Total13581241234