Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Dictionary in Ensemble

I. Introduction

The dictionary (eng_medical.dic) in the Ensemble method includes:

  • General English (eng_com.dic):
  • Medical Terms (medical.dic - from Halil):
    • medical terms from UMLS (consumer health related medical terms)
      • English strings
      • unigram
      • semantic type
        • Interventions: topp, lbrp, diap
        • Problem: cgab, acab, inpo, patf, dsyn, anab, neop, mobd, sosy, bact
        • drugs: drdd, clnd, antb, phsu, nsba, strd, vita, aapp
      • lower case

      • File name: ${PRE_PROCESS}/data/Umls/${RELEASE}/outData/umls.dic
    • some manually added data (Gopher + problem list).

    • 4 files from Dina's consumer's data:
      • umls_anatomy_merged.txt
      • umls_interventions_merged.txt
      • umls_population_merged.txt
      • umls_problem_merged.txt

    • Retrieved the 1st field from above 4 files
    • Retrieved unigrams from above terms
    • Excluded words in Jazzy (mistakes: but not yse.dic and yze.dic)
  • Total: 450K tokens (only unigrams)

II. Format

word (lower cased unigrams)

III. Re-generate the Dictionary

We tried to re-produce the dictionary in the Ensemble:

  • File name: ${PRE_PROCESS}/data/Baseline/outData/baseline.dic
  • Format: lowercase word
  • Differences:
    • The generated medical Dictionary (medDic.data) and Halil's file (medical.dic):
      • Almost identical
      • The only difference is Non-ASCII Unicode (from file encoding format)
    • Compare
      • A: Halil's Eng_medical.dic
      • B: Eng_com.dic + medical.dic

      the results are: