Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Inclusive Filter: N-Gram with CUIs

I. Introduction

A LexMultiWord must have a meaning (concept). If a nGram has a concept (CUI) from Metathesaurus, it is a good LMW candidate.

II. Procedure
The following procedure is used to find valid multiwords from n-grams that have MetaThesaurus CUI:

  • Dir: ${MEDLINE_WORDS}/bin
  • Program:

    shell> 09.MatcherCui ${YEAR}

    StepDescriptionInputsOutputsNotes
    Get CUI stats on Lexicon
    1Add CUI to Lexicon (InflVars)
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 1 hr. 10 min.
    • ${IN_DIR}inflVars.data.f1
    • inflVars.data.cui
    • Use smt to get CUIs for all terms from inflVars.data
    2Analyze and get stats of CUI in Lexicon
    • AnalyzeCuiMapping.java
    • inflVars.data.cui
    • inflVars.data.cui.rpt
    • Get stats for single words|multiwords wtih CUIs
    Get CUI stats on MEDLINE n-gram set
    10Add CUI to nGram
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 16 hr.
    • distilledNGram.${YEAR}.core.lc
      =>run 06.NGramUtil ${YEAR}
      11
    • distilledNGram.${YEAR}.core.f2
    • distilledNGram.${YEAR}.core.cui
    • get term from distilled nGrams (field 2)
    • Use smt to get CUIs for all terms from nGrams
    11Filter out nGrams without CUI
    • FilterCuiFromFile.java
    • distilledNGram.${YEAR}.core
    • distilledNGram.${YEAR}.core.cui
    • distilledNGram.${YEAR}.core.cui.out
    • Filter out nGrams without CUIs (including 1,2,3 substitutions)
    12Tag results of step-11
    • TagCuiTerm.java
    • distilledNGram.${YEAR}.core.cui.out
    • Max. WC (2000000)
    • Min. WC (0)

    • inflVars.data.${YEAR}
    • inflVars.data.current
    • notMwFromCuiTerm.data.${YEAR}
    • notMwFromCuiTerm.data.current
    • distilledNGram.${YEAR}.core.lc.cui.out.stats (the stats between init year and current)
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tbd.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tbd.${MIN_WC}-${MAX_WC}
    Tag and calulate precision:
    • sent distilledNGram.${YEAR}.core.cui.out.current.tbd.${MIN_WC}-${MAX_WC} to linguist:
      • tag yes|no|exp
      • Add valid MW to Lexicon
    • Update files from tag result of "yes|no"
      • Update inflVars.data.current from Lexicon
      • Update notMwFromCuiTerm.data.current from no-tag
    • rerun this step until current.tbd is 0
    • Check precision
    20Apply Matcher-Cui on nGram TBD
    • distilledNGram.${YEAR}.core
    Process: Generate multiword candidates from the distilled n-gram set
    30Proc: Apply filter of Lexicon on Distilled nGram (core)
    • ${N_GRAM}/distilledNGram.${YEAR}.core
    • ${IN_DIR}/inflVars.data
    • 30.disNGram.Core.lexicon.out
    • Use core-term of n-gram
    31Proc: Apply matcher of Multiword on nGram (core) from results of Step 30
    • 30.disNGram.Core.lexicon.out
    • 31.disNGram.Core.lexicon.multiword.out
      => Core multiwords from n-grams, no in the Lexicon
    • Remove single word
    • This is used for ML models
    32-0PreProc: Get unique English String from UMLS - MRCONSO.RRF.ENG
    • MRCONSO.RRF.ENG
      => link to MRCONSO.RRF.ENG.${YEAR}AA/AB
    • umlsStr.data
    • Preprocess to get English UMLS String for step 32
    32Proc: Apply matcher of UMLS-Str on distilled nGram (must run 31)
    • 30.disNGram.Core.lexicon.out
    • umlsStr.data
    • 32.disNGram.Core.umlsStr.out
      =>n-grams with CUIs (UMLS-String)
    • A simple hashTable lookup to match n-gram to UMLS String
    33Proc: Apply matcher of Multiword on nGram (core) from results of Step 32
    • 32.disNGram.Core.umlsStr.out
    • 33.disNGram.Core.multiword.out
      => Core multiwords from n-grams th CUIs
    • Remove single word
    • This is used for ML models
    34Proc: Apply matcher of EndWord (top 33) on nGram (core)
    • 33.disNGram.Core.multiword.out
    • endWords.top33.data
      => Must run 10.MatchEndWord first
      option 1, to get the top endword list

      => Manually create endWords.top${NN}.data
      => link endWords.top.data (used for top endWords)
    • 34.disNGram.Core.endword.out
      => Candidate with CUI and top endwords
    • Use the top endWord for matcher
    Post-Process: Auto remove, tag, and resort to final format
    35Post-Proc: Auto remove candidate in the Lexicon and remove/Tag candidates are invalid LMWs based on the previous tags
    • 34.disNGram.Core.endword.out
    • Use 00.CandidateList
      => Files updates are required, proceed Steps 1-3
      • validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort
      • invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no
    • 35.disNGram.Core.endword.out.autoTag
    • 35.disNGram.Core.endword.out.rmYesNo
    • 35.disNGram.Core.endword.out.rmYesTagNo
    • Filter and tag candidates that are in the preivous year lists
    • Make sure update data and re-run 00.CandidateList for the latest updates
    36Post-Proc: Rearrange (resort) canList by grouping singluars/plurals
    • 35.disNGram.Core.endword.out.rmYesNo
    • 35.disNGram.Core.endword.out.rmYesTagNo
    • 36.disNGram.Core.endword.out.rmYesNo.gsp
      => cp to 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR}
    • 36.disNGram.Core.endword.out.rmYesTagNo.gsp
      => cp to 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}
      => This file is used for annual candidate list
    • resort and group it to put singular and plural together
    Future Usage
    40PreProc: Get nGram spVar from result of 8.MatcherSpVar
    • medline.2.byM2CES.2.out.30.spVars.2016
    • nGramSpVars.data
    • get the n-grams that match spVar patterns
    41Proc: Apply matcher of nGram-SpVar on nGram
    • 34.disNGram.Core.endword.out
    • nGramSpVars.data
    • 36.disNGram.Core.spVar
    • Lost recall, not use for now