Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Lexicon Words Stats

I. Introduction

This page describes programs to get stats of Lexicon words, using MEDLINE for frequency (WC|DC).

II. Detail Process

  • Dir: ${MULTIWORDS}/bin/11.LexWords
  • Programs:
    StepDescriptionInputsOutputsNotes
    MEDLINE Unigram Spectrum Analysis
    1Group raw unigram by core-term.lc
    • NGramUtil.GrepTermsSort
    • ./Medline/unigram.${YEAR}
    • ./Medline/unigram.${YEAR}.core.lc
    • ./Medline/unigram.${YEAR}.core.lc.detail
    • Auto link unigram
    2Get MEDLINE unigram WC Frequency Spectrum
    • NGramUtil.GetBasicHistogram
    • ./Medline/unigram.${YEAR}.core.lc
    • ./Medline/unigram.${YEAR}.core.lc.his.csv
    • Used as input data for Excel diagram
    Lexicon Word Spectrum Analysis
    10Get Lexicon single word frequency spectrum
    • LexWords.GetLexWordFreSpectrum
    • TYPE 0: all, 1: SW, 2: MW
    • ${IN_DIR}inflVars.data
    • ./Medline/unigram.${YEAR}.core.lc
    • ./LexSpec/sWord.b.csv (Lexicon words in MEDLINE or not)
    • ./LexSpec/sWord.l.csv (Lexicon words with MEDLINE WC)
    • ./LexSpec/sWord.rpt
    • ./LexSpec/sWord.sum
    11Group distilled n-gram set by core-term.lc
    • NGramUtil.GroupByCoreTerm
    • ${NGRAM_DIR}nGrams/distilledNGram.${YEAR}
    • ${NGRAM_DIR}nGrams/distilledNGram.${YEAR}.core.lc
    • ${NGRAM_DIR}nGrams/distilledNGram.${YEAR}.core.lc.detail
    • Same as step-11 in 06.NGramUtil
    12Get all words frequency spectrum
    • LexWords.GetLexWordFreSpectrum
    • TYPE 0: all, 1: SW, 2: MW
    • ${IN_DIR}inflVars.data
    • ${NGRAM_DIR}nGrams/distilledNGram.${YEAR}.core.lc
    • ./LexSpec/aWord.b.csv (Lexicon words in MEDLINE or not)
    • ./LexSpec/aWord.l.csv (Lexicon words with MEDLINE WC)
    • ./LexSpec/aWord.rpt
    • aWord.sum
    13Get multiwords frequency spectrum
    • LexWords.GetLexWordFreSpectrum
    • TYPE 0: all, 1: SW, 2: MW
    • ${IN_DIR}inflVars.data
    • ${NGRAM_DIR}nGrams/distilledNGram.${YEAR}.core.lc
    • ./LexSpec/mWord.b.csv (Lexicon words in MEDLINE or not)
    • ./LexSpec/mWord.l.csv (Lexicon words with MEDLINE WC)
    • ./LexSpec/mWord.rpt
    • ./LexSpec/mWord.sum
    Lexicon Word Histgram Analysis (Used in Amia Paper)
    20Get normTerm.lc from inflVars
    • CandidateUtil.ToCoreTerm
    • ${IN_DIR}inflVars.data.f1
    • ./LexHist/inflVars.data.f1.core.lc
    • Get the norm-term.lc from inflVars
    21Split single word and multiwords from lexicon inflVars
    • LexWords.SplitSingleMultiWords
    • ./LexHist/inflVars.data.f1.core
    • inflVars.data.f1.core.mw
    • inflVars.data.f1.core.sw
    • Same as step-11 in 06.NGramUtil
    22Add WC to Lexicon single word
    • NGramUtil.AddWcToCoreTerm
    • ./LexHist/inflVars.data.f1.core.sw
    • ${NGRAM_DIR}nGrams/nGramSet.${YEAR}.30.core.lc
    • ./LexHist/inflVars.data.f1.core.sw.wc
    23Add WC to Lexicon multiword
    • NGramUtil.AddWcToCoreTerm
    • ./LexHist/inflVars.data.f1.core.mw
    • ${NGRAM_DIR}nGrams/nGramSet.${YEAR}.30.core.lc
    • ./LexHist/inflVars.data.f1.core.mw.wc
    24Get WC histogram on Lexicon single word
    • CandidateUtil.HistogramUtil
    • ./LexHist/inflVars.data.f1.core.sw.wc
    • ./LexHist/inflVars.data.f1.core.sw.wc.his.minWC-maxWc.secNO.csv
    • Use as input to feed Excel diagram
    25Get WC histogram on Lexicon multiword
    • CandidateUtil.HistogramUtil
    • ./LexHist/inflVars.data.f1.core.mw.wc
    • ./LexHist/inflVars.data.f1.core.mw.wc.his.minWC-maxWc.secNO.csv
    • Use as input to feed Excel diagram