Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

N-gram Utilities

I. Introduction

Some utility software are developed for processing n-gram. They are used in many processes and are summarized in this page.

II. Detail Process

  • Dir: ${MULTIWORDS}/bin/06.NGramUtil
  • Programs:
    StepDescriptionInputsOutputsNotes
    1Grep terms (nGrams) then sort
    • NGramUtil.GrepTermsSort
    • nGram.${YEAR}
    • nGram.${YEAR}.term.sort
    • Must create a link of the input nGram.${YEAR}
    2Filter pipe (|) from nGrams
    • NGramUtil.FilterPipe
    • nGram.${YEAR}
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.pipe
    3Group nGrams by core-term
    • NGramUtil.GroupByCoreTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.core
    • nGram.${YEAR}.noPipe.core.detail
    • Group by core-term, also update the WC
    4Group nGrams by norm-term
    • NGramUtil.GroupByNormTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.norm
    • nGram.${YEAR}.noPipe.norm.detail
    • Group by norm-term, also update the WC
    Convert from WC|core-term back to DC|WC|TERM
    5Sort nGrams by DC|WC|Term
    • NGramUtil.SortNGramsByDcWc
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.sort.WcDcTerm
    • input is sorted by N, then DC|WC|Term
    6Convert (ungroup) core-term to nGrams
    • NGramUtil.CoreTermToNGram
    • nGram.${YEAR}.noPipe.core
    • nGram.${YEAR}.noPipe.core.detail
    • nGram.${YEAR}.noPipe.core.ungroup
    • the result is sorted, same as results from Step 5
    • in format (core-term): WC|core-term
    • out format (core-term): DC|WC|TERM
    Convert from WC|core-term.lc back to WC|core-term
    7Group nGrams by core-term.lc
    • NGramUtil.GroupByCoreTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.core.lc
    • nGram.${YEAR}.noPipe.core.lc.detail
    • Results are the same because the input is all lowercase
    8core-term to corm-term nGrams
    • NGramUtil.CoreTermLcToCoreTerm
    • nGram.${YEAR}.noPipe.core.lc
    • nGram.${YEAR}.noPipe.core.lc.detail
    • nGram.${YEAR}.noPipe.core.lc.core
    • nGram.${YEAR}.noPipe.core.lc.core.detail
    • Results are the same because the input is all lowercase
    Group n-gram set by core-term.lc
    10Group nGram set by core-term.lc
    • NGramUtil.CoreTermToNGram
    • ${NGRAM_DIR}nGramSet.${YEAR}.30
    • ${NGRAM_DIR}nGramSet.${YEAR}.30.core.lc
    • ${NGRAM_DIR}nGramSet.${YEAR}.core.lc.detail
    11Group distilled nGram set by core-term.lc
    • NGramUtil.CoreTermToNGram
    • ${NGRAM_DIR}distilledNGram.${YEAR}
    • ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc
    • ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc.detail