Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Consumer Health Corpus - The Raw N-gram Set

I. Introduction

The Consumer Health Corpus (used in the CSpell) is used to retrieve the n-gram set for LMW candidate generation.

II. N-gram set Specifications

  • Corpus: Consumer Health Corpus (2017)
  • Method:
  • Max. Character size: 50
  • Min. word count: 1
  • Min. document count: 1

  • Total website: 16
  • Total articles/pages (XML files): 17,136

  • Total document count: 17,136
  • Total sentence count: 555,205
  • Total token count: 10,197,915

  • N-gram files
    • File Format - 3 fields
      Document countWord CountN-gram
  • Each grams are sorted by document count, word count, then alphabetic order of n-grams.
  • N-gram set is the concatenated results of n-gram files (N = 1 ~ 5). It is not sorted
  • The lowercased core-terms of N-gram set is sorted and used for further process in LMW candidate generation.

III. Process

  • program:
    ${LMW_DIR}/bin/21.CSpellHealthCorpus
    2017

    OptionDescriptionInputs - ${IN_DIR}/${OUT_DIR}Outputs - ${OUT_DIR}Notes
    Generate the raw n-gram set
    2Convert Xml files to Raw Corpus Text files CSpellHealthCorpus/Crawl/*/*.html 21.CSpellHealthCorpus/RawCorpus/*.data
    • File: each file includes all pages from one website
    • Format: ID|contents
    • Enhanced sentence tokenizer on Unicode
    • Enhanced String trim on Unicode space
    4Generate all raw n-gram files (N = 1-5) 21.CSpellHealthCorpus/RawCorpus/*.data 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • Data is relatively small, no need to use split-group-filter model
    • process time: ~3 min.
    6Sort all raw n-gram files 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.tdw
    7Generate the raw n-gram set 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1 min wC = 1
    8Zip n-gram set
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.${YEAR}.tgz
    • 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1.tgz

IV. Results

N-gramsFileZip SizeActual SizeNo. of n-grams
UnigramsnGram.1.2017.tgz 0.985 Mb2.8 Mb194,407
bigramsnGram.2.2017.tgz 6.6 Mb23 Mb1,233,365
TrigramsnGram.3.2017.tgz 18 Mb65 Mb2,806,783
Four-gramsnGram.4.2017.tgz 29 Mb111 Mb3,906,380
Five-gramsnGram.5.2017.tgz 39 Mb149 Mb4,396,030
N-gram SetnGramSet.2017.1.tgz 92 Mb350 Mb12,536,965