Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Consumer Health Corpus - The Raw N-gram Set
I. Introduction
The Consumer Health Corpus (used in the CSpell) is used to retrieve the n-gram set for LMW candidate generation.
II. N-gram set Specifications
Document count | Word Count | N-gram |
III. Process
${LMW_DIR}/bin/21.CSpellHealthCorpus
2017
Option | Description | Inputs - ${IN_DIR}/${OUT_DIR} | Outputs - ${OUT_DIR} | Notes |
---|---|---|---|---|
Generate the raw n-gram set | ||||
2 | Convert Xml files to Raw Corpus Text files | CSpellHealthCorpus/Crawl/*/*.html | 21.CSpellHealthCorpus/RawCorpus/*.data
|
|
4 | Generate all raw n-gram files (N = 1-5) | 21.CSpellHealthCorpus/RawCorpus/*.data | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
|
6 | Sort all raw n-gram files | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
| |
7 | Generate the raw n-gram set | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1 | min wC = 1 |
8 | Zip n-gram set |
|
|
IV. Results
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | nGram.1.2017.tgz | 0.985 Mb | 2.8 Mb | 194,407 |
bigrams | nGram.2.2017.tgz | 6.6 Mb | 23 Mb | 1,233,365 |
Trigrams | nGram.3.2017.tgz | 18 Mb | 65 Mb | 2,806,783 |
Four-grams | nGram.4.2017.tgz | 29 Mb | 111 Mb | 3,906,380 |
Five-grams | nGram.5.2017.tgz | 39 Mb | 149 Mb | 4,396,030 |
N-gram Set | nGramSet.2017.1.tgz | 92 Mb | 350 Mb | 12,536,965 |