Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Frequency Analysis on 5 WC ranges: 100, 1K, 10K, 100K, 1M
I. Introduction
Frequenct strategy is important for LMW acquistion. It is applied to LMW candidates obtained from fitlers and matchers for better precision. This page describes an frequency analysis on 5 word count range (100, 1K, 10K, 100K, 1M).
II. Details
${MULTIWORDS}/bin/08.MatcherSpVar
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good
Tag | Description |
---|---|
AUTO_YES | Automatically tagged by computer if term is in Lexicon |
AUTO_NO | Automatically tagged by computer if term is in Lexicon |
Y | Manually tagged by linguists if term is LMW, then add to Lexicon |
N | Manually tagged by linguists if term is not LMW, then add to invalid LMW List |
III. Results
Frequency | Precision (New Terms) | Precision (Total Terms) |
---|---|---|
100 | 19.81% (= 104/525) | 21.60% (= 116/537) |
1K | 36.77% (= 196/533) | 42.42% (= 249/587) |
10K | 47.73% (= 263/551) | 67.56% (= 604/894) |
100K | 35.72% (= 384/1075) | 68.38% (= 1516/2217) |
1M | 36.77% (= 556/1512) | 71.16% (= 2396/3367) |
The total precision is increased as the frequency increase. Thus, we should acquire LMW from the highest frequency n-grams.
Details data are available at:
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good/*.rpt