Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Frequency Analysis on 5 WC ranges: 100, 1K, 10K, 100K, 1M

I. Introduction

Frequenct strategy is important for LMW acquistion. It is applied to LMW candidates obtained from fitlers and matchers for better precision. This page describes an frequency analysis on 5 word count range (100, 1K, 10K, 100K, 1M).

II. Details

  • Directory:
    • ${MULTIWORDS}/bin/08.MatcherSpVar
    • ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/
    • ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good
  • Model:
    • Input Data: 2015 Distilled MEDLINE N-gram Set
    • Process:
      • Step 51: Use SpVar model of M2CES to get SpVar List
        medline.2.byM2CES.2.out.30.spVars (min_ed >= 2, WC >= 30)
      • Step 60: Apply CUI filter
        medline.2.byM2CES.2.out.30.spVars.cui
      • Step 61A: retrieve 500 LMW candidates at 5 WC range
        The algorithm only count multiwords of 500 below the WC
        • 100
        • 1000
        • 10000
        • 100000
        • 1000000
      • Tag them:
        TagDescription
        AUTO_YESAutomatically tagged by computer if term is in Lexicon
        AUTO_NOAutomatically tagged by computer if term is in Lexicon
        YManually tagged by linguists if term is LMW, then add to Lexicon
        NManually tagged by linguists if term is not LMW, then add to invalid LMW List

III. Results

FrequencyPrecision (New Terms)Precision (Total Terms)
10019.81% (= 104/525)21.60% (= 116/537)
1K36.77% (= 196/533)42.42% (= 249/587)
10K47.73% (= 263/551)67.56% (= 604/894)
100K35.72% (= 384/1075)68.38% (= 1516/2217)
1M36.77% (= 556/1512)71.16% (= 2396/3367)

The total precision is increased as the frequency increase. Thus, we should acquire LMW from the highest frequency n-grams. Details data are available at: ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good/*.rpt