Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Frequency Analysis

I. Introduction

This page describes the frequency and valid word (single words and multiwords). Frequency (word count) in the MEDLINE n-gram set are used for frequency analysis. Data of 2015 are used in the example below.

II. Word Count Class vs. Tern Number

  • Approach:
    • Split Lexicon to single words (464,781) and multiwords (431,432)
    • LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
    • Word court class: WC 100 incremental

  • Results (word-court-class vs. term number):

    • Most valid words are located in the low WC range
    • Same result as "Alice in Wonderland"

III. Word Count Class vs. Precision, Recall, F1

  • Approach:
    • Only use on LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
    • Word court class: WC 100 incremental
    • Local precision = (valid tags/total tags)

      only focus on local word count class

    • Local recall = (valid tags/total valid tags)

      Normalized to 0 ~ 1, use the max. recall as 1.

  • Results:

    • Low Frequency has higher recall and F1 score, with precision above 0.8.
    • LMW aquisition is set on the low WC range (100 - 10000)
    • LSW (Single word) is set on high WC range