SPECIALIST Lexicon

Frequency Analysis

I. Introduction

This page describes the frequency and valid word (single words and multiwords). Frequency (word count) in the MEDLINE n-gram set are used for frequency analysis. Data of 2015 are used in the example below.

II. Word Count Class vs. Tern Number

Approach:
- Split Lexicon to single words (464,781) and multiwords (431,432)
- LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
- Word court class: WC 100 incremental
Results (word-court-class vs. term number):
- Most valid words are located in the low WC range
- Same result as "Alice in Wonderland"

III. Word Count Class vs. Precision, Recall, F1

Approach:
- Only use on LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
- Word court class: WC 100 incremental
- Local precision = (valid tags/total tags)
  only focus on local word count class
- Local recall = (valid tags/total valid tags)
  Normalized to 0 ~ 1, use the max. recall as 1.
Results:
- Low Frequency has higher recall and F1 score, with precision above 0.8.
- LMW aquisition is set on the low WC range (100 - 10000)
- LSW (Single word) is set on high WC range

The SPECIALIST Lexicon