Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Multiwords: Normalization

I. Why Normalization?

A same term could be represented in many different forms (of genitive, punctuation, and case) in MEDLINE. For example, "diabetes mellitus" appears in the following n-gram terms from MEDLINE:

  • diabetes mellitus
  • diabetes mellitus,
  • diabetes mellitus]
  • diabetes mellitus:
  • diabetes mellitus.
  • [diabetes mellitus
  • diabetes mellitus)
  • (diabetes mellitus
  • (diabetes mellitus,
  • diabetes mellitus),
  • (diabetes mellitus;
  • diabetes mellitus?]
  • (diabetes mellitus)
  • diabetes mellitus -

  • Diabetes mellitus
  • Diabetes Mellitus
  • DIABETES MELLITUS

  • Diabetes mellitus,
  • Diabetes mellitus.
  • [Diabetes mellitus
  • [Diabetes Mellitus:
  • [Diabetes mellitus]
  • Diabetes Mellitus:
  • Diabetes Mellitus,
  • DIABETES MELLITUS]

Normalization (by abstracting away from genitive, punctuation, and case) is applied to n-gram terms so that these terms can be grouped for further reviewed and analysis. Also, the word count of normalized n-gram terms reflects true frequency of usage on the n-gram term.

II. Normalization

  • The normalization uses function of Lexical Tools flow components
    • -f:g (remove genitive)
    • -f:o (replace punctation with space)
    • -f:l (lowercase)

III. Normalization Usage in N-gram to generate (multi)words

We used normalization as follows:

  • Use the WC of normalized terms for the prediction filter to generate high frequency n-gram
  • Candidate multiwords filtered from MEDLINE n-grams are grouped by normalized terms. Both normalized n-gram and original n-gram are sent to linguists for review.