Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Spelling Variant Patterns - Normalization

I. Introduction

Normalization can be used to find a group of spelling variants from a list of words (such as N-grams). Java programs include:

II. Development Notes

SpVarNorm are tested on Lexicon.2015. All False-Positive are retrieved and analyzed to improve the algorithm to higher precision algorithm. Please see spVarNorm Development notes fordetails.

III. Algorithm Details

DescriptionRuleExample
Convert non-ASCII unicode to ASCII
  • Lexical Tools - ToAsciiApi
  • Labbé|Labbe
  • λmax|lamdamax
Synonym substitution
  • ^St. => ^Saint
St. Anthony's fire|Saint Anthony's fire
Spelling variant substitution
  • labelled => labeled
  • programme => program
  • tumour => tumor
  • carbon 14 labelled|carbon 14 labeled
  • drug benefit programme|drug benefit program
  • CPA tumour|CPA tumor
Rank substitution
  • 1st => first
  • 2nd => second
  • 3rd => third
  • Vth => 5th
  • 5th => fifth
  • 8th => eighth
  • 9th => ninth
  • 12th => twelfth
  • Vth nerve|5th nerve
Number substitution
  • 60 => sixty
  • 50 => fifty
  • 40 => forty
  • 30 => thirty
  • 20 => twenty
  • 19 => nineteen
  • 18 => eighteen
  • 17 => seventeen
  • 16 => sixteen
  • 15 => fifteen
  • 14 => fourteen
  • 13 => thirteen
  • 12 => twelve
  • 11 => eleven
  • 10 => ten
  • 9 => nine
  • 8 => eight
  • 7 => seven
  • 6 => six
  • 5 => five
  • 4 => four
  • 3 => three
  • 2 => two
  • 1 => one
  • 3-membered ring|three membered ring|three-membered ring
  • 12-lead|twelve-lead
Roman Number substitution
  • class-II, type-II, TBD
  • BoHV-I|BoHV-1
  • BoHVI|BoHV1
Punctuation
  • - => space
  • . => space
  • " => space
  • ! => space
  • & => space
  • ( => space
  • ) => space
  • [ => space
  • ] => space
  • / => space
  • lamin-A|lamin A
  • A.A.D.|AAD
  • University of Rome "Tor Vergata"|University of Rome Tor Vergata
  • !Kung|Kung
  • L & A|L A (L and A?)
  • aflatoxin M(1)|aflatoxin M1
  • B(a)PDE|B[a]PDE|BaPDE
Genitive
  • s's => s
  • s' => s
  • 's => space
  • ' => space

Process this operation only the matching pattern are not the end of the term
  • Addison's disease|Addisons disease
  • bilateral Wilms' tumor|bilateral Wilms tumor

  • Laufe's forceps|Laufe forceps are not spVar because an extra s-/z- sound. It is refered as "strict homonymy" (same spelling, same pronunciation, different meaning)
Lower case
  • toLowerCase()
  • Latter-Day Saint|Latter-day Saint
Remove Space
  • space =>
  • lattice work|latticework