Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

ASCII LEXICON, 1st Version (09~10)

I. Introduction

The Specialist LEXICON is distributed in UTF-8 format annually with UMLS. There are some NLP projects uses the Specialist LEXICON and still only dealing with ASCII characters. Due to the requests from user groups, the pure ASCII version of LEXICON is distributed since 2009.

II. Algorithm

  • Convert LEXICON form UTF-8 to ASCII (7-bit):
    Use Java API class, ToAsciiApi( ), from Lexical Tools (after 2009) to make the conversion.

  • Automatic/manually clean up:
    After the conversion, some data in Lexical records need to be clean up. For example, the spelling variant résumé is converted to resume and should be removed since it is the same as the base form. In 2009 LEXICON, we found following ASCII conversion cases that need to be clean up as shown in the following table. The LexCheck.CheckContent.Check( ) is used to clean up duplications.

    LEXICON contentActionNotes & Example
    {base=fillerN/AAll base is unique
    spelling_variant=fillerremove if it is duplicatedspelling_variant=résumé
    abbreviation_of=abbreviationremove if it is duplicatedNone
    acronym_of=acronymsremove if it is duplicatedNone
    nominalization_of=fillerremove if it is duplicatedNone
    variants=irregremove if it is duplicatedirreg|saute|sautes|sauted|sauted|sauteing|
    compl=pphr(N/ANeeds manual cleanup (none)
    trademark=filler(N/ANeeds manual cleanup (none)