Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Total Data Set Files and Usage

The procedure of adding LMW are:

  • Generate N-gram set from MEDLINE (or any other corpus)
  • Generate candidate lists from n-gram set
  • Linguist manualy tag/add term from candidate lists

It would be very useful if we collect all manually tagged terms. It can be used for:

  • Final filter to exclude previous tagged terms
    • Use all inflVars to filter out valid LMWs (term already in the Lexicon)
    • Use the collected invalid LMWs to tag invalid as a reference for linguists (some invalid term might become valid).
      => Please note that the above two data are changed when the Lexicon is updated, a new candidate list is completed, or a new not base/LMW files is updated in LexCheck
  • Use as training/test data set for deep learning models

This manually data are collected from two sources:

  • Program: ${MULTIWORDS}/bin/00.CandidateList
    3
  • Data directory: ${MULTIWORDS}/data/Candidate/
  • Algorithm:
    • Get total data from
      • All previous candidate list
      • All not base/LMW files from LexCheck
    • Use the lastest Lexicon (InflVars.data) to tag valid/invalid LMWs
      • valid LMWs: totalData.data.yes
      • invalid LMWs: totalData.data.no
    • Use the tagged results to tag new candidate lists:
      • valid LMWs: inflVars.data
      • invalid LMWs: totalData.data.no
      • TBD (new Candidate list): others
  • Out Files:
    • totalData.data.*

      DateNotesTotal CandidateValid LMWsInvalid LMWs
      totalData.datatotalData.data.yestotalData.data.no
      2018-11-152.MNSMatcherParAcr, 20173100416331 (52.67%) 14673 (47.32%)
      2019-01-032.MNSMatcherParAcr, 20183180616924 (53.21%) 14882 (46.79%)
      2019-05-203.DMNSMatcherCuiEndWord, 201733751 16924 (50.14%) 16827 (49.86%)