Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Factor Analysis Results

I. Error Types

  • Tokens in Brat annotation data (spelling errors to be corrected) and not in Brat annotation data (not spelling errors, should not be corrected) are tested through a set of computer program. Each type of errors are identified and coded in the program for further process as shown bellows:

    Correction TypeDetails
    PreCorrection
    • B1.1. PreCorr (T)
    • B1.2. PreCorr (F)
    Dictionary-based Correction
    • Spelling detector
    • Candidates
    • Ranking
    • B2.1. DicCorr (T)
    • B2.2. DicCorr (F)
      • B2.2.1. Not detect, real-word (error tag)
      • B2.2.2. Not detect, spelling error (non-word)
      • B2.2.3. Detect, not candidates by edit-distance
      • B2.2.4. Detect, not candidates by suggestion Dic
      • B2.2.5. Detect, not candidates by multi-corrections
      • B2.2.6. Detect, candidates, wrong (not top) rank
      • B2.2.7. Detect, candidates, wrong top rank
    CombinationTBD

  • Tokens not in Brat annotation data (Correct spelling, no need to be corrected)

    Correction TypeDetails
    Not in checkDic, Not Correct
    • A2.2.1. Not in checkDic, corrected wrong, by dictionary
    • A2.2.2. Not in checkDic, corrected wrong, by preCorrection

II. Analysis Results

The results on baseline data are shown belows:

  • PreCorrection (365, 43.8175%) :
    • T: 332 (90.9589%)
    • F: 33 (9.0411%)

  • Dictionary-based correction:
    • * LexcionE: use Lexicon, with Aa, unit, and Mw (includes spVar and Pn)
    • ** Combo1: use LexcionE, with replacing suggDic by baseline (eng_med.dic)
    • ** Combo2: use LexcionE+Medline, with replacing sgDic by baseline (eng_med.dic)

    ResultsJazzyBaselineMedlineLexiconLexicon.E*Combo1**Combo2***
    Performance (by Baseline program)
    TP|Ret.|Rel.
    Precision, Recall, F1
    • 498|2606|814
    • 0.19|0.62|0.29
    • 548|845|814
    • 0.65|0.67|0.66
    • 524|809|814
    • 0.65|0.64|0.64
    • 535|829|814
    • 0.65|0.66|0.65
    • 534|814|814
    • 0.66|0.66|0.66
    • 543|737|814
    • 0.74|0.67|0.70
    • 529|695|814
    • 0.76|0.65|0.70
    Tagged terms (833), should be corrected
    B2.1. DicCorr (T) 227 (48.5043%)232 (49.5726%)205 (43.8034%)234 (50.0000%)235 (50.2137%)226 (48.2906%)210 (44.8718%)
    B2.2. DicCorr (F) 241 (51.4957%)236 (50.4274%)263 (56.1966%)234 (50.0000%)233 (49.7863%)242 (51.7094%)258 (55.1282%)
    Tag issue: re-check the annotation
    B2.2.1.
    Not detect, real-word (error tag)
    36 (7.6923%)49 (10.4701%)43 (9.1880%)50 (10.6838%)50 (10.6838%)50 (10.6838%)50 (10.6838%)
    Detection issue: Check dictionary + exception algorithm
    B2.2.2.
    Not detect, spelling error (non-word)
    20 (4.2735%)54 (11.5385%)76 (16.2393%)57 (12.1795%)57 (12.1795%)57 (12.1795%)85 (18.1624%)
    Candidate issue: edit distance + phonetic + Suggesting dictionary
    B2.2.3.
    Detect, not candidates by edit-distance
    37 (7.9060%)34 (7.2650%)29 (6.1966%)32 (6.8376%)32 (6.8376%)32 (6.8376%)28 (5.9829%)
    B2.2.4.
    Detect, not candidates by suggestion Dic
    79 (16.8803%)11 (2.3504%)19 (4.0598%)17 (3.6325%)20 (4.2735%)15 (3.2051%)15 (3.2051%)
    B2.2.5.
    Detect, not candidates by multi-corrections
    2 (0.4274%)6 (1.2821%)13 (2.7778%)5 (1.0684%)5 (1.0684%)6 (1.2821%)6 (1.2821%)
    Ranking issue: in candidate list
    B2.2.6.
    Detect, Candidates, wrong (not top) rank
    62 (13.2479%)75 (16.0256%)77 (16.4530%)65 (13.8889%)57 (12.1795%)75 (16.0256%)69 (14.7436%)
    B2.2.7.
    Detect, Candidates, wrong top rank
    5 (1.0684%)7 (1.4957%)6 (1.2821%)8 (1.7094%)12 (2.5641%)7 (1.4957%)5 (1.0684%)
    Valid word (not-tagged), but not in checkDic, corrected wrong
    A2.2.1.
    Not in checkDic, corrected wrong, by Dic
    1912 (7.8287%)139 (0.5691%)121 (0.4954%)143 (0.5855%)137 (0.5609%)70 (0.2866%)51 (0.2088%)
    A2.2.2.
    Not in checkDic, corrected wrong, by Pre
    41 (0.1679%)33 (0.1351%)27 (0.1106%)31 (0.1269%)31 (0.1269%)31 (0.1269%)26 (0.1065%)
    Summary
    Check Dic
    B2.2.2+A2.2.1+A2.2.2
    1973226224231225158162
    Sugg Dic
    B2.2.3+B2.2.3+B2.2.4
    118516154575349

  • Edit Distance:
    edit distanceinstancepercentageAccu. percentage
    131767.74%67.74%
    211023.50%91.24%
    3245.13%96.37%
    481.71%98.08%
    561.28%99.36%
    620.43%99.79%
    710.21%100.00%