Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Corrector

This page describes the corrector algorithm that replaces the spelling errors with top ranked candidates to update the text.

I. One-To-One

  • Finding: Find the top rank candidate (TokenObj)
  • Correction: Add to the outTokenList
  • Java: OneToOneSplitCorrector.AddToFlatMapList
  • Example:

    Input...dianosed...
    Top Candidate...diagnosed...
    Correction...diagnosed...

II. Split

  • Finding: Find the top rank candidate (TokenObj)
  • Correction: use FlatMap to the outTokenList
    The top rank candidate (the split words) needs to be flat mapped to a list of TokenObjs and then add to the outTokenList.
  • Java: OneToOneSplitCorrector.AddToFlatMapList.
  • Example:

    Input...brokenbonecannotsleep...
    Top Candidate...broken bone can not sleep...
    Correction...broken bone can not sleep...

III. Merge

  • Finding: Find the top rank candidate (TokenObj)
  • Correction:
    • Update tokens for all MergeObjs
      • Go through all MergeObjs
      • update tokens before target merge start
      • update merge at target
    • add tokens after the last MergeObj
  • Java: ProcessNonWordMerge.CorrectTokenListByMerge
  • Example:

    Input...problemsduringherpregnancies.
    Correction-1...problems 
    Correction-2...problemsduring
    Correction-3...problemsduringherpregnancies.

    * MergeObj:

    tarWordmergeWordcoreMergeWordmergeNotarIndexstartIndexendIndextarPosstartPosendPos
    • xxxIndex is the index in the original text (including space tokens), used in merge operation to correct the input text
    • xxxPos is the index in the non-space token list, used to find the context for context scores.
    • coreMergeWord is used to take care of ending punctuation. Such as "disap point ment." to "disappointment."