Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Analysis - Impact Factors

The approach is to de-couple the relationship between impact factors. And then to optimize each factors for the optimized implementation of CSpell. Impact factors includes (but not limited to):

  • Dictionary (spelling error detection and spelling correction suggestion)
  • How big is the edit distance for candidates
  • Ranking score system (method and weight)

These factors have complicated relationship with each module. The impact factors associated with modules are summarized as follows:

ModuleAlgorithm - Factors
Pre-Correction (not dictionary based):
  • Rule-based algorithm
    • Patterns observed from Lexicon
    • Developed algorithm for all patterns
Dictionary based correction:
Spelling Checker
  • IsValidWord (Check Dictionary)
    • check word
    • check core-term
    • check possessive
    • check slash or (case/test)
    • Check parenthetic plural forms (s), (es), (ies)

      Dictionary should include words:

    • IsSpVar
    • IsProperNoun
    • IsAbbAcr
  • IsExecption
    • IsDigit
    • IsPunc
    • IsDigitPunc
    • IsUrl
    • IsEmail
    • IsEmptyString
    • IsMeasurements (Unit)
Candidates: 1-to-1
  • Possibility: Edit Distance (<= 2)
  • IsDicWord (Suggestion Dictionary)
Candidates: Split
  • Possibility: Number of Split

  • IsMultiword: (Multiword Dictionary)
  • IsDicWord: (Suggest Dictionary)
  • IsAbb/Acr + length: (Abb/Acr Dictionary, exclude Aa with small length)
Candidates: MergeTBD: most of merge cases are not typos, involves real-word correction
Ranking: Orthographic
  • EditDistance
  • phonetic (Metaphone 2)
  • Overlap

  • Weights
Ranking: FrequencyTBD
Ranking: ContextTBD

Tokenized words are split into two groups (with/without annotations) in this analysis. The analysis reports are described as in the results section.