Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Context Score

Introduction

This page describes the ranking algorithm using context to choose a correct word from the suggested candidates for a misspelt word. There are two major approaches:

  • n-gram model:
    n-gram model (bi-gram or tri-gram) seems like a simple and straight forward approach. However, we did not implement this model due to the time constraint because word-embedding are the state-of-art approach (compared than n-gram model) from various research. Also, it is a simple to use with outstanding performance.
  • word-embedding:

In CSpell, we chose the Continuous Bag of Words (CBOW) model in word2vec to rank candidates because CBOW is designed to predict a word from a surrounding context.

Components

  • Dual embedding in the continuous bag of words model
    • Program:
      ${PRE_PROCESS}/RunCorpus
      3
      4
      6 (Best)
      shell> ${DEV}/DL/word2vec/word2vec/word2vec2 -train ${IN_FILE} -outsyn0 ${SYN_0_FILE} -outsyn1 ${SYN_1_FILE} -outsyn1neg ${SYN_1N_FILE} -size 200 -window 5 -cbow 1 -hs 1 -threads 12
    • Input:
      • ./Crawl/word2Vec/CorpusW2V.data
    • Output:
      • ./Crawl/word2Vec/word2VecNew.syn0 (Input Matrix, word-vec)
      • ./Crawl/word2Vec/word2VecNew.syn1 (Output Matrix)
      • ./Crawl/word2Vec/word2VecNew.syn1n (Output Matrix, with negative sampling, better for prediction)
  • calculate word vector (word2vec) for context

Source Code:

  • RankByContext.java: get ranked candidate list or top rank candidate by context
  • ContextScore.java: java object of context score
  • Word2VecContext.java: Word2Vc context Utility to get context or context vector
  • Word2VecScore.java: get score by cosine similarity or inner-dot
  • DoubleVecUtil.java: basic vector operation in Double

Tests:

  • Use the Baseline non-word 1-to-1 and split (development set)
  • Results:

    Test CaseSoftwareData (Word Vec)Score MethodsPerformanceNotes
    BaselineBaselineCosine358|807|774
    0.4436|0.4625|0.4529
    Baseline
    2-1.c.cos.bCSpellBaselineCosine: [IM]484|771|774
    0.6278|0.6253|0.6265
    2-2.c.cos.0CSpellHealth CorporaCosine: [IM]443|770|774
    0.5753|0.5724|0.5738
    baseline of new Corpus
    2-3.c.cbow.0-1CSpellHealth CorporaCBOW: [IM] & [OM], syn1
    Only use positive scores
    406|678|774
    0.5988|0.5245|0.5592
    Not used, use syn1neg instead
    2-4.c.cbow.0-1n.+0-CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Use only positive (+) scores
    429|524|774
    0.8187|0.5543|0.6610
    2-5.c.cbow.0-1n.+-0!=CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Rank by +, -, 0
    505|748|774
    0.6751|0.6525|0.6636
    2-6.c.cbow.0-1n.+0-!=CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Use +, - (only if no +) scores
    445|554|774
    0.8032|0.5749|0.6702
    2-9.c.cbow.0-1n.+0-!=.cosCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    446|554|774
    0.8051|0.5762|0.6717
    Best (10% improvement)
    2-10.c.cbow.0-1n.+0-!=.cos + fixed LC on W2VCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    457|562|774
    0.8231|0.5904|0.6841
    Best (11% improvement)
    FinalCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    458|564|774
    0.8121|0.5917|0.6846
    Best (11% improvement)

* Word2Vec Score Algorithm:

  • Word2VecScore.java: Use Cosine Similarity score
  • ContextScoreComparator.java: to sort the context score
  • RankByContext.java:
    • If top rank score != 0: candidate = topRank
      • If topRank score > 0.0 => use it to correct, the bigger the positive score means the word is closer to the prediction
      • If topRank score < 0.0 => use it to correct, the smaller the negative score means the word is farer to the prediction
    • If top rank score = 0:
      • If only 1 candidate => use it to correct, even we don't have any information of Wrod2Vec score for the candidate
      • If have multiple candidate, no correct.
        Word2Vec score = 0.0 means we don't have any information on the candidate. Thus, we don't know if 0.0 is better than a negative score.

    • Context scores might be positive, zero, or negative. A zero context score means the target word does not have a word vector, which was not chosen over a negative score.