The SPECIALIST Lexicon

Antonym Generation for Tt Model (TtSet)

This program is used for the TtSet, which is the training data set used in 2021 to identify type of models. Theorectically, it does not need to re-run annually. In practice, we still run Steps: 40, 42-44, after 2022+ to ensure the quality of this set.

shell>cd ${ANTONYM_DIR}/bin
shell>GetAntonyms ${YEAR}

  • TT model: Training and Test Set
    OptionDescriptioninputOutputNotesOption
    40
    • Collect and retag source from [TT] to [CC|SN] of antonym in the training and test set
    • TtSet.CollectAntonyms.java (Collect antonyms from web source files)
    • TtSet.RetagSrcOnAntRaw.java (reTag)
    • ${TT_DIR}/input/antonymSource.data (use 2021)

    • ${ML_DIR}/input/3-gram.${YEAR}.30.core (previous_year)
      => Use shell> 06.NGramUtil ${PREV_YEAR}, option 3.
    • ./output/PreCand/antonymTtSet.data.TT

    • ./output/PreCand/antonymTtSet.data
    • If it is the first time run,
      • shell> mkdir ./output/PreCand
      • link ${ML_DIR}/input/3-gram.${YEAR}.30.core
        => need to run option 3 on ${LMW}/bin/06.NGramUtil ${PREV_YEAR} first
    • Retag [TT] (antonymTtSet.data.TT) to sources of [LEX|SD|PD|CC|SN] (antonymTtSet.data)
    • The 1st (TT) output file should be the same (unless new source files are added). However, the re-tag might be slightly different becuase of the updates of MEDLINE n-gram set. Thus, it should be re-run annaully.
    40
    41
    • No need for release!
    41
    42
    • Get antonym candidates from TtSet Collections
    • TtSet.GenAntCandFromTtSet
    • ${TT_DIR}/output/antonymTtSet.data
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${LEX_DIR}/input/inflVars.data
    • ${ANT_DIR}/input/domain.data
    • ./output/Cand/antCandTtSet.data
    • ./output/Cand/antCandTtSet.data.tbd
    • ./output/Cand/antCandTtSet.data.tag
    • ./output/candTagged/antCandTtSet.data.tag.tagged
    • TBD file should be 0
      However, there are two know exceptions for the 1st run: post|E0049060|pre|EUI_TBD|noun|CANON_TBD|TYPE_TBD|NEG_TBD|DOMAIN_TBD|CC post|E0049061|pre|EUI_TBD|verb|CANON_TBD|TYPE_TBD|NEG_TBD|DOMAIN_TBD|CC
      convert to:
      post|E0049060|pre|EUI_NONE|noun|N|NA|O|DOMAIN_NONE|CC
      post|E0049061|pre|EUI_NONE|verb|N|NA|O|DOMAIN_NONE|CC
    • Send TBD file (if other than above 2) to linguists to tag
    42
    43
    • Validate and fix tags of antonym candidates (TT)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandTtSet.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ./output/candTagged/antCandTtSet.data.tag.fixed
    • Copy antCandTtSet.data.tag.tagged to antCandTtSet.data.tag.tagged.${YEAR}.{NO}
    • Append tagged candidates to antCandTtSet.data.tag.tagged
      post|E0049060|pre|EUI_NONE|noun|N|NA|O|DOMAIN_NONE|CC
      post|E0049061|pre|EUI_NONE|verb|N|NA|O|DOMAIN_NONE|CC
    • run this step until tag and fixed files are the same (should be the same after 2022+)
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • Manually fix know exceptions (2).
      • Manually copy the fixed file to tagged file
    • Manually copy antCandTtSet.data.tag.tagged to antCandTtSet.data.tag.tagged.${YEAR}
    43
    44
    • Update release antonyms tagged file form TT
    • Antonym.UpdateAllTaggedFile
    • ./output/candTagged/antCandTtSet.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated
    • This step auto-update all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.TT
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 40-44 until it passes all steps
    • TT should be run once and pass steps from 40-44 after year 2023+.
      • no src conflict (0)
      • no tag conflict (0)

      • no new cand (0), because all aPairs form TT are already in the past release!
    44