Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for parent rules

  • DIR: ${SUFFIXD_DIR}
  • Inputs:
    • Prepare directory:
      shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
      shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local cover-recall rate)
      shell> ln -sf ./decompose.40.25 decompose
    • Get all SD-pairs (corpus)
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
      shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
      shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
      shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
    • Decompose parent's rules one-by-one:
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
      • copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.${YEAR}.relation.children.only.rpt .
      • remove line if it is a child and parent rules at the same time (|CHILD => at the the first part of relationship)
      • change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
      • Total 17 candidate parent SD-Rules to be evaluated. These rules werer evaluated previously (may not need to reevaluated).
      • go through one by one (17) by comment out (#) the rest
      • The reuslts between years might be slightly different. But, the principle is the same.
    • Also, need to evaluate new rule and their child rules.
      As a mater of facts, the parent-child evaluatin and optimizatino could be simplfied as follows:
      • Run GetSdRule ${YEAR} 7 as below to get all good candidate child-rule.
      • Compare the precisoin and coverage rate to the root parent rules, they must be greater (F1) than parents' rules so make the overall F1 better (which is to run GetSdRule ${YEAR} 1
      • We used this simplied method on 2020 to evaluate 11 new rules, and result in 22.1 ysis$|noun|yze$|verb to reaches the best optimum SD-Rule set.
  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    7
    40 (min. occurrence rate - for decompose)
    => Need to have enough coverage for further decomposition on child rules
    25 (35) (min. coverage rate - for candidate child)
    => Need to have enough coverage to be a qualified child rule
  • Outputs:
    • sdRules.decompose.out

      Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:

      • the accuracy rate (precision) is higher than parent-rule
      • the coverage rate (recall) is higher than 25% (or the specified number)
    • shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
    • such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
  • Continue to the next step of evaluate parent-child rules and repeat the whole process for all parent rules.
  • Updates optimal log while going through this process

II. Replace parent rules by selected candidate child SD-Rules for optimized set

  • DIR: ${SUFFIXD_DIR}/data/$[year}/dataR/SdRulesOptimum/
    • Create a new directory
      shell>mkdir ${NO}.${RULE}
      shell>mkdir 01.X-ally
  • Inputs:
    • Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
      shell>cd 01.X-ally
      shell>cp ../00.baseline/sdRules.stats.in .
      => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated parent-rule
      Update the follows in sdRules.stats.in:
      • Find the rank (32) of associated parent rule from the baseline
      • Change the rank (1st field): to 321 (Original rank - 32 + child level - 1)
      • Move 9th field - accuracy rate (precision) to 2nd field
      • Add 0 to 6th field (tbd no.)
      • Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
      • Comment out (#) those parent/child rules are not in test
      • The new edited file looks like:
        #32|99.08%|2073|2054|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
        321|99.08%|2073|2054|19|0|c$|adj|cally$|adv|2020|DECOMPOSE|CHILD
        #322|99.95%|1950|1949|1|0|ic$|adj|ically$|adv|2020|DECOMPOSE|CHILD
        
  • Program - Get the optimal Set:
    shell> mv sdRules.stats.in sdRules.stats.in.01.1
    shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in


    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    01.X-ally
    53440 <= total Yes from baseline

  • Outputs directory:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally
    -- Optimum SD-Rules: 92|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.05%|94.26%|1.8931|50371|52993
    

    mv Html file

    • shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
    • shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
    • Updates optimal-log file
  • Repeat this process for all generations of candidate child rules of the same parent rule.
  • Repeat this process for all parent rules (using the best sdRules.stats.in)
  • Go to result of optimization log for optimizing details.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 130 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 93 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 94.48% system (accumulated) recall rate with a system performance of 1.8948. The total valid instance number is 53440.

-- Total line no: 168
-- Total comment no: 38
-- Total Sd-Rule no: 130
---------------------------------------
-- Optimum SD-Rules: 93|63.02%|192|121|71|0|ar$|adj|e$|noun|2013|ORG_RULE|SELF|95.00%|94.48%|1.8948|50488|53145

IV. Post-Process

Generate SD-Rule trie from this 93/130 optimized set for Lexical tools SD-Rule generation (TBD).