Lexical Tools

Results of optimized set

I. The optimized set
As the result, we concluded case 10.1 is the final optimized set of SD-Rules in the corpus of Lexicon 2015 to include 76 (out of 101) SD-rules to reach:

system accuracy rate: 95.22%
system coverage rate: 95.70%
system performance: 1.9093

This set of SD-rules is expected to reach the same system performance when it is applied to other English corpora under the assumption that:

the characteristics of derivations are consistent between from Lexicon and the working general English domain.
Lexicon is considered as a representable subset (in terms of derivations) for general English. Please refer to future work for this assumption.

II. The methodology
This approach is to find the best set of SD-rules from a set of known candidate SD-rules. Theoretically, a complete set of SD-Rules can be obtained when more SD-rules are evaluated and added. This methodology provides a systematic approach to:

measure system performance
to evaluate new SD-rules
obtain the set of SD-rules according to user's specified target minimum accuracy rate (system performance)
choose among parent-child SD-Rules to reach Max. system precision and recall rate.
- In general, a parent rule has higher recall while a child rule has higher precision
- This method provides a good way to choose between a parent rule and child rule(s).

III. The target precision and recall rate (95%)

The intersection of curves (optimization) of system precision rate and system recall rate of the final set are at 95%. We also used average values for the window size of 3, 5, 7 rules for these two curves for noise reduction (smoothing algorithm - simple moving average) and find the intersections are all around 95% for all cases (see diagram below). Smoothing this data set allows us to capture the characteristics of this set and leave out noise. Accordingly, our target minimum accuracy rate (95%) is a good choice to obtain the optimized set of SD-rules (close to optimization).