Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Inclusive Filter: EndWord pattern
I. Introduction
N-Grams (terms) end with certain words have high possibilities to be a valid multiwords. Such as "syndrome, acid, disease, etc". These nGrams are retrieved as LMW candidates. The top (33+) most frequent endWord are retrieved from Lexicon, excluding: numbers (1, 2, 3, I, II) and single character word (A, B). This EndWord list is used as default to test this matchers.
II. Procedure
The following procedures are used to find valid multiwords from n-grams by endWord pattern:
shell> 10.MatcherEndWord
Step | Description | Inputs | Outputs | Notes |
---|---|---|---|---|
preProcess - Get EndWord from Lexicon | ||||
1 | Analyze EndWord pattern in LMW (only multiwords)
|
|
|
Find high frequency endword in Lexicon
|
2 | EndWordMatcher
|
|
| Check if a term contains an endWord
|
3 | Test Matcher-EndWord in Lexicon
|
|
| Similar to Step 1, but including single words
|
Process | ||||
10 | Apply Matcher-EndWord on nGram
|
|
./Distilled/${END_WROD}
./Whole/${END_WROD}
| Get N-gram that matches the specified endWords
|
11 | Tag and get Stats for LMW candidates
|
|
|
|
12 | Sort by reversed string - LMWs candidates
|
|
| Sort the results by the reversed string (last character first)
|
III. End_Word List
For now, only two endWords are selected for tagging. We will test more from the high frequency list derived frim Step 2.