Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Core-term
I. Introduction
Lots of nGrams have punctuation at the begining or/and at the end. Such as:
Input Term | CoreTerm |
---|---|
- in details, | in details |
- in details | |
in details, | |
in (5) details, | in (5) details |
(in (5) details, | |
(in (5) details), |
All above n-grmas are normalized to "in details" and "in (5) details" by stripping the leading or/and ending punctuation. The normalized term is called core-term, which is the core of the term. This process is called core-term normalization.
A core term might remain internal punctuation, such as "in (5) details". Also, leading or/and ending puncutation might remian in core-term, such as "clean room(s)".
II. Algorithm
Recursively repeat the following process until term does not change or legnth = 0:
ASCII | -)}]_!@#%&*\\:;\"',.?/~+=|>$`^ |
Unicode | ¦§»‐‑‒–—―’”•․‥…⁈ |
ASCII | -({[_!@#%&*\\:;\"',.?/~+=|>$`^ |
Unicode | ¦§«‐‑‒–—―‘“•․‥…⁈ |
ASCII | (), [], {}, <> |
Unicode | «»‘’“” |
* net bracket no = total left bracket no - total right bracket no
For example,
Term | Net Bracket No |
---|---|
(in details:) | 0 |
(in (5) details:) | 0 |
(in (5) details | 1 |
in (5) details) | -1 |
III. Examples
Input nGram | Core-term |
---|---|
Strip punctuation | |
-in details | in details |
In details: | In details |
#$%IN DETAILS:%^( | IN DETAILS |
( | |
() | |
Strip brackets | |
{in (5) details} | in (5) details |
{{in (5) details} | in (5) details |
{in (5) details}} | in (5) details |
{in (5)} details}} | {in (5)} details |
Strip brackets and punctuation | |
(in details:) | in details |
(in details:)) | in details |
(-(in details)%^) | in details |
{in (5) days}, | in (5) days |
in (5 days), | in (5 days) |
in ((5) days), | in ((5) days) |
((clean room(s))) | clean room(s) |
((inch(es))) | inch(es) |