Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Design: Tokenize & Reverse Token for Strip
Let's use the same example
< For example >
To strip stop word on "Check-in four words (in, the, top, of) are checked"
Algorithms:
| Index | Token String | Token Type |
| 1 | [Check-in] | token |
| 2 | [ ] | Space Delimiter |
| 3 | [four] | Token |
| 4 | [ ] | Space Delimiter |
| 5 | [words] | Token |
| 6 | [ ] | Space Delimiter |
| 7 | [(] | Restore Delimiter |
| 8 | [in] | Token |
| 9 | [,] | Strippable Delimiter |
| 10 | [ ] | Space Delimiter |
| 11 | [the] | Token |
| 12 | [,] | Strippable Delimiter |
| 13 | [ ] | Space Delimiter |
| 14 | [top] | Token |
| 15 | [,] | Strippable Delimiter |
| 16 | [ ] | Space Delimiter |
| 17 | [of] | Token |
| 18 | [)] | Striping Delimiter |
| 19 | [ ] | Space Delimiter |
| 20 | [are] | Token |
| 21 | [ ] | Space Delimiter |
| 22 | [checked] | Token |
| Index | Token String | Token Type |
| 1 | [Check-in] | token |
| 2 | [ ] | Space Delimiter |
| 3 | [four] | Token |
| 4 | [ ] | Space Delimiter |
| 5 | [words] | Token |
| 6 | [ ] | Space Delimiter |
| 7 | [(] | Restore Delimiter |
| 8 | [ ] | Stripped |
| 9 | [,] | Strippable Delimiter |
| 10 | [ ] | Space Delimiter |
| 11 | [ ] | Stripped |
| 12 | [,] | Strippable Delimiter |
| 13 | [ ] | Space Delimiter |
| 14 | [top] | Token |
| 15 | [,] | Strippable Delimiter |
| 16 | [ ] | Space Delimiter |
| 17 | [ ] | Stripped |
| 18 | [)] | Striping Delimiter |
| 19 | [ ] | Space Delimiter |
| 20 | [are] | Token |
| 21 | [ ] | Space Delimiter |
| 22 | [checked] | Token |
If we compose the string on this list, the output string will be
"Check-in four words ( , , top, ,) are checked"
| Table-1 | |||
| Index | Token String | Token Type | Clean Up Action |
| 1 | [Check-in] | token | keep: token |
| 2 | [ ] | Space Delimiter | keep |
| 3 | [four] | Token | keep |
| 4 | [ ] | Space Delimiter | keep |
| 5 | [words] | Token | keep |
| 6 | [ ] | Space Delimiter | keep |
| 7 | [(] | Restore Delimiter | keep |
| 8 | [ ] | Stripped | strip: stripped type |
| 9 | [,] | Strippable Delimiter | strip: Strippable Delimiter |
| 10 | [ ] | Space Delimiter | strip: last type is stripped |
| 11 | [ ] | Stripped | strip: stripped type |
| 12 | [,] | Strippable Delimiter | strip: Strippable Delimiter |
| 13 | [ ] | Space Delimiter | strip: last type is stripped |
| 14 | [top] | Token | keep |
| 15 | [,] | Strippable Delimiter | strip: conflict char |
| 16 | [ ] | Space Delimiter | strip: conflict char |
| 17 | [ ] | Stripped | strip: stripped type |
| 18 | [)] | Striping Delimiter | keep |
| 19 | [ ] | Space Delimiter | keep |
| 20 | [are] | Token | keep |
| 21 | [ ] | Space Delimiter | keep |
| 22 | [checked] | Token | keep |
| Table-2 | ||
| Index | Token String | Token Type |
| 1 | [Check-in] | token |
| 2 | [ ] | Space Delimiter |
| 3 | [four] | Token |
| 4 | [ ] | Space Delimiter |
| 5 | [words] | Token |
| 6 | [ ] | Space Delimiter |
| 7 | [(] | Restore Delimiter |
| 8 | [top] | Token |
| 9 | [)] | Striping Delimiter |
| 10 | [ ] | Space Delimiter |
| 11 | [are] | Token |
| 12 | [ ] | Space Delimiter |
| 13 | [checked] | Token |