Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Design: Tokenize & Reverse Token for Strip

IV. Example

Let's use the same example

< For example >

To strip stop word on "Check-in four words (in, the, top, of) are checked"

Algorithms:

  • Use delimiters listed in previous section.

  • Tokenize string into a list of token and determine token type

    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[in] Token
    9[,] Strippable Delimiter
    10[ ] Space Delimiter
    11[the] Token
    12[,] Strippable Delimiter
    13[ ] Space Delimiter
    14[top] Token
    15[,] Strippable Delimiter
    16[ ] Space Delimiter
    17[of] Token
    18[)] Striping Delimiter
    19[ ] Space Delimiter
    20[are] Token
    21[ ] Space Delimiter
    22[checked] Token

  • Modify list by stripping stop words (in, the, of) from the list.

    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[ ] Stripped
    9[,] Strippable Delimiter
    10[ ] Space Delimiter
    11[ ] Stripped
    12[,] Strippable Delimiter
    13[ ] Space Delimiter
    14[top] Token
    15[,] Strippable Delimiter
    16[ ] Space Delimiter
    17[ ] Stripped
    18[)] Striping Delimiter
    19[ ] Space Delimiter
    20[are] Token
    21[ ] Space Delimiter
    22[checked] Token

    If we compose the string on this list, the output string will be
    "Check-in four words ( , , top, ,) are checked"

  • However, this is not the result we want. We want something better than this. Thus, one more step is taken to clean up the list and make the list from table-1 to table-2:

    Table-1
    IndexToken StringToken TypeClean Up Action
    1[Check-in]tokenkeep: token
    2[ ] Space Delimiterkeep
    3[four] Tokenkeep
    4[ ] Space Delimiterkeep
    5[words] Tokenkeep
    6[ ] Space Delimiterkeep
    7[(] Restore Delimiterkeep
    8[ ] Strippedstrip: stripped type
    9[,] Strippable Delimiterstrip: Strippable Delimiter
    10[ ] Space Delimiterstrip: last type is stripped
    11[ ] Strippedstrip: stripped type
    12[,] Strippable Delimiterstrip: Strippable Delimiter
    13[ ] Space Delimiterstrip: last type is stripped
    14[top] Tokenkeep
    15[,] Strippable Delimiterstrip: conflict char
    16[ ] Space Delimiterstrip: conflict char
    17[ ] Strippedstrip: stripped type
    18[)] Striping Delimiterkeep
    19[ ] Space Delimiterkeep
    20[are] Tokenkeep
    21[ ] Space Delimiterkeep
    22[checked] Tokenkeep

    Table-2
    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[top] Token
    9[)] Striping Delimiter
    10[ ] Space Delimiter
    11[are] Token
    12[ ] Space Delimiter
    13[checked] Token

  • Compose the string base on this cleaned list. The output is:
    "Check-in four words (top) are checked"