Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Tokenize & Reverse Token for Strip

II. Analysis

As discussed in the (introduction), some smart algorithm is needed to clean up the tokens' list. This algorithm is based on the types of delimiters (or types of tokens). They are discussed as follows.

  1. Token Type:
    => Any token in the token list and not a delimiter (as described bellows)
    => A token must be kept during cleanup

  2. Space Delimiter Type:
    => " " or "\t" (tab) are most common in this type
    => Space delimiter is always a delimiter
    => Space delimiters are trimmed to a single space and kept during cleanup

  3. Stripped Type:
    => Any token in the token list that is modified (stripped)
    => Such as "in" or "on" in the strip stop word function
    => The string should be changed to " " if stripped happen

  4. Restore Delimiter Type:
    => A delimiter which will always be kept during cleanup in all circumstances.
    => Such as "({[".

  5. Striping Delimiter Type:
    => A delimiter which will always be kept during cleanup in all circumstances. However, tokens that are conjoint to and in front of it need to be stripped if they are stripped type or belong to a conflict token list.
    => Such as ")}]". For example:
    (top) => (top)
    (A.I.D.S.) => (A.I.D.S.)
    (in, on, of) => ( ) => " , , " are stripped from ( , , ) since they are belong to conflict token list.
    Conflict list includes: "-,:;"

  6. Strippable Delimiter Type:
    => A delimiter which will be kept only if the previous token is not a Stripped type
    => If a token is stripped, the following punctuation should be stripped. According to grammar, most of punctuations need to be placed directly after a word (no space between). Such as ".,:;". For these punctuations, they should be stripped if their previous conjoint token is stripped. For example:
    in, on, Top => Top
    "," are stripped since in and on are stripped