Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Procedures: Generate Element Words from MEDLINE (TBD)

This page describes the details of generating high frequency element words from MEDLINE.

I. Description

  • Retrieve all TI (titles) and AB (abstracts) from Medline
  • Use Lexical Tools - wordIndex to get the word list (lowercase, remove punctuation, and use space as word separator)
  • For each (single) word, updates the total word count
  • For each (single word, assign the type (LEXICON, NUMBER, NON_WORD, DIGIT, TBD)
  • Analyze results and generate reports
    • Word list for words are in TBD type with high frequency
    • All word list sorted by frequency

II. Processes

  • Root directory: ${LEXICON_DIR}/Components/Medline
  • Data:
    • Input
      • ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt: Medline files
      • ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data: all existing words in Lexicon
      • ${ROOT_DIR}/data/${YEAR}/inData/NRVAR: all number variants included Lexicon release
      • ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data/exceptions.data: words that not in Lexicon and no need to add in, such as ii, iii, etc. This list should be updated periodically.
    • Output directory: ${ROOT_DIR}/data/${YEAR}/outData/

  • Procedures:
    • MedlineFileList
      • Program: GenFileList.java
      • Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
      • Algorithm: get file list of MedLine.{$YEAR}
      • Output: ${OUT_DIR}/MedlineFiles2014.txt
    • Generate PmidTiAb${YY}n${DDDD}.txt
      • Program: GenPmidTiAbFiles.java
      • Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
      • Algorithm: retrieve PMID, Title, and abstract from Medline.${YEAR}, separated by space, keep the original case.
      • Output: ${OUT_DIR}/PmidTiAb/PmidTiAb${YY}n${NNNN}.txt
    • Generate words|count|type
      • Program: GetWordCountFromTiAbFiles.java
      • Input:
        • Medline files: ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt
        • inflVars: ${IN_DIR}/inflVars.data
        • numbers: ${IN_DIR}/NRVAR
        • exceptions: ${IN_DIR}/exceptions.data (words should not be included in Lexicon, such as iii)
      • Algorithm: get word count from title and abstract from Medline.${YEAR}
        • Use wordIndex to get word list (use space as word separator with lowercase all words)
        • Update count
      • Output: ${OUT_DIR}/wordCount.out, with following format
        wordcounttype

        where, type can be:
        LEXICONa existing word in the Lexicon, such as of
        NUMBERa existing number in the Lexicon, such as nine
        NON_WORDnot exists in the Lexicon and not a real word or element of multiwords, such as iii
        DIGITdigit, such as 9
        TBDTo be done, not in above types
    • Analyze results and generate reports
      • Program:
        • AnalyzeWordCountFile.java
        • GetTbdWords.java
      • Input: ${OUT_DIR}/wordCount.out
      • Algorithm:
      • Output:
        • ${OUT_DIR}/wordCount.sum: summary
        • ${OUT_DIR}/wordCount.rpt: report, sorted by frequency
          rankwordtypeword countcum. word countcum. coverage (recall)
        • ${OUT_DIR}/wordCount.csv: in csv format for diagram
          rankword counttype
        • ${OUT_DIR}/wordCount.tbd

  • Program: