Lexical Tools

Prefix Computer Programs

A set of computer programs is developed to retrieve prefix word|word in LEXICON and validation for derivations. This program is run annually for lvg release.

Get all base forms from LEXICON (inflvars.data)
- Program: GetBaseForms.java
- Input:./dataOrg/inflVars.data
- Output:./data/bases.data
- Descriptions
  - go through all lines (inflectional variants) in file of "inflVars.data"
  - retrieve base form (infl = 1)
Retrieve and validate prefix words|words
- Program: GetPrefixWordsFromFile.java
- Input:
  - ./dataOrg/prefix.data
  - ./data/bases.data
  - ./data/prefix.tag.data
- Output:./data/prefixWords.meta.data
- Descriptions
  - get prefixes from a file (./dataOrg/prefix.data)
  - get base forms from a file (./data/bases.data)
  - get prefix tags from a file (./data/prefix.tag.data)
  - Find all pairs of prefix words|words in LEXICON:
    - go through all prefixes from the sorted prefixes list
    - find all pairs of prefix word|word (prefixWordList) if:
      - prefix word is in base forms
      - word is base in base forms
  - validate all pairs of prefix words|words in prefixWordList
    - go through all pairs of prefixWord|words in prefixWordList
      - print tag ("yes" or "no") to ./data/prefixWords.meta.data
      - print "tbd" if no tag found
Generate various reports from ./data/prefixWords.meta.data by tag
- Program: GeneratePrefixFiles.java
- Input:
  - ./data/prefixWords.meta.data
- Output:
  - ./data/prefix.tbd.data
  - ./data/prefixWords.data
  - ./data/prefix.newTag.data
- Descriptions
  - go through all pairs of tagged prefixWord|words in prefixWords.meta.data
    - send all "tbd" tags to prefix.tbd.data
    - send all "yes" and "no" tags to prefix.newTag.data
    - send all "yes" tags to prefixWords.data
    - Check if there is invalid tag
    - Check all comment lines

Validate results:

Program: 2.GetPrefixWords
Input:
- ./data/prefix.tag.data
- ./data/prefix.newTag.dat
- ./data/prefixWords.data.new
Output:
- ./data/prefix.tag.data.noComment.sort
- ./data/prefix.newTag.data.all.sort

Descriptions

Remove all comments line from prefix.tag.data

			fgrep -v '#' prefix.tag.data prefix.tag.data.noComment
			
sort -u prefix.tag.data.noComment > prefix.tag.data.noComment.sort

Combine results and new prefixWords (will be added in the future)

			cat prefix.newTag.data prefixWords.data.new > prefix.newTag.data.all
			
sort -u prefix.newTag.data.all > prefix.newTag.data.all.sort

Compare two input and results tagged files

			diff prefix.tag.data.noComment.sort prefix.newTag.data.all.sort > prefix.tag.diff

Usage for (future) releases:
- update inflVars.data from new release of LEXICON
- update prefix.data
- update prefixWords.data.new (for new prefix words that not in this release)
```
	./bin/1.GetBaseForms ${YEAR}
	
./bin/2.GetPrefixWords ${YEAR}
	
```
  - Check lines of prefix.tag.diff (should be 0)
  - prefixWords.data (to be added to derivations.data)
  - prefix.tbd.data (send to linguists for validations)