8000 GitHub - latincy/verba: verba.txt - A Latin word list in the style of Unix /usr/share/dict/words · GitHub
[go: up one dir, main page]

Skip to content

latincy/verba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

verba

A curated Latin word list — 134,154 unique forms, analogous to Unix /usr/share/dict/words.

Format

  • One word per line
  • UTF-8 encoded
  • Alphabetically sorted
  • Normalized: v→u, j→i
  • Corpus-validated against 375M+ tokens of Latin text

Usage

# Check if a word is Latin
grep -q "^amicus$" verba.txt && echo "found"

# Count words
wc -l verba.txt

# Use as a spell-check dictionary
aspell --lang=la --personal=./verba.txt check mytext.txt

Statistics

  • 134,154 unique word forms
  • Validated against 975,803 forms from LatinCy word lists (Wiktionary + UD treebanks)
  • Validated against a 375M-token Latin corpus (CC100, Wikipedia, Wikisource, Perseus, Tesserae, Latin Library, CAMENA, Patrologia Latina, UD treebanks)

Reporting Errors

If you find an incorrect, missing, or spurious entry in the word list, please open an issue.

License

CC0 1.0 Universal — see LICENSE.

Citation

If you use this word list in research, please cite:

@dataset{burns_verba_2026,
  author       = {Burns, Patrick J.},
  title        = {Verba: A Curated Latin Word List for {NLP} Applications},
  year         = {2026},
  url          = {https://github.com/latincy/verba},
  version      = {0.1.1},
  note         = {134,154 unique Latin word forms derived from LatinCy word lists and validated against a 375M-token corpus}
}

About

verba.txt - A Latin word list in the style of Unix /usr/share/dict/words

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

0