Collection of stopwords, frequent words and other things.
To help a build application with NLP (Natural Language Processing) like:
- Stemming
- Text simplification
- Text-to-speech
- Text-proofing
- Natural language search
- Query expansion
- Automated essay scoring
- Truecasing
or Search Engines like:
Language ISO 639-1 | Name | Stopwords | Frequent Words | Obs |
---|---|---|---|---|
bg | Bulgarian | Yes | No | UTF-8 |
cz | Czech | Yes | No | UTF-8 |
de | German | Yes | Yes | |
en | English | Yes | Yes | |
es | Spanish | Yes + | Yes | |
fi | Finnish | Yes | Yes | |
fr | French | Yes | Yes | |
hu | Hungarian | Yes | No | UTF-8 |
it | Italian | Yes | Yes | UTF-8 |
pl | Polish | Yes | No | UTF-8 |
pt | Portuguese | Yes + | No | |
ru | Russian | Yes | No | UTF-8 |
sv | Swedish | Yes | Yes |
Almost everything was extract from http://members.unine.ch/jacques.savoy/clef/
Make a fork, do your changes and request a pull.
Please, also do the modifications on this readme file!
Thanks for your help!