artwork: Mihrap, Osman Hamdi Bey
Aligned with new NLP Trends: Generative AI, Retrieval Systems, and Evaluation
| Generative AI & LLMs | Retrieval & RAG | Evaluation & Benchmarks | Encoder Models | Tools & Libraries | Datasets | Community & Learning | Misc |
Language models specific to Turkish, ranging from adaptations of open weights (Llama, Mistral) to native pretrained models.
- Trendyol LLMs : Bilingual (TR/EN) models ranging from 7B to 70B parameters, including specialized cybersecurity variants.
- Kumru-2B : Decoder-only foundational models trained from scratch for Turkish with a native tokenizer. blog
- TURNA : A 1.1B parameter foundational model for NLU and generation.
- Cosmos Turkish Llama : The Cosmos Llama is designed for text generation tasks, trained with DPO for coherent Turkish continuation.
- Kanarya-2b : Turkish GPT-J model trained on large-scale corpora.
- Turkcell-LLM-7b-v1 : Extended version of Mistral fine-tuned on Turkish instruction sets.
- WiroAI/wiroai-turkish-llm-9b : Robust language models adapted to Turkish culture and context.
- Kocdigital-LLM-8b-v0.1 : Fine-tuned version of Llama3 8b for Turkish.
Models adapted for specific verticals (Legal, Medical, Finance).
Model Context Protocol (MCP) servers enabling AI agents to interact with Turkish data sources.
- Borsa MCP : Istanbul Stock Exchange (BIST) and investment fund data.
- Yargı MCP : Search for Turkish Legal Databases (Yargıtay, Danıştay).
- Mevzuat MCP : Search Turkish Legislation (laws, regulations).
- YÖK Tez MCP : Turkish National Thesis Center (YÖK Tez) search.
- YÖK Atlas MCP : YÖK Atlas higher education and ranking data.
Crucial for RAG (Retrieval Augmented Generation) pipelines, moving beyond keyword search.
Late-Interaction Models
Late-interaction models (ColBERT) are specifically designed for high-performance retrieval tasks.
- TurkColBERT : Benchmark and collection of token-level matching models for high-performance retrieval. arxiv, blog
Embedding models for semantic search and retrieval.
- TurkEmbed4Retrieval : Specialized embedding model for Turkish retrieval tasks.
- Mursit-Large-TR-Retrieval : Late-interaction retrieval model for Turkish.
- TY-ecomm-embed-multilingual-base-v1.2.0 : Multilingual e-commerce embeddings.
- Floret Embeddings : Turkish Floret Embeddings, large and medium sized.
- VNLP Word Embeddings : Word2Vec Turkish word embeddings.
- TurkishGloVe : Turkish GloVe word embeddings.
Leaderboards and datasets to validate model performance in Turkish.
- Mezura : Leaderboard focusing on human evaluation (ELO) and RAG performance.
- Mizan : Embedding model leaderboard for retrieval and clustering tasks.
- TurkBench : Comprehensive generative LLM benchmark with 21 subtasks. arxiv
- Cetvel : A 26-task benchmark including translation, summarization, and correction.
- TR-MMLU : Evaluation framework with 6,200 Turkish-specific multiple-choice questions.
- TrGLUE : Turkish-native corpora curated for GLUE-style evaluations.
Traditional Transformer models (BERT, RoBERTa, etc.) and Word Vectors.
- BERTurk : Turkish BERT/DistilBERT, ELECTRA and ConvBERT models.
- TurkishBERTweet : A BERTweet model fine-tuned on Turkish tweets.
- Loodos/Turkish Language Models : Transformer based Turkish language models.
- ELMO For ManyLangs : Pre-trained ELMo Representations.
- Fasttext - Word Vector : Pre-trained word vectors for 157 languages.
Core libraries for morphological analysis, tokenization, and processing.
- VNLP (Python) : State-of-the-art, lightweight NLP tools for Turkish.
- Zemberek-NLP (Java) : The veteran NLP library for Turkish (Morphology, Spell Check, etc.).
- Zemberek-Python (Python) : Python wrapper/implementation of Zemberek.
- Zemberek-Server (Docker) : REST Docker server for Zemberek.
- TRmorph (FST) : Finite-state morphological analyzer.
- spaCy Turkish models : Pre-trained Turkish pipelines for spaCy.
- Starlang Tools (Python) : Comprehensive suite (Morphology, Spell Check, Dependency Parsing, Deasciifier, NER).
- ITU Turkish NLP (Web/API) : Tools from ITU Natural Language Processing Group.
- Nuve (C#) : Turkish NLP library for morphological analysis.
- SadedeGel (Python) : Extraction-based news summarization.
- Turkish Stemmer (Python) : Stemming algorithm.
- sinKAF (Python) : Profanity detection library.
- TrTokenizer (Python) : Sentence and word tokenizers.
- snnclsr/NER (Python) : Named Entity Recognition system.
- Helsinki-NLP Translation : Neural machine translation (EN-TR).
Extensive corpora and collections for training and evaluation.
- InstrucTurca : 2.58M instruction samples (OpenOrca/MedText translations).
- Turkish-Alpaca : 52k cleaned/verified instruction following samples.
- WikiRAG-TR : Questions derived from Turkish Wikipedia for RAG.
- turkish-math-186k : Large-scale dataset for mathematical reasoning.
- Boğaziçi University TABI - NLI-TR : Natural Language Inference datasets.
- TurkishLLaVA OCR Enhancement : Specialized books collection for OCR improvement.
- unsloth-pmc-vqa-tr : Turkish PMC-VQA (Medical Visual Question Answering).
- BosphorusSign22k : Turkish Sign Language Recognition (SLR) benchmark.
- Cosmos Datasets : Extensive datasets from YTU Cosmos Research Group.
- Trendyol Datasets : E-commerce and general datasets from Trendyol.
- Turkish National Corpus (TNC) : Balanced, large scale (50M words) general-purpose corpus.
- TS Corpus : Independent project for Turkish corpora and datasets.
- TDD - Turkish Data Depository : Foundational datasets.
- METU Corpora : MTC and Discourse Bank.
- Universal Dependencies (UD) : Standardized cross-linguistic treebanks.
- UD Turkish BOUN : 9.7k sentences, created by TABILAB.
- UD Turkish Kenet : 18.7k sentences, based on TDK dictionary.
- UD Ottoman Turkish : Historical treebank.
- METU-Sabancı Treebank : Syntactic analysis resources.
- SentiTurca : Sentiment analysis benchmark.
- FSMTSAD : Balanced sentiment dataset (Hotel, Movie, Product).
- HisTR : NER dataset for historical Turkish.
- Turkish NLP Suite Datasets : NER, medical, and sentiment resources.
- Amazon MASSIVE & OPUS : Multilingual resources.
- Common Crawl (CC-100) & OSCAR : Large/Web-scale corpora.
- Miscellaneous: Song Lyrics, Poems, Idioms, Stop Words, Bad Word Blacklist, Tatoeba: Multilingual Sentences
- Awesome Turkish NLP : Alternative curated list.
- Awesome Turkish Language Models : Curated list of models.
- Açık Veri Kaynakları : Open data sources in Turkey.
- Kip : A programming language in Turkish based on case and mood.
Your contributions are welcome! If you want to contribute to this list, send a pull request or just open a new issue.
