GitHub - agmmnn/turkish-nlp-resources: 🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Turkish NLP (Türkçe Doğal Dil İşleme) Tools, Libraries, Models, Datasets, and other resources.
Aligned with new NLP Trends: Generative AI, Retrieval Systems, and Evaluation

Generative AI & LLMs

Foundation & Chat Models

Language models specific to Turkish, ranging from adaptations of open weights (Llama, Mistral) to native pretrained models.

Trendyol LLMs : Bilingual (TR/EN) models ranging from 7B to 70B parameters, including specialized cybersecurity variants.
Kumru-2B : Decoder-only foundational models trained from scratch for Turkish with a native tokenizer. blog
TURNA : A 1.1B parameter foundational model for NLU and generation.
Cosmos Turkish Llama : The Cosmos Llama is designed for text generation tasks, trained with DPO for coherent Turkish continuation.
Kanarya-2b : Turkish GPT-J model trained on large-scale corpora.
Turkcell-LLM-7b-v1 : Extended version of Mistral fine-tuned on Turkish instruction sets.
WiroAI/wiroai-turkish-llm-9b : Robust language models adapted to Turkish culture and context.
Kocdigital-LLM-8b-v0.1 : Fine-tuned version of Llama3 8b for Turkish.

Domain Specific LLMs

Models adapted for specific verticals (Legal, Medical, Finance).

Mecellem : Specialized ModernBERT-based models for the Turkish legal domain. arxiv

LLM Integrations (MCP Servers)

Model Context Protocol (MCP) servers enabling AI agents to interact with Turkish data sources.

Borsa MCP : Istanbul Stock Exchange (BIST) and investment fund data.
Yargı MCP : Search for Turkish Legal Databases (Yargıtay, Danıştay).
Mevzuat MCP : Search Turkish Legislation (laws, regulations).
YÖK Tez MCP : Turkish National Thesis Center (YÖK Tez) search.
YÖK Atlas MCP : YÖK Atlas higher education and ranking data.

↥ Back To Top

Retrieval & Semantic Search (RAG)

Crucial for RAG (Retrieval Augmented Generation) pipelines, moving beyond keyword search.

Late-Interaction Models

Late-interaction models (ColBERT) are specifically designed for high-performance retrieval tasks.

TurkColBERT : Benchmark and collection of token-level matching models for high-performance retrieval. arxiv, blog

Embedding Models

Embedding models for semantic search and retrieval.

TurkEmbed4Retrieval : Specialized embedding model for Turkish retrieval tasks.
Mursit-Large-TR-Retrieval : Late-interaction retrieval model for Turkish.
TY-ecomm-embed-multilingual-base-v1.2.0 : Multilingual e-commerce embeddings.
Floret Embeddings : Turkish Floret Embeddings, large and medium sized.
VNLP Word Embeddings : Word2Vec Turkish word embeddings.
TurkishGloVe : Turkish GloVe word embeddings.

↥ Back To Top

Evaluation & Benchmarks

Leaderboards and datasets to validate model performance in Turkish.

Mezura : Leaderboard focusing on human evaluation (ELO) and RAG performance.
Mizan : Embedding model leaderboard for retrieval and clustering tasks.
TurkBench : Comprehensive generative LLM benchmark with 21 subtasks. arxiv
Cetvel : A 26-task benchmark including translation, summarization, and correction.
TR-MMLU : Evaluation framework with 6,200 Turkish-specific multiple-choice questions.
TrGLUE : Turkish-native corpora curated for GLUE-style evaluations.

↥ Back To Top

Encoder Models

Traditional Transformer models (BERT, RoBERTa, etc.) and Word Vectors.

BERTurk : Turkish BERT/DistilBERT, ELECTRA and ConvBERT models.
TurkishBERTweet : A BERTweet model fine-tuned on Turkish tweets.
Loodos/Turkish Language Models : Transformer based Turkish language models.
ELMO For ManyLangs : Pre-trained ELMo Representations.
Fasttext - Word Vector : Pre-trained word vectors for 157 languages.

↥ Back To Top

Tools & Libraries

Core libraries for morphological analysis, tokenization, and processing.

VNLP (Python) : State-of-the-art, lightweight NLP tools for Turkish.
Zemberek-NLP (Java) : The veteran NLP library for Turkish (Morphology, Spell Check, etc.).
Zemberek-Python (Python) : Python wrapper/implementation of Zemberek.
Zemberek-Server (Docker) : REST Docker server for Zemberek.
TRmorph (FST) : Finite-state morphological analyzer.
spaCy Turkish models : Pre-trained Turkish pipelines for spaCy.
Starlang Tools (Python) : Comprehensive suite (Morphology, Spell Check, Dependency Parsing, Deasciifier, NER).
ITU Turkish NLP (Web/API) : Tools from ITU Natural Language Processing Group.
Nuve (C#) : Turkish NLP library for morphological analysis.
SadedeGel (Python) : Extraction-based news summarization.
Turkish Stemmer (Python) : Stemming algorithm.
sinKAF (Python) : Profanity detection library.
TrTokenizer (Python) : Sentence and word tokenizers.
snnclsr/NER (Python) : Named Entity Recognition system.
Helsinki-NLP Translation : Neural machine translation (EN-TR).

↥ Back To Top

Datasets

Extensive corpora and collections for training and evaluation.

Instruction Tuning & Dialogue (LLM)

InstrucTurca : 2.58M instruction samples (OpenOrca/MedText translations).
Turkish-Alpaca : 52k cleaned/verified instruction following samples.
WikiRAG-TR : Questions derived from Turkish Wikipedia for RAG.
turkish-math-186k : Large-scale dataset for mathematical reasoning.
Boğaziçi University TABI - NLI-TR : Natural Language Inference datasets.

Multimodal & Vision

TurkishLLaVA OCR Enhancement : Specialized books collection for OCR improvement.
unsloth-pmc-vqa-tr : Turkish PMC-VQA (Medical Visual Question Answering).
BosphorusSign22k : Turkish Sign Language Recognition (SLR) benchmark.

Major Corpora & Collections

Cosmos Datasets : Extensive datasets from YTU Cosmos Research Group.
Trendyol Datasets : E-commerce and general datasets from Trendyol.
Turkish National Corpus (TNC) : Balanced, large scale (50M words) general-purpose corpus.
TS Corpus : Independent project for Turkish corpora and datasets.
TDD - Turkish Data Depository : Foundational datasets.
METU Corpora : MTC and Discourse Bank.

Treebanks (Syntax & Morphology)

Universal Dependencies (UD) : Standardized cross-linguistic treebanks.
UD Turkish BOUN : 9.7k sentences, created by TABILAB.
UD Turkish Kenet : 18.7k sentences, based on TDK dictionary.
UD Ottoman Turkish : Historical treebank.
METU-Sabancı Treebank : Syntactic analysis resources.

Sentiment, General NLP & Others

SentiTurca : Sentiment analysis benchmark.
FSMTSAD : Balanced sentiment dataset (Hotel, Movie, Product).
HisTR : NER dataset for historical Turkish.
Turkish NLP Suite Datasets : NER, medical, and sentiment resources.
Amazon MASSIVE & OPUS : Multilingual resources.
Common Crawl (CC-100) & OSCAR : Large/Web-scale corpora.
Miscellaneous: Song Lyrics, Poems, Idioms, Stop Words, Bad Word Blacklist, Tatoeba: Multilingual Sentences

Dataset Search

↥ Back To Top

Community & Learning

YouTube Channels

Awesome Lists

Awesome Turkish NLP : Alternative curated list.
Awesome Turkish Language Models : Curated list of models.
Açık Veri Kaynakları : Open data sources in Turkey.

Educational Resources

Turkish Natural Language Processing - Kemal Oflazer

↥ Back To Top

Misc

Kip : A programming language in Turkish based on case and mood.

↥ Back To Top

Contributing

Your contributions are welcome! If you want to contribute to this list, send a pull request or just open a new issue.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Contents:

Generative AI & LLMs

Foundation & Chat Models

Domain Specific LLMs

LLM Integrations (MCP Servers)

Retrieval & Semantic Search (RAG)

Late-Interaction Models

Embedding Models

Evaluation & Benchmarks

Encoder Models

Tools & Libraries

Datasets

Instruction Tuning & Dialogue (LLM)

Multimodal & Vision

Major Corpora & Collections

Treebanks (Syntax & Morphology)

Sentiment, General NLP & Others

Dataset Search

Community & Learning

YouTube Channels

Awesome Lists

Educational Resources

Misc

Contributing

About

Uh oh!

Contributors 5

Uh oh!

agmmnn/turkish-nlp-resources

Folders and files

Latest commit

History

Repository files navigation

artwork: Mihrap, Osman Hamdi Bey

Turkish NLP Resources

Contents:

Generative AI & LLMs

Foundation & Chat Models

Domain Specific LLMs

LLM Integrations (MCP Servers)

Retrieval & Semantic Search (RAG)

Late-Interaction Models

Embedding Models

Evaluation & Benchmarks

Encoder Models

Tools & Libraries

Datasets

Instruction Tuning & Dialogue (LLM)

Multimodal & Vision

Major Corpora & Collections

Treebanks (Syntax & Morphology)

Sentiment, General NLP & Others

Dataset Search

Community & Learning

YouTube Channels

Awesome Lists

Educational Resources

Misc

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 5

Uh oh!