E5CE GitHub - agmmnn/turkish-nlp-resources: 🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.
[go: up one dir, main page]

Skip to content

🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.

Notifications You must be signed in to change notification settings

agmmnn/turkish-nlp-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 

Repository files navigation

Turkish NLP Resources

Turkish NLP (Türkçe Doğal Dil İşleme) Tools, Libraries, Models, Datasets, and other resources.
Aligned with new NLP Trends: Generative AI, Retrieval Systems, and Evaluation

Contents:

| Generative AI & LLMs | Retrieval & RAG | Evaluation & Benchmarks | Encoder Models | Tools & Libraries | Datasets | Community & Learning | Misc |


Generative AI & LLMs

Foundation & Chat Models

Language models specific to Turkish, ranging from adaptations of open weights (Llama, Mistral) to native pretrained models.

  • Trendyol LLMs : Bilingual (TR/EN) models ranging from 7B to 70B parameters, including specialized cybersecurity variants.
  • Kumru-2B : Decoder-only foundational models trained from scratch for Turkish with a native tokenizer. blog
  • TURNA : A 1.1B parameter foundational model for NLU and generation.
  • Cosmos Turkish Llama : The Cosmos Llama is designed for text generation tasks, trained with DPO for coherent Turkish continuation.
  • Kanarya-2b : Turkish GPT-J model trained on large-scale corpora.
  • Turkcell-LLM-7b-v1 : Extended version of Mistral fine-tuned on Turkish instruction sets.
  • WiroAI/wiroai-turkish-llm-9b : Robust language models adapted to Turkish culture and context.
  • Kocdigital-LLM-8b-v0.1 : Fine-tuned version of Llama3 8b for Turkish.

Domain Specific LLMs

Models adapted for specific verticals (Legal, Medical, Finance).

  • Mecellem : Specialized ModernBERT-based models for the Turkish legal domain. arxiv

LLM Integrations (MCP Servers)

Model Context Protocol (MCP) servers enabling AI agents to interact with Turkish data sources.

  • Borsa MCP : Istanbul Stock Exchange (BIST) and investment fund data.
  • Yargı MCP : Search for Turkish Legal Databases (Yargıtay, Danıştay).
  • Mevzuat MCP : Search Turkish Legislation (laws, regulations).
  • YÖK Tez MCP : Turkish National Thesis Center (YÖK Tez) search.
  • YÖK Atlas MCP : YÖK Atlas higher education and ranking data.

Retrieval & Semantic Search (RAG)

Crucial for RAG (Retrieval Augmented Generation) pipelines, moving beyond keyword search.

Late-Interaction Models

Late-interaction models (ColBERT) are specifically designed for high-performance retrieval tasks.

  • TurkColBERT : Benchmark and collection of token-level matching models for high-performance retrieval. arxiv, blog

Embedding Models

Embedding models for semantic search and retrieval.

Evaluation & Benchmarks

Leaderboards and datasets to validate model performance in Turkish.

  • Mezura : Leaderboard focusing on human evaluation (ELO) and RAG performance.
  • Mizan : Embedding model leaderboard for retrieval and clustering tasks.
  • TurkBench : Comprehensive generative LLM benchmark with 21 subtasks. arxiv
  • Cetvel : A 26-task benchmark including translation, summarization, and correction.
  • TR-MMLU : Evaluation framework with 6,200 Turkish-specific multiple-choice questions.
  • TrGLUE : Turkish-native corpora curated for GLUE-style evaluations.

Encoder Models

Traditional Transformer models (BERT, RoBERTa, etc.) and Word Vectors.

Tools & Libraries

Core libraries for morphological analysis, tokenization, and processing.

  • VNLP (Python) : State-of-the-art, lightweight NLP tools for Turkish.
  • Zemberek-NLP (Java) : The veteran NLP library for Turkish (Morphology, Spell Check, etc.).
  • Zemberek-Python (Python) : Python wrapper/implementation of Zemberek.
  • Zemberek-Server (Docker) : REST Docker server for Zemberek.
  • TRmorph (FST) : Finite-state morphological analyzer.
  • spaCy Turkish models : Pre-trained Turkish pipelines for spaCy.
  • Starlang Tools (Python) : Comprehensive suite (Morphology, Spell Check, Dependency Parsing, Deasciifier, NER).
  • ITU Turkish NLP (Web/API) : Tools from ITU Natural Language Processing Group.
  • Nuve (C#) : Turkish NLP library for morphological analysis.
  • SadedeGel (Python) : Extraction-based news summarization.
  • Turkish Stemmer (Python) : Stemming algorithm.
  • sinKAF (Python) : Profanity detection library.
  • TrTokenizer (Python) : Sentence and word tokenizers.
  • snnclsr/NER (Python) : Named Entity Recognition system.
  • Helsinki-NLP Translation : Neural machine translation (EN-TR).

Datasets

Extensive corpora and collections for training and evaluation.

Instruction Tuning & Dialogue (LLM)

Multimodal & Vision

Major Corpora & Collections

Treebanks (Syntax & Morphology)

Sentiment, General NLP & Others

Dataset Search

Community & Learning

YouTube Channels

Awesome Lists

Educational Resources

Misc

  • Kip : A programming language in Turkish based on case and mood.

Contributing

Your contributions are welcome! If you want to contribute to this list, send a pull request or just open a new issue.

About

🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.

Topics

Resources

Stars

Watchers

Forks

Contributors 5

0