Learn more about Data-Centric AI

Data-Centric AI is the systematic engineering of better data (via AI and automation). Learn about key concepts, useful tricks, and helpful tools.

CROWDLAB: The Right Way to Combine Humans and AI for LLM Evaluation

CROWDLAB improves your team's LLM Evals process by automatically producing reliable ratings and flagging which outputs need further review.

How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)

Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.

Automatically Detect Problematic Content in any Text Dataset

Introducing AI text audits for automated content moderation and curation, including the detection of: toxic, non-English, and informal language, as well as personally identifiable information.

Ensuring Reliable Few-Shot Prompt Selection for LLMs

Learn data-centric techniques for better few-shot prompting when applying LLMs to noisy real-world data.

Improving any OpenAI Language Model by Systematically Improving its Data

Reduce LLM prediction error by 37% via data-centric AI.

CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators

Understanding cleanlab's new methods for multi-annotator data and what makes them effective.

Automatically catching spurious correlations in ML datasets

An open-source module to detect spurious correlations between dataset labels and features that will not generalize to real-world deployment.

Announcing Auto-Labeling Agent: Your Assistant for Rapid and High Quality Labeling

Generate AI, not headaches. Automate annotation with AI.

Accelerate Time Series Modeling with Cleanlab Studio AutoML: Train and Deploy in Minutes

Accelerate Time Series Modeling with Cleanlab Studio AutoML. Predictable results in a few clicks."

An open-source platform to catch all sorts of issues in all sorts of datasets

With cleanlab v2.6, the most popular library for Data-Centric AI now offers more comprehensive data audits including new checks for underperforming groups, null values, imbalanced classes, and more.

Comparing tools for Data Science, Data Quality, Data Annotation, and AI/ML

What's the next-generation platform for Data Science? A data-centric AI system that can automatically: find and fix data issues, label data, and train/deploy reliable models.

How to Filter Unsafe and Low-Quality Images from any Dataset: A Product Catalog Case Study

Introducing an automated solution to ensure high-quality image data, for both content moderation and boosting engagement. Easily curate any product/content catalog or photo gallery to delight your customers.