data-centric-ai-355937zb824d
data-centric-ai-355937zb824d
data-centric-ai-355937zb824d
https://doi.org/10.1007/s10844-024-00901-9
Data-Centric AI
Abstract
The evolution of Artificial Intelligence (AI) has been driven by two core components: data
and algorithms. Historically, AI research has predominantly followed the Model-Centric
paradigm, which focuses on developing and refining models, while often treating data as
static. This approach has led to the creation of increasingly sophisticated algorithms, which
demand vast amounts of manually labeled and meticulously curated data. However, as data
becomes central to AI development, it is also emerging as a significant bottleneck. The Data-
Centric AI (DCAI) paradigm shifts the focus towards improving data quality, enabling the
achievement of accuracy levels that are unattainable with Model-Centric approaches alone.
This special issue presents recent advancements in DCAI, offering insights into the paradigm
and exploring future research directions, aiming to contextualize the contributions included
in this issue.
1 Introduction
Nowadays the large amount of data generated in multiple applications has led to the recent
surge of AI techniques across various fields ranging from remote sensing, business process
management, healthcare and industry. Despite the close relationship between AI and Big
Data, traditional AI methods have primarily operated within the traditional Model-Centric
paradigm, prioritizing algorithm design and hyper-parameter optimization, while handling
data as static entities and overlooking issues related to data quality (Kumar et al., 2024).
Hence, models developed under the Model-Centric AI paradigm are typically specialized
and tailored to specific tasks and datasets, making it challenging to transfer them across tasks
or datasets also within the same problem domain. In contrast, the emerging Data-Centric AI
(DCAI) paradigm is focused on systematically and algorithmically generating optimal data to
feed learning algorithms (Ng, 2022; Jakubik et al., 2024). As reported in Subramonyam et al.
(2021), the data is the backbone of AI systems across the board, but the adoption of the DCAI
paradigm is mandatory for the success of the next generation of ML and DL tools (Jarrahi
et al., 2023). In particular, the DCAI paradigm is divided in three main steps: training data
B Donato Malerba
donato.malerba@uniba.it
Vincenzo Pasquadibisceglie
vincenzo.pasquadibisceglie@uniba.it
1 Department of Informatics, Università degli Studi di Bari Aldo Moro, via Orabona, 4 - 70125 Bari, Italy
123
Journal of Intelligent Information Systems
development, inference data development, and data maintenance (Zha et al., 2023). These
steps, which are all essential for developing a robust DCAI process, are interconnected.
Training data development The main goal of the training data development step is to collect
and produce high-quality and rich training data to support the training of decision models.
Both quality and quantity of training data are achieved in the training data development
through the steps of data creation and data processing. Data creation focuses on encoding
human intentions into datasets. Data processing focuses on preparing the data for the learning
stage. In particular, the training data development step includes the following steps:
– Data collection (Stonebraker et al., 2013; Stonebraker & Ilyas, 2018) which aims to
identify the most related and useful datasets from data lakes and data marketplaces. This
step often requires data integration operations.
– Data labeling (Dekel & Shamir, 2009) that assigns one or more labels to data samples
enabling the use of supervised learning algorithms. As this operation is a time-consuming
and resource-intensive process, various techniques have been identified to improve effi-
ciency and reduce the cost of data labeling (e.g., crowd source labeling, consensus
learning, semi-supervised and active learning).
– Data preparation (Wan et al., 2023) that prepares raw data for the learning stage by han-
dling noise, inconsistencies and any unnecessary information that may lead to inaccurate
or biased results. Both feature extraction and feature transformation are two examples of
transformation operations.
– Data reduction (Riquelme et al., 2003) that reduces the complexity of a given dataset,
while retaining its representative information. This is achieved by reducing either the
feature size (dimensionality reduction) or the sample size (sampling).
– Data augmentation (Frid-Adar et al., 2018) that is a technique to increase the size and
diversity of data.
– In-distribution evaluation (Otles et al., 2021) that involves generating samples aligned
to the training data. This is crucial to identify and calibrate underrepresented groups to
prevent biases and errors, understand decision boundary and scrutinize ethical consider-
ations.
– out-of-distribution evaluation (Madry et al., 2019) which aims to generate samples that
significantly differ from training data. For example, adversarial samples can aid in under-
standing the robustness of models to out-of-distribution.
123
Journal of Intelligent Information Systems
Data maintenance In real-world applications, data are not created once, but they need to be
continuously updated and curated. The data maintenance step aims to maintain the quality
and reliability of data in a dynamic environment. It involves three sub-goals:
– Data understanding (Burch & Weiskopf, 2013) that may use visual summarization, clus-
tering or data statistics to help to organize complex data and produce human readable
insights.
– Data quality assurance (Pipino et al., 2002) that is commonly performed in dynamic
environments where continuous monitoring and qualitative improvement are mandatory.
Quality assessment includes both objective and subjective metrics. The former measure
inherent data attributes (accuracy, timeliness, consistency, completeness), while the latter
evaluate data quality from a human perspective.
– Data storage and retrieval (Van Aken et al., 2017) that manages exponentially growing
data through resource allocation strategies to optimize throughput and latency in data
administration systems.
Various works in academia and industry have recently started applying DCAI principles
in different application contexts. For example, the authors of Roscher et al. (2023) have
recently described the main principles of the DCAI paradigm in both remote sensing and
geospatial data applications. Their study shows that geospatial data acquisition and curation
should receive as much attention as data engineering and model development and evaluation.
In Zahid et al. (2021), the authors have recently illustrated a systematic review of emerg-
ing information technologies used for data modeling and analytics to achieve Data-Centric
Health-Care (DCHC) for sustainable healthcare. From the industry perspective, the authors
of Luley et al. (2023) describe a tangible, adaptable implementation of a DCAI development
process tailored for industrial applications, particularly in machining and manufacturing
sectors.
Transitioning from a Model-Centric to a DCAI paradigm addresses the idea that better data
leads to better AI systems. However, this AI paradigm shift poses significant challenges
referred to training data development, inference data development and data maintenance
tasks. In particular, the key research issues in DCAI require answering the following ques-
tions:
How do we collect, select, and valorize the data for an AI research project?
The Model-Centric AI paradigm involves adopting a fixed dataset that may hide several
issues in production phase (Seedat et al., 2024). This operational mode ensures an improve-
ment in terms of model advancement as the literature has seen a proliferation of sophisticated
deep neural architectures, various learning strategies and optimization methods. However, the
Model-Centric AI paradigm fixes an important constraint on data nature because it assumes
the data is clean, error-free and does not evolve over time. On the other hand, if we consider
the real scenario, then the datasets are imperfect, contain errors in terms of label or missing
value and don’t fit with the real distribution. As reported in Northcutt et al. (2021), there are
several examples of datasets with noise labels in the literature, like ImageNet (Russakovsky
et al., 2015) or MS-COCO (Lin et al., 2014). The authors raise two important issues related to
123
Journal of Intelligent Information Systems
the presence of noisy labels: how to identify mislabeled examples and how to learn effectively
despite noisy labels. These issues should be addressed regardless of the data type or model
used. For example, the authors of Northcutt et al. (2021) introduce some DCAI strategies
such as the use Confident Learning to address the challenges of label quality.
How can data quality and consistency be ensured in an AI research project?
In a study conducted in 2021, Google researchers (Sambasivan et al., 2021) explored
the impact of data quality on learning algorithms to extract empirical evidence of the ‘data
cascades’ phenomenon. This term compounds events causing negative, downstream effects
due to data issues, leading to technical debt over time. Specifically, the study involved 53 AI
practitioners from the USA, India and East and West Africa, working in high-risk sectors such
as healthcare, agriculture, finance, public safety, environmental conservation and education.
Results show that 92% of the participants experienced at least one data cascade in their
projects, resulting in technical debt over time. Based on these premises, the literature has
recently seen the proliferation of various DCAI solutions (Peng et al., 2021; Clemente et al..,
2023) to improve data quality, in order to mitigate the phenomenon of error propagation in
the various phases of an AI project.
How should we evaluate AI systems?
The evaluation phase of a model is a crucial step to complete before releasing the model
for production. In the Model-Centric paradigm, the evaluation phase uses test datasets to
measure the models’ accuracy metrics. In the DCAI paradigm, the evaluation phase is not
limited to evaluating the model only by accuracy metrics, but it should account for various
aspects such as, for example, the dynamic nature of the data, the presence of adversaries or
the right of explanation (Zha et al., 2023). In particular, the evaluation phase opens up various
challenges, such as evaluating the resilience or reusability of a model development phase.
In recent years, there has been a boom in initiatives and research suites on the DCAI theme.
Several scientific events have been organized with the aim to raise the scientific community’s
awareness of this new paradigm. The most recent events are reported in the following:
– “Data Centric AI", Workshop co-located with the Thirty-Fifth Annual Conference on
Neural Information Processing Systems (NeurIPS 2021), Virtual Conference, 2021,
https://neurips.cc/virtual/2021/workshop/21860.
– “DMLR Data-centric Machine Learning Research" (DMLR 2024) co-located with The
Forty-First International Conference on Machine Learning (ICML 2024), Vienna, Aus-
tria, July 21-27, 2024, https://dmlr.ai/.
– “Artificial Intelligence and Data Science for Healthcare: Bridging data-centric AI and
People-Centric Healthcare" (AIDSH-KDD 24), co-located with 30th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD 2024), August 25 - 29,
2024 - Barcelona, Spain, https://aimel.ai.
– “1st International Workshop on Data-Centric Artificial Intelligence" (DEARING 2024)
co-located with European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML PKDD 2024), September 13, 2024 -
Vilnius, Lithuania, https://dearing2024.di.uniba.it/landing.
– “The 4th International Workshop on Data-Centric AI" (DCAI24), co-located with 33rd
ACM International Conference on Information and Knowledge Management (CIKM
123
Journal of Intelligent Information Systems
Despite remarkable milestones recently achieved with DCAI paradigm, new research trends
already emerge:
123
Journal of Intelligent Information Systems
reason, the next research directions should focus on strategies to mitigate and control this
phenomenon.
– High quality There is a need for new benchmarks that allow learning algorithms to be
evaluated not only in terms of accuracy. In addition, although the bias and fairness have
been less studied in the Model-Centric paradigm, these issues have become essential
in modern machine-learning applications. In Whang et al. (2023), the authors analyze
fairness measures and inequity mitigation techniques that can be applied before, during
or after model training. In this way, the scientific community can be aware of controlled
data management.
The era of DCAI marks a pivotal paradigm shift in AI, allowing us to build a new generation
of intelligent systems through the strategic utilization of high-quality data. This approach
accentuates the significance of ensuring that information not only facilitates learning, but
also precisely targets the specific learning requirements of AI.
This special issue aims to explore the transformative impact of recent developments in
the DCAI paradigm on the future of AI, ML and DL. The call for paper invited contribu-
tions exploring how these advancements influence the development of intelligent systems
across various domains such as business process development and maintenance, cybersecu-
rity, Earth observation, bioinformatics, energy markets, smart cities, finance, and healthcare.
We welcome both research-oriented and practical contributions that shed light on opportuni-
ties, perspectives, and open research directions within the DCAI paradigm, inspiring further
innovations in the field. Topics of interest for this special issue included, but were not limited
to:
– High quality data preparation
– Data cleaning, denoising, and interpolation
– Novel feature engineering pipelines
– Label Errors and Confident Learning (CL)
– Selecting features and/or instances
– Performing outlier detection and removal
– Ensuring label consensus
– Producing consistent and low noise training data
– Extracting smart data from raw data
– Creating training datasets for small data problems
– Handling rare classes and explaining important class coverage in big data problems
– Incorporating human feedback into training datasets
– Combining multi-view, multi-source, multi-objective datasets
– Data-Centric machine learning and deep learning approaches
– Active learning to identify the most valuable examples to label
– Core-set learning to handle big data
– Semi-supervised learning, few-shot learning, weak supervision, confident learning
to take advantage of limited amount of labels or handle label noise
– Transfer learning and self-supervised learning algorithms to achieve rich data repre-
sentations to be used with scarceness of labels
– Concept drift detection to identify new data to label
123
Journal of Intelligent Information Systems
123
Journal of Intelligent Information Systems
called MVSTC. This method aims to enhance text clustering by integrating multiple text
representation models to capture various features such as syntactic, topic, and semantic
information. MVSTC leverages these diverse views to detect latent correlations between
documents and projects the data onto a topological map to reveal relationships. The method
outperforms existing multi-view clustering methods using several evaluation metrics on real-
world datasets.
Finally, in Bernardi et al. (2024), the authors describe in this issue describes an applica-
tion in the context of business process management. The authors propose BPLLM, a new
methodology to enhance conversations with Large Language Models (LLMs) and support
process-aware decision support systems (DSS). The framework analyses and describes busi-
ness processes, enhancing LLMs’ conversational capabilities in dealing with process-related
activities. The integration of a Retrieval-Augmented Generation (RAG) framework allows
us to acquire contextual knowledge relevant to specific user queries, allowing the model to
provide more accurate and relevant answers. The main goal of the BPLLM is to assist users
in understanding and executing business processes through natural language interactions,
offering advanced and targeted support.
Acknowledgements We sincerely appreciate all the authors who contributed their papers to this special issue.
We are also grateful to the reviewers for their meticulous evaluations and insightful feedback, which greatly
enhanced the quality of the submissions. Additionally, we acknowledge the invaluable support of Zbigniew Ras,
Editor-in-Chief of the Journal of Intelligent Information Systems, and the editorial team for their constructive
guidance and timely assistance. This special issue has been developed in fulfillment of research objective of
PNRR project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007)
and the Transversal Project TP7 - Data-Centric AI and Infrastructure, under the NRRP MUR program funded
by the NextGenerationEU.
References
Andresini, G., Appice, A., Ienco, D., et al. (2024). DIAMANTE: A datacentric semantic segmentation approach
to map tree dieback induced by bark beetle infestations via satellite images. In: Journal of intelligent
information systems. https://doi.org/10.1007/s10844-024-00877-6.
Bernardi, M. L., Casciani, A., Cimitile, M., et al. (2024). Conversing with business process-aware large
language models: the BPLLM framework. In: Journal of intelligent information systems. https://doi.org/
10.1007/s10844-024-00898-1.
Burch, M., & Weiskopf, D. (2013). On the benefits and drawbacks of radial diagrams. In: Handbook of human
centric visualization. Springer, pp. 429– 451. https://doi.org/10.1007/978-1-4614-7485-2_17.
Clemente, F., Ribeiro, G. M., Quemy, A., et al. (2023). ydata-profiling: Accelerating data-centric AI with
high-quality data. In: Neurocomputing 554. https://doi.org/10.1016/j.neucom.2023.126585.
Dekel, O., & Shamir, O. (2009). Vox Populi: Collecting High-Quality Labels from a Crowd. In: Proc. 22nd
Annual conference on learning theory (COLT), 2009. https://www.cs.mcgill.ca/~colt2009/papers/037.
pdf#page=1.
Fraj, M., HajKacem, M. A. B., & Essoussi, N. (2024). Multi-view subspace text clustering. In: Journal of
intelligent information systems. https://doi.org/10.1007/s10844-024-00897-2.
123
Journal of Intelligent Information Systems
Frid-Adar, M., E. Klang, M. Amitai, et al. (2018). Synthetic data augmentation using GAN for improved liver
lesion classification. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).
IEEE, pp. 289–293. https://doi.org/10.1109/ISBI.2018.8363576.
García-Gil, D., Luque-Sánchez, F., Luengo, J., et al. (2019). From big to smart data: Iterative ensemble filter
for noise filtering in big data classification. In: International Journal of Intelligent Systems 34.12, pp.
3260–3274. https://doi.org/10.1002/int.22193.
Jakubik, J., Vössing, M., Kühl, N., et al. (2024). Data-centric artificial intelligence. In: Business & information
systems engineering. https://doi.org/10.1007/s12599-024-00857-8.
Jarrahi, M. H., Memariani, A., Guha, S. (2023). The Principles of Data-Centric AI. In: Commun. ACM 66.8.
https://doi.org/10.1145/3571724.
Kumar, S., Datta, S., Singh, V., et al. (2024). Opportunities and Challenges in Data-Centric AI. In: IEEE
Access. https://doi.org/10.1109/ACCESS.2024.3369417.
Lin, T., Maire, M., Belongie, S., et al. (2014). Microsoft COCO: Common Objects in Context. In: Computer
vision - ECCV 2014. Ed. by David Fleet, Tomas Pajdla, Bernt Schiele, et al. Cham: Springer International
Publishing, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48.
Luley, P., Deriu, J. M., Yan, P., et al. (2023). From concept to implementation: The data-centric development
process for AI in industry. In: 2023 10th IEEE Swiss Conference on Data Science (SDS). IEEE, pp.
73–76. https://doi.org/10.1109/SDS57534.2023.00017.
Madry, A., A. Makelov, L. Schmidt, et al. (2019). Towards deep learning models resistant to adversarial
Attacks. In: CoRR. https://doi.org/10.48550/arXiv1706.06083.
Mazumder, M., Banbury, C. R., Yao, X., et al. (2022). DataPerf: Benchmarks for Data-Centric AI Development.
In: CoRR. https://doi.org/10.48550/arXiv.2207.10062.
Ng, A. (2022). Unbiggen AI-IEEE Spectrum. In: IEEE Spectrum. url: https://spectrum.ieee.org/andrew-ng-
data-centric-ai.
Northcutt, C., Jiang, L., Chuang, I. (2021). Confident learning: estimating uncertainty in dataset labels. In:
Journal of Artificial Intelligence Research 70. https://doi.org/10.1613/jair.1.12125.
Otles, E., Oh, J., Li, B. et al. (2021). Mind the performance gap: examining dataset shift during prospective val-
idation. In: Machine Learning for Healthcare Conference. PMLR, pp. 506-534. url: https://proceedings.
mlr.press/v149/otles21a.html.
Peng, J., Wu, W., Lockhart, B., et al. (2021). Dataprep. eda: Task-centric exploratory data analysis for statistical
modeling in python. In: Proceedings of the 2021 international conference on management of data, pp.
2271– 2280. https://doi.org/10.1145/3448016.3457330.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. In: Communications of the ACM
45.4. https://doi.org/10.1145/505248.506010.
Polyzotis, N. & M. Zaharia (2021). What can Data-Centric AI Learn from Data and ML Engineering? In:
CoRR. https://doi.org/10.48550/arXiv.2112.06439.
Riquelme, J. C., Aguilar-Ruiz, J. S., & Toro, M. (2003). Finding representative patterns with ordered projec-
tions. In: Pattern Recognition 36.4. https://doi.org/10.1016/S0031-3203(02)00119-X.
Roscher, R., Rußwurm, M., Gevaert, C., et al. (2023). Data-centric machine learning for geospatial remote
sensing data. In: CoRR. https://doi.org/10.48550/arXiv2312.05327.
Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. In:
International journal of computer vision 115.3. https://doi.org/10.1007/s11263-015-0816-y.
Sambasivan, N., Kapania, S., Highfill, H., et al. (2021). Everyone wants to do the model work, not the data
work: Data Cascades in High-Stakes AI. In: Proceedings of the 2021 CHI conference on human factors
in Computing Systems. CHI ’21. Yokohama, Japan: Association for Computing Machinery. https://doi.
org/10.1145/3411764.3445518.
Sancricca, C., Siracusa, G., Cappiello, C. (2024). Enhancing data preparation: Insights from a time series case
study. In: Journal of intelligent information systems. https://doi.org/10.1007/s10844-024-00867-8.
Seedat, N., Imrie, F., & van der Schaar, M. (2024). Navigating Data-Centric Artificial Intelligence With DC-
Check: Advances, Challenges, and Opportunities. In: IEEE Transactions on Artificial Intelligence 5.6.
https://doi.org/10.1109/TAI.2023.3345805.
Shah, D., Shah, K., Jagani, M., et al. (2024). CONCORD: Enhancing COVID-19 research with weak-
supervision based numerical claim extraction. In: Journal of intelligent information systems. https://
doi.org/10.1007/s10844-024-00885-6.
Stonebraker, M., D. Bruckner, I. F. Ilyas, et al. (2013). Data curation at scale: The data tamer system. In: Sixth
biennial conference on innovative data systems research, CIDR 2013, Asilomar, CA, USA, January 6-
9, 2013, online proceedings. Vol. 2013. url: https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.
pdf.
Stonebraker, M, & Ilyas, I. F. (2018). Data Integration: The Current Status and the Way Forward. In: IEEE
Data engineering bulletin 41.2, pp. 3–9. url: http://sites.computer.org/debull/A18june/p3.pdf.
123
Journal of Intelligent Information Systems
Subramonyam, H., Seifert, C., & Adar, M. E. (2021). How can humancentered design shape data-centric AI.
In: Proceedings of NeurIPS Data- Centric AI Workshop. https://www.cond.org/humandataai.pdf.
Van Aken, D., Pavlo, A., Gordon, G. J., et al. (2017). Automatic database management system tuning through
large-scale machine learning. In: Proceedings of the 2017 ACM international conference on management
of data, pp. 1009–1024. https://doi.org/10.1145/3035918.3064029.
Wan, M., Zha, D., Liu, N., et al. (2023). In-processing modeling techniques for machine learning fairness:
A survey. In: ACM Transactions on knowledge discovery from data 17.3, pp. 1–27. https://doi.org/10.
1145/3551390.
Whang, S. E., Roh, Y., Song, H., et al. (2023). Data collection and quality challenges in deep learning: A
data-centric AI perspective. In: The VLDB Journal 32.4, pp. 791–813. https://doi.org/10.1007/s00778-
022-00775-9.
Zahid, A., Kay Poulsen, J., Sharma, R., et al. (2021). A systematic review of emerging information technologies
for sustainable data-centric healthcare. In: International Journal of Medical Informatics 149. https://doi.
org/10.1016/j.ijmedinf.2021.104420.
Zha, D., Bhat, Z. P., Lai, K., et al. (2023). Data-centric ai: Perspectives and challenges. In: Proceedings of
the 2023 SIAM international conference on data mining (SDM). SIAM, pp. 945–948. https://doi.org/10.
1137/1.9781611977653.ch106.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
123