[go: up one dir, main page]

0% found this document useful (0 votes)
3 views10 pages

data-centric-ai-355937zb824d

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

Journal of Intelligent Information Systems

https://doi.org/10.1007/s10844-024-00901-9

Data-Centric AI

Donato Malerba1 · Vincenzo Pasquadibisceglie1

© The Author(s) 2024

Abstract
The evolution of Artificial Intelligence (AI) has been driven by two core components: data
and algorithms. Historically, AI research has predominantly followed the Model-Centric
paradigm, which focuses on developing and refining models, while often treating data as
static. This approach has led to the creation of increasingly sophisticated algorithms, which
demand vast amounts of manually labeled and meticulously curated data. However, as data
becomes central to AI development, it is also emerging as a significant bottleneck. The Data-
Centric AI (DCAI) paradigm shifts the focus towards improving data quality, enabling the
achievement of accuracy levels that are unattainable with Model-Centric approaches alone.
This special issue presents recent advancements in DCAI, offering insights into the paradigm
and exploring future research directions, aiming to contextualize the contributions included
in this issue.

1 Introduction

Nowadays the large amount of data generated in multiple applications has led to the recent
surge of AI techniques across various fields ranging from remote sensing, business process
management, healthcare and industry. Despite the close relationship between AI and Big
Data, traditional AI methods have primarily operated within the traditional Model-Centric
paradigm, prioritizing algorithm design and hyper-parameter optimization, while handling
data as static entities and overlooking issues related to data quality (Kumar et al., 2024).
Hence, models developed under the Model-Centric AI paradigm are typically specialized
and tailored to specific tasks and datasets, making it challenging to transfer them across tasks
or datasets also within the same problem domain. In contrast, the emerging Data-Centric AI
(DCAI) paradigm is focused on systematically and algorithmically generating optimal data to
feed learning algorithms (Ng, 2022; Jakubik et al., 2024). As reported in Subramonyam et al.
(2021), the data is the backbone of AI systems across the board, but the adoption of the DCAI
paradigm is mandatory for the success of the next generation of ML and DL tools (Jarrahi
et al., 2023). In particular, the DCAI paradigm is divided in three main steps: training data

B Donato Malerba
donato.malerba@uniba.it
Vincenzo Pasquadibisceglie
vincenzo.pasquadibisceglie@uniba.it
1 Department of Informatics, Università degli Studi di Bari Aldo Moro, via Orabona, 4 - 70125 Bari, Italy

123
Journal of Intelligent Information Systems

development, inference data development, and data maintenance (Zha et al., 2023). These
steps, which are all essential for developing a robust DCAI process, are interconnected.

Training data development The main goal of the training data development step is to collect
and produce high-quality and rich training data to support the training of decision models.
Both quality and quantity of training data are achieved in the training data development
through the steps of data creation and data processing. Data creation focuses on encoding
human intentions into datasets. Data processing focuses on preparing the data for the learning
stage. In particular, the training data development step includes the following steps:

– Data collection (Stonebraker et al., 2013; Stonebraker & Ilyas, 2018) which aims to
identify the most related and useful datasets from data lakes and data marketplaces. This
step often requires data integration operations.
– Data labeling (Dekel & Shamir, 2009) that assigns one or more labels to data samples
enabling the use of supervised learning algorithms. As this operation is a time-consuming
and resource-intensive process, various techniques have been identified to improve effi-
ciency and reduce the cost of data labeling (e.g., crowd source labeling, consensus
learning, semi-supervised and active learning).
– Data preparation (Wan et al., 2023) that prepares raw data for the learning stage by han-
dling noise, inconsistencies and any unnecessary information that may lead to inaccurate
or biased results. Both feature extraction and feature transformation are two examples of
transformation operations.
– Data reduction (Riquelme et al., 2003) that reduces the complexity of a given dataset,
while retaining its representative information. This is achieved by reducing either the
feature size (dimensionality reduction) or the sample size (sampling).
– Data augmentation (Frid-Adar et al., 2018) that is a technique to increase the size and
diversity of data.

Inference data development The Model-Centric paradigm mainly evaluates a decision


by limiting to use performance metrics (e.g. accuracy metrics, computation time, memory
usage). However, this may lead to performing the evaluation phase while neglecting critical
aspects of a decision model such as resilience, adaptability, and the reasoning behind decision-
making. The goal of the inference data development phase is to generate new evaluation sets
that offer more detailed insights into the model or activate a particular capability of the model
using engineered data inputs. Some sub-goals of this phase are:

– In-distribution evaluation (Otles et al., 2021) that involves generating samples aligned
to the training data. This is crucial to identify and calibrate underrepresented groups to
prevent biases and errors, understand decision boundary and scrutinize ethical consider-
ations.
– out-of-distribution evaluation (Madry et al., 2019) which aims to generate samples that
significantly differ from training data. For example, adversarial samples can aid in under-
standing the robustness of models to out-of-distribution.

123
Journal of Intelligent Information Systems

Data maintenance In real-world applications, data are not created once, but they need to be
continuously updated and curated. The data maintenance step aims to maintain the quality
and reliability of data in a dynamic environment. It involves three sub-goals:
– Data understanding (Burch & Weiskopf, 2013) that may use visual summarization, clus-
tering or data statistics to help to organize complex data and produce human readable
insights.
– Data quality assurance (Pipino et al., 2002) that is commonly performed in dynamic
environments where continuous monitoring and qualitative improvement are mandatory.
Quality assessment includes both objective and subjective metrics. The former measure
inherent data attributes (accuracy, timeliness, consistency, completeness), while the latter
evaluate data quality from a human perspective.
– Data storage and retrieval (Van Aken et al., 2017) that manages exponentially growing
data through resource allocation strategies to optimize throughput and latency in data
administration systems.
Various works in academia and industry have recently started applying DCAI principles
in different application contexts. For example, the authors of Roscher et al. (2023) have
recently described the main principles of the DCAI paradigm in both remote sensing and
geospatial data applications. Their study shows that geospatial data acquisition and curation
should receive as much attention as data engineering and model development and evaluation.
In Zahid et al. (2021), the authors have recently illustrated a systematic review of emerg-
ing information technologies used for data modeling and analytics to achieve Data-Centric
Health-Care (DCHC) for sustainable healthcare. From the industry perspective, the authors
of Luley et al. (2023) describe a tangible, adaptable implementation of a DCAI development
process tailored for industrial applications, particularly in machining and manufacturing
sectors.

2 Research issues in DCAI

Transitioning from a Model-Centric to a DCAI paradigm addresses the idea that better data
leads to better AI systems. However, this AI paradigm shift poses significant challenges
referred to training data development, inference data development and data maintenance
tasks. In particular, the key research issues in DCAI require answering the following ques-
tions:
How do we collect, select, and valorize the data for an AI research project?
The Model-Centric AI paradigm involves adopting a fixed dataset that may hide several
issues in production phase (Seedat et al., 2024). This operational mode ensures an improve-
ment in terms of model advancement as the literature has seen a proliferation of sophisticated
deep neural architectures, various learning strategies and optimization methods. However, the
Model-Centric AI paradigm fixes an important constraint on data nature because it assumes
the data is clean, error-free and does not evolve over time. On the other hand, if we consider
the real scenario, then the datasets are imperfect, contain errors in terms of label or missing
value and don’t fit with the real distribution. As reported in Northcutt et al. (2021), there are
several examples of datasets with noise labels in the literature, like ImageNet (Russakovsky
et al., 2015) or MS-COCO (Lin et al., 2014). The authors raise two important issues related to

123
Journal of Intelligent Information Systems

the presence of noisy labels: how to identify mislabeled examples and how to learn effectively
despite noisy labels. These issues should be addressed regardless of the data type or model
used. For example, the authors of Northcutt et al. (2021) introduce some DCAI strategies
such as the use Confident Learning to address the challenges of label quality.
How can data quality and consistency be ensured in an AI research project?
In a study conducted in 2021, Google researchers (Sambasivan et al., 2021) explored
the impact of data quality on learning algorithms to extract empirical evidence of the ‘data
cascades’ phenomenon. This term compounds events causing negative, downstream effects
due to data issues, leading to technical debt over time. Specifically, the study involved 53 AI
practitioners from the USA, India and East and West Africa, working in high-risk sectors such
as healthcare, agriculture, finance, public safety, environmental conservation and education.
Results show that 92% of the participants experienced at least one data cascade in their
projects, resulting in technical debt over time. Based on these premises, the literature has
recently seen the proliferation of various DCAI solutions (Peng et al., 2021; Clemente et al..,
2023) to improve data quality, in order to mitigate the phenomenon of error propagation in
the various phases of an AI project.
How should we evaluate AI systems?
The evaluation phase of a model is a crucial step to complete before releasing the model
for production. In the Model-Centric paradigm, the evaluation phase uses test datasets to
measure the models’ accuracy metrics. In the DCAI paradigm, the evaluation phase is not
limited to evaluating the model only by accuracy metrics, but it should account for various
aspects such as, for example, the dynamic nature of the data, the presence of adversaries or
the right of explanation (Zha et al., 2023). In particular, the evaluation phase opens up various
challenges, such as evaluating the resilience or reusability of a model development phase.

3 Academic and scientific activities and achievements

In recent years, there has been a boom in initiatives and research suites on the DCAI theme.
Several scientific events have been organized with the aim to raise the scientific community’s
awareness of this new paradigm. The most recent events are reported in the following:

– “Data Centric AI", Workshop co-located with the Thirty-Fifth Annual Conference on
Neural Information Processing Systems (NeurIPS 2021), Virtual Conference, 2021,
https://neurips.cc/virtual/2021/workshop/21860.
– “DMLR Data-centric Machine Learning Research" (DMLR 2024) co-located with The
Forty-First International Conference on Machine Learning (ICML 2024), Vienna, Aus-
tria, July 21-27, 2024, https://dmlr.ai/.
– “Artificial Intelligence and Data Science for Healthcare: Bridging data-centric AI and
People-Centric Healthcare" (AIDSH-KDD 24), co-located with 30th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD 2024), August 25 - 29,
2024 - Barcelona, Spain, https://aimel.ai.
– “1st International Workshop on Data-Centric Artificial Intelligence" (DEARING 2024)
co-located with European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML PKDD 2024), September 13, 2024 -
Vilnius, Lithuania, https://dearing2024.di.uniba.it/landing.
– “The 4th International Workshop on Data-Centric AI" (DCAI24), co-located with 33rd
ACM International Conference on Information and Knowledge Management (CIKM

123
Journal of Intelligent Information Systems

2024), Boise, Idaho, USA, October 21-25, 2024, https://data-centric-ai-dev.github.io/


CIKM2024/.
This non-exhaustive list of scientific recent events shows how the scientific AI community
has embraced this paradigm shift proactively. In Mazumder et al. (2022), the authors
introduce a community-led benchmark suite for evaluating DCAI datasets and algorithms.
The aim of this initiative to promote innovation in Data-Centric AI through competition,
comparability, and reproducibility. In Seedat et al. (2024), the authors propose DC-
Check, a DCAI checklist to guide the researchers in developing the end-to-end ML and
DL pipeline.

4 Future research trends

Despite remarkable milestones recently achieved with DCAI paradigm, new research trends
already emerge:

– Inference data holds the “same weight" as training data


Optimizing the model’s hyperparameters to improve the model’s performance on the
evaluation data is one of the limitations of the Model-Centric paradigm in that the concept
of fit to the data is lost (Polyzotis & Zaharia, 2021), i.e., the extent to which the model is
fully supported and covered by the data, considering the distinctive attributes of the real
environment in which the AI system operates. This concept ensures that the data offers a
suitable representation of this environment. For example, the class distribution should be
checked to avoid possible distortions that may adversely affect the model’s performance
(Jarrahi et al., 2023).
– Transforming raw data into smart data
In García-Gil et al. (2019), the authors define smart data as the challenge of acquiring
knowledge from raw big data, i.e., transforming information into knowledge. In detail, the
authors underline the objective of smart data, that is, separating the raw part of the big
data (volume/velocity) from the intelligent part (veracity/value). Therefore, upcoming
smart data research directions will focus on extracting valuable knowledge from data in
a subset containing sufficient quality for a successful DCAI pipeline.
– Explainability as enabler of data model co-design
Co-designing input data and models is an essential process that involves collaborating to
create adaptable and transferable models. This collaborative approach has wide-ranging
implications for data monitoring as it enhances the synergy between artificial intelligence,
data science, and human decision-making processes. The primary goal is to tackle critical
data quality and model stability challenges. Based on these premises, the role of Explain-
able Artificial Intelligence (XAI) is crucial, as it can contribute to building high-quality
AI systems in collaboration with domain experts.
– Data bias and fairness
In the learning field, applications are commonly characterized in: low-risk and high-
risk. This classification depends on the application domain of the predictive model. For
example, low-risk applications are recommender systems that suggest buying products.
In contrast, high-risk applications concern predictions regarding a patient’s health or the
granting of a loan. These applications share the same issue, i.e., the presence of bias in
the data, which can have a negative effect on the customer’s user experience (in the case
of recommendations) or people’s lives (in the case of a loan or health decision). For this

123
Journal of Intelligent Information Systems

reason, the next research directions should focus on strategies to mitigate and control this
phenomenon.
– High quality There is a need for new benchmarks that allow learning algorithms to be
evaluated not only in terms of accuracy. In addition, although the bias and fairness have
been less studied in the Model-Centric paradigm, these issues have become essential
in modern machine-learning applications. In Whang et al. (2023), the authors analyze
fairness measures and inequity mitigation techniques that can be applied before, during
or after model training. In this way, the scientific community can be aware of controlled
data management.

5 Articles in this issue

The era of DCAI marks a pivotal paradigm shift in AI, allowing us to build a new generation
of intelligent systems through the strategic utilization of high-quality data. This approach
accentuates the significance of ensuring that information not only facilitates learning, but
also precisely targets the specific learning requirements of AI.
This special issue aims to explore the transformative impact of recent developments in
the DCAI paradigm on the future of AI, ML and DL. The call for paper invited contribu-
tions exploring how these advancements influence the development of intelligent systems
across various domains such as business process development and maintenance, cybersecu-
rity, Earth observation, bioinformatics, energy markets, smart cities, finance, and healthcare.
We welcome both research-oriented and practical contributions that shed light on opportuni-
ties, perspectives, and open research directions within the DCAI paradigm, inspiring further
innovations in the field. Topics of interest for this special issue included, but were not limited
to:
– High quality data preparation
– Data cleaning, denoising, and interpolation
– Novel feature engineering pipelines
– Label Errors and Confident Learning (CL)
– Selecting features and/or instances
– Performing outlier detection and removal
– Ensuring label consensus
– Producing consistent and low noise training data
– Extracting smart data from raw data
– Creating training datasets for small data problems
– Handling rare classes and explaining important class coverage in big data problems
– Incorporating human feedback into training datasets
– Combining multi-view, multi-source, multi-objective datasets
– Data-Centric machine learning and deep learning approaches
– Active learning to identify the most valuable examples to label
– Core-set learning to handle big data
– Semi-supervised learning, few-shot learning, weak supervision, confident learning
to take advantage of limited amount of labels or handle label noise
– Transfer learning and self-supervised learning algorithms to achieve rich data repre-
sentations to be used with scarceness of labels
– Concept drift detection to identify new data to label

123
Journal of Intelligent Information Systems

– Adversarial learning to improve robustness and resilience


– Responsible and ethical AI
– Ensuring fairness, bias, ethics and diversity
– Green AI design and evaluation
– Scalable and reliable training
– Privacy-preserving and secure learning
– Reproducibility of AI
– Data benchmark creation
– Creating licensed datasets based on public resources
– Creating high quality data from low quality resources
– Data-Centric Explainable AI
– Novel XAI methods to identify possible data issues in the learning stage
– XAI methods to generate features for machine learning problems
– Applications of novel DCAI solutions
Five papers were selected from a total of twenty-eight submitted papers. They represent
different areas involved in the topics of this special issue, like high-quality data preparation,
data benchmark creation, combining multi-view, multi-source, multi-objective datasets, weak
supervision, and extracting smart data from raw data.
In Sancricca et al. (2024), the authors describe a new DCAI framework to support data
exploration and preparation, suggesting suitable cleaning tasks to obtain valuable analysis
results. The study proposes an adaptive self-service environment that can analyze and prepare
different types of sources, i.e., tabular, and streaming data. The central component of the
framework is a knowledge base that collects evidence related to the effectiveness of the data
preparation actions, along with the type of input data and the considered machine learning
model. The experiments show the potential of the proposed approach in several time series
data streams.
In Andresini et al. (2024), the authors tackle the problem of collecting high-quality data
in remote sensing. The article describes a DCAI-based semantic segmentation approach to
detect forest tree dieback events due to bark beetle infestation in satellite images. Specifically,
a multisensor data set is developed using both the SAR Sentinel-1 sensor and the optical
Sentinel-2 sensor and a multi-modal approach is designed for the model development stage.
The effectiveness of the proposed approach is evaluated in a real inventory case study that
regards non-overlapping forest scenes from the Northeast of France acquired in October
2018. Experiments explore the accuracy and re-usability over time of multi-modal models.
In Shah et al. (2024), the authors deal with the problem of data labelling. Specifically,
the authors propose a weak-supervision-based white-box model for extracting numerical
claims from scientific research articles. The approach leverages a few labelled examples
and labelling functions to annotate a much larger dataset, reducing the need for extensive
manual labelling. Specifically, the article describes the use of CONCORD, an open-source
labelled dataset of numerical claims, extracted from approximately 57,000 full-text scientific
articles related to COVID-19. This dataset aims to provide credible sources of information
derived from peer-reviewed publications. The proposed model aims to enable the automated
extraction of claims from full-text articles (not just abstracts) and offers a robust method for
processing large-scale scientific literature.
In Fraj et al. (2024), the authors study the problem of multi-view representation of text
data. In particular, the authors propose a new subspace multi-view text clustering method

123
Journal of Intelligent Information Systems

called MVSTC. This method aims to enhance text clustering by integrating multiple text
representation models to capture various features such as syntactic, topic, and semantic
information. MVSTC leverages these diverse views to detect latent correlations between
documents and projects the data onto a topological map to reveal relationships. The method
outperforms existing multi-view clustering methods using several evaluation metrics on real-
world datasets.
Finally, in Bernardi et al. (2024), the authors describe in this issue describes an applica-
tion in the context of business process management. The authors propose BPLLM, a new
methodology to enhance conversations with Large Language Models (LLMs) and support
process-aware decision support systems (DSS). The framework analyses and describes busi-
ness processes, enhancing LLMs’ conversational capabilities in dealing with process-related
activities. The integration of a Retrieval-Augmented Generation (RAG) framework allows
us to acquire contextual knowledge relevant to specific user queries, allowing the model to
provide more accurate and relevant answers. The main goal of the BPLLM is to assist users
in understanding and executing business processes through natural language interactions,
offering advanced and targeted support.
Acknowledgements We sincerely appreciate all the authors who contributed their papers to this special issue.
We are also grateful to the reviewers for their meticulous evaluations and insightful feedback, which greatly
enhanced the quality of the submissions. Additionally, we acknowledge the invaluable support of Zbigniew Ras,
Editor-in-Chief of the Journal of Intelligent Information Systems, and the editorial team for their constructive
guidance and timely assistance. This special issue has been developed in fulfillment of research objective of
PNRR project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007)
and the Transversal Project TP7 - Data-Centric AI and Infrastructure, under the NRRP MUR program funded
by the NextGenerationEU.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives


4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction
in any medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do
not have permission under this licence to share adapted material derived from this article or parts of it. The
images or other third party material in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative
Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/.

References
Andresini, G., Appice, A., Ienco, D., et al. (2024). DIAMANTE: A datacentric semantic segmentation approach
to map tree dieback induced by bark beetle infestations via satellite images. In: Journal of intelligent
information systems. https://doi.org/10.1007/s10844-024-00877-6.
Bernardi, M. L., Casciani, A., Cimitile, M., et al. (2024). Conversing with business process-aware large
language models: the BPLLM framework. In: Journal of intelligent information systems. https://doi.org/
10.1007/s10844-024-00898-1.
Burch, M., & Weiskopf, D. (2013). On the benefits and drawbacks of radial diagrams. In: Handbook of human
centric visualization. Springer, pp. 429– 451. https://doi.org/10.1007/978-1-4614-7485-2_17.
Clemente, F., Ribeiro, G. M., Quemy, A., et al. (2023). ydata-profiling: Accelerating data-centric AI with
high-quality data. In: Neurocomputing 554. https://doi.org/10.1016/j.neucom.2023.126585.
Dekel, O., & Shamir, O. (2009). Vox Populi: Collecting High-Quality Labels from a Crowd. In: Proc. 22nd
Annual conference on learning theory (COLT), 2009. https://www.cs.mcgill.ca/~colt2009/papers/037.
pdf#page=1.
Fraj, M., HajKacem, M. A. B., & Essoussi, N. (2024). Multi-view subspace text clustering. In: Journal of
intelligent information systems. https://doi.org/10.1007/s10844-024-00897-2.

123
Journal of Intelligent Information Systems

Frid-Adar, M., E. Klang, M. Amitai, et al. (2018). Synthetic data augmentation using GAN for improved liver
lesion classification. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).
IEEE, pp. 289–293. https://doi.org/10.1109/ISBI.2018.8363576.
García-Gil, D., Luque-Sánchez, F., Luengo, J., et al. (2019). From big to smart data: Iterative ensemble filter
for noise filtering in big data classification. In: International Journal of Intelligent Systems 34.12, pp.
3260–3274. https://doi.org/10.1002/int.22193.
Jakubik, J., Vössing, M., Kühl, N., et al. (2024). Data-centric artificial intelligence. In: Business & information
systems engineering. https://doi.org/10.1007/s12599-024-00857-8.
Jarrahi, M. H., Memariani, A., Guha, S. (2023). The Principles of Data-Centric AI. In: Commun. ACM 66.8.
https://doi.org/10.1145/3571724.
Kumar, S., Datta, S., Singh, V., et al. (2024). Opportunities and Challenges in Data-Centric AI. In: IEEE
Access. https://doi.org/10.1109/ACCESS.2024.3369417.
Lin, T., Maire, M., Belongie, S., et al. (2014). Microsoft COCO: Common Objects in Context. In: Computer
vision - ECCV 2014. Ed. by David Fleet, Tomas Pajdla, Bernt Schiele, et al. Cham: Springer International
Publishing, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48.
Luley, P., Deriu, J. M., Yan, P., et al. (2023). From concept to implementation: The data-centric development
process for AI in industry. In: 2023 10th IEEE Swiss Conference on Data Science (SDS). IEEE, pp.
73–76. https://doi.org/10.1109/SDS57534.2023.00017.
Madry, A., A. Makelov, L. Schmidt, et al. (2019). Towards deep learning models resistant to adversarial
Attacks. In: CoRR. https://doi.org/10.48550/arXiv1706.06083.
Mazumder, M., Banbury, C. R., Yao, X., et al. (2022). DataPerf: Benchmarks for Data-Centric AI Development.
In: CoRR. https://doi.org/10.48550/arXiv.2207.10062.
Ng, A. (2022). Unbiggen AI-IEEE Spectrum. In: IEEE Spectrum. url: https://spectrum.ieee.org/andrew-ng-
data-centric-ai.
Northcutt, C., Jiang, L., Chuang, I. (2021). Confident learning: estimating uncertainty in dataset labels. In:
Journal of Artificial Intelligence Research 70. https://doi.org/10.1613/jair.1.12125.
Otles, E., Oh, J., Li, B. et al. (2021). Mind the performance gap: examining dataset shift during prospective val-
idation. In: Machine Learning for Healthcare Conference. PMLR, pp. 506-534. url: https://proceedings.
mlr.press/v149/otles21a.html.
Peng, J., Wu, W., Lockhart, B., et al. (2021). Dataprep. eda: Task-centric exploratory data analysis for statistical
modeling in python. In: Proceedings of the 2021 international conference on management of data, pp.
2271– 2280. https://doi.org/10.1145/3448016.3457330.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. In: Communications of the ACM
45.4. https://doi.org/10.1145/505248.506010.
Polyzotis, N. & M. Zaharia (2021). What can Data-Centric AI Learn from Data and ML Engineering? In:
CoRR. https://doi.org/10.48550/arXiv.2112.06439.
Riquelme, J. C., Aguilar-Ruiz, J. S., & Toro, M. (2003). Finding representative patterns with ordered projec-
tions. In: Pattern Recognition 36.4. https://doi.org/10.1016/S0031-3203(02)00119-X.
Roscher, R., Rußwurm, M., Gevaert, C., et al. (2023). Data-centric machine learning for geospatial remote
sensing data. In: CoRR. https://doi.org/10.48550/arXiv2312.05327.
Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. In:
International journal of computer vision 115.3. https://doi.org/10.1007/s11263-015-0816-y.
Sambasivan, N., Kapania, S., Highfill, H., et al. (2021). Everyone wants to do the model work, not the data
work: Data Cascades in High-Stakes AI. In: Proceedings of the 2021 CHI conference on human factors
in Computing Systems. CHI ’21. Yokohama, Japan: Association for Computing Machinery. https://doi.
org/10.1145/3411764.3445518.
Sancricca, C., Siracusa, G., Cappiello, C. (2024). Enhancing data preparation: Insights from a time series case
study. In: Journal of intelligent information systems. https://doi.org/10.1007/s10844-024-00867-8.
Seedat, N., Imrie, F., & van der Schaar, M. (2024). Navigating Data-Centric Artificial Intelligence With DC-
Check: Advances, Challenges, and Opportunities. In: IEEE Transactions on Artificial Intelligence 5.6.
https://doi.org/10.1109/TAI.2023.3345805.
Shah, D., Shah, K., Jagani, M., et al. (2024). CONCORD: Enhancing COVID-19 research with weak-
supervision based numerical claim extraction. In: Journal of intelligent information systems. https://
doi.org/10.1007/s10844-024-00885-6.
Stonebraker, M., D. Bruckner, I. F. Ilyas, et al. (2013). Data curation at scale: The data tamer system. In: Sixth
biennial conference on innovative data systems research, CIDR 2013, Asilomar, CA, USA, January 6-
9, 2013, online proceedings. Vol. 2013. url: https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.
pdf.
Stonebraker, M, & Ilyas, I. F. (2018). Data Integration: The Current Status and the Way Forward. In: IEEE
Data engineering bulletin 41.2, pp. 3–9. url: http://sites.computer.org/debull/A18june/p3.pdf.

123
Journal of Intelligent Information Systems

Subramonyam, H., Seifert, C., & Adar, M. E. (2021). How can humancentered design shape data-centric AI.
In: Proceedings of NeurIPS Data- Centric AI Workshop. https://www.cond.org/humandataai.pdf.
Van Aken, D., Pavlo, A., Gordon, G. J., et al. (2017). Automatic database management system tuning through
large-scale machine learning. In: Proceedings of the 2017 ACM international conference on management
of data, pp. 1009–1024. https://doi.org/10.1145/3035918.3064029.
Wan, M., Zha, D., Liu, N., et al. (2023). In-processing modeling techniques for machine learning fairness:
A survey. In: ACM Transactions on knowledge discovery from data 17.3, pp. 1–27. https://doi.org/10.
1145/3551390.
Whang, S. E., Roh, Y., Song, H., et al. (2023). Data collection and quality challenges in deep learning: A
data-centric AI perspective. In: The VLDB Journal 32.4, pp. 791–813. https://doi.org/10.1007/s00778-
022-00775-9.
Zahid, A., Kay Poulsen, J., Sharma, R., et al. (2021). A systematic review of emerging information technologies
for sustainable data-centric healthcare. In: International Journal of Medical Informatics 149. https://doi.
org/10.1016/j.ijmedinf.2021.104420.
Zha, D., Bhat, Z. P., Lai, K., et al. (2023). Data-centric ai: Perspectives and challenges. In: Proceedings of
the 2023 SIAM international conference on data mining (SDM). SIAM, pp. 945–948. https://doi.org/10.
1137/1.9781611977653.ch106.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

123

You might also like