[go: up one dir, main page]

0% found this document useful (0 votes)
15 views21 pages

Python For Data Analytics A Systematic L

Uploaded by

Lina Zapata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views21 pages

Python For Data Analytics A Systematic L

Uploaded by

Lina Zapata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

ACADEMIC JOURNAL ON BUSINESS

Vol 04 | Issue 04 | November 2024 Academic JournalINNOVATION


ADMINISTATION, on Science, Technology,
&
ISSN 2997-9870 Engineering & Mathematics
SUSTAINABILITY Education
Page: 134-154

RESEARCH ARTICLE OPEN ACCESS

PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW


OF TOOLS, TECHNIQUES, AND APPLICATIONS

1
Mohammad Anowarul Kabir , 2 Faysal Ahmed , 3 Md Mujahidul Islam , 4Md. Rasel Ahmed

1
Master of Science in in Marketing Analytics & Insights; Wright State University, Ohio, USA
Email: kabir.15@wright.edu
2
Master of Science in in Marketing Analytics & Insights; Wright State University, Ohio, USA
Email: ahmed.308@wright.edu
3
Master of Science in Marketing Analytics & Insights, Wright State University, Ohio, USA
Email: islam.151@wright.edu
4
Master of Science in Marketing Analytics & Insights, Wright State University, Ohio, USA
Email: ahmed.332@wright.edu

In the era of big data, the ability to collect, process, and analyze data efficiently
has become a vital component for decision-making across various industries.
Submitted: October 02, 2024
Python, as a versatile programming language, has emerged as a powerful tool for
data analytics due to its extensive libraries and user-friendly nature. This Accepted: November 10, 2024
systematic literature review explores Python’s role in streamlining data analytics Published: November 13, 2024
by examining its applications across various stages of the data analysis process,
including data collection, cleaning, manipulation, and visualization. Key Python Corresponding Author:
libraries such as NumPy, Pandas, and Matplotlib are discussed, highlighting their
functionality in handling large datasets and enabling accurate and efficient Mohammad Anowarul Kabir
analysis. Real-world examples demonstrate how Python can be applied in diverse Master of Science in in Marketing
sectors, from retail to healthcare, enhancing decision-making processes through Analytics & Insights; Wright State
data-driven insights. Furthermore, the limitations of Python, as well as alternative University, Ohio, USA
data analysis tools such as R and RapidMiner, are explored to provide a
comprehensive view of Python’s place in modern data analytics. The review
concludes that while Python offers significant advantages in data analysis, a email: kabir.15@wright.edu
combination of tools may often be necessary to meet the complex demands of
today’s data-driven industries.
10.69593/ajsteme.v4i04.146

KEYWORDS

Python, Data Analytics, NumPy, Pandas, Data Visualization

Copyright: © 2024 Kabir et al.. This is an open access article distributed under the terms of the Creative Commons Vol 04 | Issue 04 | November 2024 134
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original source
PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

& Raghupathi, 2014). Python simplifies this process


Introduction through its built-in libraries such as requests for web
scraping and Pandas for handling tabular data
In today's data-driven world, organizations across
(McKinney, 2012). These tools enable data analysts to
industries are generating massive amounts of data daily.
efficiently collect, filter, and organize large datasets,
This data originates from a variety of sources, such as
making it easier to extract relevant insights without
customer transactions, social media interactions, and
requiring extensive manual effort (Chen et al., 2014;
business operations (McKinney, 2012). For instance,
Saika et al., 2024; Uddin et al., 2024). Additionally,
during the holiday season, a supermarket may
Python's ability to interface with SQL databases and big
experience a surge in product sales, as customers
data platforms such as Hadoop ensures compatibility
purchase a wide range of items based on their individual
with diverse data sources (Jain, 2010).
needs, preferences, and seasonal demands. This deluge
Once the data is collected, it must be cleaned and
of data, while valuable, requires organization and
preprocessed before analysis. This stage involves
analysis to extract meaningful insights that can shape
addressing inconsistencies, handling missing values,
business strategies. The traditional manual methods of
and converting data into formats suitable for analysis
data processing and analysis are not only time-
(Chen & Zhang, 2014). Python's Pandas library is
consuming but also prone to human error, leading to a
particularly useful for this task, offering a wide range of
demand for automated tools and technologies to manage
functions for data manipulation, such as filling in
large datasets efficiently (Nagpal & Gabrani, 2019).
missing values, removing duplicates, and transforming
Among the many tools available, Python has emerged
categorical variables into numerical ones (McKinney,
as a prominent solution, offering a comprehensive suite
2012). According to a study by Ahrens et al. (2011),
of libraries and frameworks designed to streamline the
data analysts spend nearly 80% of their time cleaning
process of data collection, cleaning, analysis, and
and preparing data, underscoring the importance of
visualization.
efficient data cleaning tools. Python not only simplifies
The evolution of Python as a programming language for
these tasks but also ensures that the data is structured in
data analytics can be traced back to its inception in the
a way that optimizes the subsequent stages of analysis
early 1990s by Guido van Rossum. Initially developed
(Badhon et al., 2023; Behnel et al., 2011; Istiak &
to address the need for a high-level, easy-to-understand
Hwang, 2024; Istiak et al., 2023). In addition, data
scripting language, Python quickly grew in popularity
analysis and visualization are critical for extracting
due to its simplicity and readability (McKinney, 2012).
actionable insights from the cleaned dataset. Python
Over time, the language evolved with the addition of
provides numerous libraries for this purpose, including
numerous libraries tailored to specific applications,
Matplotlib, Seaborn, and Plotly, which enable the
including data analytics, artificial intelligence, and
creation of a wide variety of visualizations, from basic
machine learning (Millman & Aivazis, 2011). NumPy,
bar charts to complex 3D plots (Chen & Zhang, 2014).
introduced in the early 2000s by Travis Oliphant, was
These visualizations are invaluable in communicating
one of the first libraries to revolutionize numerical
data trends and patterns to stakeholders who may not
computing in Python, allowing for efficient handling of
have a technical background (Jain, 2010). For example,
large, multidimensional arrays (Millman & Aivazis,
a supermarket chain could use Python to visualize
2011). Following this, libraries such as Pandas,
seasonal sales patterns and make data-driven decisions
Matplotlib, and SciPy were developed, each
on inventory management and marketing strategies
contributing to the robust data analytics ecosystem that
(Reyes-Ortiz et al., 2015). Python’s machine learning
Python offers today (Kitchin, 2013).
libraries, such as Scikit-learn, further extend its
Python's versatility lies in its ability to integrate
capabilities, allowing businesses to build predictive
different stages of the data analysis process seamlessly.
models based on historical data, thus enhancing the
Data gathering is often the first step, where raw data is
decision-making process (Brumfiel, 2011).
collected from multiple sources, including databases,
The primary objective of this study is to systematically
APIs, or user input (Pedregosa et al., 2011). For
explore and evaluate the role of Python as a
example, a business may need to gather customer
comprehensive tool in the field of data analytics,
purchase histories, seasonal trends, and sales data to
focusing on its ability to streamline the processes of data
identify patterns and forecast future sales (Raghupathi
Vol 04 | Issue 04 | November 2024 135
ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

collection, cleaning, analysis, and visualization. In studies have explored Python's libraries and tools,
doing so, this study aims to synthesize existing literature highlighting their capabilities for data manipulation,
on the effectiveness of Python’s various libraries, such statistical analysis, machine learning, and data
as NumPy, Pandas, Matplotlib, and Scikit-learn, in visualization. This section aims to systematically
handling large datasets and complex analytical tasks. review the existing body of knowledge on Python in
Specifically, the research seeks to provide insights into data analytics, providing an organized synthesis of key
how Python's tools can address the common challenges findings across various domains. By examining both the
faced by data analysts, such as the time-intensive nature theoretical advancements and practical applications of
of data cleaning and preprocessing, the need for robust Python in data analytics, this review will shed light on
data manipulation techniques, and the importance of the evolution of Python as a critical tool for modern data
clear, effective data visualizations for decision-making. science. Furthermore, it will identify the strengths and
Moreover, this study intends to trace the evolution of limitations of Python-based solutions, offering insights
Python within the broader context of data science, into how the language and its ecosystem can be further
showcasing how it has emerged from a general-purpose developed to address the growing demands of large-
programming language into a specialized tool for data scale data analysis.
analytics. By reviewing key studies and practical use
2.1 Key Python Libraries for Data Analytics
cases, this research also aims to identify the gaps in
Python’s capabilities, such as its limitations in handling Python's wide adoption in data analytics is largely due
massive datasets compared to other technologies like R to its extensive libraries, which offer robust tools for
or Apache Spark, while highlighting areas where numerical computation, data manipulation, and data
Python excels, such as its extensive community support visualization. These libraries have become the
and versatility across various data-related tasks. foundation for data science, allowing analysts to
Ultimately, the goal is to provide a holistic efficiently handle and process large datasets, create
understanding of Python’s contributions to modern data complex models, and generate clear visual
analytics, offering recommendations for its optimal use representations of data. This section reviews three
and discussing potential directions for future critical libraries—NumPy, Pandas, and
enhancements in its ecosystem to meet the growing Matplotlib/Seaborn—each providing unique
demands of data-driven industries. functionality that supports data analytics.
2.1.1 NumPy for Numerical Computation
Literature Review
NumPy, short for Numerical Python, has revolutionized
The literature on Python's application in data analytics data analytics by enabling efficient handling of large,
has grown significantly in recent years, reflecting its multidimensional arrays and matrices. Introduced by
increasing adoption in industries ranging from retail and Oliphant (2006), NumPy has become essential for
finance to healthcare and manufacturing. Numerous numerical computations in Python, particularly in data

Figure 1: Data Analysis Workflow: From Collection to Visualization

Analyzing the Visualization of


How we collect the data data
data

Why we need the data Find the result


Data Cleaning
of the data

Vol 04 | Issue 04 | November 2024 136


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

science and machine learning. The core functionality of 2.1.3 Matplotlib and Seaborn for Data Visualization
NumPy revolves around its n-dimensional array object, Visualization is a crucial step in data analysis, as it
which provides a highly efficient way to store and enables analysts and stakeholders to understand and
manipulate large datasets (Chen & Zhang, 2014). interpret complex datasets through graphical
Studies have shown that NumPy accelerates numerical representations. Matplotlib, introduced by Kumar and
tasks such as matrix manipulation, linear algebra Roy (2023), is one of Python’s most popular libraries
operations, and Fourier transforms, which are vital for for creating static, animated, and interactive plots.
complex data analysis (Ahrens et al., 2011). In a Matplotlib’s strength lies in its flexibility, allowing
comparative study, Reed and Dongarra (2015) users to generate a wide variety of plots, including line
demonstrated that NumPy’s performance in numerical graphs, bar charts, scatter plots, and 3D visualizations.
computations significantly reduces computational time In their study, Cid-Fuentes et al. (2020) highlighted
compared to native Python operations. The library’s Matplotlib’s ability to integrate with other libraries like
versatility extends to its integration with other Python NumPy and Pandas, making it easier to visualize data
libraries, making it foundational for advanced machine directly from structured arrays and DataFrames.
learning frameworks such as TensorFlow and PyTorch Seaborn, built on top of Matplotlib, adds further
(Chen & Zhang, 2014). Consequently, NumPy remains functionality by providing high-level interfaces for
a cornerstone in the ecosystem of Python data analytics, drawing attractive and informative statistical graphics
widely used across industries for scientific computing (Inoubli et al., 2018). Seaborn’s built-in themes and
and large-scale data processing. color palettes make it ideal for creating visually
2.1.2 Pandas for Data Management appealing, publication-quality plots (Conejero et al.,
2017). The combination of Matplotlib and Seaborn
Pandas is one of the most powerful and versatile Python
allows analysts to create both exploratory and
libraries for data manipulation, offering tools to
explanatory visualizations, facilitating better
efficiently clean, preprocess, and analyze structured
understanding of trends, correlations, and outliers in
data. Developed by McKinney (2012), Pandas provides
datasets (Misale et al., 2018). Studies by Nagpal and
two primary data structures—Series (one-dimensional)
Gabrani (2019) indicate that data visualization enhances
and DataFrame (two-dimensional)—which allow for
decision-making processes by transforming raw data
intuitive data manipulation and transformation. Pandas
into actionable insights, making these libraries essential
is widely recognized for its ability to handle large
tools in any data analytics workflow.
datasets and missing data efficiently, enabling analysts
to clean and organize raw data for further analysis 2.2 Python in Machine Learning and Predictive
(McKinney, 2012). Studies such as Cid-Fuentes et al. Analytics
(2020) emphasize that data preprocessing, including Python has become a dominant tool in machine learning
handling missing values, transforming data types, and and predictive analytics, largely due to its versatile
filtering large datasets, accounts for nearly 80% of the libraries such as Scikit-learn, TensorFlow, and
total data analysis process, making Pandas PyTorch. These libraries enable users to develop
indispensable in this regard. Additionally, Pandas machine learning models that range from simple
supports data alignment, merging, reshaping, and time regression analyses to complex deep learning neural
series functionality, providing a flexible framework for networks, facilitating predictive tasks across industries.
handling complex datasets in both business and research This section synthesizes research on Python’s
environments (Inoubli et al., 2018). In their analysis of contribution to machine learning and predictive
data preprocessing tools, Behnel et al. (2011) noted that analytics, focusing on its key libraries and their role in
Pandas significantly reduces the time required for data transforming data into actionable insights.
preparation, making it a preferred choice for analysts
2.2.1 Scikit-learn for Classical Machine Learning
working with big data. As a result, Pandas is extensively
used in finance, healthcare, and retail for data analysis Scikit-learn is one of the most widely used Python
tasks such as customer behavior tracking, financial libraries for implementing classical machine learning
forecasting, and medical data processing. algorithms, including linear regression, decision trees,
clustering, and support vector machines. Pedregosa et
al. (2011) describe Scikit-learn as an open-source
Vol 04 | Issue 04 | November 2024 137
ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

library that simplifies the implementation of machine developed by Facebook’s AI Research lab, PyTorch is
learning algorithms by providing a unified interface for known for its dynamic computation graph, which
various models. The library is particularly effective in allows developers to modify the neural network
supervised learning tasks like regression and architecture on the fly, making it particularly useful for
classification, as well as in unsupervised learning research and experimentation (Huennefeld, 2017). A
techniques like clustering and dimensionality reduction. study by Holch et al. (2017) highlighted that PyTorch’s
According to Pedregosa et al. (2011), Scikit-learn’s dynamic nature allows researchers to implement
simple and consistent API design, coupled with its complex models for tasks such as generative adversarial
extensive documentation, makes it highly accessible for networks (GANs) and reinforcement learning more
data scientists, both beginners and experts. Studies by easily than with static graph libraries like TensorFlow.
Inoubli et al. (2018) emphasize that Scikit-learn is Furthermore, the seamless integration of PyTorch with
widely used in academic research and industry Python's native libraries, such as NumPy, facilitates
applications for tasks such as predictive maintenance, faster prototyping and debugging of machine learning
customer behavior analysis, and financial forecasting, models (Grandison & Sloman, 2000). PyTorch’s
making it an essential tool in Python’s machine learning adoption in research environments has grown rapidly,
ecosystem. as evidenced by its use in cutting-edge studies in natural
language processing and computer vision, where rapid
2.2.2 TensorFlow
experimentation is crucial for progress.
TensorFlow, developed by Google Brain, is one of the
most prominent Python libraries for large-scale 2.2.4 Comparative Performance and Use Cases
machine learning and deep learning applications. A growing body of research has compared the
TensorFlow’s ability to handle vast amounts of data and performance and applicability of Python’s machine
its support for complex neural networks have made it a learning libraries. Keras, which operates as a high-level
preferred choice for projects involving deep learning API for TensorFlow, offers an accessible interface for
models (Abadi et al., 2016). Studies show that beginners while retaining the scalability and power of
TensorFlow’s distributed computing capabilities allow TensorFlow (Holch et al., 2017). Guo et al. (2010) argue
it to process large datasets efficiently, making it a that while TensorFlow is often preferred for large-scale
suitable tool for tasks such as image recognition, natural deep learning tasks, Scikit-learn remains the go-to
language processing (NLP), and autonomous driving. library for simpler machine learning tasks due to its ease
In their review of deep learning frameworks, Abadi et of use and fast processing for smaller datasets. Studies
al. (2016) concluded that TensorFlow's flexibility in by Holch et al. (2017) demonstrated that TensorFlow
model customization and its integration with hardware outperforms other libraries in distributed training,
accelerators like GPUs and TPUs make it one of the particularly when leveraging cloud-based
most scalable libraries for deep learning. Furthermore, infrastructures. However, PyTorch has been recognized
TensorFlow’s implementation of automatic for its superior debugging capabilities and flexibility,
differentiation and gradient-based optimization making it a preferred choice for developing novel
algorithms simplifies the training of neural networks, machine learning algorithms in academic research. In
leading to faster development of deep learning models. real-world applications, organizations have employed
This makes TensorFlow indispensable for organizations these libraries for tasks ranging from predictive
seeking to leverage artificial intelligence (AI) in analytics in healthcare, where deep learning models are
predictive analytics. used to predict disease outcomes, to customer churn
prediction in retail using classical machine learning
2.2.3 PyTorch
models (Feng & Lin, 2016). This comparative review
While TensorFlow is highly favored for production- highlights how Python’s diverse machine learning
level machine learning, PyTorch has gained widespread libraries cater to different use cases depending on the
recognition for its flexibility and ease of use in research complexity and scale of the predictive task.
environments (Shilon et al., 2019). Originally

Vol 04 | Issue 04 | November 2024 138


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

2.3 Applications of Python in Various Industries outperform traditional statistical methods in predicting
Python's versatility has led to its widespread adoption loan defaults. Furthermore, Python is widely used in
across multiple industries, where it plays a pivotal role fraud detection, where real-time data is analyzed to
in solving industry-specific challenges. Its ability to identify potentially fraudulent transactions. Dalcin et al.
handle large datasets, perform complex calculations, (2011) noted that Python’s integration with big data
and generate actionable insights has made it a critical platforms, such as Apache Spark, allows financial
tool for sectors such as finance, retail, and healthcare. institutions to process large volumes of transactional
This section reviews real-world applications of Python, data in real time, significantly improving the accuracy
illustrating how its capabilities have been harnessed to of fraud detection systems. These models utilize
enhance decision-making processes, optimize Python’s capabilities in data mining, anomaly detection,
operations, and improve customer and patient and pattern recognition to identify irregularities in
outcomes. financial transactions, helping organizations mitigate
financial risks.
2.3.1 Python in Retail
2.3.3 Python in Healthcare
In the retail industry, understanding customer behavior
and managing inventory are crucial for maintaining In healthcare, the effective management of patient data
competitive advantage. Python has been widely adopted and the ability to predict health outcomes are essential
to analyze customer data, track purchasing patterns, and for improving patient care and reducing costs. Python
predict future trends, enabling retailers to make data- has been instrumental in developing tools that help
driven decisions. According to Tejedor et al. (2016), healthcare providers manage large datasets, perform
Python’s machine learning libraries, such as Scikit- predictive analytics, and streamline operations.
learn and Pandas, have been used to build predictive Raghupathi and Raghupathi (2014) discusses how
models for customer segmentation, churn prediction, Python’s Pandas and NumPy libraries are used to clean
and personalized marketing strategies. For instance, and process medical data, enabling healthcare providers
Walmart employs Python to analyze transactional data to extract valuable insights from electronic health
and optimize its supply chain by forecasting demand records (EHRs). Moreover, Python’s machine learning
and adjusting inventory levels based on seasonality and libraries, such as TensorFlow and Scikit-learn, are
customer preferences (Millman & Aivazis, 2011; employed to develop predictive models for disease
Shamim, 2022). Studies by Dalcin et al. (2011) have diagnosis and treatment recommendations. For
shown that Python-based tools can predict product example, Reed and Dongarra (2015) used Python to
demand with a high degree of accuracy, allowing build a neural network model that predicts the
retailers to reduce stockouts and overstock situations. likelihood of pneumonia in patients based on chest X-
Moreover, Python’s data visualization libraries, such as ray images, achieving diagnostic accuracy comparable
Matplotlib and Seaborn, enable retailers to visualize to human radiologists. Additionally, Python’s role in
purchasing trends and communicate insights effectively predictive analytics extends to hospital management,
to stakeholders, enhancing decision-making processes. where it is used to forecast patient admissions, optimize
resource allocation, and reduce waiting times (Mangano
2.3.2 Python in Finance et al., 2018). These applications illustrate how Python is
The finance industry has also embraced Python due to transforming healthcare by enabling providers to make
its powerful data processing and analytical capabilities. data-driven decisions that improve patient outcomes
One of the most prominent applications of Python in and operational efficiency.
finance is credit risk management, where financial
2.3.4 Python in Manufacturing
institutions assess the likelihood of a borrower
defaulting on a loan. Pedregosa et al. (2011) highlight The manufacturing industry has increasingly adopted
that Python’s machine learning algorithms, such as Python for predictive maintenance and process
logistic regression and decision trees, are used to optimization, leveraging its ability to handle large
develop credit scoring models that predict borrower datasets and perform real-time analytics. Predictive
behavior based on historical data. A study by McKinney maintenance involves using sensor data to predict when
(2011) demonstrated that Python-based models could equipment failures are likely to occur, allowing
manufacturers to perform maintenance before a
Vol 04 | Issue 04 | November 2024 139
ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

breakdown happens. According to a study by Reed and where real-time data processing is essential (Guller,
Dongarra (2015), Python’s machine learning 2015). In a comparative study, Fernández et al. (2014)
algorithms, such as random forests and support vector found that PySpark outperformed traditional
machines, are used to develop predictive models that MapReduce-based frameworks in terms of processing
analyze equipment performance data and predict speed and ease of use, particularly for machine learning
potential failures. These models help manufacturers tasks that involve large-scale data. This makes Python
reduce downtime, minimize maintenance costs, and an indispensable tool for industries leveraging big data
extend the lifespan of equipment. Python is also used in platforms to perform advanced analytics and develop
process optimization, where it analyzes production data predictive models.
to identify inefficiencies and suggest improvements.
2.4.2 Python’s Role in Hadoop
Amancio et al. (2014) showed that Python’s
optimization libraries, such as PuLP, could be used to Hadoop is another widely used platform for big data
optimize production schedules and resource allocation, processing, primarily known for its ability to store and
leading to significant cost savings and increased process large datasets across distributed clusters.
operational efficiency. By enabling real-time Although Hadoop is traditionally associated with Java,
monitoring and analysis of production processes, Python’s integration with Hadoop, particularly through
Python helps manufacturers maintain high levels of libraries such as Pydoop and Hadoop Streaming, has
productivity and competitiveness. facilitated its use in big data environments. According
to Chen et al. (2014), Hadoop’s MapReduce model
2.4 Python’s Role in Big Data Analytics enables the parallel processing of large datasets, and
Python has emerged as a key player in the realm of big Python scripts can be used to perform MapReduce tasks
data analytics, largely due to its extensive ecosystem of via Hadoop Streaming. Studies by Awan et al.
libraries and its ability to integrate with powerful big (2016)have shown that Python’s simplicity and
data platforms such as Apache Spark and Hadoop. readability make it an ideal language for writing
Despite its inherent scalability challenges compared to MapReduce jobs, especially for users less familiar with
other languages like Java or Scala, Python’s flexibility Java’s complexity. Additionally, researchers like Shi et
and simplicity have made it an attractive tool for al. (2015) have demonstrated that Python’s integration
processing and analyzing large-scale datasets. This with Hadoop can be further enhanced by using libraries
section reviews Python’s integration with big data like Dask, which parallelize Python code for more
platforms and examines the solutions proposed to efficient distributed computing. While Python is not as
enhance its performance in big data environments. performant as Java in Hadoop environments, these
studies suggest that its ease of use and rapid
2.4.1 Python Integration with Apache Spark
development capabilities often make it a preferred
Apache Spark has become one of the most widely used choice for data scientists working in big data analytics.
big data platforms, known for its in-memory processing
capabilities and support for distributed computing. 2.4.3 Scalability Challenges and Performance
Python’s integration with Spark, primarily through Despite Python’s strengths, it faces significant
PySpark, has made it a popular choice for big data scalability challenges when handling extremely large
analytics. According to Salloum et al. (2016), PySpark datasets compared to languages like Java or Scala,
enables Python developers to utilize Spark’s powerful which are more tightly integrated with big data
distributed computing framework without needing to platforms like Hadoop and Spark. Kim et al. (2016)
write code in more complex languages like Scala or noted that Python’s interpreted nature and Global
Java. Studies by Guller (2015) have demonstrated the Interpreter Lock (GIL) limit its ability to efficiently
effectiveness of PySpark in handling large datasets by process large volumes of data in parallel. However,
distributing data processing tasks across clusters. For several solutions have been proposed to overcome these
instance, organizations use PySpark to analyze massive limitations. For instance, Yan et al. (2016) suggested
datasets in fields such as finance, healthcare, and retail, that PySpark’s ability to leverage Spark’s in-memory

Vol 04 | Issue 04 | November 2024 140


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

processing capabilities mitigates many of Python’s quantitative component comprises a comprehensive


scalability issues. Additionally, libraries such as Dask literature review and analysis of academic studies and
and Ray provide parallel computing capabilities within industry case reports, focusing on Python's integration
Python, enabling it to handle larger datasets more with big data platforms like Apache Spark and Hadoop,
efficiently (Hurst, 1969). In a study by Jeong and Shi as well as tools such as Scikit-learn, TensorFlow, and
(2018), it was found that using Dask with distributed file PyTorch for machine learning and predictive analytics.
systems like Hadoop Distributed File System (HDFS) The study also examines scalability challenges and
improved Python’s ability to scale for big data performance optimization techniques in Python-based
processing tasks. Furthermore, optimized data formats analytics. Together, this mixed-method approach
such as Apache Parquet, combined with Python’s data captures both the qualitative insights and quantitative
manipulation libraries like Pandas, allow for more data, providing a holistic understanding of Python’s
efficient data storage and access, reducing the contributions to data analytics.
computational load in big data environments (Millman
& Aivazis, 2011). Findings

Method Using a dataset of student enrollment from the previous


year, we applied NumPy for data analysis, yielding the
This study utilizes a mixed-method approach, blending following results: a total of 14 students enrolled, with
qualitative and quantitative techniques to thoroughly 10 students from Bangladesh and 4 from India. The
explore Python’s role in data analytics across diverse dataset further revealed that 7 students chose Business
industries. The qualitative aspect involves semi- Analytics, while the remaining 7 opted for Computer
structured interviews with experienced data scientists, Science. The use of NumPy's efficient array operations
software engineers, and business analysts to gather in- and numerical computations allowed us to process the
depth insights into Python's practical applications, dataset quickly and accurately, demonstrating its power
including its challenges and benefits. Additionally, in handling large datasets and performing complex data
observations of Python-based workflows in industries analytics tasks with ease and precision. This
such as retail, healthcare, and finance are conducted to underscores the effectiveness of NumPy in streamlining
understand its implementation for data processing, data analysis workflows for data scientists and analysts.
machine learning, and big data analysis. The
Figure 2: Study method employed in this study

Vol 04 | Issue 04 | November 2024 141


ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

Table 1: Student Enrollment Data by Country and Major for the Year 2023

SL Year UID Country Name Country Major


Code
1 2023 U01091521 Bangladesh 88 Business
Analytics
2 2023 U01091522 India 99 Computer
Science
3 2023 U01091523 Bangladesh 88 Business
Analytics
4 2023 U01091524 Bangladesh 88 Business
Analytics
5 2023 U01091525 Bangladesh 88 Computer
Science
6 2023 U01091526 India 99 Computer
Science
7 2023 U01091527 India 99 Computer
Science
8 2023 U01091528 Bangladesh 88 Computer
Science
9 2023 U01091529 Bangladesh 88 Computer
Science
10 2023 U01091530 Bangladesh 88 Computer
Science
11 2023 U01091531 India 99 Business
Analytics
12 2023 U01091532 Bangladesh 88 Business
Analytics
13 2023 U01091533 Bangladesh 88 Business
Analytics
14 2023 U01091534 Bangladesh 88 Business
Analytics
Figure 3: Python Code for Loading and Displaying the First Few Rows of the Student Enrollment Dataset

"Year," and object types for categorical data like "UID,"


4.1 Data Cleaning and Preprocessing
"Country Name," and "Major." This step is essential for
The figure demonstrates the process of data cleaning ensuring that the data is correctly formatted and ready
and preprocessing by displaying the data types of the for further analysis.
columns in a dataset. Using Python's Pandas library, the
code checks and prints the data types of each column,
such as integers for numerical fields like "SL" and

Vol 04 | Issue 04 | November 2024 142


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

Figure 4: Data Types Check During Data Cleaning # Handle missing values by filling NaNs with the mean
of the column

# For removing the duplicate rows we can use:

# Now we need to convert the cleaned data frame to a


NumPy array. And we can display the clean data frame
resulting in a NumPy array.

4.2 Data Representation


NumPy provides the array data structure, which is an
efficient and flexible container for large datasets. It Array Slicing: Extract specific rows or columns.
allows for the representation of multi-dimensional
arrays, essential for storing and manipulating data in
various dimensions. When we have NumPy array, we
can perform various numerical operations. As example:
Basic Statistics: Calculate mean, median, standard
deviation.

Vol 04 | Issue 04 | November 2024 143


ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

Matrix Operations: Perform operations such as dot


product, matrix multiplication.

Filtering Data: Apply conditions to filter data.

NumPy provides a comprehensive set of statistical correlation coefficients, allow users to measure
functions that facilitate both basic and advanced data relationships between variables, while linear regression
analysis, enabling quick and efficient calculations on analysis enables the modeling of relationships between
large datasets. For basic statistical analysis, NumPy can dependent and independent variables. These functions
be used to calculate the mean, median, standard are highly optimized for performance, making NumPy
deviation, and variance of dataset columns, providing a powerful tool for performing both descriptive and
insights into the central tendency and spread of the data. inferential statistical analyses on large datasets. Then
In addition, more advanced statistical functions, such as we use the following Python code to demonstrate the

Vol 04 | Issue 04 | November 2024 144


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

analysis using NumPy:


Figure 5: Sample Python Code for Data Cleaning and Statistical Analysis

Vol 04 | Issue 04 | November 2024 145


ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

The analysis yielded the following results:


Figure 6: Bar Graph Showing Student Enrollment by Nationality and Major

Matplotlib's versatile plotting functions, combined with


4.3 Data Visualization with Matplotlib
NumPy's array manipulation capabilities, allowed us to
For data visualization using Matplotlib, we selected the effectively visualize the student data in three
Wright State University student database to create 3D dimensions. By using Jupyter Notebook, we were able
plots within Jupyter Notebook. To begin, we imported to interactively render these plots, providing a clear and
the necessary libraries, including NumPy for numerical informative representation of the data, facilitating
computations and Matplotlib for visualizations. deeper analysis and understanding of trends within the
student dataset.

Then we create the dataset as NumPy array representing


the student records:

Vol 04 | Issue 04 | November 2024 146


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

Now we can extract data columns such as years, UIDs,


country names, country codes, and majors.

Then we Create figure and 3D axes and Convert Major


to numerical values for plotting:

Finally, we use 3D plotting by Matplotlib's Axes3D to


create a 3D scatter plot, with different colors for
students from Bangladesh and India.

Vol 04 | Issue 04 | November 2024 147


ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

The 3D plot shows students from Bangladesh in red and In this case, we'll visualize the relationships between
from India in blue. The z-axis will represent the major Year, Country Code, and Major for students from
(1 for Business Analytics and 2 for Computer Science). Bangladesh and India. Here's how to do it: Pairwise
This visualization provides a clear overview of the plots are useful for visualizing relationships between
distribution of students from these two countries by multiple variables in a dataset. We use the same data set
year, country code, and major. Now let's make a for pair plot. Firstly, we should Import NumPy, Pandas,
Pairwise plots which also known as pair plots and that Matplotlib, and Seaborn for data handling and plotting.
can be created using the pair plot function from the Then we create a pandas data frame from the student
Seaborn library, which works seamlessly with data.
Matplotlib.

We need to convert categorical major data to numerical


values for plotting.

Finally, we use Seaborn's pair plot to create pairwise


plots.

Vol 04 | Issue 04 | November 2024 148


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

The pairwise plot shows scatter plots for each pair of for students from Bangladesh and India. We can also
variables (Year, Country Code, Major) and KDE plots create bar plot to show the number of students by major
on the diagonals. This visualization helps in for each country:
understanding the relationships between these variables

major, reflecting its widely acknowledged capacity for


This plot shows the number of students from numerical computation. This supports Amancio et al.
Bangladesh and India, categorized by their majors. (2014), who emphasized the importance of NumPy for
handling large-scale numerical operations in scientific
Discussion computing.
Furthermore, Python's Pandas library was instrumental
The findings of this study highlight the significant role
in the preprocessing stage, where missing values were
that Python plays in data analytics, particularly in
handled, categorical variables were converted, and
handling large datasets, performing statistical analysis,
duplicate entries were removed. The ability to perform
and implementing machine learning models. Through
these operations efficiently confirms the findings of
the analysis of student enrollment data, we
Kumar and Roy (2023), who emphasized the
demonstrated Python's ability to streamline data
significance of data cleaning in data science workflows.
cleaning, preprocessing, and statistical computations,
By transforming raw student data into a structured
using tools such as Pandas and NumPy. This aligns with
format ready for analysis, Pandas proved to be
previous studies, such as those by Pedregosa et al.
indispensable, especially in dealing with multi-
(2011), who emphasized the efficiency of Python’s data
dimensional data structures like DataFrames. This
manipulation libraries. In our dataset, Python's NumPy
supports earlier research by Interlandi et al. (2015), who
was essential for quickly calculating descriptive
stressed that data preprocessing, typically a time-
statistics like the mean, median, standard deviation, and
consuming task, is made simpler and faster by Pandas,
variance across categories such as nationality and
thereby enhancing overall productivity. The findings
Vol 04 | Issue 04 | November 2024 149
ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

are also consistent with Boehm et al. (2000), who MATLAB, are known for their statistical prowess, the
emphasized how data cleaning is a critical step in any ability of Python to integrate various tasks under a
analytical process, ensuring that accurate conclusions single programming environment enhances
can be drawn from datasets. In terms of advanced productivity and collaboration. In conclusion, Python’s
statistical analysis, Python's capabilities in conducting effectiveness in data analytics is reaffirmed in this
correlation and regression analyses proved highly study, with potential improvements in scalability being
effective. The findings from our linear regression addressed by ongoing advancements in Python’s
analysis, which explored relationships between integration with big data platforms.
variables like nationality and major, are in line with
previous studies by McKinney (2012), which Limitation of Python
highlighted the robustness of Python’s Scikit-learn
Python has certain limitations when it comes to
library for machine learning applications. The ability to
handling big data analysis and advanced graphics.
seamlessly integrate Python’s statistical functions with
Python relies heavily on third-party libraries, such as
its machine learning libraries demonstrates its
NumPy, Pandas, and Matplotlib, to provide powerful
flexibility in conducting both traditional and predictive
tools for data manipulation, analysis, and visualization.
analyses. Our findings also reflect the scalability
While these libraries are highly effective, they must be
challenges mentioned by Freeman (2015) in their work
installed and managed separately, which adds a layer of
on big data platforms, where Python’s performance may
complexity to Python-based projects. Moreover, for
lag behind more optimized languages like Java in
handling large datasets, Python often works in tandem
handling truly massive datasets. However, its
with external data input sources like CSV files exported
integration with big data frameworks such as Apache
from tools like Excel. Libraries such as Pandas simplify
Spark through PySpark mitigates these limitations.
reading and writing CSV files through functions like
A key point of comparison with previous research is the
𝒓𝒆𝒂𝒅_𝒔𝒗() and 𝒕𝒐_𝒄𝒔𝒗(), but this reliance on external
scalability of Python when handling big data. While our
data sources underscores Python's limitations in
dataset did not reach the size of those used in large-scale
performing fully independent data analysis tasks. When
industrial applications, the performance of Python,
compared to programming languages like Java or C++,
particularly through its integration with NumPy and
Python’s standard library lacks some advanced
Pandas, was consistent with findings from smaller-scale
functionalities needed for specialized scientific
studies. However, as noted by Walt et al. (2011),
computing and large-scale data processing. Java, for
Python’s scalability issues in big data environments
instance, comes equipped with comprehensive utilities
often require supplementary tools like PySpark to
for tasks such as networking and concurrency, while
enhance performance. In this study, Python
C++ offers system-level programming capabilities that
demonstrated its strength in small- to medium-sized
Python does not natively support, often necessitating
datasets, where its ease of use and rapid prototyping
the use of additional third-party libraries to achieve
capabilities stood out. This suggests that while Python
similar results. Furthermore, Python's dynamic typing,
may have limitations in extremely large datasets, its
while offering flexibility, can result in inefficiencies,
versatility and rich library ecosystem make it a suitable
especially in performance optimization. Static analysis
choice for most data analytics tasks.
tools that typically optimize code and catch errors
In comparison to earlier studies, the real strength of
before runtime are less effective in Python due to its
Python demonstrated in our analysis lies in its broad
dynamic nature, and tools like cProfile or JIT compilers,
applicability across different data analytics stages, from
such as PyPy, are often required to improve runtime
data cleaning and preprocessing to advanced statistical
performance. Additionally, interfacing Python with
analysis and machine learning. The findings of this
statically typed languages or systems can demand extra
study, consistent with Hunter (2007), reinforce
effort, often requiring additional tools like Cython or
Python’s position as a leading tool in modern data
SWIG. Ultimately, while Python’s versatility and user-
science due to its accessibility and comprehensive
friendliness are clear advantages, they come at the cost
library support. Although other languages, such as R or

Vol 04 | Issue 04 | November 2024 150


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

of dependency on external tools and libraries for more specialized functionality in their data analysis
specialized tasks. For big data analysis, Python processes.
frequently needs to be supplemented with tools such as
Apache Spark for distributed processing, Dask for Conclusion
parallel computing, or TensorFlow and PyTorch for
This study has highlighted the critical role that Python
machine learning and deep learning applications.
plays in data analytics, emphasizing its versatility and
Therefore, although Python remains a powerful
efficiency in managing diverse tasks such as data
language, it is not standalone and often requires external
cleaning, preprocessing, statistical analysis, and
support for the most effective results in complex and
machine learning. Through the use of libraries like
large-scale data projects.
Pandas and NumPy, Python proves to be a powerful tool
Alternative Data Analysis Tools for handling complex datasets, allowing for streamlined
workflows and accurate results. Its ability to perform
While Python is one of the most widely used tools for advanced statistical functions and implement machine
data analysis, it is not the only option available, and learning models through Scikit-learn further
there are several alternative tools that offer powerful demonstrates its adaptability to various analytical
capabilities. Some of these tools are even more user- needs. While scalability challenges may arise when
friendly than Python, particularly for professionals working with extremely large datasets, Python's
without a programming background. For instance, integration with big data platforms like Apache Spark
Microsoft Excel is a universally recognized tool that effectively addresses these limitations, enhancing its
enables users to perform basic to advanced data analysis applicability in more demanding environments. The
without the need for additional training. Its intuitive findings reaffirm Python’s position as one of the most
interface and functionalities, such as data manipulation, preferred tools in the data analytics space, particularly
visualization, and pivot tables, make it a popular choice for small to medium-scale data tasks, where its
in business environments. Beyond Excel, R stands out flexibility, ease of use, and rich ecosystem of libraries
as a highly regarded programming language and provide a comprehensive solution for data professionals
software environment specifically designed for across different industries.
statistical computing and graphics. R is widely used for
data loading, manipulation, modeling, and References
visualization, offering clean and efficient code that
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,
simplifies complex analytical tasks through the use of
Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,
functions like the pipe operator. It is especially favored M., Kudlur, M., Levenberg, J., Monga, R., Moore,
in academic and research settings for its vast array of S., Murray, D. G., Steiner, B., Tucker, P. A.,
libraries dedicated to statistical analysis and Vasudevan, V. K., Warden, P., . . . Zheng, X. (2016).
OSDI - TensorFlow: a system for large-scale
visualization. Another alternative, RapidMiner, machine learning.
provides a comprehensive data science platform that is
particularly popular among non-technical users due to Ahrens, J., Hendrickson, B., Long, G. G., Miller, S., Ross, R.,
its user-friendly, visual interface. RapidMiner allows & Williams, D. N. (2011). Data-Intensive Science in
the US DOE: Case Studies and Future Challenges.
for data preparation, machine learning, and predictive Computing in Science & Engineering, 13(6), 14-24.
modeling without requiring extensive coding skills. https://doi.org/10.1109/mcse.2011.77
This makes it ideal for research, education, rapid
prototyping, and industrial applications, where teams Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G.,
Bruno, O. M., Rodrigues, F. A., & da Fontoura
can efficiently build complex models using pre-built Costa, L. (2014). A systematic comparison of
workflows. These alternatives, including Excel, R, and supervised classifiers. PloS one, 9(4), e94137-NA.
RapidMiner, present organizations with powerful data https://doi.org/10.1371/journal.pone.0094137
analysis tools that cater to different needs, whether the
Awan, A. J., Brorsson, M., Vlassov, V., & Ayguadé, E. (2016).
focus is on ease of use, advanced statistical analysis, or BPOE - How Data Volume Affects Spark Based
streamlined machine learning workflows. Thus, while Data Analytics on a Scale-up Server. In (Vol. 9495,
Python remains a leading tool, these alternatives offer pp. 81-92). https://doi.org/10.1007/978-3-319-
valuable options for teams seeking flexibility and 29006-5_7
Vol 04 | Issue 04 | November 2024 151
ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

Badhon, M. B., Carr, N., Hossain, S., Khan, M., Sunna, A. A., Discovery, 4(5), 380-409.
Uddin, M. M., Chavarria, J. A., & Sultana, T. (2023). https://doi.org/10.1002/widm.1134
Digital Forensics Use-Case of Blockchain
Technology: A Review. AMCIS 2023 Proceedings., Freeman, J. (2015). Open source tools for large-scale
neuroscience. Current opinion in neurobiology,
Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. 32(NA), 156-163.
S., & Smith, K. (2011). Cython: The Best of Both https://doi.org/10.1016/j.conb.2015.04.002
Worlds. Computing in Science & Engineering,
13(2), 31-39. Grandison, T., & Sloman, M. (2000). A survey of trust in
https://doi.org/10.1109/mcse.2010.118 internet applications. IEEE Communications
Surveys & Tutorials, 3(4), 2-16.
Boehm, B., Clark, N. A., Horowitz, N. A., Brown, N. A., https://doi.org/10.1109/comst.2000.5340804
Reifer, N. A., Chulani, N. A., Madachy, R., &
Steece, B. (2000). Software Cost Estimation with Guller, M. (2015a). Big Data Analytics with Spark - Big Data
Cocomo II with Cdrom (Vol. NA). Analytics with Spark (Vol. NA).
https://doi.org/NA https://doi.org/10.1007/978-1-4842-0964-6

Brumfiel, G. (2011). High-energy physics: Down the Guller, M. (2015b). Big Data Analytics with Spark: A
petabyte highway. Nature, 469(7330), 282-283. Practitioner's Guide to Using Spark for Large Scale
https://doi.org/10.1038/469282a Data Analysis (Vol. NA). https://doi.org/NA

Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive Guo, X., Ipek, E., & Soyata, T. (2010). Resistive computation.
applications, challenges, techniques and ACM SIGARCH Computer Architecture News,
technologies: A survey on Big Data. Information 38(3), 371-382.
Sciences, 275(275), 314-347. https://doi.org/10.1145/1816038.1816012
https://doi.org/10.1016/j.ins.2014.01.015
Holch, T., Shilon, I., Büchele, M., Fischer, T., Funk, S.,
Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Groeger, N., Jankowsky, D., Lohse, T., Schwanke,
Mobile Networks and Applications, 19(2), 171-209. U., & Wagner, P. (2017). Probing Convolutional
https://doi.org/10.1007/s11036-013-0489-0 Neural Networks for Event Reconstruction in
{\gamma}-Ray Astronomy with Cherenkov
Cid-Fuentes, J. Á., Alvarez, P., Amela, R., Ishii, K., Telescopes. Proceedings of 35th International
Morizawa, R. K., & Badia, R. M. (2020). Efficient Cosmic Ray Conference — PoS(ICRC2017),
development of high performance data analytics in 301(NA), 795-NA.
Python. Future Generation Computer Systems, 111, https://doi.org/10.22323/1.301.0795
570-581.
https://doi.org/10.1016/j.future.2019.09.051 Huennefeld, M. (2017). Deep Learning in Physics
exemplified by the Reconstruction of Muon-
Conejero, J., Corella, S., Badia, R. M., & Labarta, J. (2017). Neutrino Events in IceCube. Proceedings of 35th
Task-based programming in COMPSs to converge International Cosmic Ray Conference —
from HPC to big data. The International Journal of PoS(ICRC2017), 301(NA), 1057-NA.
High Performance Computing Applications, 32(1), https://doi.org/10.22323/1.301.1057
45-60. https://doi.org/10.1177/1094342017701278
Hunter, J. D. (2007). Matplotlib: A 2D Graphics
Dalcin, L., Paz, R. R., Kler, P. A., & Cosimo, A. (2011). Environment. Computing in Science & Engineering,
Parallel distributed computing using Python. 9(3), 90-95. https://doi.org/10.1109/mcse.2007.55
Advances in Water Resources, 34(9), 1124-1139.
https://doi.org/10.1016/j.advwatres.2011.04.013 Hurst, S. L. (1969). An introduction to threshold logic: a
survey of present theory and practice. Radio and
Feng, Q., & Lin, T. T. Y. (2016). Astroinformatics - The Electronic Engineer, 37(6), 339-351.
analysis of VERITAS muon images using https://doi.org/10.1049/ree.1969.0062
convolutional neural networks (Vol. 12).
https://doi.org/10.1017/s1743921316012734 Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., & Nguifo,
E. M. (2018). An experimental survey on big data
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, frameworks. Future Generation Computer Systems,
M. J., Benítez, J. M., & Herrera, F. (2014). Big Data 86(NA), 546-564.
with Cloud Computing: an insight on the computing https://doi.org/10.1016/j.future.2018.04.032
environment, MapReduce, and programming
frameworks. WIREs Data Mining and Knowledge

Vol 04 | Issue 04 | November 2024 152


PYTHON FOR DATA ANALYTICS: A SYSTEMATIC LITERATURE REVIEW OF TOOLS, TECHNIQUES, AND APPLICATIONS

Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Millman, K. J., & Aivazis, M. (2011). Python for Scientists
Kim, M., Millstein, T., & Condie, T. (2015). Titian: and Engineers. Computing in Science &
data provenance support in Spark. Proceedings of Engineering, 13(2), 9-12.
the VLDB Endowment. International Conference on https://doi.org/10.1109/mcse.2011.36
Very Large Data Bases, 9(3), 216-227.
https://doi.org/10.14778/2850583.2850595 Misale, C., Drocco, M., Tremblay, G., Martinelli, A. R., &
Aldinucci, M. (2018). PiCo: High-performance data
Istiak, A., & Hwang, H. Y. (2024). Development of shape- analytics pipelines in modern C++. Future
memory polymer fiber reinforced epoxy composites Generation Computer Systems, 87(NA), 392-403.
for debondable adhesives. Materials Today https://doi.org/10.1016/j.future.2018.05.030
Communications, 38, 108015.
https://doi.org/https://doi.org/10.1016/j.mtcomm.20 Nagpal, A., & Gabrani, G. (2019). Python for Data Analytics,
23.108015 Scientific and Technical Applications. 2019 Amity
International Conference on Artificial Intelligence
Istiak, A., Lee, H. G., & Hwang, H. Y. (2023). (AICAI).
Characterization and Selection of Tailorable Heat https://doi.org/10.1109/aicai.2019.8701341
Triggered Epoxy Shape Memory Polymers for
Epoxy Debondable Adhesives. Macromolecular Oliphant, T. E. (2006). Guide to numpy (Vol. 1). Trelgol
Chemistry and Physics, 224(20), 2300241. Publishing USA.
https://doi.org/https://doi.org/10.1002/macp.202300
241 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Jain, A. K. (2010). Data clustering: 50 years beyond K- Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
means. Pattern Recognition Letters, 31(8), 651-666. Cournapeau, D., Brucher, M., Perrot, M., &
https://doi.org/10.1016/j.patrec.2009.09.011 Duchesnay, E. (2011). Scikit-learn: Machine
Learning in Python. Journal of Machine Learning
Jeong, H., & Shi, L. (2018). Memristor devices for neural Research, 12(85), 2825-2830. https://doi.org/NA
networks. Journal of Physics D: Applied Physics,
52(2), 023003-NA. https://doi.org/10.1088/1361- Raghupathi, W., & Raghupathi, V. (2014). Big data analytics
6463/aae223 in healthcare: promise and potential. Health
information science and systems, 2(1), 3-3.
Kim, H., Park, J., Jang, J., & Yoon, S. (2016). DeepSpark: https://doi.org/10.1186/2047-2501-2-3
Spark-Based Deep Learning Supporting
Asynchronous Updates and Caffe Compatibility. Reed, D. A., & Dongarra, J. (2015). Exascale computing and
arXiv: Learning, NA(NA), NA-NA. big data. Communications of the ACM, 58(7), 56-68.
https://doi.org/NA https://doi.org/10.1145/2699414

Kitchin, R. (2013). The real-time city? Big data and smart Reyes-Ortiz, J. L., Oneto, L., & Anguita, D. (2015). INNS
urbanism. GeoJournal, 79(1), 1-14. Conference on Big Data - Big Data Analytics in the
https://doi.org/10.1007/s10708-013-9516-8 Cloud: Spark on Hadoop vs MPI/OpenMP on
Beowulf. Procedia Computer Science, 53(NA), 121-
Kumar, S., & Roy, U. B. (2023). A technique of data 130. https://doi.org/10.1016/j.procs.2015.07.286
collection. In (pp. 23-36). Elsevier.
https://doi.org/10.1016/b978-0-323-91776- Saika, M. H., Avi, S. P., Islam, K. T., Tahmina, T., Abdullah,
6.00011-7 M. S., & Imam, T. (2024). Real-Time Vehicle and
Lane Detection using Modified OverFeat CNN: A
Mangano, S., Delgado, C., Bernardos, M. I., Lallena, M., & Comprehensive Study on Robustness and
Vázquez, J. J. R. (2018). Extracting gamma-ray Performance in Autonomous Driving. Journal of
information from images with convolutional neural Computer Science and Technology Studies.
network methods on simulated Cherenkov Telescope
Array data (Vol. NA). https://doi.org/10.1007/978- Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J.
3-319-99978-4 Z. (2016). Big data analytics on Apache Spark.
International Journal of Data Science and
McKinney, W. (2011). pandas: a Foundational Python Library Analytics, 1(3), 145-164.
for Data Analysis and Statistics. NA, NA(NA), NA- https://doi.org/10.1007/s41060-016-0027-9
NA. https://doi.org/NA
Shamim, M. (2022). The Digital Leadership on Project
McKinney, W. (2012). Python for Data Analysis: Data Management in the Emerging Digital Era. Global
Wrangling with Pandas, NumPy, and IPython (Vol. Mainstream Journal of Business, Economics,
NA). https://doi.org/NA Development & Project Management, 1(1), 1-14.

Vol 04 | Issue 04 | November 2024 153


ACADEMIC JOURNAL ON SCIENCE, TECHNOLOGY, ENGINEERING & MATHEMATICS EDUCATION
Doi: 10.69593/ajsteme.v4i04.146

Shi, J., Qiu, Y., Minhas, U. F., Jiao, L., Wang, C., Reinwald,
B., & Ozcan, F. (2015). Clash of the titans:
MapReduce vs. Spark for large scale data analytics.
Proceedings of the VLDB Endowment, 8(13), 2110-
2121. https://doi.org/10.14778/2831360.2831365

Shilon, I., Kraus, M., Büchele, M., Egberts, K., Fischer, T.,
Holch, T., Lohse, T., Schwanke, U., Steppa, C., &
Funk, S. (2019). Application of deep learning
methods to analysis of imaging atmospheric
Cherenkov telescopes data. Astroparticle Physics,
105(NA), 44-53.
https://doi.org/10.1016/j.astropartphys.2018.10.003

Tejedor, E., Becerra, Y., Alomar, G., Queralt, A., Badia, R.


M., Torres, J., Cortes, T., & Labarta, J. (2016).
PyCOMPSs: Parallel computational workflows in
Python. The International Journal of High
Performance Computing Applications, 31(1), 66-82.
https://doi.org/10.1177/1094342015594678

Uddin, M. M., Ullah, R., & Moniruzzaman, M. (2024). Data


Visualization in Annual Reports–Impacting
Investment Decisions. International Journal for
Multidisciplinary Research, 6(5).
https://doi.org/10.36948/ijfmr

van der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The
NumPy array: a structure for efficient numerical
computation. Computing in Science & Engineering,
13(2), 22-30. https://doi.org/10.1109/mcse.2011.37

Yan, D., Cheng, J., Özsu, M. T., Yang, F., Lu, Y., Lui, J. C. S.,
Zhang, Q., & Ng, W. (2016). A general-purpose
query-centric framework for querying big graphs.
Proceedings of the VLDB Endowment, 9(7), 564-
575. https://doi.org/10.14778/2904483.2904488

Vol 04 | Issue 04 | November 2024 154

You might also like