BIG DATA ANALYTICS
UNIT - I
Introduction: Introduction to big data: Introduction to Big Data Platform, Challenges of
Conventional Systems, Intelligent data analysis, Nature of Data, Analytic Processes and Tools,
Analysis vs Reporting.
Introduction to big data
Big data refers to large, complex sets of data that are difficult to manage and analyse using
traditional data processing tools and techniques. The term "big data" typically describes
datasets that are too large, too diverse, or too complex to be analysed using traditional data
processing tools and methods.
The growth of big data is driven by the widespread use of digital technologies, such as social
media, mobile devices, and the Internet of Things (IoT). These technologies generate massive
amounts of data that can be collected, stored, and analysed for insights into consumer
behaviour, market trends, and other business insights.
Big data technologies and tools include data storage and management systems like Hadoop,
Apache Spark, and NoSQL databases, as well as data processing and analysis tools such as
Python, R, and SQL. Machine learning algorithms and artificial intelligence techniques are
also commonly used to analyse big data and extract insights.
The use of big data has transformed many industries, including healthcare, finance, retail, and
manufacturing, by providing new insights and enabling data-driven decision-making.
However, the use of big data also raises important ethical and privacy concerns, particularly
around the collection, use, and sharing of personal data.
Introduction to Big Data Platform
A Big Data Platform is a comprehensive and scalable infrastructure that provides the necessary
tools and technologies for collecting, storing, processing, and analysing large and complex
data sets. A Big Data Platform enables organizations to manage and extract valuable insights
from vast amounts of data that are generated from various sources, such as social media,
mobile devices, sensors, and other IoT devices.
The architecture of a Big Data Platform typically consists of several layers, including:
1. Data Ingestion: This layer is responsible for collecting and bringing in data from
various sources, such as structured and unstructured data, into the platform.
2. Data Storage: This layer stores the ingested data in a distributed and scalable manner,
typically using technologies such as Hadoop Distributed File System (HDFS), NoSQL
databases, and cloud-based storage solutions.
Page | 1
3. Data Processing: This layer involves processing and transforming the ingested data
into a format that can be easily analysed using various data processing technologies such
as Apache Spark, Apache Storm, and MapReduce.
4. Data Analysis: This layer uses advanced analytics techniques such as machine learning,
data mining, and natural language processing to extract insights and knowledge from
the processed data.
5. Data Visualization: This layer presents the analysed data in an easily understandable
and interactive format, typically using visualization tools such as Tableau, QlikView,
and D3.js.
A Big Data Platform enables organizations to derive valuable insights and make data-driven
decisions by providing the necessary infrastructure, tools, and technologies to manage,
process, and analyse large and complex data sets. It also provides a scalable and flexible
architecture that can be adapted to meet the evolving needs of the organization.
Challenges of Conventional Systems
Conventional systems, also known as traditional systems, often face challenges when it comes
to managing and analysing large and complex data sets. Some of the key challenges of
conventional systems are:
1. Scalability: Conventional systems are not designed to handle large amounts of data,
and they can quickly become overwhelmed as data volumes grow. This can lead to
system crashes and performance issues.
2. Processing speed: Conventional systems are often slow when it comes to processing
large amounts of data, and they can take hours or even days to process and analyse data
sets.
3. Data complexity: Conventional systems struggle to handle complex data types, such as
unstructured data, which can include audio, video, and text data. This can lead to
difficulties in extracting insights from the data.
4. Data integration: Conventional systems often struggle to integrate data from different
sources, such as social media, IoT devices, and other systems. This can make it difficult
to get a comprehensive view of the data.
5. Security: Conventional systems are often vulnerable to security threats, such as data
breaches and cyber-attacks, due to their outdated security protocols and architectures.
6. Cost: Scaling up conventional systems to handle large amounts of data can be
expensive, as it often requires additional hardware and software resources.
These challenges can make it difficult for organizations to manage and analyse large and
complex data sets, and can hinder their ability to derive valuable insights from the data. This
is where Big Data Platforms come in, as they are specifically designed to handle the challenges
of large and complex data sets, providing scalable, fast, and secure solutions for data
management and analysis.
Page | 2
Intelligent data analysis
Intelligent data analysis (IDA) is a process of analysing data using advanced computational
and statistical techniques to extract insights, patterns, and knowledge. IDA involves using
machine learning, data mining, natural language processing, and other artificial intelligence
(AI) techniques to identify hidden relationships and patterns in data that might not be apparent
using traditional statistical analysis.
IDA can be applied to various types of data, including structured and unstructured data, and it
can be used to solve a wide range of problems, such as fraud detection, predictive analytics,
and customer segmentation. The goal of IDA is to automate the process of data analysis and
make it more efficient and accurate.
Some of the key techniques used in IDA include:
1. Machine learning: Machine learning involves using algorithms and statistical models
to enable computers to learn from data and improve their performance over time.
2. Data mining: Data mining involves using statistical techniques to extract patterns and
knowledge from data sets.
3. Natural language processing: Natural language processing involves using
computational techniques to analyse and interpret human language.
4. Visualization: Visualization involves presenting data in a graphical format to enable
users to better understand and interpret the data.
5. Predictive analytics: Predictive analytics involves using statistical and machine
learning techniques to make predictions about future events based on historical data.
Intelligent data analysis is increasingly being used in various industries, including healthcare,
finance, and retail, to improve decision-making, optimize business processes, and gain a
competitive edge. However, it also raises important ethical concerns around the use of personal
data and the potential for bias in automated decision-making.
Nature of Data
Data is any information that can be stored, processed, and analysed by computers or other
electronic devices. The nature of data can vary widely, depending on the source and the
purpose of the data. Some common characteristics of data include:
1. Volume: The amount of data generated can be massive, ranging from small data sets to
petabytes of data generated by IoT devices, social media, and other sources.
2. Variety: Data can be structured, unstructured, or semi-structured. Structured data is
organized and can be easily processed, while unstructured data, such as text or video, is
more difficult to analyse and interpret.
3. Velocity: The speed at which data is generated and needs to be analysed can be high,
with real-time or near-real-time processing required in some cases.
4. Veracity: The accuracy, completeness, and reliability of data can vary widely, and it
can be difficult to ensure that data is clean and error-free.
Page | 3
5. Value: The value of data depends on the insights and knowledge that can be extracted
from it. Data can be used to identify patterns, predict outcomes, and optimize processes,
among other applications.
6. Variability: Data can be highly variable in terms of format, structure, and quality,
which can pose challenges for processing and analysis.
The nature of data is constantly evolving, with new sources of data emerging and new
technologies being developed to process and analyse data. This has led to the emergence of
big data and data science, which are focused on managing and analysing large and complex
data sets to extract insights and knowledge.
Analytic Processes and Tools
Analytic processes and tools are used to process, analyze, and visualize data in order to derive
insights and knowledge from it. There are several key steps in the analytic process, which
include:
1. Data collection and preparation: This involves identifying and gathering relevant data
from various sources and preparing it for analysis. This includes data cleaning,
transformation, and normalization.
2. Data exploration and visualization: This involves using visual tools and techniques
to explore and gain insights from the data. Data visualization tools, such as charts and
graphs, can be used to present data in a clear and understandable way.
3. Data modeling and analysis: This involves using statistical and machine learning
techniques to build models and analyze the data. These models can be used to make
predictions and identify patterns in the data.
4. Interpretation and communication: This involves interpreting the results of the
analysis and communicating the insights to stakeholders. This includes presenting
findings in a clear and actionable way.
There are several tools and technologies used in the analytic process, including:
1. Statistical software: Statistical software, such as R and SAS, are used to perform data
analysis and build models.
2. Business intelligence tools: Business intelligence tools, such as Tableau and Power BI,
are used to visualize and analyze data.
3. Machine learning platforms: Machine learning platforms, such as TensorFlow and
Scikit-learn, are used to build and deploy machine learning models.
4. Data integration tools: Data integration tools, such as Apache Kafka and Apache NiFi,
are used to integrate data from various sources.
5. Cloud-based analytics platforms: Cloud-based analytics platforms, such as AWS and
Google Cloud Platform, provide scalable and cost-effective solutions for data analysis
and processing.
The choice of tools and technologies used in the analytic process depends on the nature of the
data, the complexity of the analysis, and the specific requirements of the project.
Page | 4
Analysis vs Reporting
Differences between analytics and reporting can significantly benefit your business. If you
want to use both to their full potential and not miss out on essential parts of either one knowing
the difference between the two is important. Some key differences are:
Analytic Reporting
Analytics is the method of examining Reporting is an action that includes all
and analyzing summarized data to make the needed information and data and is
business decisions. put together in an organized way.
Questioning the data, understanding it, Identifying business events, gathering
investigating it, and presenting it to the the required information, organizing,
end users are all part of analytics. summarizing, and presenting existing
data are all part of reporting.
The purpose of analytics is to draw The purpose of reporting is to organize
conclusions based on data. the data into meaningful information.
Analytics is used by data analysts, Reporting is provided to the appropriate
scientists, and business people to make business leaders to perform effectively
effective decisions. and efficiently within a firm.
Analysis and reporting are both important aspects of data management and communication,
but they have different objectives and approaches.
Reporting is the process of collecting and presenting data in a structured format, typically in
the form of tables, charts, and graphs. The purpose of reporting is to provide information to
stakeholders in a clear and concise way. Reporting focuses on answering specific questions
and providing a summary of key findings. Reports are typically generated on a regular basis,
such as daily, weekly, or monthly.
Analysis, on the other hand, involves a more in-depth examination of the data. The purpose of
analysis is to uncover insights, patterns, and relationships in the data that may not be apparent
through simple reporting. Analysis typically involves statistical techniques and machine
learning algorithms to identify trends and patterns. The focus of analysis is on understanding
the data and using that understanding to drive decision-making.
Reporting tends to be more structured and standardized, while analysis requires more creativity
and flexibility. Reporting is generally more quantitative, while analysis can be both
quantitative and qualitative. Reporting is typically done by a wider range of stakeholders,
while analysis is often done by data scientists, analysts, and other technical experts.
In summary, reporting provides a summary of data in a structured format, while analysis
involves a deeper examination of the data to uncover insights and patterns. Both reporting and
analysis are important in data management and communication, but they serve different
purposes and require different approaches.
Page | 5