CHAPTER TWO: Data Science
09/20/2025 1
After completing this chapter, the students will be able to:
➢ Describe what data science is and the role of data scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse perspectives
➢ Describe data value chain in emerging era of big data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem components.
09/20/2025 2
2.1. Overview of Data Science
Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured data.
Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes. It helps you to discover hidden patterns
from the raw data.
09/20/2025 3
Skills important for data science
• Statistics
• Linear algebra
• Programming knowledge
09/20/2025 4
Significant advantages of using Data Science
Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive business
advantage.
Data Science can help you to detect fraud using advanced machine
learning algorithms
It helps you to prevent any significant monetary losses
09/20/2025 5
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer brand loyalty
It enables you to take better and faster decisions
Helps you to recommend the right product to the right customer to
enhance your business
09/20/2025 6
Challenges of Data science
High variety of information & data is required for accurate analysis
Not adequate data science talent pool available
Management does not provide financial support for a data science
team
Unavailability of/difficult access to data
09/20/2025 7
Data Science results not effectively used by business decision
makers
Explaining data science to others is difficult
Privacy issues
Lack of significant domain expert
If an organization is very small, they can't have a Data Science
team
09/20/2025 8
What are data and information?
Data can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or electronic
machines.
It can be described as unprocessed facts and figures.
It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
09/20/2025 9
Information is the processed data on which decisions and actions
are based.
Information is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived value in the
current or the prospective action or decision of recipient.
Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context.
09/20/2025 10
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
Data processing consists of the following basic steps: Input,
Processing and Output. These three steps constitute the data
processing cycle.
Fig. 1.Data processing Cycle
09/20/2025 11
Input - in this step, the input data is prepared in some convenient form
for processing.
The form will depend on the processing machine.
For example, when electronic computers are used, the input data can
be recorded on any one of the several types of storage medium, such
as hard disk, CD, flash disk and so on.
Processing- in this step, the input data is changed to produce data in a
more useful form.
For example, interest can be calculated on deposit to a bank, or a
summary of sales for the month can be calculated from the sales
orders.
09/20/2025 12
Output-at this stage, the result of the proceeding processing step is
collected.
The particular form of the output data depends on the use of the
data.
For example, output data may be payroll for employees.
09/20/2025 13
Data types and their representation
Data types can be described from diverse perspectives.
In computer science and computer programming, for
instance, a data type is simply an attribute of data that tells
the compiler or interpreter how the programmer intends to
use the data.
09/20/2025 14
1. Data types from Computer programming perspective
Almost all programming languages explicitly include the notion of
data type, though different languages may use different terminology.
Common data types include:
Integers(int)- is used to represent whole numbers, mathematically
known as integers
Booleans(bool)- is used to represent restricted to one of two
values:true or false
Characters(char)- is used to represent a single character
Floating-point numbers(float)- is used to represent real numbers
Alphanumeric strings(string)- used to represent a combination of
characters and numbers
09/20/2025 15
2. Data types from Data Analytics perspective
From a data analytics point of view, it is important to
understand that there are three common types of data
types or structures:
Structured
Semi-structured and
Unstructured data types.
09/20/2025 16
Structured Data
Structured data is data that adheres to a pre-defined data
model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with a
relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL
databases.
Each of these has structured rows and columns that can be
sorted.
09/20/2025 17
Semi-structured Data
Semi-structured data is a form of structured data that does not conform with
the formal structure of data models associated with relational databases or
other forms of data tables, but nonetheless, contains tags or other markers to
separate semantic elements and enforce hierarchies of records and fields
within the data.
Therefore, it is also known as a self-describing structure.
Examples of semi-structured data include JSON and XML are forms of
semi-structured
09/20/2025
data. 18
Unstructured Data
Unstructured data is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
Unstructured information is typically text-heavy but may contain data such as
dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or NoSQL.
09/20/2025 19
Metadata – Data about Data
• The last category of data type is metadata.
• From a technical point of view, this is not a separate
data structure, but it is one of the most important
elements for Big Data analysis and big data solutions.
• Metadata is data about data.
• It provides additional information about a specific set
of data.
• In a set of photographs, for example, metadata could
describe when and where the photos were taken.
09/20/2025 20
Data value Chain
The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data. The Big Data Value Chain identifies the
following key high-level activities:
Fig2.Data Value Chain
09/20/2025 21
1. Data Acquisition
• It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
• Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
• The infrastructure required to support the acquisition of big
data must deliver low, predictable latency in both capturing
data and in executing queries; be able to handle very high
transaction volumes, often in a distributed environment and
support flexible and dynamic data structures.
09/20/2025 22
2. Data Analysis
• It is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-
specific usage.
• Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant
data, synthesizing and extracting useful hidden
information with high potential from a business point of
view.
• Related areas include data mining, business intelligence,
and machine learning.
09/20/2025 23
3. Data Curation
• It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
• Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
• Data curators (also known as scientific curators or data annotators) hold
the responsibility of ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
• A key trend for the duration of big data utilizes community and crowd
sourcing approaches. 24
4. Data Storage
• It is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access
to the data.
• Relational Database Management Systems (RDBMS) have
been the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
• complexity grow, making them unsuitable for big data
scenarios.
• NoSQL technologies have been designed with the scalability
goal in mind and present a wide range of solutions based on
alternative data models. 25
5. Data Usage
• It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
• Data usage in business decision making can enhance
competitiveness through the reduction of costs,
increased added value, or any other parameter that can
be measured against existing performance criteria.
09/20/2025 26
Basic concepts of big data
What Is Big Data?
• Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
• In this context, a “large dataset” means a dataset too large to
reasonably process or store with traditional tooling or on a single
computer.
• This means that the common scale of big datasets is constantly
shifting and may vary significantly from organization to
organization.
09/20/2025 27
Big data is characterized by 4V
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc.
Fig 3. Characteristics of Big data
09/20/2025 28
09/20/2025 29
Big Data Solutions
Clustered Computing and Hadoop Ecosystem
Clustered Computing
Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages.
To better address the high storage and computational needs of
big data, computer clusters are a better fit.
Cluster Computing:-a form of computing, a group of computer
connected through internet and perform like single machine
Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
09/20/2025 30
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault tolerance
and availability guarantees to prevent hardware or software failures
from affecting access to data and processing.
• This becomes increasingly important as we continue to emphasize the
importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by adding
additional machines to the group.
• This means the system can react to changes in resource requirements
without expanding the physical resources on a machine.
09/20/2025 31
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make
interaction with big data easier.
Apache open source software framework for reliable,
scalable, distributed computing over massive amount of data
It is a framework that allows for the distributed processing of
large datasets across clusters of computers using simple
programming models.
It is inspired by a technical document published by Google.
The four key characteristics of Hadoop are:
09/20/2025 32
• Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
• Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically.
A few extra nodes help in scaling up the framework.
• Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and
storage.
It is continuously growing to meet the needs of Big Data.
09/20/2025 33
It comprises the following components and many others:
HDFS: Hadoop Distributed File System-where Hadoop stores
data
YARN: Yet Another Resource Negotiator-A framework for job
scheduling and cluster resource management.
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
09/20/2025 34
Figure 4 Hadoop Ecosystem
09/20/2025 35
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The first stage of Big Data processing is Ingest.
The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
Sqoop Tool to easily import information from structured
databases (MySQL, Oracle, etc.) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster.
09/20/2025 36
2. Processing the data in storage
The second stage is Processing. In this stage, the data is
stored and processed.
The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing.
Spark:A fast and general compute engine for Hadoop data.
09/20/2025 37
3. Computing and analyzing data
The third stage is to Analyze.
Here, the data is analyzed by processing frameworks
such as Pig, Hive, and Impala.
Pig converts the data using a map and reduce and then
analyzes it.
Hive is also based on the map and reduce programming
and is most suitable for structured data.
09/20/2025 38
4. Visualizing the results
The fourth stage is Access, which is performed by tools
such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed by users.
09/20/2025 39
Advantages and disadvantages of Hadoop
Hadoop is good for:
processing massive amounts of data through parallelism
handling a variety of data (structured, unstructured, semi-structured)
using inexpensive commodity hardware
Hadoop is not good for:
Processing transactions (random access)
when work cannot be parallelized
Low latency data access
Processing lots of small files
Intensive calculations with small amounts of data
09/20/2025 40
Big Data vs Data Science
Factors Big Data Data Science
Concept Handling large Data Analyzing data
Responsibility Processing huge volume of Understand pattern
data and generate insights within and make
decisions
Industry E-commerce ,security Sales, image
services, telecommunication recognition,
advertisement ,risk
analytics
Tools Hadoop Python ,R
09/20/2025 41
THANK YOU ?
09/20/2025 42
Quiz 1 (10%) Time allotted 18’
Write your Name, ID, Section
1.What is clustered computing? and list benefits of clustered computing?
2. What is data science and explain data processing cycle ?
3. What is Hadoop and what are the four core components of Hadoop ?
4. What is Big data and explain characteristics of Big data?
5. What is data type and explain data type from data analytic perspective ?
6. List key high-level activities in Data Value Chain and explain it
09/20/2025 43