0% found this document useful (0 votes)

12 views35 pages

Chapter 2 (Data Science)

Chapter Two provides an overview of data science, detailing its definition, the data processing cycle, and various data types including structured, semi-structured, and unstructured data. It also discusses the data value chain, big data concepts, and the Hadoop ecosystem, emphasizing the importance of clustered computing for handling large datasets. The chapter concludes with an outline of the big data life cycle stages within the Hadoop framework.

Uploaded by

nabilalihaji772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views35 pages

Chapter 2 (Data Science)

Uploaded by

nabilalihaji772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

CHAPTER TWO

Data Science

.
Page 2

Main Contents
Overview of Data Science

Data and Information

Data Processing Cycle

Data Science
Data Types and their Representation

Data Value Chain

Basic Concepts of Big Data

Clustered Computing and Hadoop Ecosystem

Overview of Data Science Page 3

 Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights

from structured, semi-structured and unstructured data.
 Let’s consider this idea by thinking about some of the data involved in

buying a box of cereal from the store or supermarket:

Whatever your cereal preferences teff, wheat, or barley you prepare

for the purchase by writing “cereal” in your notebook. This planned

purchase is a piece of data though it is written by pencil that you can

read. (This an example of data).

Data and Information Page 4

Data
 Is representation of facts, concepts, or instructions in a

formalized manner, which should be suitable for

communication, interpretation, or processing, by human or
electronic machines.
 It can be described as unprocessed facts and figures.

 Can be represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,

*, <,>, =, etc.).
Data and Information … Page 5

Information
 Is the processed data on which decisions and actions are

based.
 It is data that has been processed into a form that is

meaningful to the recievers.

 Information is interpreted data: created from organized,

structured, and processed data in a particular context.

Data Processing Cycle Page 6

 Data processing is the re-structuring or re-ordering of data by people

or machines to increase their usefulness and add values for a particular

purpose.
 Data processing consists of the following basic steps:

 Input, processing, and output

 These three steps constitute the data processing cycle.

Input Processing Output

Output
Data Processing Cycle
Data Processing Cycle… Page 7

Input
 In this step, the input data is prepared in some convenient form for

processing.
 The form will depend on the processing machine.

 Any information that is provided to a computer or a software

program is known as input.

 The input enables the computer to do what is designed to do and

produce an output.
Example: [keyboard, mouse...]
Data Processing Cycle… Page 8

Processing
In this step, the input data is changed to produce data in a more useful

form.
Example: [CPU, GPU, Network Interface Cards…]
Data Processing Cycle… Page 9

Output
At this stage, the result of the proceeding processing step is collected.

The particular form of the output data depends on the use of the data.

Example: [Monitor, Printer, Projector…]

Data Types and their Representation Page
10

 In computer programming, a data type is an attribute of data that tells

the compiler or interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
 The Common data types include

 Integers(int)- is used to store whole numbers, integers

 Booleans(bool)- is used to represent true or false.

 Characters(char)- is used to store a single character like “A”.

 Floating-point numbers(float)- is used to store real numbers

 Alphanumeric strings(string)- used to store a combination of

characters and numbers like “ddu01256”.

Data Types and their Representation Page 11

Data types from Data Analytics perspective

 From a data analytics point of view, it is important to understand that

there are three common types of data types or structures:

 Structured

 Semi-structured, and

 Unstructured data types

 The fourth data type is metadata which data of data.

 The following figure describes the three types of data and metadata.
Data Types and their Representation… Page
12

Structured Data
 Structured data is data that adheres to a pre-defined data model and is

therefore straight forward to analyze.

 Structured data conforms to a tabular format with a relationship

between the different rows and columns.

Example: Excel files , Coma Separated Value files (.csv) and SQL
database files.
 Each of these has structured rows and columns that can be sorted.
Data Types and their Representation… Page 13

Semi-structured Data
 Semi-structured data is a form of structured data that does not

conform with the formal structure of data models associated with

relational databases or other forms of data tables.
 It contains tags or other markers to separate semantic elements and

enforce hierarchies of records and fields within the data. Therefore, it is

also known as a self-describing structure.
Examples: JSON (JavaScript Object Notation) and XML (Extended
Markup Languages) are forms of semi-structured data.
Data Types and their Representation… Page 14

Unstructured Data
 Unstructured data is information that either does not have a

predefined data model or is not organized in a pre-defined manner.

 Unstructured information is typically text-heavy but may contain data

such as dates, numbers, and facts as well.

 This results in irregularities and ambiguities that make it difficult to

understand using traditional programs as compared to data stored in

structured databases.
Example: Audio, video files and NoSQL (None SQL) databases.
Data Types and their Representation… Page 15

Metadata (Data about Data)

 From a technical point of view, this is not a separate data structure, but

it is one of the most important elements for Big Data analysis and big
data solutions.
 Metadata is data about data.

 It provides additional information about a specific set of data.

 Metadata is frequently used by Big Data solutions for initial analysis.

 In a set of photographs, for example, metadata could describe when

and where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data.
Data Types and their Representation… Page 16

Meta Data
Data Value Chain Page 17

 The Data Value Chain is concerned with describing the information flow within

a big data system as a series of steps needed to generate value and useful insights
from data.
 The data value chain describes the evolution of data from collection to analysis,

dissemination, and the final impact of data on decision making.

• The Big Data Value Chain identifies the following key high-level activities:
Data Value Chain… Page 18

Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put in a data

warehouse or any other storage solution on which data analysis can be carried
out.
 Data acquisition is one of the major big data challenges in terms of infrastructure

requirements.
 The infrastructure required to support the acquisition of big data must deliver

low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment; and
support flexible and dynamic data structures.
Data Value Chain… Page 19

Data Analysis
 It is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
 Data analysis involves:

 Exploring,

 Transforming, and

 Modeling data

 The main goal of data analysis is highlighting relevant data, synthesizing and

extracting useful hidden information with high potential from a business point
of view.
 Related areas include
Data Value Chain… Page 20

Data Curation
 It is the active management of data over its life cycle to ensure it meets the

necessary data quality requirements for its effective usage.

 Data curation processes can be categorized into different activities such as content

creation, selection, classification, transformation, validation, and preservation.

 Data curation is performed by expert curators that are responsible for improving

the accessibility and quality of data.

 Data curators (scientific curators or data annotators) hold the responsibility of

ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
 A key trend for the duration of big data utilizes community and crowdsourcing

approaches.
Data Value Chain… Page 21

Data Storage
 It is the persistence and management of data in a scalable way that

satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the

main, and almost unique, a solution to the storage paradigm for nearly
40 years.
 Not Only SQL (NoSQL) technologies have been designed with the

scalability goal in mind and present a wide range of solutions based on

alternative data models.
Data Value Chain… Page 22

Data Usage
 It covers the data-driven business activities that need access to data,

its analysis, and the tools needed to integrate the data analysis within
the business activity.
 Data usage in business decision-making can enhance

competitiveness through the reduction of costs, increased added value,

or any other parameter that can be measured against existing
performance criteria.
Basic Concepts of Big Data Page 23

 Big data is a blanket term for the non-traditional strategies and

technologies needed to gather, organize, process, and gather insights

from large datasets.
 While the problem of working with data that exceeds the computing

power or storage of a single computer is not new, the pervasiveness,

scale, and value of this type of computing have greatly expanded in
recent years.
Basic Concepts of Big Data Page 24

What is Big Data?

 Big data is the term for a collection of data sets so large and complex

that it becomes difficult to process using on-hand database

management tools or traditional data processing applications.
 In this context, a “large dataset” means a dataset too large to

reasonably process or store with traditional tooling or on a single

computer.
 Big data is characterized by 3V and more: Volume, Velocity, Variety

and Veracity
Basic Concepts of Big Data Page 25

Characteristics of Big Data

 Volume: large amounts of data /Massive datasets

 Velocity: Data is live streaming or in motion

 Variety: data comes in many different forms from diverse sources

 Veracity: can we trust the data? How accurate is it?

Clustered Computing and Hadoop Ecosystem Page 26

Clustered Computing
Because of the qualities of big data, individual computers are often

inadequate for handling the data at most stages.

To better address the high storage and computational needs of big data,

computer clusters are a better fit.

Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of benefits:

 Resource Pooling

 High Availability

 Easy Scalability
Clustered Computing and Hadoop Ecosystem… Page 27

Resource Pooling
 Combining the available storage space to hold data.

High Availability
 Availability guarantees to prevent hardware or software failures from

affecting access to data and processing.

Easy Scalability
 Clusters make it easy to scale horizontally by adding additional

machines to the group. This means the system can react to changes in
resource requirements without expanding the physical resources on a
machine.
Clustered Computing and Hadoop Ecosystem… Page 28

 Using clusters requires a solution for managing cluster membership,

coordinating resource sharing, and scheduling actual work on

individual nodes.
 Cluster membership and resource allocation can be handled by

software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 The assembled computing cluster often acts as a foundation that other

software interfaces with to process the data.

Clustered Computing and Hadoop Ecosystem… Page 29

Hadoop and its Ecosystem

Hadoop is an open-source framework intended to make interaction with

big data easier.

 It is a framework that allows for the distributed processing of large

datasets across clusters of computers using simple programming models.

 It is inspired by a technical document published by Google. The four

key characteristics of Hadoop are:

 Economical

 Reliable

 Scalable

 Flexible
Clustered Computing and Hadoop Ecosystem… Page 30

The key characteristics of Hadoop:

 Economical: Its systems are highly economical as ordinary

computers can be used for data processing.

 Reliable: It is reliable as it stores copies of the data on different

machines and is resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically. A

few extra nodes help in scaling up the framework

 Flexible: It is flexible and you can store as much structured and

unstructured data as you need to and decide to use them later.

Clustered Computing and Hadoop Ecosystem… Page 31

 Hadoop has an ecosystem that has evolved from its four core

components: data management, access, processing, and storage.

 It is continuously growing to meet the needs of Big Data.

 It comprises the following main components and many others:

• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Clustered Computing and Hadoop Ecosystem… Page 32
Big Data Life Cycle with Hadoop (Stages) Page 33

1. Ingesting data into the system:

 The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such

as relational databases, systems, or local files.

 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage:
 The second stage is Processing.

 In this stage, the data is stored and processed.

 The data is stored in the distributed file system, HDFS, and the NoSQL

distributed data, HBase.


Big Data Life Cycle with Hadoop… Page 34

 Computing and analyzing data

 The third stage is to Analyze.

 Here, the data is analyzed by processing frameworks such as Pig, Hive,

and Impala.
 Pig converts the data using a MapReduce and then analyzes it.
 Hive is also based on the MapReduce programming and is most suitable
for structured data.

 Visualizing the results

 The fourth stage is Access, which is performed by tools such as Hue and

Cloudera Search.
 In this stage, the analyzed data can be accessed by users.
Page 35

?
END OF CHAPTER TWO
Next:- Chapter Three [Artificial Intelligence]

Data Science
No ratings yet
Data Science
32 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
20 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
CH 2
No ratings yet
CH 2
23 pages
CH 2
No ratings yet
CH 2
20 pages
빅데이터 전략 및 분석
No ratings yet
빅데이터 전략 및 분석
289 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
CSE-DS Power BI Updated Lab Manual
No ratings yet
CSE-DS Power BI Updated Lab Manual
99 pages
What Is BIG DATA - Introduction, Types, Characteristics, Example
No ratings yet
What Is BIG DATA - Introduction, Types, Characteristics, Example
11 pages
Basics of Data and Types of Data
No ratings yet
Basics of Data and Types of Data
3 pages
Big Data Security Challenges & Solutions
No ratings yet
Big Data Security Challenges & Solutions
26 pages
A Detailed View Inside Snowflake
No ratings yet
A Detailed View Inside Snowflake
14 pages
Emerging Technology: Currently Developing, or That Are Expected To Be Available Within The Next Five To Ten Years
100% (2)
Emerging Technology: Currently Developing, or That Are Expected To Be Available Within The Next Five To Ten Years
42 pages
Chapter 1. Understanding Big Data
No ratings yet
Chapter 1. Understanding Big Data
70 pages
Big Data
No ratings yet
Big Data
15 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
40 pages
Bcs714d Module 1 Notes
No ratings yet
Bcs714d Module 1 Notes
29 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
96 pages
CH2-Getting To Know Your Data
No ratings yet
CH2-Getting To Know Your Data
23 pages
Unit - Big - Data - (DK - PPT) - Part - 1
No ratings yet
Unit - Big - Data - (DK - PPT) - Part - 1
70 pages
Business Intelligence Processes & Tools
No ratings yet
Business Intelligence Processes & Tools
77 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Snow Pro Data Engineer Study Guide
No ratings yet
Snow Pro Data Engineer Study Guide
15 pages
Class 12 Competency Based Question - Computer Science Chap 7 (2024-25)
No ratings yet
Class 12 Competency Based Question - Computer Science Chap 7 (2024-25)
25 pages
All Chapter Emerging Technology
No ratings yet
All Chapter Emerging Technology
105 pages
1.8 Big Data - Introduction & Characteristics
No ratings yet
1.8 Big Data - Introduction & Characteristics
9 pages
CH 2
No ratings yet
CH 2
42 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
Introduction to Information Storage & Retrieval
No ratings yet
Introduction to Information Storage & Retrieval
110 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
69 pages
All BI Notes Merged
No ratings yet
All BI Notes Merged
316 pages
KIIT Data Analytics Course Guide
No ratings yet
KIIT Data Analytics Course Guide
65 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
74 pages