0% found this document useful (0 votes)

32 views31 pages

Data Science & Big Data Essentials

Uploaded by

melkiastad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views31 pages

Data Science & Big Data Essentials

Uploaded by

melkiastad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Overview for

Data Science
Chapter 2
Learning outcomes
After completing this lesson you should be able to

Describe what data science is and the role of data scientists.

Differentiate data and information.

Describe data processing life cycle

Understand different data types from diverse perspectives

Describe data value chain in emerging era of big data

Basic concepts of Big Data

An Overview of Data Science
Data science is a multidisciplinary field that uses

scientific methods, processes and algorithm systems to

extract knowledge, Insights from structured, semi-structured
and unstructured data

Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals
What is data?
A representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication,
interpretation, or processing by human or electronic machine

Data can be described as unprocessed facts and figures

It can also be defined as groups of non-random symbols in the form

of text, images, and voice representing quantities, action and objects
What is Information?
Organized or classified data, which has some meaningful values for
the receiver

Processed data on which decisions and actions are based. Plain

collected data as raw facts cannot help much in decision-making

Interpreted data created from organized, structured, and processed

data in a particular context
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by
people or machine to increase their usefulness and add values for
a particular purpose

Data processing consists the following steps

Input

Processing

Output
Data Processing Cycle
Input

The input data is prepared in some convenient form for processing

The form will depend on the processing machine

For example, when electronic computers are used, the input data
can be recorded on any one of the several types of input medium,
such as flash disks, hard disk, and so on
Data Processing Cycle
Processing

In this step, the input data is changed to produce data in a more

useful form

For example, a summary of sales for a month can be calculated

from the sales orders data
Data Processing Cycle
Output

At this stage, the result of the proceeding processing step is

collected

The particular form of the output data depends on the use of the
data

For example, output data can be total sale in a month

Data types and its
representation
In computer science and computer programming, a data type or
simply type is an attribute of data which tells the compiler or interpreter
how the programmer intends to use the data

Common data types include

Integers, Boolean, Characters, Floating-Point Numbers,

Alphanumeric Strings

This data type defines the operations that can be done on the
data, the meaning of the data, and the way values of that type
can be stored
Data types from Data Analytics
perspective
Structured, Unstructured, and Semi-structured data types
Structured Data
Data that adheres to a predefined data model and is therefore
straightforward to analyze

Conforms to a tabular format with relationship between different rows

and columns

Common examples

Excel files or SQL databases

Unstructured Data
Data that either does not have a predefined data model or is not
organized in a predefined manner

It is typically text-heavy, but may contain data such as dates,

numbers, and facts as well

Common examples

audio, video files, NoSQL, pictures, pdfs ...

Semi-structured Data
A form of structured data that does not conform with the formal
structure of data models associated with relational databases or
other forms of data tables

But contain tags or other markers to separate semantic elements

and enforce hierarchies of records and fields within the data

Therefore, it is also known as self-describing structure

Semi-structured Data
Examples of semi-structured data

JSON and XML

Metadata – Data about Data
It provides additional information about a specific set of data

For example

Metadata of a photo could describe when and where the photos

were taken

The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data
Data Value Chain
Describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data

The Big Data Value Chain identifies the following key high-level
activities

Data Acquisition, Data Analysis, Data Curation, Data

Storage, Data Usage,
Data Value Chain
Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out

Data acquisition is one of the major big data challenges in terms of

infrastructure requirements
Data Acquisition
The infrastructure required for data acquisition must

deliver low, predictable latency in both capturing data and in

executing queries

be able to handle very high transaction volumes, often in a

distributed environment

support flexible and dynamic data structures

Data Analysis
Involves exploring, transforming, and modelling data with the goal
of highlighting relevant data, synthesising and extracting useful
hidden information with high potential from a business point of view

Related areas include data mining, business intelligence, and

machine learning
Data Curation
Active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage

Data curation processes can be categorized into different activities

content creation, selection, classification, transformation,

validation, and preservation
Data Curation
Data curators (also known as scientific curators, or data annotators)
hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable, and fit their purpose

A key trend for the curation of big data utilizes community and
crowd sourcing approaches
Data Storage
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data

Relational Database Management Systems (RDBMS) have been the

main, and almost unique, solution to the storage paradigm for nearly 40
years
Data Storage
Relational database that guarantee database transactions, lack
flexibility with regard to schema changes, performance and fault
tolerance when data volumes and complexity grow, making them
unsuitable for big data scenarios

NoSQL technologies have been designed with the scalability goal in

mind and present a wide range of solutions based on alternative data
models
Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis
within the business activity

In business decision-making , it can enhance competitiveness

through reduction of costs, increased added value, or any other
parameter that can be measured against existing performance criteria
What Is Big Data?
A collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or
traditional data processing applications

Big data is characterized by 3V and more

The Vs
Volume: large amounts of data Zeta bytes/Massive datasets

Velocity: Data is live streaming or in motion

Variety: data comes in many different forms from diverse sources

Veracity: can we trust the data? How accurate is it? etc

The Vs
Clustered Computing
● Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
○ To better address the high storage and computational needs of big data, computer
clusters are a better fit
● Benefits :
○ Resource Pooling: Combining the available storage space to hold data is a clear benefit, but
CPU and memory pooling are also extremely important. Processing large datasets requires large
amounts of all three of these resources.
● High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and processing. This
becomes increasingly important as we continue to emphasize the
importance of real-time analytics.

● Easy Scalability: Clusters make it easy to scale horizontally by

adding additional machines to the group. This means the system
can react to changes in resource requirements without expanding
the physical resources on a machine.

Chapter Two
No ratings yet
Chapter Two
57 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
CH 2
No ratings yet
CH 2
23 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
No ratings yet
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
4 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Data Science & Big Data Guide
No ratings yet
Data Science & Big Data Guide
6 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Eds Unit 1
No ratings yet
Eds Unit 1
28 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Data Science
No ratings yet
Data Science
23 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Module 1
No ratings yet
Module 1
35 pages
Introduction
No ratings yet
Introduction
21 pages
Data Science
No ratings yet
Data Science
23 pages
Vision 2020
No ratings yet
Vision 2020
10 pages
Leadership Personas Explained
No ratings yet
Leadership Personas Explained
1 page
LifeFiber ADSS 8-Span 80m
No ratings yet
LifeFiber ADSS 8-Span 80m
8 pages
New Cultus Brochure
No ratings yet
New Cultus Brochure
8 pages
Daytona Beach Agenda Summary 0402
No ratings yet
Daytona Beach Agenda Summary 0402
11 pages
Sniffy Pro: Operant Conditioning Guide
No ratings yet
Sniffy Pro: Operant Conditioning Guide
9 pages
Muliadi - NCL APPLICATION FORM
No ratings yet
Muliadi - NCL APPLICATION FORM
3 pages
Srilankan Bo
No ratings yet
Srilankan Bo
4 pages
Computer Vision
No ratings yet
Computer Vision
21 pages
Poultry Project Report BROILER 1000: Unit Cost-Particulars 35,000 00 5,000 00 Consruction 16,000 00 Equipments
No ratings yet
Poultry Project Report BROILER 1000: Unit Cost-Particulars 35,000 00 5,000 00 Consruction 16,000 00 Equipments
2 pages
An Introduction To: Warehouse Management Solution
No ratings yet
An Introduction To: Warehouse Management Solution
25 pages
The Japanese Skincare Revolution How To Have The Most Beautiful Skin of Your Life - at Any Age (Chizu Saeki) (Z-Library)
No ratings yet
The Japanese Skincare Revolution How To Have The Most Beautiful Skin of Your Life - at Any Age (Chizu Saeki) (Z-Library)
132 pages
Assignment On Women Entrepreneurship
No ratings yet
Assignment On Women Entrepreneurship
14 pages
Internship Report
No ratings yet
Internship Report
6 pages
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
No ratings yet
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
10 pages
MSME Annual Report 2011 12 English
No ratings yet
MSME Annual Report 2011 12 English
318 pages
Filipino Online Shopping Trends
No ratings yet
Filipino Online Shopping Trends
101 pages
8th Grade English Review Worksheet
No ratings yet
8th Grade English Review Worksheet
3 pages
Science and Civilisation in China - ARTISANS and ENGINEERS (Vol 4-2) - Joseph Needham (PP 10-50)
100% (1)
Science and Civilisation in China - ARTISANS and ENGINEERS (Vol 4-2) - Joseph Needham (PP 10-50)
78 pages
Deepak Resume
No ratings yet
Deepak Resume
3 pages
CBSE Class 12 Accountancy Financial Statements of Company PDF
No ratings yet
CBSE Class 12 Accountancy Financial Statements of Company PDF
10 pages
Module 10
No ratings yet
Module 10
11 pages
A Critical Review On Leadership A Case S-1
No ratings yet
A Critical Review On Leadership A Case S-1
11 pages
Caveat Joinuddin SK Seniour Division 1234512
No ratings yet
Caveat Joinuddin SK Seniour Division 1234512
2 pages
Dinner of The Lion PDF
33% (3)
Dinner of The Lion PDF
3 pages
General Guide To Read Your SOS Report
No ratings yet
General Guide To Read Your SOS Report
4 pages
Cuti Umum Dan Cuti Sekolah 2010
No ratings yet
Cuti Umum Dan Cuti Sekolah 2010
2 pages
BẢNG ĐỘNG TỪ BẤT QUY TẮC
No ratings yet
BẢNG ĐỘNG TỪ BẤT QUY TẮC
3 pages
Norse Influence on English Language
No ratings yet
Norse Influence on English Language
3 pages
Evolution of Smartphones
No ratings yet
Evolution of Smartphones
2 pages

Data Science & Big Data Essentials

Uploaded by

Data Science & Big Data Essentials

Uploaded by

Overview for

Describe what data science is and the role of data scientists.

Differentiate data and information.

Describe data processing life cycle

Understand different data types from diverse perspectives

Describe data value chain in emerging era of big data

Basic concepts of Big Data

scientific methods, processes and algorithm systems to

Data can be described as unprocessed facts and figures

It can also be defined as groups of non-random symbols in the form

Processed data on which decisions and actions are based. Plain

Interpreted data created from organized, structured, and processed

Data processing consists the following steps

The input data is prepared in some convenient form for processing

The form will depend on the processing machine

In this step, the input data is changed to produce data in a more

For example, a summary of sales for a month can be calculated

At this stage, the result of the proceeding processing step is

For example, output data can be total sale in a month

Common data types include

Integers, Boolean, Characters, Floating-Point Numbers,

Conforms to a tabular format with relationship between different rows

Excel files or SQL databases

It is typically text-heavy, but may contain data such as dates,

audio, video files, NoSQL, pictures, pdfs ...

But contain tags or other markers to separate semantic elements

Therefore, it is also known as self-describing structure

JSON and XML

Metadata of a photo could describe when and where the photos

Data Acquisition, Data Analysis, Data Curation, Data

Data acquisition is one of the major big data challenges in terms of

deliver low, predictable latency in both capturing data and in

be able to handle very high transaction volumes, often in a

support flexible and dynamic data structures

Related areas include data mining, business intelligence, and

Data curation processes can be categorized into different activities

content creation, selection, classification, transformation,

Relational Database Management Systems (RDBMS) have been the

NoSQL technologies have been designed with the scalability goal in

In business decision-making , it can enhance competitiveness

Big data is characterized by 3V and more

Velocity: Data is live streaming or in motion

Variety: data comes in many different forms from diverse sources

Veracity: can we trust the data? How accurate is it? etc

● Easy Scalability: Clusters make it easy to scale horizontally by

You might also like