Overview for
Data Science
Chapter 2
Learning outcomes
After completing this lesson you should be able to
Describe what data science is and the role of data scientists.
Differentiate data and information.
Describe data processing life cycle
Understand different data types from diverse perspectives
Describe data value chain in emerging era of big data
Basic concepts of Big Data
An Overview of Data Science
Data science is a multidisciplinary field that uses
scientific methods, processes and algorithm systems to
extract knowledge, Insights from structured, semi-structured
and unstructured data
Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals
What is data?
A representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication,
interpretation, or processing by human or electronic machine
Data can be described as unprocessed facts and figures
It can also be defined as groups of non-random symbols in the form
of text, images, and voice representing quantities, action and objects
What is Information?
Organized or classified data, which has some meaningful values for
the receiver
Processed data on which decisions and actions are based. Plain
collected data as raw facts cannot help much in decision-making
Interpreted data created from organized, structured, and processed
data in a particular context
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by
people or machine to increase their usefulness and add values for
a particular purpose
Data processing consists the following steps
Input
Processing
Output
Data Processing Cycle
Input
The input data is prepared in some convenient form for processing
The form will depend on the processing machine
For example, when electronic computers are used, the input data
can be recorded on any one of the several types of input medium,
such as flash disks, hard disk, and so on
Data Processing Cycle
Processing
In this step, the input data is changed to produce data in a more
useful form
For example, a summary of sales for a month can be calculated
from the sales orders data
Data Processing Cycle
Output
At this stage, the result of the proceeding processing step is
collected
The particular form of the output data depends on the use of the
data
For example, output data can be total sale in a month
Data types and its
representation
In computer science and computer programming, a data type or
simply type is an attribute of data which tells the compiler or interpreter
how the programmer intends to use the data
Common data types include
Integers, Boolean, Characters, Floating-Point Numbers,
Alphanumeric Strings
This data type defines the operations that can be done on the
data, the meaning of the data, and the way values of that type
can be stored
Data types from Data Analytics
perspective
Structured, Unstructured, and Semi-structured data types
Structured Data
Data that adheres to a predefined data model and is therefore
straightforward to analyze
Conforms to a tabular format with relationship between different rows
and columns
Common examples
Excel files or SQL databases
Unstructured Data
Data that either does not have a predefined data model or is not
organized in a predefined manner
It is typically text-heavy, but may contain data such as dates,
numbers, and facts as well
Common examples
audio, video files, NoSQL, pictures, pdfs ...
Semi-structured Data
A form of structured data that does not conform with the formal
structure of data models associated with relational databases or
other forms of data tables
But contain tags or other markers to separate semantic elements
and enforce hierarchies of records and fields within the data
Therefore, it is also known as self-describing structure
Semi-structured Data
Examples of semi-structured data
JSON and XML
Metadata – Data about Data
It provides additional information about a specific set of data
For example
Metadata of a photo could describe when and where the photos
were taken
The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data
Data Value Chain
Describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data
The Big Data Value Chain identifies the following key high-level
activities
Data Acquisition, Data Analysis, Data Curation, Data
Storage, Data Usage,
Data Value Chain
Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out
Data acquisition is one of the major big data challenges in terms of
infrastructure requirements
Data Acquisition
The infrastructure required for data acquisition must
deliver low, predictable latency in both capturing data and in
executing queries
be able to handle very high transaction volumes, often in a
distributed environment
support flexible and dynamic data structures
Data Analysis
Involves exploring, transforming, and modelling data with the goal
of highlighting relevant data, synthesising and extracting useful
hidden information with high potential from a business point of view
Related areas include data mining, business intelligence, and
machine learning
Data Curation
Active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage
Data curation processes can be categorized into different activities
content creation, selection, classification, transformation,
validation, and preservation
Data Curation
Data curators (also known as scientific curators, or data annotators)
hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable, and fit their purpose
A key trend for the curation of big data utilizes community and
crowd sourcing approaches
Data Storage
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data
Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly 40
years
Data Storage
Relational database that guarantee database transactions, lack
flexibility with regard to schema changes, performance and fault
tolerance when data volumes and complexity grow, making them
unsuitable for big data scenarios
NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models
Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis
within the business activity
In business decision-making , it can enhance competitiveness
through reduction of costs, increased added value, or any other
parameter that can be measured against existing performance criteria
What Is Big Data?
A collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or
traditional data processing applications
Big data is characterized by 3V and more
The Vs
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it? etc
The Vs
Clustered Computing
● Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
○ To better address the high storage and computational needs of big data, computer
clusters are a better fit
● Benefits :
○ Resource Pooling: Combining the available storage space to hold data is a clear benefit, but
CPU and memory pooling are also extremely important. Processing large datasets requires large
amounts of all three of these resources.
● High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and processing. This
becomes increasingly important as we continue to emphasize the
importance of real-time analytics.
● Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group. This means the system
can react to changes in resource requirements without expanding
the physical resources on a machine.