Chapter 1
Data Science (Introduction)
Data Science
◦ Data science is a collection of techniques used to extract value from data.
◦ It has become an essential tool for any organization that collects, stores, and
◦ processes data as part of its operations.
◦ Data science techniques rely on finding useful patterns, connections, and
relationships within data.
◦ Data science is also commonly referred to as:
◦ knowledge discovery,
◦ machine learning,
◦ predictive analytics, and
◦ data mining.
AI, MACHINE LEARNING,
AND DATA SCIENCE
◦ Artificial intelligence, Machine learning, and data science are all related to each
other.
◦ Artificial intelligence,
Machine learning, and
data science are all related
to each other.
Data Science
Traditional program and machine learning.
Data Science models
• Data science is the business application of:
• machine learning, artificial intelligence,
• and other quantitative fields like:
• statistics, visualization, and mathematics.
Traditional program and
machine learning.
• It is an interdisciplinary field that extracts
value from data.
• In the context of how data science is used
today, it relies heavily on machine learning
and is sometimes called data mining.
Data Science models
Combination of Statistics, Machine
Learning, and Computing
◦ In the pursuit of extracting useful and relevant information from large datasets,
data science uses computational techniques from the disciplines of
◦ statistics,
◦ machine learning,
◦ experimentation, and
◦ database theories.
Learning Algorithms
◦ data science as a process of discovering previously unknown patterns in data using
automatic iterative methods.
◦ The application of sophisticated learning algorithms for extracting useful patterns from
data differentiates data science from traditional data analysis techniques.
◦ These iterative algorithms automate
the process of searching for an optimal solution for a given data problem.
Based on the problem, data science is classified into tasks such as classification, association
analysis, clustering, and regression.
◦ Each data science task uses specific learning algorithms like decision trees, neural
networks, k-nearest neighbors (k-NN), and k-means clustering, among others.
Combination of Statistics, Machine
Learning, and Computing
Associated Fields
◦ The techniques used in the steps of a data science process and in
conjunction with the term “data science” are:
◦ Descriptive statistics:
◦ Computing mean, standard deviation, correlation, and other descriptive statistics, quantify
the aggregate structure of a dataset.
◦ Dimensional slicing:
◦ Online analytical processing (OLAP) applications, which are prevalent in organizations, mainly provide
information on the data through dimensional slicing, filtering, and pivoting.
◦ OLAP analysis is enabled by a unique database schema design where the data are
organized as dimensions (e.g., products, regions, dates) and quantitative facts or measures
(e.g., revenue, quantity).
Combination of Statistics, Machine
Learning, and Computing
Associated Fields
Hypothesis testing:
◦ In confirmatory data analysis, experimental data are collected to evaluate whether a
hypothesis has enough evidence to be supported or not.
Data engineering:
◦ Data engineering is the process of sourcing,
◦ organizing, assembling, storing, and distributing data for effective analysis and usage.
Database engineering, distributed storage, and computing frameworks (e.g., Apache
Hadoop, Spark, Kafka), parallel computing, extraction transformation and loading
processing, and data warehousing constitute data engineering techniques.
Business intelligence:
◦ Business intelligence helps organizations consume data effectively. It helps query the ad
hoc data without the need to write the technical query command or use dashboards or
visualizations to communicate the facts and trends.
DATA SCIENCE
CLASSIFICATION
◦ Data science problems can be
broadly categorized into supervised
or unsupervised learning models.
◦ Supervised or directed data science
tries to infer a function or
relationship based on labeled
training data and uses this function
to map new unlabeled data.
◦ Supervised techniques predict the
value of the output variables based
on a set of input variables.
◦ To do this, a model is developed from
a training dataset where the values
of input and output are previously
known.
(Local outlier Factor)
Data Preparation
Data Science
◦ Data science is a collection of techniques used to extract value from
data.
◦ It has become an essential tool for any organization that collects,
stores, and
◦ processes data as part of its operations.
◦ Data science techniques rely on finding useful patterns, connections,
and relationships within data.