0% found this document useful (0 votes)

18 views188 pages

DSA Notes Unit-01

The document provides an introduction to Data Science, detailing its interdisciplinary nature and the methodologies used to extract insights from data. It covers key terminologies such as Big Data, Business Intelligence, Data Analytics, and the various types of data repositories like Data Lakes and Data Warehouses. Additionally, it outlines the roles of personnel involved in data science, including Data Scientists, Analysts, Engineers, and Architects.

Uploaded by

Sohail Agha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views188 pages

DSA Notes Unit-01

Uploaded by

Sohail Agha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 188

Padre Conceição

College Of Engineering

CEAM-03 – Data Science and Analytics

(T.E Computer , Sem-VI)

Presented by: Asst Prof. Vidya G

30/07/2025 INTRODUCTION TO DATA SCIENCE 1

Syllabus

30/07/2025 INTRODUCTION TO DATA SCIENCE 2

UNIT-01
Introduction to Data
Science

30/07/2025 INTRODUCTION TO DATA SCIENCE 3

Data Science
Data Science is also known as data-driven science, it is interdisciplinary field of scientific
methods, processes, and systems to extract knowledge or insights from data in various forms,
either structured or unstructured similar to Data Mining.
Convergence of various knowledge domains for effective utilisations of various analysis
method for better output of experts in their activities ( Refer Fig.1.1).
Data Science is one of the recent fields combining big data, unstructured data and
combination of statistics and analytics and business intelligence.

30/07/2025 INTRODUCTION TO DATA SCIENCE 4

30/07/2025 INTRODUCTION TO DATA SCIENCE 5
Data Science
Data Science is the discipline of using quantitative methods from statistics and mathematics
along with the technology to develop algorithms designed to discover patterns, predict
outcomes, and final optimal solutions to complex problems.

Data science employs techniques and theories drawn from many fields within broad areas of
mathematics, statistics, information science, and computer science, in particular from the
sub-domains of machine learning, classification, cluster analysis, data lakes data mining,
and warehousing, databases, and visualization (Refer Fig.1.2)

30/07/2025 INTRODUCTION TO DATA SCIENCE 6

30/07/2025 INTRODUCTION TO DATA SCIENCE 7
Terminology Related with Data Science
1. Big Data.

Big Data is a term applied to datasets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low-latency.

Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage and process data within a tolerable elapsed time.

30/07/2025 INTRODUCTION TO DATA SCIENCE 8

30/07/2025 INTRODUCTION TO DATA SCIENCE 9
Terminology Related with Data Science
2. Business Intelligence (BI)

BI is the technology which uses the transformed and loaded historical data to get or create the
reports.

It is a set of methodologies, process, theories that transform raw data into useful information
to help companies make better decisions.

BI is a process for analyzing data and presenting actionable information to help executives,
managers and other corporate end users make informed business decisions and help in decision
making.

30/07/2025 INTRODUCTION TO DATA SCIENCE 10

Terminology Related with Data Science
Functions in BI technologies include reporting, online analytical processing, analytics, data
mining, process mining, complex event processing, business performance management,
benchmarking, text mining, predictive analytics and prescriptive analytics.

BI can be used by enterprises to support a wide range of business decisions ranging from
operational to strategic.

30/07/2025 INTRODUCTION TO DATA SCIENCE 11

Terminology Related with Data Science
3. Data Analytics

Data Analytics and analytics, are used to describe the field and comprehensive collection of
associated methods.

Data analyst collect, process and perform statistical analyses of data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 12

Terminology Related with Data Science
Difference between Big Data and Business Intelligence

BIG DATA
BUSINESS INTELLIGENCE (BI)
Big Data refers to act of generating, BI encompasses only commercial activities, its
capturing, and processing enormous amounts domain is larger. The data is collected in data
of data on continuous basis. lakes and refined in data warehousing through
data mining techniques.
BI refers to software and systems that import
Big data is the technology which collects
data streams of any size and use them to
transforms the huge data which is in generate informational displays that point
unstructured manner. specific decisions.

30/07/2025 INTRODUCTION TO DATA SCIENCE 13

Terminology Related with Data Science
4. Data Wrangling

The process of conversion of data, through the use of scripting languages to make it easier to
work is known as Data Wrangling or data munging.

Example: 900,000 birth year values of the format yyyy-dd-mm and 100,000 of the format
mm/dd/yyyy, write a perl script to convert latter to look the same former as you can use all
together, it is known as data wrangling.

30/07/2025 INTRODUCTION TO DATA SCIENCE 14

Terminology Related with Data Science
5. Algorithm

A series of repeatable steps for carrying out a certain type of task with data

6. Machine Learning

Analytics in which computers “learn” from data to produce models or rules that apply to those
data and other similar data

Predictive modelling techniques such as neural nets, classification and regression trees, naïve
bayes, k-nearest neighbour, and support vector machines.

30/07/2025 INTRODUCTION TO DATA SCIENCE 15

Terminology Related with Data Science
7. Web Analytics

Statistical or machine learning methods applied to web data such as page views, hits, clicks,
and conversions generally with a view to learning what web presentations are most effective in
achieving the organizational goal.

This goal might to sell products and services on a site, to server and sell advertising space, to
purchase on other sites.

Advantage is volume & constant flow of data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 16

Methods of Data Repository
Data repository is the term used for data storage.

Data repository refers to data storage entity into which data has been specifically partitioned for an
analytical or reporting purpose.

It has several different shapes:

 Data lakes
 Data marts
 Data Ware Housing
 Big Data and Hadoop and similar frameworks.

30/07/2025 INTRODUCTION TO DATA SCIENCE 17

Methods of Data Repository
1. Data lake
 Data lakes is storage repository that holds a vast amount of raw data in native format until it is
needed and refined.
 Data lake shares data environment that has multiple repositories and capitalizes on big data
technologies.
 Provides data to an organization for variety of analytic processes.

30/07/2025 INTRODUCTION TO DATA SCIENCE 18

Methods of Data Repository
 Data lake is associated with Hadoop-oriented object storage, in which organizations data is
loaded into Hadoop-platform.

 Business analytics and data mining tools are applied to the data where it resdes on the Hadoop
cluster.

 The data lake concept takes Hadoop deployments to their extreme, creating a potentially,
limitless reservoir for disparate collections of structured, unstructured and semi-structured data
generated by transaction systems, social networks, server logs, sensors and other sources.

30/07/2025 INTRODUCTION TO DATA SCIENCE 19

Methods of Data Repository
Characteristics of Data Lake:

1. All data is loaded from source systems. No data is turned away.

2. Data is stored at the leaf level in an untransformed or nearly untransformed state.
3. Data is transformed and schema is applied to fulfil the needs of analysis.

30/07/2025 INTRODUCTION TO DATA SCIENCE 20

Methods of Data Repository
2. Data Warehouse

 Data warehouse is constructed by integrating by data from multiple heterogeneous sources that
support analytical reporting, structured and / or ad hoc queries, decision making.

 Data warehousing involves data cleaning, data integration, and data consolidations.

 A core component of BI, data warehouse is central repository of integrated data from one or
more disparate sources, and its used for reporting & data analytics.

30/07/2025 INTRODUCTION TO DATA SCIENCE 21

Methods of Data Repository
 Hierarchical database that stores data in files or folders a data lake uses a flat architecture to

store data.

 Example: on updating daily basis transactions.

 Data warehouse provides generalized and consolidated data in multidimensional view.

 Data warehouse provides the online analytical processing (OLAP) tools.

 This tools helps in interactive and effective analysis of data in multidimensional space. Analysis

results in data mining.

30/07/2025 INTRODUCTION TO DATA SCIENCE 22

Methods of Data Repository
Understanding a Data Warehouse

1. A data warehouse is a database, which is kept separate from the organization’s operational database.

2. There is no frequent updating done in data warehouse.

3. It possesses consolidated historical data, which helps the organization to analyze the business.

4. A data warehouse helps executives to organize, understand and use their data to take strategic decisions.

5. A data warehouse systems help in the integration of diversity of application systems.

6. A data warehouse systems helps in consolidated historical data analysis.

30/07/2025 INTRODUCTION TO DATA SCIENCE 23

Methods of Data Repository
Data ware house models
From the perspective of data warehouse architecture, we have the following data warehouse
models
1. Virtual warehouse
2. Data marts
3. Enterprise warehouse

30/07/2025 INTRODUCTION TO DATA SCIENCE 24

Methods of Data Repository
1. Virtual warehouse
 The view over an operational data warehouse is know as virtual warehouse.
 It is easy to build virtual warehouse.
 Building virtual warehouse requires excess capacity on operational database servers.

30/07/2025 INTRODUCTION TO DATA SCIENCE 25

Methods of Data Repository
2. Data marts:
 Data mart contains a subset of organization-wide data.
 This subset is valuable to specific group of an organization.
 Example: the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.

30/07/2025 INTRODUCTION TO DATA SCIENCE 26

30/07/2025 INTRODUCTION TO DATA SCIENCE 27
Methods of Data Repository
3. Enterprise warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization.
1. It provides us enterprise -wide data integration.
2. The data is integrated from operational systems and external information providers.
3. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.

30/07/2025 INTRODUCTION TO DATA SCIENCE 28

Methods of Data Repository
Process flow in data warehouse

30/07/2025 INTRODUCTION TO DATA SCIENCE 29

Methods of Data Repository

Fig 1.3. Processes in Data Warehouse

30/07/2025 INTRODUCTION TO DATA SCIENCE 30
Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 31

Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 32

Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 33

Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 34

Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 35

Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 36

Methods of Data Repository
Functions of data warehouse-tools and utilities

30/07/2025 INTRODUCTION TO DATA SCIENCE 37

Methods of Data Repository

Fig.1.4 Functions of data ware housing

30/07/2025 INTRODUCTION TO DATA SCIENCE 38
Personnel involved with data science
1. Data Scientist

 A data scientist is someone who is better at statistics than any software engineer and better at
software engineering than any statistician.

Data scientist implies the ability to work with large volumes of data generated not by studies, but
by ongoing organizational processes.

Due to complexity of dealing with large datasets and data flows, most of day-to-day work lies in
data pipeline challenges.

30/07/2025 INTRODUCTION TO DATA SCIENCE 39

30/07/2025 INTRODUCTION TO DATA SCIENCE 40
30/07/2025 INTRODUCTION TO DATA SCIENCE 41
30/07/2025 INTRODUCTION TO DATA SCIENCE 42
Personnel involved with data science
2. Data Analyst

 Data analyst collect, process and perform statistical analyses of data.

 Skills may not be as advanced as data scientist

 E.g. they may not be able to create new algorithms, but the goals are same- to discover how data
can be used to answer questions and solve problems.

30/07/2025 INTRODUCTION TO DATA SCIENCE 43

30/07/2025 INTRODUCTION TO DATA SCIENCE 44
30/07/2025 INTRODUCTION TO DATA SCIENCE 45
30/07/2025 INTRODUCTION TO DATA SCIENCE 46
30/07/2025 INTRODUCTION TO DATA SCIENCE 47
Personnel involved with data science
3. Data Engineer

 A specialist is data wrangling.

 Data engineers are the ones that take the messy data and build the infrastructure for real,
tangible analysis.

 They run ETL software, enrich and clean all the data that companies have been storing for years.

30/07/2025 INTRODUCTION TO DATA SCIENCE 48

30/07/2025 INTRODUCTION TO DATA SCIENCE 49
30/07/2025 INTRODUCTION TO DATA SCIENCE 50
30/07/2025 INTRODUCTION TO DATA SCIENCE 51
Personnel involved with data science
4. Data Architect
 Data architect create blueprints for data management systems.
 After assessing a company’s potential data sources architects design a plan to integrate,
centralize, protect and maintain them.
 This allows employees to access crtitial information in the right place at right time.

30/07/2025 INTRODUCTION TO DATA SCIENCE 52

30/07/2025 INTRODUCTION TO DATA SCIENCE 53
30/07/2025 INTRODUCTION TO DATA SCIENCE 54
30/07/2025 INTRODUCTION TO DATA SCIENCE 55
Types of Data

30/07/2025 INTRODUCTION TO DATA SCIENCE 56

Unstructured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 57

Semi-Structured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 58

Meta data

30/07/2025 INTRODUCTION TO DATA SCIENCE 59

Meta data

30/07/2025 INTRODUCTION TO DATA SCIENCE 60

30/07/2025 INTRODUCTION TO DATA SCIENCE 61
Structured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 62

The Data Science Process (DSP)
 DSP is an agile, iterative data science methodology to deliver predictive analytics solutions and
intelligent applications efficiently.

 DSP helps improve team collaboration and learning.

 It contains a distillation of the best practices and structures from Microsoft and others in the
industry that facilitate the successful implementation of data science initiatives.

 The goal is to help companies fully realize the benefits of their analytics program.

This provide a generic description of the process here that can be implemented with variety of

tools.

30/07/2025 INTRODUCTION TO DATA SCIENCE 63

The Data Science Process (DSP)
The process may involve 7 clear cut steps for data analytics.