Ch01 - Introduction To Data Science

Data Science is an interdisciplinary field that combines statistics, data analysis, and machine learning to extract insights from data, enabling better decision-making and predictive analysis across various industries. A Data Scientist must possess skills in programming, statistics, and data management, and follow a structured process of data collection, cleaning, and analysis using tools like Python and its libraries. The document also discusses the importance of data structuring, types of data, and essential functions for data analysis.

Uploaded by

Yogesh Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views65 pages

Ch01 - Introduction To Data Science

Uploaded by

Yogesh Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

Introduction to

Data Science
• Data Science Introduction
• Data Science is a combination of multiple disciplines
that uses
• statistics,
• data analysis,
• and machine learning to analyze data and to extract
knowledge and insights from it.
• What is Data Science?
• Data Science is about data gathering, analysis and decision-
making.

• Data Science is about finding patterns in data, through

analysis, and make future predictions.

• By using Data Science, companies are able to make:

• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden
information in the data)
• Where is Data Science Needed?
• Data Science is used in many industries in the world today, e.g.
banking, consultancy, healthcare, and manufacturing.
• Data Science can be applied in nearly every part of a
business where data is available.
• How Does a Data Scientist Work?
• A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• A Data Scientist must find patterns within the data.

• Before he/she can find the patterns, he/she must organize the
data in a standard format.
• Here is how a Data Scientist works:
• Ask the right questions - To understand the business problem.
• Explore and collect data - From database, web logs, customer
feedback, etc.
• Extract the data - Transform the data to a standardized format.
• Clean the data - Remove erroneous values from the data.
• Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
• Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
• Analyze data, find patterns and make future predictions.
• Represent the result - Present the result with useful insights in a way
the "company" can understand.
• What is Data?

• Data is a collection of information.

• One purpose of Data Science is to structure data, making it

interpretable and easy to work with.

• Data can be categorized into two groups:

• Structured data
• Unstructured data
• Unstructured Data
• Unstructured data is not organized.
• We must organize the data for analysis purposes.
• Structured Data
• Structured data is organized and easier to work with.
• How to Structure Data?
• We can use an array or a database table to structure or
present data.
• Example of an array:
[80, 85, 90, 95, 100, 105,
110, 115, 120, 125]

• The following example shows how to create an array in

Python:
Example
Array =
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
• Database Table
• A database table is a table with structured data.
• The following table shows a database table with health data
extracted from a sports watch:
• This dataset contains information of a typical training session
such as duration, average pulse, calorie burnage etc.
• Database Table Structure
• A database table consists of column(s) and row(s):
• A row is a horizontal representation of data.
• A column is a vertical representation of data.
• Variables
• A variable is defined as something that can be measured or
counted.
• Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.
• There are 6 columns, meaning that there are 6 variables
(Duration, Average_Pulse, Max_Pulse, Calorie_Burnage,
Hours_Work, Hours_Sleep).
• There are 11 rows, meaning that each variable has 10
observations.
• But if there are 11 rows, how come there are only 10
observations?
• It is because the first row is the label, meaning that it is
the name of the variable.
• Data Science & Python
• Python
• Python is a programming language widely used by Data
Scientists.
• Python has in-built mathematical libraries and functions,
making it easier to calculate mathematical problems and to
perform data analysis.
• Python Libraries
• Python has libraries with large collections of mathematical
functions and analytical tools.
• Python libraries:
• Pandas - This library is used for structured data operations,
like import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform,
etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules
• Data Science - Python DataFrame
• Create a DataFrame with Pandas
• A data frame is a structured representation of data.
• Let's define a data frame with 3 columns and 5 rows with
fictional numbers:
• Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2':
[4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
• Example Explained
• Import the Pandas library as pd
• Define data with column and rows in a variable named d
• Create a data frame using the function pd.DataFrame()
• The data frame contains 3 columns and 5 rows
• Print the data frame output with the print() function
• We write pd. in front of DataFrame() to let Python know that we
want to activate the DataFrame() function from the Pandas
library.
• Interpreting the Output

Now, we can use Python to count the columns and rows.

We can use df.shape[1] to find the number of columns:
Example
Count the number of columns:

count_column = df.shape[1]
print(count_column)
• We can use df.shape[0] to find the number of rows:
• Example
• Count the number of rows:
count_row =
df.shape[0]
print(count_row)
• Data Science Functions
• Three commonly used functions when working with Data
Science: max(), min(), and mean().
• The Sports Watch Data Set
• The data set above consists of 6 variables, each with 10
observations:
• Duration - How long lasted the training session in minutes?
• Average_Pulse - What was the average pulse of the training
session? This is measured by beats per minute
• Max_Pulse - What was the max pulse of the training session?
• Calorie_Burnage - How much calories were burnt on the training
session?
• Hours_Work - How many hours did we work at our job before the
training session?
• Hours_Sleep - How much did we sleep the night before the training
session?
• We use underscore (_) to separate strings because Python cannot
read space as separator.
• The max() function
• The Python max() function is used to find the highest value in an
array.
• Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
• The min() function
• The Python min() function is used to find the lowest value in an
array.
• Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
• The mean() function
• The NumPy mean() function is used to find the average value of
an array.
• Example
import numpy as np
Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)

• Note: We write np. in front of mean to let Python know that we

want to activate the mean function from the Numpy library.
• Data Science - Data Preparation
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
• Extract and Read Data With Pandas
• Before data can be analyzed, it must be imported/extracted.
• In the example below, we show you how to import data using
Pandas in Python.
• We use the read_csv() function to import a CSV file with the
health data:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
• Example Explained:
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are
to be found in the first row (note that 0 means the first row in
Python)
• sep="," means that "," is used as the separator between the
values. This is because we are using the file type .csv (comma
separated values)

• Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data.head()):
• Data Cleaning
• Look at the imported data. As you can see, the data are "dirty"
with wrongly or unregistered values:
• There are some blank fields
• Average pulse of 9 000 is not possible
• 9 000 will be treated as non-numeric, because of the space
separator
• One observation of max pulse is denoted as "AF", which does
not make sense
• So, we must clean the data in order to perform the analysis.
• Remove Blank Rows:
• We see that the non-numeric values (9 000 and AF) are in the
same rows with missing values.
• Solution: We can remove the rows with missing observations
to fix this problem.
• When we load a data set using Pandas, all blank cells are
automatically converted into "NaN" values.
• So, removing the NaN cells gives us a clean data set that can
be analyzed.
• We can use the dropna() function to remove the NaNs. axis=0
means that we want to remove all rows that have a NaN value:
• Example:
health_data.dropna(axis=0,inplace=True)
print(health_data)
• The result is a data set without NaN rows:
• Data Categories:
• To analyze data, we also need to know the types of data we are
dealing with.
• Data can be split into two main categories:
• Quantitative Data - Can be expressed as a number or can be
quantified. Can be divided into two sub-categories:
• Discrete data: Numbers are counted as "whole", e.g.
number of students in a class, number of goals in a soccer
game
• Continuous data: Numbers can be of infinite precision.
e.g. weight of a person, shoe size, temperature
• Qualitative Data - Cannot be expressed as a number and cannot
be quantified. Can be divided into two sub-categories:
• Nominal data: Example: gender, hair color, ethnicity
• Ordinal data: Example: school grades (A, B, C), economic
status (low, middle, high)
• By knowing the type of your data, you will be able to know what
technique to use when analyzing them.
• Data Types
• We can use the info() function to list the data types within our
data set:
• Example:
print(health_data.info())
• We see that this data set has two different types of data:
• Float64
• Object
• We cannot use objects to calculate and perform analysis here.
• We must convert the type object to float64 (float64 is a number
with a decimal in Python).
• We can use the astype() function to convert the data into float64.
• The following example converts "Average_Pulse" and
"Max_Pulse" into data type float64 (the other variables are
already of data type float64):
• Analyze the Data
• When we have cleaned the data set, we can start analyzing the
data.
• We can use the describe() function in Python to summarize
data:
• Example: print(health_data.describe())
• Result:
• Count - Counts the number of observations
• Mean - The average value
• Std - Standard deviation (explained in the statistics chapter)
• Min - The lowest value
• 25%, 50% and 75% are percentiles (explained in the statistics
chapter)
• Max - The highest value
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations

Data Science Guide: Concepts and Python Tools
No ratings yet
Data Science Guide: Concepts and Python Tools
45 pages
Data Science Introduction - Lecture Class
No ratings yet
Data Science Introduction - Lecture Class
62 pages
Data Science
No ratings yet
Data Science
13 pages
Data Science
No ratings yet
Data Science
8 pages
Data Science Introduction
No ratings yet
Data Science Introduction
9 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Python & Excel for Data Science
No ratings yet
Python & Excel for Data Science
19 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
Pandas
No ratings yet
Pandas
21 pages
Python CA2
No ratings yet
Python CA2
11 pages
Pandas
No ratings yet
Pandas
41 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Report
No ratings yet
Report
18 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Data Science Book
No ratings yet
Data Science Book
16 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Data Manipulation With Pandas and NumPy - Lect 3
No ratings yet
Data Manipulation With Pandas and NumPy - Lect 3
20 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Practical Data Science
No ratings yet
Practical Data Science
121 pages
Advance Python Unit 4
No ratings yet
Advance Python Unit 4
13 pages
Python Data Libraries Guide
No ratings yet
Python Data Libraries Guide
53 pages
Unit 1
100% (1)
Unit 1
69 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Data Science 2
No ratings yet
Data Science 2
15 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Python
No ratings yet
Python
170 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Hari ML Record
No ratings yet
Hari ML Record
54 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Introduction To Data Science - 1650687630477
No ratings yet
Introduction To Data Science - 1650687630477
34 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Logistic Regression and Beginner ML Notes
No ratings yet
Logistic Regression and Beginner ML Notes
9 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Data Science Training Report
100% (1)
Data Science Training Report
26 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Data Science: Career, Tools, and Trends
No ratings yet
Data Science: Career, Tools, and Trends
40 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Clinic Management System Design
No ratings yet
Clinic Management System Design
10 pages
5 Logging
No ratings yet
5 Logging
11 pages
Ankit Kaushik Resume
No ratings yet
Ankit Kaushik Resume
1 page
Software Requirements Specification
No ratings yet
Software Requirements Specification
2 pages
Memory and Forgetting - Study Notes
No ratings yet
Memory and Forgetting - Study Notes
4 pages
Dataxelerate Course Catalogue
No ratings yet
Dataxelerate Course Catalogue
34 pages
235 Lab Exam Cheatsheet - Joshua Edition
No ratings yet
235 Lab Exam Cheatsheet - Joshua Edition
4 pages
GIS Chapter 1
No ratings yet
GIS Chapter 1
14 pages
Library Automation Essentials
No ratings yet
Library Automation Essentials
6 pages
PeopleTools 8.56 8.60 I II Accelerated Live Course Agenda
No ratings yet
PeopleTools 8.56 8.60 I II Accelerated Live Course Agenda
3 pages
Math Full Guide
No ratings yet
Math Full Guide
51 pages
Chap-2-Database Security and Authorization
100% (1)
Chap-2-Database Security and Authorization
38 pages
Job Splits
No ratings yet
Job Splits
17 pages
Assignment1 S1 2024
No ratings yet
Assignment1 S1 2024
10 pages
Internship Report
No ratings yet
Internship Report
19 pages
Airline Operation Control Performs The Actions Across
No ratings yet
Airline Operation Control Performs The Actions Across
19 pages
PL-300 Exam - Free Microsoft Questions and Answers
No ratings yet
PL-300 Exam - Free Microsoft Questions and Answers
1 page
SQL Full Class Notes
100% (1)
SQL Full Class Notes
83 pages
B.Tech CSE IV Sem Scheme and Syllabus
No ratings yet
B.Tech CSE IV Sem Scheme and Syllabus
13 pages
Tripleten 5 - Introduction To Table Relationships and Joining Tables
No ratings yet
Tripleten 5 - Introduction To Table Relationships and Joining Tables
31 pages
Primavera Software Installation Instruction
No ratings yet
Primavera Software Installation Instruction
6 pages
SSRN Id4185512
No ratings yet
SSRN Id4185512
18 pages
Spreadsheet and Ms Excel
No ratings yet
Spreadsheet and Ms Excel
23 pages
Lab 4 Creating A Streaming Data Pipeline For A Real
No ratings yet
Lab 4 Creating A Streaming Data Pipeline For A Real
18 pages
2012 - High Speed Implementation of RSA Algorithm With Modified Keys Exchange
No ratings yet
2012 - High Speed Implementation of RSA Algorithm With Modified Keys Exchange
4 pages
Lecture 2 - Database Theory For Data Science
No ratings yet
Lecture 2 - Database Theory For Data Science
19 pages
SNAP Sentinel 1 TrainingCourse Exercise3
No ratings yet
SNAP Sentinel 1 TrainingCourse Exercise3
42 pages
Netbackup For Sybase
No ratings yet
Netbackup For Sybase
68 pages
NodeJS Practice Questions
No ratings yet
NodeJS Practice Questions
19 pages
Genome Browsers for Researchers
No ratings yet
Genome Browsers for Researchers
24 pages

Ch01 - Introduction To Data Science

Uploaded by

Ch01 - Introduction To Data Science

Uploaded by

Introduction to

• Data Science is about finding patterns in data, through

• By using Data Science, companies are able to make:

• Data is a collection of information.

• One purpose of Data Science is to structure data, making it

• Data can be categorized into two groups:

• The following example shows how to create an array in

Now, we can use Python to count the columns and rows.

• Note: We write np. in front of mean to let Python know that we

You might also like