0% found this document useful (0 votes)

5 views13 pages

Data Science

VERY GOOD AND HELPFUL INFO

Uploaded by

bilalhassan0201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views13 pages

Data Science

VERY GOOD AND HELPFUL INFO

Uploaded by

bilalhassan0201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

DATA SCIENCE

Data Science Introduction

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.

What is Data Science?

Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)

 Predictive analysis (what will happen next?)

 Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship

 To foresee delays for flight/ship/train etc. (through predictive analysis)

 To create promotional offers

 To find the best suited time to deliver goods

 To forecast the next years revenue for a company

 To analyze health benefit of training

 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is available. Examples are:

 Consumer goods

 Stock markets

 Industry

 Politics
 Logistic companies

 E-commerce

How Does a Data Scientist Work?

A Data Scientist requires expertise in several backgrounds:

 Machine Learning

 Statistics

 Programming (Python or R)

 Mathematics

 Databases

A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must
organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.

2. Explore and collect data - From database, web logs, customer feedback, etc.

3. Extract the data - Transform the data to a standardized format.

4. Clean the data - Remove erroneous values from the data.

5. Find and replace missing values - Check for missing values and replace them with a suitable
value (e.g. an average value).

6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m.
However, the number 140 is larger than 1,8. - so scaling is important).

7. Analyze data, find patterns and make future predictions.

8. Represent the result - Present the result with useful insights in a way the "company" can
understand.

Data Science - What is

Data?
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.

Data can be categorized into two groups:

 Structured data

 Unstructured data

Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.

Structured Data
Structured data is organized and easier to work with.
How to Structure Data?
We can use an array or a database table to structure or present data.

Example of an array:

[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

The following example shows how to create an array in Python:

Example

Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)

Data Science - Database

Table
Database Table
A database table is a table with structured data.

The following table shows a database table with health data extracted from a sports watch:

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7

60 115 145 310 8

75 120 150 320 0

75 125 150 330 8

Database Table Structure

A database table consists of column(s) and row(s):

Column 1 Column 2 Column 3 Column 4 Column 5

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

Row 1 30 80 120 240 10

Row 2 30 85 120 250 10

Row 3 45 90 130 260 8

Row 4 45 95 130 270 8

Row 5 45 100 140 280 0

Row 6 60 105 140 290 7

Row 7 60 110 145 300 7

Row 8 60 115 145 310 8

Row 9 75 120 150 320 0

Row 10 75 125 150 330 8

A row is a horizontal representation of data.

A column is a vertical representation of data.

Variables
A variable is defined as something that can be measured or counted.

Examples can be characters, numbers or time.

In the example under, we can observe that each column represents a variable.

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7

60 115 145 310 8

75 120 150 320 0

75 125 150 330 8

There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse,
Calorie_Burnage, Hours_Work, Hours_Sleep).

There are 11 rows, meaning that each variable has 10 observations.

But if there are 11 rows, how come there are only 10 observations?

It is because the first row is the label, meaning that it is the name of the variable.

Data Science & Python

Python
Python is a programming language widely used by Data Scientists.

Python has in-built mathematical libraries and functions, making it easier to

calculate mathematical problems and to perform data analysis.

Data Science - Python Data-

Frame
Create a Data-Frame with Pandas
A data frame is a structured representation of data.

Let's define a data frame with 3 columns and 5 rows with fictional numbers:

Example:

import pandas as pd

d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3':

[7, 8, 12, 1, 11]}

df = pd.DataFrame(data=d)

print(df)
Example Explained

 Import the Pandas library as pd

 Define data with column and rows in a variable named d
 Create a data frame using the function pd.DataFrame()
 The data frame contains 3 columns and 5 rows
 Print the data frame output with the print() function

We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame()
function from the Pandas library.
Be aware of the capital D and F in DataFrame!

Interpreting the Output

This is the output:
We see that "col1", "col2" and "col3" are the names of the columns.

Do not be confused about the vertical numbers ranging from 0-4. They tell us
the information about the position of the rows.

In Python, the numbering of rows starts with zero.

Now, we can use Python to count the columns and rows.

We can use df.shape[1] to find the number of columns:

Example

Count the number of columns:

count_column = df.shape[1]
print(count_column)

We can use df.shape[0] to find the number of rows:

Example

Count the number of rows:

count_row = df.shape[0]
print(count_row)
Why Can We Not Just Count the Rows and Columns Ourselves?
If we work with larger data sets with many columns and rows, it will be
confusing to count it by yourself. You risk to count it wrongly. If we use the built-
in functions in Python correctly, we assure that the count is correct.
Data Science Functions
This chapter shows three commonly used functions when working with Data
Science: max(), min(), and mean().

The Sports Watch Data Set

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7

60 115 145 310 8

75 120 150 320 0

75 125 150 330 8

The data set above consists of 6 variables, each with 10 observations:

 Duration - How long lasted the training session in minutes?

 Average_Pulse - What was the average pulse of the training session?
This is measured by beats per minute
 Max_Pulse - What was the max pulse of the training session?
 Calorie_Burnage - How much calories were burnt on the training
session?
 Hours_Work - How many hours did we work at our job before the training
session?
 Hours_Sleep - How much did we sleep the night before the training
session?

We use underscore (_) to separate strings because Python cannot read space as separator.

The max() function

The Python max() function is used to find the highest value in an array.
Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_max)

The min() function

The Python min() function is used to find the lowest value in an array.

Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_min)
The mean() function
The NumPy mean() function is used to find the average value of an array.

Example
import numpy as np

Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]

Average_calorie_burnage = np.mean(Calorie_burnage)

print(Average_calorie_burnage)

Note: We write np. in front of mean to let Python know that we want to
activate the mean function from the Numpy library.

Data Science - Data

Preparation
Before analyzing data, a Data Scientist must extract the data, and make it clean
and valuable.

Extract and Read Data with Pandas

Before data can be analyzed, it must be imported/extracted.

In the example below, we show you how to import data using Pandas in Python.

We use the read_csv() function to import a CSV file with the health data:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)
Example Explained

 Import the Pandas library

 Name the data frame as health_data.
 header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
 sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)

Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or
unregistered values:

 There are some blank fields

 Average pulse of 9 000 is not possible
 9 000 will be treated as non-numeric, because of the space separator
 One observation of max pulse is denoted as "AF", which does not make
sense

So, we must clean the data in order to perform the analysis.

Remove Blank Rows
 Import the Pandas library
 Name the data frame as health_data.
 header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
 sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)

Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

The result is a data set without NaN rows:

Data Science Introduction - Lecture Class
No ratings yet
Data Science Introduction - Lecture Class
62 pages
Data Science Guide: Concepts and Python Tools
No ratings yet
Data Science Guide: Concepts and Python Tools
45 pages
Ch01 - Introduction To Data Science
No ratings yet
Ch01 - Introduction To Data Science
65 pages
Data Science Introduction
No ratings yet
Data Science Introduction
9 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Data Science
No ratings yet
Data Science
8 pages
Lecture Notes
No ratings yet
Lecture Notes
8 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Data Science Book
No ratings yet
Data Science Book
16 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Data Science Training Report
100% (1)
Data Science Training Report
26 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Unit 1
No ratings yet
Unit 1
33 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science: Career, Tools, and Trends
No ratings yet
Data Science: Career, Tools, and Trends
40 pages
Unit2 PDS
No ratings yet
Unit2 PDS
17 pages
Free Data Science Course Material 2018
No ratings yet
Free Data Science Course Material 2018
32 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Introduction To Data Science - 1650687630477
No ratings yet
Introduction To Data Science - 1650687630477
34 pages
Data Science - Data
No ratings yet
Data Science - Data
10 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Data Science Generating Value From Data Course Slides Red
No ratings yet
Data Science Generating Value From Data Course Slides Red
54 pages
Data Science
No ratings yet
Data Science
31 pages
File 2
No ratings yet
File 2
43 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
5 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Unit 1
No ratings yet
Unit 1
76 pages
Build ETL Using Python
No ratings yet
Build ETL Using Python
7 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Data Science... 1
No ratings yet
Data Science... 1
20 pages
Data Science
No ratings yet
Data Science
13 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
21CSS303T - UNIT-1 - Lecture - 1
No ratings yet
21CSS303T - UNIT-1 - Lecture - 1
90 pages
Unit 1
No ratings yet
Unit 1
84 pages
Unit I
No ratings yet
Unit I
52 pages
Chapter 1+ Python Basics-1
No ratings yet
Chapter 1+ Python Basics-1
16 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Basics of Data Science
No ratings yet
Basics of Data Science
216 pages
5 Tips To Prepare For Data Scientist Interview
No ratings yet
5 Tips To Prepare For Data Scientist Interview
17 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
Unit 1
100% (1)
Unit 1
69 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Data Science Using Python - Introduction
No ratings yet
Data Science Using Python - Introduction
6 pages
DSF Notes
No ratings yet
DSF Notes
97 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Chapter 1+ Python Basics
No ratings yet
Chapter 1+ Python Basics
6 pages
ML & Data Analytics Course Syllabus
No ratings yet
ML & Data Analytics Course Syllabus
35 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
CS3352 FDS QP Solved (Anna University)
No ratings yet
CS3352 FDS QP Solved (Anna University)
98 pages
Operating and Managing Hitachi Content Platform v8.2: Management API
No ratings yet
Operating and Managing Hitachi Content Platform v8.2: Management API
32 pages
Uttar Pradesh
No ratings yet
Uttar Pradesh
52 pages
29questions Answers
No ratings yet
29questions Answers
6 pages
Introduction To IoT Syllabus
No ratings yet
Introduction To IoT Syllabus
6 pages
2007Q2 User Guide
No ratings yet
2007Q2 User Guide
39 pages
Arabic & English CV New My
No ratings yet
Arabic & English CV New My
2 pages
Professional Resume Format
No ratings yet
Professional Resume Format
3 pages
Preview Sementara Simulasi Rakit PC Enterkomputer
No ratings yet
Preview Sementara Simulasi Rakit PC Enterkomputer
2 pages
HW 6
No ratings yet
HW 6
8 pages
Networking Thesis Title
100% (4)
Networking Thesis Title
7 pages
Solution Manual For Enhanced Discovering Computers 2017, 1st Edition Digital Download
100% (7)
Solution Manual For Enhanced Discovering Computers 2017, 1st Edition Digital Download
133 pages
Cad Cam Principles and Applications 3Rd Edn 3rd Edition P. N. Rao Digital Version 2025
100% (4)
Cad Cam Principles and Applications 3Rd Edn 3rd Edition P. N. Rao Digital Version 2025
117 pages
Core Java Course Curriculum Guide
No ratings yet
Core Java Course Curriculum Guide
11 pages
Digital Systems Design Using Verilog 1st Edition Roth Solutions Manual PDF Download
100% (4)
Digital Systems Design Using Verilog 1st Edition Roth Solutions Manual PDF Download
55 pages
CSE203 - Assignment-2 - Colaboratory
No ratings yet
CSE203 - Assignment-2 - Colaboratory
12 pages
ESales (RDO) Job Aid
100% (1)
ESales (RDO) Job Aid
20 pages
F1 SQL
No ratings yet
F1 SQL
7 pages
Plans For 2025
No ratings yet
Plans For 2025
8 pages
PCI DSS Notes
No ratings yet
PCI DSS Notes
3 pages
Skill Sheet LangGraph Developer - Munshot
No ratings yet
Skill Sheet LangGraph Developer - Munshot
3 pages
Students Project List
No ratings yet
Students Project List
52 pages
Next Prev: Upcoming Repacks
No ratings yet
Next Prev: Upcoming Repacks
12 pages
How To Install Nonpdrm Plugin
No ratings yet
How To Install Nonpdrm Plugin
1 page
Tulsi Weigh Solutions PVT LTD Product Catalogue
No ratings yet
Tulsi Weigh Solutions PVT LTD Product Catalogue
2 pages
Tourism Management System Project
No ratings yet
Tourism Management System Project
39 pages
WCDMA RAN Interfaces & Protocols
No ratings yet
WCDMA RAN Interfaces & Protocols
8 pages
Crear y Gestionar Jobs en ABAP
No ratings yet
Crear y Gestionar Jobs en ABAP
7 pages
Install Adjusted Sky Firmware
100% (1)
Install Adjusted Sky Firmware
11 pages
Reshma Resume
No ratings yet
Reshma Resume
3 pages
SystemVerilog Part II
No ratings yet
SystemVerilog Part II
54 pages