[go: up one dir, main page]

0% found this document useful (0 votes)
5 views13 pages

Data Science

VERY GOOD AND HELPFUL INFO

Uploaded by

bilalhassan0201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Data Science

VERY GOOD AND HELPFUL INFO

Uploaded by

bilalhassan0201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA SCIENCE

Data Science Introduction


Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.

What is Data Science?


Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)

 Predictive analysis (what will happen next?)

 Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship

 To foresee delays for flight/ship/train etc. (through predictive analysis)

 To create promotional offers

 To find the best suited time to deliver goods

 To forecast the next years revenue for a company

 To analyze health benefit of training

 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is available. Examples are:

 Consumer goods

 Stock markets

 Industry

 Politics
 Logistic companies

 E-commerce

How Does a Data Scientist Work?


A Data Scientist requires expertise in several backgrounds:

 Machine Learning

 Statistics

 Programming (Python or R)

 Mathematics

 Databases

A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must
organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.

2. Explore and collect data - From database, web logs, customer feedback, etc.

3. Extract the data - Transform the data to a standardized format.

4. Clean the data - Remove erroneous values from the data.

5. Find and replace missing values - Check for missing values and replace them with a suitable
value (e.g. an average value).

6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m.
However, the number 140 is larger than 1,8. - so scaling is important).

7. Analyze data, find patterns and make future predictions.

8. Represent the result - Present the result with useful insights in a way the "company" can
understand.

Data Science - What is


Data?
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.

Data can be categorized into two groups:

 Structured data

 Unstructured data

Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.

Structured Data
Structured data is organized and easier to work with.
How to Structure Data?
We can use an array or a database table to structure or present data.

Example of an array:

[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

The following example shows how to create an array in Python:

Example

Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)

Data Science - Database


Table
Database Table
A database table is a table with structured data.

The following table shows a database table with health data extracted from a sports watch:

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7


60 115 145 310 8

75 120 150 320 0

75 125 150 330 8

Database Table Structure


A database table consists of column(s) and row(s):

Column 1 Column 2 Column 3 Column 4 Column 5

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

Row 1 30 80 120 240 10

Row 2 30 85 120 250 10

Row 3 45 90 130 260 8

Row 4 45 95 130 270 8

Row 5 45 100 140 280 0

Row 6 60 105 140 290 7

Row 7 60 110 145 300 7

Row 8 60 115 145 310 8

Row 9 75 120 150 320 0

Row 10 75 125 150 330 8

A row is a horizontal representation of data.

A column is a vertical representation of data.


Variables
A variable is defined as something that can be measured or counted.

Examples can be characters, numbers or time.

In the example under, we can observe that each column represents a variable.

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7

60 115 145 310 8

75 120 150 320 0

75 125 150 330 8

There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse,
Calorie_Burnage, Hours_Work, Hours_Sleep).

There are 11 rows, meaning that each variable has 10 observations.

But if there are 11 rows, how come there are only 10 observations?

It is because the first row is the label, meaning that it is the name of the variable.

Data Science & Python


Python
Python is a programming language widely used by Data Scientists.

Python has in-built mathematical libraries and functions, making it easier to


calculate mathematical problems and to perform data analysis.

Data Science - Python Data-


Frame
Create a Data-Frame with Pandas
A data frame is a structured representation of data.

Let's define a data frame with 3 columns and 5 rows with fictional numbers:

Example:

import pandas as pd

d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3':


[7, 8, 12, 1, 11]}

df = pd.DataFrame(data=d)

print(df)
Example Explained

 Import the Pandas library as pd


 Define data with column and rows in a variable named d
 Create a data frame using the function pd.DataFrame()
 The data frame contains 3 columns and 5 rows
 Print the data frame output with the print() function

We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame()
function from the Pandas library.
Be aware of the capital D and F in DataFrame!

Interpreting the Output


This is the output:
We see that "col1", "col2" and "col3" are the names of the columns.

Do not be confused about the vertical numbers ranging from 0-4. They tell us
the information about the position of the rows.

In Python, the numbering of rows starts with zero.

Now, we can use Python to count the columns and rows.

We can use df.shape[1] to find the number of columns:

Example

Count the number of columns:

count_column = df.shape[1]
print(count_column)

We can use df.shape[0] to find the number of rows:

Example

Count the number of rows:

count_row = df.shape[0]
print(count_row)
Why Can We Not Just Count the Rows and Columns Ourselves?
If we work with larger data sets with many columns and rows, it will be
confusing to count it by yourself. You risk to count it wrongly. If we use the built-
in functions in Python correctly, we assure that the count is correct.
Data Science Functions
This chapter shows three commonly used functions when working with Data
Science: max(), min(), and mean().

The Sports Watch Data Set

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work

30 80 120 240 10

30 85 120 250 10

45 90 130 260 8

45 95 130 270 8

45 100 140 280 0

60 105 140 290 7

60 110 145 300 7

60 115 145 310 8


75 120 150 320 0

75 125 150 330 8

The data set above consists of 6 variables, each with 10 observations:

 Duration - How long lasted the training session in minutes?


 Average_Pulse - What was the average pulse of the training session?
This is measured by beats per minute
 Max_Pulse - What was the max pulse of the training session?
 Calorie_Burnage - How much calories were burnt on the training
session?
 Hours_Work - How many hours did we work at our job before the training
session?
 Hours_Sleep - How much did we sleep the night before the training
session?

We use underscore (_) to separate strings because Python cannot read space as separator.

The max() function


The Python max() function is used to find the highest value in an array.
Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_max)

The min() function


The Python min() function is used to find the lowest value in an array.

Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)

print (Average_pulse_min)
The mean() function
The NumPy mean() function is used to find the average value of an array.

Example
import numpy as np

Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]

Average_calorie_burnage = np.mean(Calorie_burnage)

print(Average_calorie_burnage)

Note: We write np. in front of mean to let Python know that we want to
activate the mean function from the Numpy library.

Data Science - Data


Preparation
Before analyzing data, a Data Scientist must extract the data, and make it clean
and valuable.

Extract and Read Data with Pandas


Before data can be analyzed, it must be imported/extracted.

In the example below, we show you how to import data using Pandas in Python.

We use the read_csv() function to import a CSV file with the health data:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)
Example Explained

 Import the Pandas library


 Name the data frame as health_data.
 header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
 sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)

Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or
unregistered values:

 There are some blank fields


 Average pulse of 9 000 is not possible
 9 000 will be treated as non-numeric, because of the space separator
 One observation of max pulse is denoted as "AF", which does not make
sense

So, we must clean the data in order to perform the analysis.


Remove Blank Rows
 Import the Pandas library
 Name the data frame as health_data.
 header=0 means that the headers for the variable names are to be found
in the first row (note that 0 means the first row in Python)
 sep="," means that "," is used as the separator between the values. This
is because we are using the file type .csv (comma separated values)

Tip: If you have a large CSV file, you can use the head() function to only show
the top 5rows:

Example
import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

The result is a data set without NaN rows:

You might also like