Introduction to
Data Science
• Data Science Introduction
• Data Science is a combination of multiple disciplines
that uses
• statistics,
• data analysis,
• and machine learning to analyze data and to extract
knowledge and insights from it.
• What is Data Science?
• Data Science is about data gathering, analysis and decision-
making.
• Data Science is about finding patterns in data, through
analysis, and make future predictions.
• By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden
information in the data)
• Where is Data Science Needed?
• Data Science is used in many industries in the world today, e.g.
banking, consultancy, healthcare, and manufacturing.
• Data Science can be applied in nearly every part of a
business where data is available.
• How Does a Data Scientist Work?
• A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• A Data Scientist must find patterns within the data.
• Before he/she can find the patterns, he/she must organize the
data in a standard format.
• Here is how a Data Scientist works:
• Ask the right questions - To understand the business problem.
• Explore and collect data - From database, web logs, customer
feedback, etc.
• Extract the data - Transform the data to a standardized format.
• Clean the data - Remove erroneous values from the data.
• Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
• Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
• Analyze data, find patterns and make future predictions.
• Represent the result - Present the result with useful insights in a way
the "company" can understand.
• What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data, making it
interpretable and easy to work with.
• Data can be categorized into two groups:
• Structured data
• Unstructured data
• Unstructured Data
• Unstructured data is not organized.
• We must organize the data for analysis purposes.
• Structured Data
• Structured data is organized and easier to work with.
• How to Structure Data?
• We can use an array or a database table to structure or
present data.
• Example of an array:
[80, 85, 90, 95, 100, 105,
110, 115, 120, 125]
• The following example shows how to create an array in
Python:
Example
Array =
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
• Database Table
• A database table is a table with structured data.
• The following table shows a database table with health data
extracted from a sports watch:
• This dataset contains information of a typical training session
such as duration, average pulse, calorie burnage etc.
• Database Table Structure
• A database table consists of column(s) and row(s):
• A row is a horizontal representation of data.
• A column is a vertical representation of data.
• Variables
• A variable is defined as something that can be measured or
counted.
• Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.
• There are 6 columns, meaning that there are 6 variables
(Duration, Average_Pulse, Max_Pulse, Calorie_Burnage,
Hours_Work, Hours_Sleep).
• There are 11 rows, meaning that each variable has 10
observations.
• But if there are 11 rows, how come there are only 10
observations?
• It is because the first row is the label, meaning that it is
the name of the variable.
• Data Science & Python
• Python
• Python is a programming language widely used by Data
Scientists.
• Python has in-built mathematical libraries and functions,
making it easier to calculate mathematical problems and to
perform data analysis.
• Python Libraries
• Python has libraries with large collections of mathematical
functions and analytical tools.
• Python libraries:
• Pandas - This library is used for structured data operations,
like import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform,
etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules
• Data Science - Python DataFrame
• Create a DataFrame with Pandas
• A data frame is a structured representation of data.
• Let's define a data frame with 3 columns and 5 rows with
fictional numbers:
• Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2':
[4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
• Example Explained
• Import the Pandas library as pd
• Define data with column and rows in a variable named d
• Create a data frame using the function pd.DataFrame()
• The data frame contains 3 columns and 5 rows
• Print the data frame output with the print() function
• We write pd. in front of DataFrame() to let Python know that we
want to activate the DataFrame() function from the Pandas
library.
• Interpreting the Output
Now, we can use Python to count the columns and rows.
We can use df.shape[1] to find the number of columns:
Example
Count the number of columns:
count_column = df.shape[1]
print(count_column)
• We can use df.shape[0] to find the number of rows:
• Example
• Count the number of rows:
count_row =
df.shape[0]
print(count_row)
• Data Science Functions
• Three commonly used functions when working with Data
Science: max(), min(), and mean().
• The Sports Watch Data Set
• The data set above consists of 6 variables, each with 10
observations:
• Duration - How long lasted the training session in minutes?
• Average_Pulse - What was the average pulse of the training
session? This is measured by beats per minute
• Max_Pulse - What was the max pulse of the training session?
• Calorie_Burnage - How much calories were burnt on the training
session?
• Hours_Work - How many hours did we work at our job before the
training session?
• Hours_Sleep - How much did we sleep the night before the training
session?
• We use underscore (_) to separate strings because Python cannot
read space as separator.
• The max() function
• The Python max() function is used to find the highest value in an
array.
• Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
• The min() function
• The Python min() function is used to find the lowest value in an
array.
• Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
• The mean() function
• The NumPy mean() function is used to find the average value of
an array.
• Example
import numpy as np
Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
• Note: We write np. in front of mean to let Python know that we
want to activate the mean function from the Numpy library.
• Data Science - Data Preparation
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
• Extract and Read Data With Pandas
• Before data can be analyzed, it must be imported/extracted.
• In the example below, we show you how to import data using
Pandas in Python.
• We use the read_csv() function to import a CSV file with the
health data:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
• Example Explained:
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are
to be found in the first row (note that 0 means the first row in
Python)
• sep="," means that "," is used as the separator between the
values. This is because we are using the file type .csv (comma
separated values)
• Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data.head()):
• Data Cleaning
• Look at the imported data. As you can see, the data are "dirty"
with wrongly or unregistered values:
• There are some blank fields
• Average pulse of 9 000 is not possible
• 9 000 will be treated as non-numeric, because of the space
separator
• One observation of max pulse is denoted as "AF", which does
not make sense
• So, we must clean the data in order to perform the analysis.
• Remove Blank Rows:
• We see that the non-numeric values (9 000 and AF) are in the
same rows with missing values.
• Solution: We can remove the rows with missing observations
to fix this problem.
• When we load a data set using Pandas, all blank cells are
automatically converted into "NaN" values.
• So, removing the NaN cells gives us a clean data set that can
be analyzed.
• We can use the dropna() function to remove the NaNs. axis=0
means that we want to remove all rows that have a NaN value:
• Example:
health_data.dropna(axis=0,inplace=True)
print(health_data)
• The result is a data set without NaN rows:
• Data Categories:
• To analyze data, we also need to know the types of data we are
dealing with.
• Data can be split into two main categories:
• Quantitative Data - Can be expressed as a number or can be
quantified. Can be divided into two sub-categories:
• Discrete data: Numbers are counted as "whole", e.g.
number of students in a class, number of goals in a soccer
game
• Continuous data: Numbers can be of infinite precision.
e.g. weight of a person, shoe size, temperature
• Qualitative Data - Cannot be expressed as a number and cannot
be quantified. Can be divided into two sub-categories:
• Nominal data: Example: gender, hair color, ethnicity
• Ordinal data: Example: school grades (A, B, C), economic
status (low, middle, high)
• By knowing the type of your data, you will be able to know what
technique to use when analyzing them.
• Data Types
• We can use the info() function to list the data types within our
data set:
• Example:
print(health_data.info())
• We see that this data set has two different types of data:
• Float64
• Object
• We cannot use objects to calculate and perform analysis here.
• We must convert the type object to float64 (float64 is a number
with a decimal in Python).
• We can use the astype() function to convert the data into float64.
• The following example converts "Average_Pulse" and
"Max_Pulse" into data type float64 (the other variables are
already of data type float64):
• Analyze the Data
• When we have cleaned the data set, we can start analyzing the
data.
• We can use the describe() function in Python to summarize
data:
• Example: print(health_data.describe())
• Result:
• Count - Counts the number of observations
• Mean - The average value
• Std - Standard deviation (explained in the statistics chapter)
• Min - The lowest value
• 25%, 50% and 75% are percentiles (explained in the statistics
chapter)
• Max - The highest value
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations