[go: up one dir, main page]

0% found this document useful (0 votes)
13 views65 pages

Ch01 - Introduction To Data Science

Data Science is an interdisciplinary field that combines statistics, data analysis, and machine learning to extract insights from data, enabling better decision-making and predictive analysis across various industries. A Data Scientist must possess skills in programming, statistics, and data management, and follow a structured process of data collection, cleaning, and analysis using tools like Python and its libraries. The document also discusses the importance of data structuring, types of data, and essential functions for data analysis.

Uploaded by

Yogesh Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views65 pages

Ch01 - Introduction To Data Science

Data Science is an interdisciplinary field that combines statistics, data analysis, and machine learning to extract insights from data, enabling better decision-making and predictive analysis across various industries. A Data Scientist must possess skills in programming, statistics, and data management, and follow a structured process of data collection, cleaning, and analysis using tools like Python and its libraries. The document also discusses the importance of data structuring, types of data, and essential functions for data analysis.

Uploaded by

Yogesh Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to

Data Science
• Data Science Introduction
• Data Science is a combination of multiple disciplines
that uses
• statistics,
• data analysis,
• and machine learning to analyze data and to extract
knowledge and insights from it.
• What is Data Science?
• Data Science is about data gathering, analysis and decision-
making.

• Data Science is about finding patterns in data, through


analysis, and make future predictions.

• By using Data Science, companies are able to make:


• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden
information in the data)
• Where is Data Science Needed?
• Data Science is used in many industries in the world today, e.g.
banking, consultancy, healthcare, and manufacturing.
• Data Science can be applied in nearly every part of a
business where data is available.
• How Does a Data Scientist Work?
• A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• A Data Scientist must find patterns within the data.

• Before he/she can find the patterns, he/she must organize the
data in a standard format.
• Here is how a Data Scientist works:
• Ask the right questions - To understand the business problem.
• Explore and collect data - From database, web logs, customer
feedback, etc.
• Extract the data - Transform the data to a standardized format.
• Clean the data - Remove erroneous values from the data.
• Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
• Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
• Analyze data, find patterns and make future predictions.
• Represent the result - Present the result with useful insights in a way
the "company" can understand.
• What is Data?

• Data is a collection of information.

• One purpose of Data Science is to structure data, making it


interpretable and easy to work with.

• Data can be categorized into two groups:


• Structured data
• Unstructured data
• Unstructured Data
• Unstructured data is not organized.
• We must organize the data for analysis purposes.
• Structured Data
• Structured data is organized and easier to work with.
• How to Structure Data?
• We can use an array or a database table to structure or
present data.
• Example of an array:
[80, 85, 90, 95, 100, 105,
110, 115, 120, 125]

• The following example shows how to create an array in


Python:
Example
Array =
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
• Database Table
• A database table is a table with structured data.
• The following table shows a database table with health data
extracted from a sports watch:
• This dataset contains information of a typical training session
such as duration, average pulse, calorie burnage etc.
• Database Table Structure
• A database table consists of column(s) and row(s):
• A row is a horizontal representation of data.
• A column is a vertical representation of data.
• Variables
• A variable is defined as something that can be measured or
counted.
• Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.
• There are 6 columns, meaning that there are 6 variables
(Duration, Average_Pulse, Max_Pulse, Calorie_Burnage,
Hours_Work, Hours_Sleep).
• There are 11 rows, meaning that each variable has 10
observations.
• But if there are 11 rows, how come there are only 10
observations?
• It is because the first row is the label, meaning that it is
the name of the variable.
• Data Science & Python
• Python
• Python is a programming language widely used by Data
Scientists.
• Python has in-built mathematical libraries and functions,
making it easier to calculate mathematical problems and to
perform data analysis.
• Python Libraries
• Python has libraries with large collections of mathematical
functions and analytical tools.
• Python libraries:
• Pandas - This library is used for structured data operations,
like import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform,
etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules
• Data Science - Python DataFrame
• Create a DataFrame with Pandas
• A data frame is a structured representation of data.
• Let's define a data frame with 3 columns and 5 rows with
fictional numbers:
• Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2':
[4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
• Example Explained
• Import the Pandas library as pd
• Define data with column and rows in a variable named d
• Create a data frame using the function pd.DataFrame()
• The data frame contains 3 columns and 5 rows
• Print the data frame output with the print() function
• We write pd. in front of DataFrame() to let Python know that we
want to activate the DataFrame() function from the Pandas
library.
• Interpreting the Output

Now, we can use Python to count the columns and rows.


We can use df.shape[1] to find the number of columns:
Example
Count the number of columns:

count_column = df.shape[1]
print(count_column)
• We can use df.shape[0] to find the number of rows:
• Example
• Count the number of rows:
count_row =
df.shape[0]
print(count_row)
• Data Science Functions
• Three commonly used functions when working with Data
Science: max(), min(), and mean().
• The Sports Watch Data Set
• The data set above consists of 6 variables, each with 10
observations:
• Duration - How long lasted the training session in minutes?
• Average_Pulse - What was the average pulse of the training
session? This is measured by beats per minute
• Max_Pulse - What was the max pulse of the training session?
• Calorie_Burnage - How much calories were burnt on the training
session?
• Hours_Work - How many hours did we work at our job before the
training session?
• Hours_Sleep - How much did we sleep the night before the training
session?
• We use underscore (_) to separate strings because Python cannot
read space as separator.
• The max() function
• The Python max() function is used to find the highest value in an
array.
• Example
Average_pulse_max =
max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
• The min() function
• The Python min() function is used to find the lowest value in an
array.
• Example
Average_pulse_min =
min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
• The mean() function
• The NumPy mean() function is used to find the average value of
an array.
• Example
import numpy as np
Calorie_burnage
= [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)

• Note: We write np. in front of mean to let Python know that we


want to activate the mean function from the Numpy library.
• Data Science - Data Preparation
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
• Extract and Read Data With Pandas
• Before data can be analyzed, it must be imported/extracted.
• In the example below, we show you how to import data using
Pandas in Python.
• We use the read_csv() function to import a CSV file with the
health data:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
• Example Explained:
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are
to be found in the first row (note that 0 means the first row in
Python)
• sep="," means that "," is used as the separator between the
values. This is because we are using the file type .csv (comma
separated values)

• Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows:
• Example:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data.head()):
• Data Cleaning
• Look at the imported data. As you can see, the data are "dirty"
with wrongly or unregistered values:
• There are some blank fields
• Average pulse of 9 000 is not possible
• 9 000 will be treated as non-numeric, because of the space
separator
• One observation of max pulse is denoted as "AF", which does
not make sense
• So, we must clean the data in order to perform the analysis.
• Remove Blank Rows:
• We see that the non-numeric values (9 000 and AF) are in the
same rows with missing values.
• Solution: We can remove the rows with missing observations
to fix this problem.
• When we load a data set using Pandas, all blank cells are
automatically converted into "NaN" values.
• So, removing the NaN cells gives us a clean data set that can
be analyzed.
• We can use the dropna() function to remove the NaNs. axis=0
means that we want to remove all rows that have a NaN value:
• Example:
health_data.dropna(axis=0,inplace=True)
print(health_data)
• The result is a data set without NaN rows:
• Data Categories:
• To analyze data, we also need to know the types of data we are
dealing with.
• Data can be split into two main categories:
• Quantitative Data - Can be expressed as a number or can be
quantified. Can be divided into two sub-categories:
• Discrete data: Numbers are counted as "whole", e.g.
number of students in a class, number of goals in a soccer
game
• Continuous data: Numbers can be of infinite precision.
e.g. weight of a person, shoe size, temperature
• Qualitative Data - Cannot be expressed as a number and cannot
be quantified. Can be divided into two sub-categories:
• Nominal data: Example: gender, hair color, ethnicity
• Ordinal data: Example: school grades (A, B, C), economic
status (low, middle, high)
• By knowing the type of your data, you will be able to know what
technique to use when analyzing them.
• Data Types
• We can use the info() function to list the data types within our
data set:
• Example:
print(health_data.info())
• We see that this data set has two different types of data:
• Float64
• Object
• We cannot use objects to calculate and perform analysis here.
• We must convert the type object to float64 (float64 is a number
with a decimal in Python).
• We can use the astype() function to convert the data into float64.
• The following example converts "Average_Pulse" and
"Max_Pulse" into data type float64 (the other variables are
already of data type float64):
• Analyze the Data
• When we have cleaned the data set, we can start analyzing the
data.
• We can use the describe() function in Python to summarize
data:
• Example: print(health_data.describe())
• Result:
• Count - Counts the number of observations
• Mean - The average value
• Std - Standard deviation (explained in the statistics chapter)
• Min - The lowest value
• 25%, 50% and 75% are percentiles (explained in the statistics
chapter)
• Max - The highest value
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations
• Count - Counts the number of observations

You might also like