[go: up one dir, main page]

0% found this document useful (0 votes)
4 views81 pages

Introduction To Data Analytics: Instructor: Parisa Pouladzadeh Email: Parisa - Pouladzadeh@humber - Ca

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views81 pages

Introduction To Data Analytics: Instructor: Parisa Pouladzadeh Email: Parisa - Pouladzadeh@humber - Ca

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Introduction to Data

Analytics
ITC 5201
Instructor: Parisa Pouladzadeh
Email: parisa.pouladzadeh@humber.ca
Pandas

We will use pandas to:

• Read in data from Excel.

• Manipulate data in spreadsheet.


Reading in Data From Excel
I have the following data saved in the file “Grades_Short.csv”:

Let’s see how we read this data into pandas:


Reading in Data From Excel
I have the following data saved in the file “Grades_Short.csv”:

Reading the data into a variable called


Let’s see how we read this data into pandas: df_grades.

Built in read_csv method Path to file


The head() Method
Using the head() method

• If the data is really large you don’t want to print out the entire dataframe to your
output.

• The head(n) method outputs the first n rows of the data frame. If n is not supplied,
the default is the first 5 rows.

• I like to run the head() method after I read in the dataframe to check that
everything got read in correctly.

• There is also a tail(n) method that returns the last n rows of the dataframe
Basic Features

Think of this
as a list

object = string

float64 = decimal

int64 = integer
Basic Features
column names

row names = index


Selecting a Single Column

• Between square brackets, the column must be given as a string


• Outputs column as a series
• A series is a one dimensional dataframe..more on this in the slicing
section
Selecting a Single Column

• Exactly equivalent way to get Name column


• + : don’t have to type brackets or quotes
• -: won’t generalize to selecting multiple columns,, won’t work if
column names have spaces, can’t create new columns this way
Selecting Multiple Columns

• List of strings, which correspond to


column names.
• You can select as many column as
you want.
• Column don’t have to be contiguous.
Storing Result

Why store a slice?

• We might want/have to do our


analysis is steps.
• Less error prone
• More readable

The variable name stores a


series
Slicing a Series

Slice/index through
the index, which is
usually numbers
Slicing a Series

Slice/index through
the index, which is
usually numbers

Picking out single element


Slicing a Series

Slice/index through
the index, which is
usually numbers

Picking out single element Contiguous slice


non_inclusive
Slicing a Series

Slice/index through
the index, which is
usually numbers

Picking out single element Contiguous slice


Arbitrary slice
Slicing a Data Frame

• There are a few ways to pick slice a data frame, we will use the .loc method.

• Access elements through the index labels column names

• We will see how to change both of these labels later on


Slicing a Data Frame

• Pick a single value out.


Column name
Index label (string)
(number)
Slicing a Data Frame

• Pick out entire row: “pick out all


columns”

first_row is a series
Slicing a Data Frame

• Pick out contiguous chunk: Endpoints are inclusive!


Slicing a Data Frame

• Pick out arbitrary chunk:


Built in Functions

How do I compute the average score on the final?

Built in mean() method


Built in Functions

How do I compute the highest Mini Exam 1 score?


Creating New Columns

We can also create column as function of other column. The Final was worth 36
points, let’s create a column for each student’s percentage.
Deleting Columns
The Drop Method

• inplace = True– change df_grades


• List of column of index label • inplace = False – return dataframe with
specified columns deleted, do not change
df_grades

• axis = 1 – delete specified columns


• axis = 0 – delete specified rows
Changing Column Names
Changing Column Names
Changing Column Names
Changing Column Names

“curly” brackets around


new column name
assignments
Changing Column Names

“old_column_name”: “new_column_name”
Changing Column Names

inplace = False (default) returns a new dataframe


(df_grades is unaltered) with updated column names.
Changing Column Names
Concatenating DataFrames - Stacked

Let’s say you had separate csv files with the info for the students who got an A and
everyone else, but you want to analyze everything together.
Concatenating DataFrames - Stacked

axis = 0 (default) – combine the two


dataframes by stacking them on top
of each other. Set axis =1 to stack
side by side.
Concatenating DataFrames - Stacked

• # of columns has to match

• What is going to happen to index?


Concatenating DataFrames - Stacked

Notice the ignore_index input!


Using the Index

• The index in this case is row numbers.

• What if I want to quickly see Joe’s row?

• I have to look up what row Joe is in.


• Instead, I can make the index the column name.
Using the Index

Column that will become index (make sure this


is unique).
Missing Data

We can replace the missing data with a true NaN (right now everything is just a string).
Isnull() Method

• The isnull() method lets you check where the NaNs are:
Isnull() Method

• The isnull() method lets you check where the NaNs are:
Dropna() Method

How do I get rid of all rows with NaN?


Dropna() Method

How do I get rid of all rows with NaN?

• Setting axis = 1 would drop all columns with an NaN


Dropna() Method
Fillna() Method

Rather than getting rid of rows/columns, we fill the “holes” in a number of ways.
Introduction to Data
Analytics
ITE 5201
Lecture4-Data Visualization
Instructor: Parisa Pouladzadeh
Email: parisa.pouladzadeh@humber.ca
Data visualization

Data visualization is the graphical representation of information


and data.
By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand
trends and patterns in data.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Types of Data Visualization
Data Storytelling
◦ For presentations to organizational decision makers

Data Showcasing
◦ For presentations to analysts, scientist, mathematicians, and engineers

Data Art
◦ For presentations to activists or to the general public

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Data Storytelling

◦ Make it easy for the audience to get the point. Your data visualization
should be:
• Clutter-free
• Highly organized
◦ Audience:
• Nonanalysts
• Nontechnical business managers
◦ Product types:
• Static images
• Simple interactive dashboards

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Data Showcasing

◦ Showcase lots of data so your audience members can think for


themselves
◦ Your data visualization should be:
◦ Highly contextual
◦ Open ended
◦ Audience:
◦ Analysts
◦ Engineers, mathematicians, scientists
◦ Product types:
◦ Static images
◦ Interactive dashboards

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Data Art
◦ Use your data visualization to make a statement
◦ Your data visualization should be:
◦ Attention getting
◦ Creative, controversial
◦ Intended Audience:
◦ Idealists, artists, Social activists
◦ Product types:
◦ Static images

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Graphics of data storytelling

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Choropleth map
 A choropleth map displays divided geographical areas or regions that are
coloured in relation to a numeric variable.
 First read the instructions and colour legend/key to understand what the
shading means.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Point map

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Steps to choosing data graphics
Make a list of question that your data visualization is meant to
answer.
What is your data visualization types?(storytelling, showcasing
or art?
What data graphic type are preferable for the type of data
visualization?
Tests out different types of data graphics to see which one is
more meaningful?

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Creating context with color
Color should be used:
◦ Strategically
◦ Sparingly
◦ consistently

We use color to draw more attention to the part of the visualization.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Charts
Line Charts
◦ Show the change in the value of an attribute with respect to an x-variable with is
often Time.
◦ Can be use to visually compare the values of several attributes.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Charts
Bar Charts
◦ Respect data attribute values within a particular data category by using bars of
different heights.
◦ Bar Charts represent observation counts within categories.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Charts
Pie Chart
◦ A whole and entire set of categorical data is represented by the complete
circle and the proportions of observation are presented by slices.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


What about pie charts?

Commonly used to show parts of a whole


However…
➢ Hard to judge relative size of pie slices – better at differentiating length
➢ Take up a lot of space to present little information
➢ Require labels and good color contrast to even be usable (often difficult)

Best use is when one overwhelmingly larger value


than the rest – no need to focus on actual values

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Graphs are useful for?

Line graphs can also be used to compare changes over the same
period of time for more than one group.
Pie charts are best to use when you are trying to compare parts of
a whole. They do not show changes over time.
Bar graphs are used to compare things between different groups
or to track changes over time.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Data visualization libraries in
Python
Matplotlib

Seaborn

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Time series
A set of observations, results, or other data obtained over a period
of time, usually at regular intervals:
◦ Monthly sales figures
◦ quarterly inventory data
◦ and daily bank balances are all time series.
◦ A time series plot is a graph that you can use to evaluate patterns and
behavior in data over time.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Times series components

Trends: This refers to the movement of a series to relatively


higher or lower values over a long period of time.

◦ For example, when the Time Series Analysis shows a pattern that is
upward, we call it an Uptrend,
◦ when the pattern is downward, we call it a Down trend
◦ if there was no trend at all, we call it a horizontal or stationary trend

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Times series components

◦ Seasonality: This refers to a repeating pattern within a fixed time period.

◦ Trend happens for a period of time and then disappears. However Seasonality keeps
happening within a fixed time period.

◦ For example, when it’s Christmas, you discover more candies and chocolates are sold and this
keeps happening every year.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Times series components

Irregularity: This is also called noise.


Irregularity happens for a short duration and it’s non depleting.

◦ A very good example is the case of Ebola. During that period, there
was a massive demand for hand sanitizers

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Times series components
Cyclic: This is when a series is repeating upward and downward movement.
It usually does not have a fixed pattern. It could happen in 6months, then two
years later, then 4 years, then 1 year later. These kinds of patterns are much
harder to predict.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

Lecture 4

Matplotlib Overview

Importing Library

Import the matplotlib.pyplot with plt

In [70]: 1 import numpy as np


2 import pandas as pd
3 import matplotlib.pyplot as plt

You'll also need to use this line to see plots in the notebook:

In [71]: 1 %matplotlib inline

That line is only for jupyter notebooks, if you are using another editor, you'll use: plt.show() at the end of all your plotting commands to have the figure pop up
in another window.

Example with numpy


Let's walk through a very simple example using two numpy arrays. You can also use lists, but most likely you'll be passing numpy arrays or pandas columns
(which essentially also behave like arrays).

** The data we want to plot:**

In [72]: 1 import numpy as np


2 x = np.arange(0,20,2)
3 y = x ** 2

In [73]: 1 x

Out[73]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

In [74]: 1 y

Out[74]: array([ 0, 4, 16, 36, 64, 100, 144, 196, 256, 324], dtype=int32)

Basic Matplotlib Commands


We can create a very simple line plot using the following.

In [75]: 1 plt.plot(x, y, 'r') # 'r' is the color red


2 plt.xlabel('X Axis Title Here')
3 plt.ylabel('Y Axis Title Here')
4 plt.title('String Title Here')
5 plt.show()

localhost:8888/notebooks/Lecture3-Part2.ipynb# 1/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

In [76]: 1 #Example
2 x = range(1,10)
3 y = [1,2,3,4,0,4,3,2,1]
4 ​
5 ​

Now, we want to show the line chart of the real dataset.First, download the dataset from Blackboard and then change the address to the address of your
downloaded dataset.

In [77]: 1 cars = pd.read_csv('Downloads/mtcars.csv')


2 cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
3 ​
4 mpg = cars['mpg']

In [78]: 1 mpg.plot()
2 ​

Out[78]: <AxesSubplot:>

In [79]: 1 mpg.plot()
2 plt.xlabel('number of cars')
3 plt.ylabel('mpg')
4 plt.title('statistic')
5 plt.show()

Then we want to select 3 variables and compare them in the chart.

In [80]: 1 df = cars[['cyl','wt','mpg']]
2 df.plot()

Out[80]: <AxesSubplot:>

localhost:8888/notebooks/Lecture3-Part2.ipynb# 2/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

Object Oriented Method

To begin we create a figure instance. Then we can add axes to that figure:

In [81]: 1 # Create Figure (empty canvas)


2 fig = plt.figure()
3 ​
4 # Add set of axes to figure
5 axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
6 ​
7 ​

In [82]: 1 # Create Figure (empty canvas)


2 fig = plt.figure()
3 ​
4 # Add set of axes to figure
5 axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
6 ​
7 ​
8 # Plot on that set of axes
9 axes.plot(x, y, 'b')
10 axes.set_xlabel('Set X Label') # Notice the use of set_ to begin methods
11 axes.set_ylabel('Set y Label')
12 axes.set_title('Set Title')

Out[82]: Text(0.5, 1.0, 'Set Title')

Code is a little more complicated, but the advantage is that we now have full control of where the plot axes are placed, and we can easily add more than one
axis to the figure:

localhost:8888/notebooks/Lecture3-Part2.ipynb# 3/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

In [83]: 1 # Creates blank canvas


2 fig = plt.figure()
3 ​
4 axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # main axes
5 axes2 = fig.add_axes([0.2, 0.5, 0.4, 0.3]) # inset axes
6 ​
7 # Larger Figure Axes 1
8 axes1.plot(x, y, 'b')
9 axes1.set_xlabel('X_label_axes1')
10 axes1.set_ylabel('Y_label_axes1')
11 axes1.set_title('Axes 1 Title')
12 ​
13 # Insert Figure Axes 2
14 axes2.plot(y, x, 'r')
15 axes2.set_xlabel('X_label_axes2')
16 axes2.set_ylabel('Y_label_axes2')
17 axes2.set_title('Axes 2 Title');

subplots()
Here we will have two subplot , x1 and x2. in this subplot we will have one row and two column. so we will write: ax1.plot(x) and ax2.plot(x,y)

The plt.subplots() object will act as a more automatic axis manager.

Then you can specify the number of rows and columns when creating the subplots() object:

In [84]: 1 # Empty canvas of 1 by 2 subplots


2 fig, axes = plt.subplots(nrows=1, ncols=2)
3 ​

In [85]: 1 # Axes is an array of axes to plot on


2 axes

Out[85]: array([<AxesSubplot:>, <AxesSubplot:>], dtype=object)

We can iterate through this array:

localhost:8888/notebooks/Lecture3-Part2.ipynb# 4/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

In [86]: 1 for ax in axes:


2 ax.plot(x, y, 'b')
3 ax.set_xlabel('x')
4 ax.set_ylabel('y')
5 ax.set_title('title')
6 ​
7 # Display the figure object
8 fig

Out[86]:

Saving figures
Matplotlib can generate high-quality output in a number formats, including PNG, JPG, EPS, SVG, PGF and PDF.

To save a figure to a file we can use the savefig method in the Figure class:

In [87]: 1 fig.savefig("filename.png")

In [88]: 1 %pwd

Out[88]: 'C:\\Users\\prpou'

linetypes
In [89]: 1 import numpy as np
2 x = np.linspace(0, 5, 11)
3 y = x ** 2

In [90]: 1 x

Out[90]: array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

In [91]: 1 y

Out[91]: array([ 0. , 0.25, 1. , 2.25, 4. , 6.25, 9. , 12.25, 16. ,


20.25, 25. ])

In [92]: 1 # MATLAB style line color and style


2 fig, ax = plt.subplots()
3 ax.plot(x, x**2, 'b.-') # blue line with dots
4 ax.plot(x, x**3, 'g--') # green dashed line

Out[92]: [<matplotlib.lines.Line2D at 0x21332ae8cd0>]

localhost:8888/notebooks/Lecture3-Part2.ipynb# 5/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

line Colors
In [93]: 1 fig, ax = plt.subplots()
2 ​
3 ax.plot(x, x+1, color="blue", alpha=0.5) # half-transparant
4 ax.plot(x, x+2, color="#8B008B") # RGB hex code
5 ax.plot(x, x+3, color="#FF8C00") # RGB hex code

Out[93]: [<matplotlib.lines.Line2D at 0x2133181e070>]

Example:
In [94]: 1 fig, ax = plt.subplots(figsize=(12,6))
2 ​
3 ax.plot(x, x+1, color="red", linewidth=0.25)
4 ax.plot(x, x+2, color="red", linewidth=0.50)
5 ax.plot(x, x+3, color="red", linewidth=1.00)
6 ax.plot(x, x+4, color="red", linewidth=2.00)
7 ​
8 # possible linestype options ‘-‘, ‘–’, ‘-.’, ‘:’, ‘steps’
9 ax.plot(x, x+5, color="green", lw=3, linestyle='-')
10 ax.plot(x, x+6, color="green", lw=3, ls='-.')
11 ax.plot(x, x+7, color="green", lw=3, ls=':')
12 ​
13 # custom dash
14 line, = ax.plot(x, x+8, color="black", lw=1.50)
15 line.set_dashes([5, 10, 15, 10]) # format: line length, space length, ...
16 ​
17 # possible marker symbols: marker = '+', 'o', '*', 's', ',', '.', '1', '2', '3', '4', ...
18 ax.plot(x, x+ 9, color="blue", lw=3, ls='-', marker='+')
19 ax.plot(x, x+10, color="blue", lw=3, ls='--', marker='o')
20 ax.plot(x, x+11, color="blue", lw=3, ls='-', marker='s')
21 ax.plot(x, x+12, color="blue", lw=3, ls='--', marker='1')
22 ​
23 # marker size and color
24 ax.plot(x, x+13, color="purple", lw=1, ls='-', marker='o', markersize=2)
25 ax.plot(x, x+14, color="purple", lw=1, ls='-', marker='o', markersize=4)
26 ax.plot(x, x+15, color="purple", lw=1, ls='-', marker='o', markersize=8, markerfacecolor="red")
27 ax.plot(x, x+16, color="purple", lw=1, ls='-', marker='s', markersize=8,
28 markerfacecolor="yellow", markeredgewidth=3, markeredgecolor="green");

localhost:8888/notebooks/Lecture3-Part2.ipynb# 6/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

Special Plot Types


In [98]: 1 plt.scatter(x,y)

Out[98]: <matplotlib.collections.PathCollection at 0x2132ff8b310>

In [99]: 1 plt.scatter(x,y,color = 'g')

Out[99]: <matplotlib.collections.PathCollection at 0x2132ff86250>

Creating bar charts

In [21]: 1 plt.bar(x,y)

Out[21]: <BarContainer object of 10 artists>

Creating bar charts for dataset

localhost:8888/notebooks/Lecture3-Part2.ipynb# 7/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

In [85]: 1 mpg.plot(kind="bar")

Out[85]: <matplotlib.axes._subplots.AxesSubplot at 0x21544dc1e08>

In [62]: 1 mpg.plot(kind="barh")

Out[62]: <matplotlib.axes._subplots.AxesSubplot at 0x21544214d08>

#Question: what is the different of "bar"and "barh"

Creating a pie chart

In [64]: 1 x = [1,2,3,4,0.5]
2 plt.pie(x)
3 plt.show()

In [65]: 1 plt.pie(x)
2 plt.savefig('pie_chart.png')
3 plt.show()

localhost:8888/notebooks/Lecture3-Part2.ipynb# 8/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

In [66]: 1 %pwd

Out[66]: 'C:\\Users\\prpou'

Pie chart-Labels
In [60]: 1 import matplotlib.pyplot as plt
2 ​
3 y = [35, 25, 25, 15]
4 mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
5 ​
6 plt.pie(y, labels = mylabels)
7 plt.show()

Pie chart-Start Angle


In [62]: 1 mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
2 ​
3 plt.pie(y, labels = mylabels, startangle = 90)
4 plt.show()

Pie chart-Explode
In [63]: 1 myexplode = [0.2, 0, 0, 0]
2 ​
3 plt.pie(y, labels = mylabels, explode = myexplode)
4 plt.show()

localhost:8888/notebooks/Lecture3-Part2.ipynb# 9/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

Pie chart-Shadow
In [65]: 1 plt.pie(y, labels = mylabels, explode = myexplode, shadow = True)
2 plt.show()

Pie chart-Colors

https://www.w3schools.com/colors/colors_hexadecimal.asp (https://www.w3schools.com/colors/colors_hexadecimal.asp)
https://www.w3schools.com/colors/colors_names.asp (https://www.w3schools.com/colors/colors_names.asp)

In [66]: 1 mylabels = ["Apples", "Bananas", "Cherries", "Dates"]


2 mycolors = ["black", "hotpink", "b", "#4CAF50"]
3 ​
4 plt.pie(y, labels = mylabels, colors = mycolors)
5 plt.show()

Pie chart-Legend
In [67]: 1 plt.pie(y, labels = mylabels)
2 plt.legend()
3 plt.show()

Type Markdown and LaTeX: 𝛼2


In [ ]: 1 ​

localhost:8888/notebooks/Lecture3-Part2.ipynb# 10/11
1/26/23, 9:12 AM Lecture3-Part2 - Jupyter Notebook

Other 2D plot styles


In [36]: 1 n = np.array([0,1,2,3,4,5])

In [43]: 1 fig, axes = plt.subplots(1, 4, figsize=(12,3))


2 ​
3 axes[0].scatter(n, n**2)
4 axes[0].set_title("scatter")
5 ​
6 axes[1].step(n, n**2, lw=2)
7 axes[1].set_title("step")
8 ​
9 axes[2].bar(n, n**2, align="center", width=0.5, alpha=0.5)
10 axes[2].set_title("bar")
11 ​
12 axes[3].fill_between(n, n**2, n**3, color="green", alpha=0.5);
13 axes[3].set_title("fill_between");

In [ ]: 1 ​

localhost:8888/notebooks/Lecture3-Part2.ipynb# 11/11

You might also like