ELE492 - ELE492 - Image Process Lecture Notes 5
ELE492 - ELE492 - Image Process Lecture Notes 5
Hacettepe University
Department of Electrical and Electronics Engineering
Disclaimer: Much of the material and slides for this lecture were borrowed from
- IBM Courses at Coursera, https://www.coursera.org/professional-certificates/ibm-data-science
- CS109 Data Science course at Harvard University, by Rafael A. Irizarry and Verena Kaynig-Fittkau.
- Python Numpy Tutorial by Justin Johnson.
2
What is Data?
3
What is Data Science?
• Data science is the study of data.
4
Everybody is Talking about Data
5
Why Data Science Now?
• We are producing more and more data every minute via
– Sensors
– Video Surveillance Cameras
– Browsing Web
– Medical Instruments
– …
6
Python Libraries for Machine Learning
7
Python Libraries to Analyse Data
• Pandas
– Provides data structures and operations for data (e.g.
tables and time series) manipulation and analysis.
• Numpy
– Provides means to work with multidimensional arrays.
• Matplotlib
– A plotting library used to create high-quality graphs,
charts, and figures.
8
Pandas
• A library that contains high-performance, easy-to-use data
structures and data analysis tools.
– Tools for reading and writing data in different formats, e.g. csv, Excel,
SQL Database.
9
Create DataFrames using Dictionaries
import pandas as pd
data = { 'name': ['Fuat', 'Aykut', 'Erkut'],
'midterm': [60, 85, 100],
'final': [69, 90, 100],
'attendance': [6, 10, 10]
}
df_bbm101 = pd.DataFrame(data)
10
Same Thing, in Another Way
names = ['Fuat', 'Aykut', 'Erkut']
midterms = [60, 85, 100]
finals = [69, 90, 100]
attendances = [6, 10, 10]
data = dict(zipped)
df_bbm101 = pd.DataFrame(data)
11
Broadcasting
df_bbm101['total'] = 0
# Adds new column to df and
# broadcasts 0 to entire column
print(df_bbm101.head())
12
Compute Columns
df_bbm101['total'] = df_bbm101['midterm']*0.3 + \
df_bbm101['final']*0.6 + \
df_bbm101['attendance']*0.1
print(df_bbm101.head())
13
Subsetting/Slicing Data
print(df_bbm101[['name', 'grade']])
name grade
0 Fuat D
1 Aykut B
2 Erkut A
14
DataFrames from CSV Files
df_bbm101 = pd.read_csv('bbm101.csv')
print(df_bbm101.head())
15
Indexing DataFrames
df_bbm101 = pd.read_csv('bbm101.csv’, index_col ='name')
print(df_bbm101.head())
midterm final attendance total grade
name
Fuat 60 69 6 60.0 D
Aykut 85 90 10 80.5 B
Erkut 100 100 10 91.0 A
print(df_bbm101.loc['Fuat']) midterm 60
final 69
attendance 6
total 60
print(df_bbm101. grade D
loc[['Aykut', 'Erkut']]) Name: Fuat, dtype: object
17
Creating Numpy Arrays
import numpy as np
18
Miscellaneous Ways to Create Arrays
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # [[ 0. 0.]
# [ 0. 0.]]
20
Slicing
• Similar to slicing Python lists.
• Since arrays may be multidimensional, you must
specify a slice for each dimension of the array.
• Slices are views (not copies) of the original data.
21
Slicing Examples
a = np.array([[1, 2, 3, 4], # Create a rank 2 array
[5, 6, 7, 8], # with shape (3, 4)
[9, 10, 11, 12]])
print(a) # [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
b = a[:2, 1:3]
print(b) # [[ 2 3 ]
# [ 6 7 ]
print(a[1, :]) # [5 6 7 8]
print(a[:, :-2]) # [[ 1 2]
# [ 5 6]
# [ 9 10]]
22
Integer Indexing
• NumPy arrays may be indexed with other arrays.
• Index arrays must be of integer type.
• Each value in the array indicates which value in the
array to use in place of the index.
• Returns a copy of the original data.
23
Integer Indexing Examples
a = np.array([1, 2, 3, 4, 5, 6])
print(a) # [1 2 3 4 5 6]
print(a[[1, 3, 5]]) # [2 4 6]
24
Boolean (or, Mask) Indexing
• Boolean array indexing lets you pick out arbitrary
elements of an array.
• Frequently used to select the elements of an array
that satisfy some condition.
– Thus, called the mask indexing.
25
Boolean (or, Mask) Indexing Examples
a = np.array([1, 2, 3, 4, 5, 6])
bool_idx = (a > 2)
# Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.
# Elementwise sum
print(x + y)
print(np.add(x, y))
# [[ 6 8]
# [10 12]]
# Elementwise product
print(x * y)
print(np.multiply(x, y)) Same principle holds for
# [[ 5 12] “np.divide, /” and “np.subtract, -”
# [21 32]]
27
Array Math (Cont’d)
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
v = np.array([9, 10]
w = np.array([11, 12])
28
Matplotlib
• Python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats and interactive
environments.
• There are many plot types. Some of are more frequently used.
29
Why Build Visuals?
• For exploratory data analysis
• Communicate data clearly
• Share unbiased representation of data
• A picture is worth a thousand words
30
Make a Simple Plot
import matplotlib.pyplot as plt
plt.plot(5, 5, 'o')
plt.title("Plot a Point")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
31
Plot a Simple Line
import matplotlib.pyplot as plt
plt.plot(year, lowest_rank)
plt.title(”HU-BBM Progress")
plt.xlabel('Year')
plt.ylabel('Lowest Rank')
plt.show()
32
Dataset to Use for the Rest of This Section
• The Population Division of the United Nations compiled data
pertaining to 45 countries.
33
Immigration Data to Canada
34
Read Data into Pandas Dataframe
df = pd.read_excel
(’http://www.un.org/…/Canada.xlsx',
sheetname='Canada by Citizenship',
skiprows=range(20),
skip_footer=2)
print(df.head())
35
After Little Preprocessing
36
Line Plots
A line plot displays information as a series of data points called
‘markers’ connected by straight line segments.
plt.show()
37
Area Plots
Commonly used to
represent cumulated
totals using numbers
or percentages over time.
df_canada.sort_values(['Total'], ascending=False,
axis=0, inplace=True)
df_top5 = df_canada.head()
df_top5 = df_top5[years].transpose()
df_top5.plot(kind='area')
plt.show() 38
Histogram
Histogram is a way of representing the frequency distribution of
a variable.
df_canada[2013].plot(kind='hist')
plt.show()
39
Bar Chart
Unlike a histogram, a bar
chart is commonly used
to compare the values of
a variable at a given point.
plt.show()
40
Pie Chart
A pie chart is a circular statistical graphic divided into slices to
illustrate numerical proportion.
plt.show()
41