[go: up one dir, main page]

0% found this document useful (0 votes)
12 views41 pages

ELE492 - ELE492 - Image Process Lecture Notes 5

This document provides an overview of data science libraries in Python for analyzing data, including Pandas, NumPy, and Matplotlib. It discusses what each library is used for and provides examples of how to create data frames and arrays and manipulate data using these libraries. Key terms discussed include data frames, series, indexing, slicing, and broadcasting.

Uploaded by

ozllmtkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

ELE492 - ELE492 - Image Process Lecture Notes 5

This document provides an overview of data science libraries in Python for analyzing data, including Pandas, NumPy, and Matplotlib. It discusses what each library is used for and provides examples of how to create data frames and arrays and manipulate data using these libraries. Key terms discussed include data frames, series, indexing, slicing, and broadcasting.

Uploaded by

ozllmtkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ELE 492: Image Processing

Assoc. Prof. Seniha Esen Yüksel

Lecture 5: Data Science Libraries

Hacettepe University
Department of Electrical and Electronics Engineering

** Course slides are the courtesy of Fual Akal, Erkut Erdem


and Aykut Erdem from the lecture notes of BBM 101
1
Lecture Overview
• Introduction to Data Science
– Data, Data Science, Data Scientist…
• Python Libraries to Analyse Data
– Pandas
– Numpy
– Matplotlib

Disclaimer: Much of the material and slides for this lecture were borrowed from
- IBM Courses at Coursera, https://www.coursera.org/professional-certificates/ibm-data-science
- CS109 Data Science course at Harvard University, by Rafael A. Irizarry and Verena Kaynig-Fittkau.
- Python Numpy Tutorial by Justin Johnson.
2
What is Data?

3
What is Data Science?
• Data science is the study of data.

• It involves developing methods of recording, storing,


and analyzing data to effectively extract useful
information.

• The goal of data science is to gain insights and


knowledge from any type of data — both structured
and unstructured.

4
Everybody is Talking about Data

5
Why Data Science Now?
• We are producing more and more data every minute via
– Sensors
– Video Surveillance Cameras
– Browsing Web
– Medical Instruments
– …

• The biggest data source we


have today is Internet
– Currently at Exabytes

• Getting insights out of data


is crucial as we want to
– Build better football teams
– Sell more products
– Avoid fraud
– Find treatments
– …

6
Python Libraries for Machine Learning

7
Python Libraries to Analyse Data
• Pandas
– Provides data structures and operations for data (e.g.
tables and time series) manipulation and analysis.

• Numpy
– Provides means to work with multidimensional arrays.

• Matplotlib
– A plotting library used to create high-quality graphs,
charts, and figures.

8
Pandas
• A library that contains high-performance, easy-to-use data
structures and data analysis tools.

• Some important aspects of Pandas

– A fast and efficient DataFrame object for data manipulation with


integrated indexing.

– Tools for reading and writing data in different formats, e.g. csv, Excel,
SQL Database.

– Slicing, indexing, subsetting, merging and joining of huge datasets.

• Typically imported as import pandas as pd in Python


programs

9
Create DataFrames using Dictionaries
import pandas as pd
data = { 'name': ['Fuat', 'Aykut', 'Erkut'],
'midterm': [60, 85, 100],
'final': [69, 90, 100],
'attendance': [6, 10, 10]
}
df_bbm101 = pd.DataFrame(data)

print(df_bbm101.head()) # Prints top 5 rows

name midterm final attendance


0 Fuat 60 69 7
1 Aykut 85 90 10
2 Erkut 100 100 10

10
Same Thing, in Another Way
names = ['Fuat', 'Aykut', 'Erkut']
midterms = [60, 85, 100]
finals = [69, 90, 100]
attendances = [6, 10, 10]

list_labels = ['name', 'midterm', 'final', 'attendance']


list_cols = [names, midterms, finals, attendances]

zipped = list(zip(list_labels, list_cols))

print(zipped) # [('name', ['Fuat', 'Aykut', 'Erkut']),


# ('midterm', [60, 85, 100]),
# ('final', [69, 90, 100]),
# ('attendance', [6, 10, 10])]

data = dict(zipped)

df_bbm101 = pd.DataFrame(data)

11
Broadcasting
df_bbm101['total'] = 0
# Adds new column to df and
# broadcasts 0 to entire column

print(df_bbm101.head())

name midterm final attendance total


0 Fuat 60 69 6 0
1 Aykut 85 90 10 0
2 Erkut 100 100 10 0

12
Compute Columns
df_bbm101['total'] = df_bbm101['midterm']*0.3 + \
df_bbm101['final']*0.6 + \
df_bbm101['attendance']*0.1

df_bbm101.loc[(df_bbm101['total'] >= 60) &


(df_bbm101['total'] < 70), 'grade'] = 'D’
… # Code to compute Bs and Cs comes here
df_bbm101.loc[df_bbm101['total'] >= 90, 'grade'] = 'A’

print(df_bbm101.head())

name midterm final attendance total grade


0 Fuat 60 69 6 60.0 D
1 Aykut 85 90 10 80.5 B
2 Erkut 100 100 10 91.0 A

13
Subsetting/Slicing Data
print(df_bbm101[['name', 'grade']])

print(df_bbm101.iloc[:, [0, 5]])

print(df_bbm101.iloc[:, [True, False, False, False,


False, True]])

# They all return the same thing


# name and grade columns of the df
# Same principle can be applied to rows as well

name grade
0 Fuat D
1 Aykut B
2 Erkut A

14
DataFrames from CSV Files

file name: bbm101.csv

df_bbm101 = pd.read_csv('bbm101.csv')
print(df_bbm101.head())

name midterm final attendance total grade


0 Fuat 60 69 6 60.0 D
1 Aykut 85 90 10 80.5 B
2 Erkut 100 100 10 91.0 A

15
Indexing DataFrames
df_bbm101 = pd.read_csv('bbm101.csv’, index_col ='name')
print(df_bbm101.head())
midterm final attendance total grade
name
Fuat 60 69 6 60.0 D
Aykut 85 90 10 80.5 B
Erkut 100 100 10 91.0 A

print(df_bbm101.loc['Fuat']) midterm 60
final 69
attendance 6
total 60
print(df_bbm101. grade D
loc[['Aykut', 'Erkut']]) Name: Fuat, dtype: object

midterm final attendance total grade


name
Aykut 85 90 10 80.5 B
Erkut 100 100 10 91.0 A 16
Numpy
• A library for the Python programming language, adding
support for large multi-dimensional arrays and matrices,
– along with a large collection of high-level mathematical
functions to operate on these arrays.

• A numpy array is a grid of values, all of the same type, and


is indexed by a tuple of nonnegative integers.

• The number of dimensions is the rank of the array.

• The shape of an array is a tuple of integers giving the size of


the array along each dimension.

• Typically imported as import numpy as np in Python


programs

17
Creating Numpy Arrays
import numpy as np

a = np.array([1,2,3]) # Create a rank 1 array


print(type(a)) # <class 'numpy.ndarray'>
print(a.shape) # (3,)
print(a) # [1 2 3]
print(a[0], a[1], a[2]) # 1 2 3

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array


print(b.shape) # (2, 3)
print(b) # [[1 2 3]
# [4 5 6]]
print(b[0, 0], b[0, 1], b[1, 0]) # 1 2 4

18
Miscellaneous Ways to Create Arrays
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # [[ 0. 0.]
# [ 0. 0.]]

b = np.ones((1,2)) # Create an array of all ones


print(b) # [[ 1. 1.]]

c = np.full((2,2), 7) # Create a constant array


print(c) # [[ 7. 7.]
# [ 7. 7.]]

d = np.eye(2) # Create a 2x2 identity matrix


print(d) # [[ 1. 0.]
# [ 0. 1.]]

e = np.random.random((2,2)) # Create an array filled with


# random values
print(e) # Might print
# [[ 0.91940167 0.08143941]
# [ 0.68744134 0.87236687]]
19
Indexing Arrays
• Slicing
• Integer Indexing
• Boolean (or, Mask) Indexing

20
Slicing
• Similar to slicing Python lists.
• Since arrays may be multidimensional, you must
specify a slice for each dimension of the array.
• Slices are views (not copies) of the original data.

21
Slicing Examples
a = np.array([[1, 2, 3, 4], # Create a rank 2 array
[5, 6, 7, 8], # with shape (3, 4)
[9, 10, 11, 12]])

print(a) # [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]

b = a[:2, 1:3]
print(b) # [[ 2 3 ]
# [ 6 7 ]

print(a[1, :]) # [5 6 7 8]

print(a[:, :-2]) # [[ 1 2]
# [ 5 6]
# [ 9 10]]

22
Integer Indexing
• NumPy arrays may be indexed with other arrays.
• Index arrays must be of integer type.
• Each value in the array indicates which value in the
array to use in place of the index.
• Returns a copy of the original data.

23
Integer Indexing Examples
a = np.array([1, 2, 3, 4, 5, 6])
print(a) # [1 2 3 4 5 6]
print(a[[1, 3, 5]]) # [2 4 6]

a = np.array([[1, 2], [3, 4], [5, 6]])


print(a) # [[ 1 2 ]
# [ 3 4 ]
# [ 5 6 ]]

# The returned array will have shape (3,)


print(a[[0, 1, 2], [0, 1, 0]]) # [1 4 5]
print(np.array([a[0, 0], a[1, 1], a[2, 0]])) # [1 4 5]

# The same element from the source array can be reused


print(a[[0, 0], [1, 1]]) # [2 2]
print(np.array([a[0, 1], a[0, 1]])) # [2 2]

24
Boolean (or, Mask) Indexing
• Boolean array indexing lets you pick out arbitrary
elements of an array.
• Frequently used to select the elements of an array
that satisfy some condition.
– Thus, called the mask indexing.

25
Boolean (or, Mask) Indexing Examples
a = np.array([1, 2, 3, 4, 5, 6])

bool_idx = (a > 2)
# Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.

print(bool_idx) # [False False True


# True True True]

# We use boolean array indexing to construct a rank 1 array


# consisting of the elements of a corresponding to the True
# values of bool_idx
print(a[bool_idx]) # [3 4 5 6]

# We can do all of the above in a single concise statement:


print(a[a > 2]) # [3 4 5 6]
26
Array Math
• Basic mathematical functions operate elementwise
on arrays.
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])

# Elementwise sum
print(x + y)
print(np.add(x, y))
# [[ 6 8]
# [10 12]]

# Elementwise product
print(x * y)
print(np.multiply(x, y)) Same principle holds for
# [[ 5 12] “np.divide, /” and “np.subtract, -”
# [21 32]]

27
Array Math (Cont’d)
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])

v = np.array([9, 10]
w = np.array([11, 12])

# Inner product of vectors; # Matrix / matrix product;


# both produce 219 # both produce a rank 2 array
print(v.dot(w)) # [[19 22]
print(np.dot(v, w)) # [43 50]]
print(x.dot(y))
# Matrix / vector product; print(np.dot(x, y))
# both produce the rank 1
# array [29 67] # Transpose of x
print(x.dot(v)) # [[1 3]
print(np.dot(x, v)) # [2 4]]
print(x.T)

28
Matplotlib
• Python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats and interactive
environments.

• Typically imported as import matplotlib.pyplot as


plt in Python programs.

• Pyplot is a module of Matplotlib which provides simple


functions to add plot elements like lines, images, text, etc.

• There are many plot types. Some of are more frequently used.

29
Why Build Visuals?
• For exploratory data analysis
• Communicate data clearly
• Share unbiased representation of data
• A picture is worth a thousand words 

30
Make a Simple Plot
import matplotlib.pyplot as plt

plt.plot(5, 5, 'o')

plt.title("Plot a Point")

plt.xlabel("X")
plt.ylabel("Y")

plt.show()

31
Plot a Simple Line
import matplotlib.pyplot as plt

year = ['2016', '2017', '2018', '2019', '2020']


lowest_rank = [21358, 20816, 17555, 11743, 7500]

plt.plot(year, lowest_rank)

plt.title(”HU-BBM Progress")
plt.xlabel('Year')
plt.ylabel('Lowest Rank')

plt.show()

32
Dataset to Use for the Rest of This Section
• The Population Division of the United Nations compiled data
pertaining to 45 countries.

• For each country, annual data on the flows of international


migrants is reported in addition to other metadata.

• We will work with data on Canada.

• You can get the original data at:


– https://www.un.org/en/development/desa/population/migration/dat
a/empirical2/migrationflows.asp#

33
Immigration Data to Canada

34
Read Data into Pandas Dataframe
df = pd.read_excel
(’http://www.un.org/…/Canada.xlsx',
sheetname='Canada by Citizenship',
skiprows=range(20),
skip_footer=2)

print(df.head())

35
After Little Preprocessing

In case you want to try:


df_canada = df.drop(columns=['Type', 'Coverage', 'AREA', 'REG', 'DEV'])
df_canada.rename(columns={'OdName':'Country', 'AreaName':'Continent', /
'RegName':'Region'}, inplace=True)
df_canada.set_index('Country', inplace=True)
df_canada['Total'] = df_canada.sum(axis=1)

36
Line Plots
A line plot displays information as a series of data points called
‘markers’ connected by straight line segments.

years = list(range(1980, 2014))


df_canada.loc['Haiti', years].plot(kind = 'line')

plt.title('Immigration from Haiti')


plt.xlabel('Years')
plt.ylabel('Number of Immigrants')

plt.show()

37
Area Plots
Commonly used to
represent cumulated
totals using numbers
or percentages over time.

df_canada.sort_values(['Total'], ascending=False,
axis=0, inplace=True)
df_top5 = df_canada.head()
df_top5 = df_top5[years].transpose()
df_top5.plot(kind='area')

plt.title('Immigration trend of top 5 countries')


plt.xlabel('Years')
plt.ylabel('Number of Immigrants')

plt.show() 38
Histogram
Histogram is a way of representing the frequency distribution of
a variable.
df_canada[2013].plot(kind='hist')

plt.title('Histogram of Immigration in 2013')


plt.xlabel('Number of Immigrants')
plt.ylabel('Number of Countries')

plt.show()

39
Bar Chart
Unlike a histogram, a bar
chart is commonly used
to compare the values of
a variable at a given point.

df_iceland = df_canada.loc['Iceland', years]


df_iceland.plot(kind='bar')

plt.title('Icelandic Immigrants to Canada from 1980 to 2013')


plt.xlabel('Year')
plt.ylabel('Number of Immigrants')

plt.show()
40
Pie Chart
A pie chart is a circular statistical graphic divided into slices to
illustrate numerical proportion.

df_continents = df_canada.groupby('Continent', axis=0).sum()


df_continents['Total'].plot(kind='pie')

plt.title('Immigration to Canada by Continent')

plt.show()

41

You might also like