05 - Python Data Science Modules II

Uploaded by

內湖高工-謝欣翰

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views43 pages

05 - Python Data Science Modules II

Uploaded by

內湖高工-謝欣翰

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

巨量資料探勘與應用

Big Data Mining and Applications

Python Data Science Modules II

李建樂
Chien-Yueh Lee, Ph.D.
Assistant Professor
Master Program in Artificial Intelligence
Innovation Frontier Institute of Research for Science and Technology
Department of Electrical Engineering
National Taipei University of Technology
Mar. 17, 2025
Outline
• pandas
ØIntroduction
ØSeries and DataFrame
ØBasic Operations
ØHandling NaN
ØBasic Arithmetic Operations/Functions
ØDataFrame Grouping
ØPivot Table
ØDataFrame Merging
ØFile Data Input/Output
ØVarious Plotting Functions
Understanding pandas
• pandas is an open-source third-party Python package that provides high-
performance and easy-to-use data structures and data analysis tools. Its main
features are as follows:
Ø Provides two primary data structures: Series (序列) and DataFrame (資料框), which are
used to handle one-dimensional and two-dimensional data, respectively. They can also
store heterogeneous data (different data types).
Ø After loading data, pandas allows quick preprocessing using structured object methods,
such as data imputation, removing or replacing missing values, etc.
Ø Offers methods for data analysis, statistics, and visualization, supporting multiple data
input/output formats such as TXT, CSV, Excel spreadsheets, JSON, HTML, relational
databases, and more.
• To install pandas, enter the following command: pip install pandas
• Before using it, the module must be imported. In practice, pandas is often
abbreviated as pd:
import pandas as pd
pandas Data Structure – Series
• A Series is an object composed of one-dimensional array-like data.
• pandas provides multiple ways to create a Series, such as:
ØCreating a Series from a list

Index Data

ØCreating a Series from a dictionary

pandas Data Structure – Series
ØCreating a Series from a NumPy ndarray

ØCreating a Series from a scalar

Index and Values of a Series
• You can use the index and values attributes to retrieve the
index and values of a Series, respectively.
Reading a Series
• Similar to a NumPy ndarray, a Series can be accessed using index
values or slicing operations.
pandas Data Structure – DataFrame
• A DataFrame is a two-dimensional data structure in pandas, similar to a table
in an Excel spreadsheet.
• A DataFrame can be thought of as a dictionary of multiple Series, where each
Series shares the same index and has its own column name.
• Each column in a DataFrame can have a different data type, such as numbers,
strings, or boolean values.

https://www.runoob.com/pandas/pandas-dataframe.html
pandas Data Structure – DataFrame
• Creating a DataFrame from a dictionary with columns as the direction
ØData content in list format:

ØData content in Series format:

pandas Data Structure – DataFrame
• Creating a DataFrame from a dictionary with rows as the direction:

• Creating a DataFrame from a list with rows as the direction:

pandas Data Structure – DataFrame
• pandas provides multiple methods to load data from files into a DataFrame, such as:
Ø pd.read_csv(file): Import data in CSV format
Ø pd.read_excel(file): Import data in Excel format
Ø pd.read_sql(file): Import data from an SQLite database
Ø pd.read_json(file): Import data in JSON format
Ø pd.read_html(file): Import data from web pages
Ø pd.read_clipboard(): Import data from the clipboard
Reading a DataFrame
• Accessing data by column name, similar to dictionary-style access,
returns a Series.
• Accessing data by index number or name:
Øat[row, col]: Retrieve a single value using row and column names.
Øiat[i, j]: Retrieve a single value using row and column index numbers.
Øloc[row, col]: Retrieve partial data using row/column names.
Øiloc[i, j]: Retrieve partial data using row/column index numbers.
Reading a DataFrame by Column Name
Reading a DataFrame by Column Name
Accessing Data by Index
• Using at, iat, loc, and iloc attributes of pandas to retrieve partial data from a Series or
DataFrame by name or index number.
Basic Arithmetic Operations/Functions
• Performing basic arithmetic operations/functions of Series or DataFrame objects,
including addition, subtraction, multiplication, division, comparison, modulus,
exponentiation, product, rounding, and matrix multiplication.
Handling NaN
• When a Series or DataFrame contains missing
values like NaN, pandas provides the following
functions for handling them:
Ø isna()/isnull(): Check for missing values
Ø notna()/notnull(): Check for non-missing values
Ø fillna(value): Fill missing values with a specified
value
Ø dropna(axis=0, how='any'): Remove rows or
columns containing NaN
axis=0: Delete by row; axis=1: Delete by column
how='any': Delete if at least one NaN exists in
the row/column
how='all': Delete only if all values in the
row/column are NaN
Statistical Functions
Function Description Function Description
abs() Absolute value prod(axis=None) Product of values along the
specified axis
all(axis=0) Check if all values along the specified
axis are True. sum(axis=None) Sum of values along the
specified axis
any(axis=0) Check if at least one value along the
specified axis is True diff(axis=0) Difference between adjacent
values along the specified axis
count(axis=0) Count non-NaN values along the
specified axis describe() Statistical summary
cummax(axis=None) Cumulative maximum along the cov() Covariance
specified axis
kurt(axis=None) Kurtosis (峰度) along the
cummin(axis=None) Cumulative minimum along the specified axis
specified axis
median(axis=None) Median along the specified
cumprod(axis=None) Cumulative product along the specified axis
axis
mean(axis=None) Mean along the specified axis
cumsum(axis=None) Cumulative sum along the specified
std(axis=None) Standard deviation along the
axis
specified axis
max(axis=None) Maximum value along the specified
skew(axis=None) Skewness (偏度) along the
axis
specified axis
min(axis=None) Minimum value along the specified axis
Statistical Functions
DataFrame Grouping
DataFrame Grouping
Pivot Table
DataFrame.pivot_table(index, [options])
• Common Parameters:
Ø index: Required. The specified column becomes the first column index and is
compared with other columns (e.g., those specified in columns). Accepts list or
array inputs, resulting in a nested structure.
Ø values: Optional. Filters data for calculations. If multiple values are provided as a
list or array, results are displayed in separate columns.
Ø columns: Optional. Used to split data and specify columns for comparison.
Ø aggfunc: Function parameter. Supports built-in functions like max, min, mean
(default), sum, or custom functions. Multiple functions can be passed as a list.
• Other Parameters:
Ø fill_value: Replaces NULL values with a specific value.
Ø dropna: Boolean; removes missing values if True.
Pivot Table
Pivot Table
DataFrame Concatenating – pd.concat()
pd.concat([df1, df2, …], [options])

• Required Parameter:
Ø[df1, df2, …]: A list of two or more DataFrames to be concatenated.
• Common Parameters:
Øaxis: Optional. Default is 0 (vertical concatenation); set to 1 for horizontal
concatenation.
Øignore_index: Optional. Default is False. Set to True to ignore existing
indexes and generate new ones.
DataFrame Concatenating – pd.concat()
DataFrame Merging – pd.merge()
pd.merge(left_df, right_df, [options])
• Required Parameters:
Ø left_df: Left DataFrame to merge.
Ø right_df:Right DataFrame to merge.
• Common Parameters:
Ø on: Optional. Specifies the column name used for merging. Both DataFrames
must have the same column name. If not specified, it automatically finds
common columns.
Ø left_on: Optional. Specifies the column name for merging from the left
DataFrame.
Ø right_on: Optional. Specifies the column name for merging from the right
DataFrame.
Ø how: Optional. Specifies the merge type:
—inner (default): Only keeps matching values.
—outer: Keeps all values from both DataFrames.
—left: Keeps all values from the left DataFrame.
—right: Keeps all values from the right DataFrame.
DataFrame Merging – pd.merge()
Illustration of Various Join Methods in pd.merge()

https://read01.com/GPQBMxx.html#.YboVw1l-Xb0
Merging a DataFrame and Series with Indexes
File Data Input/Output
Use the to_csv() function to write a DataFrame to a text file, e.g.,

>>> df = pd.DataFrame(np.array([[15, 160, 48], [14, 175, 66], [15, 153, 50], [15, 162, 44]]))
>>> df
0 1 2
0 15 160 48
1 14 175 66
2 15 153 50
3 15 162 44
>>> df.to_csv("df.csv", header = 0, index = 0)
File Data Input/Output
Use the read_csv() function to read data from .csv, .txt, and other text
files, e.g.,
>>> pd.read_csv("df.csv", names = ["Age", "Height", "Weight"])
Age Height Weight
0 15 160 48
1 14 175 66
2 15 153 50
3 15 162 44
>>> pd.read_csv("E:\\df.csv", names = ["Age", "Height", "Weight"], nrows = 2) #Take the first 2 rows
Age Height Weight
0 15 160 48
1 14 175 66
>>> pd.read_csv("E:\\df.csv", names = ["Age", "Height", "Weight"], skiprows = 1) #Skip the first rows
Age Height Weight
0 14 175 66
1 15 153 50
2 15 162 44
pandas cheat sheet
• https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Various Plotting Functions
• Line Plot (折線圖): df.plot()
• Bar Chart (長條圖): df.plot.bar() or df.plot(kind='bar')
• Horizontal Bar Chart (橫向長條圖): df.plot.barh() or df.plot(kind='barh')
• Histogram (直方圖): df.plot.hist() or df.plot(kind='hist')
• Pie Chart (圓餅圖): df.plot.pie() or df.plot(kind='pie')
• Scatter Plot (散佈圖): df.plot.scatter() or df.plot(kind='scatter')
• Box Plot (箱形圖/盒鬚圖): df.plot.box() or df.plot(kind='box')
• Kernel Density Estimation Plot (核密度估計圖): df.plot.kde() or df.plot(kind='kde')
• Area Plot (面積圖): df.plot.area() or df.plot(kind='area')
• Hexagonal Binning (六邊形分箱圖): df.plot.hexbin() or df.plot(kind='hexbin')
Line Plot from a Series
• Pandas' Series and DataFrame types provide basic plotting methods.
By default, the plot() function generates a line plot.
Line Plot from a DataFrame
Bar Chart
Horizontal Bar Chart
HW4-1
• Please merge the following two DataFrames and find the employee with the
highest salary in each department along with their years of service.
df1 = pd.DataFrame({
'EmpID': ['E01', 'E02', 'E03', 'E04'],
'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'Department': ['HR', 'HR', 'IT', 'IT']
})

df2 = pd.DataFrame({
'EmpID': ['E01', 'E02', 'E03', 'E04'],
'Salary': [50000, 60000, 70000, 80000],
'Years': [1, 2, 3, 4]
})
HW4-2
• Please visit this website (https://www.kaggle.com/datasets/mayukh18/deap-
deciphering-environmental-air-pollution) to download the environmental
air pollution dataset:
ØList the top 3 cities (City) with the highest PM10_median.
ØCalculate the average PM10_median for all states (State), excluding null values.
HW4-3
• Please visit the Kaggle website
(https://www.kaggle.com/datasets/omarsobhy14/nba-players-
salaries?resource=download) to download the CSV file for NBA player
salaries from 2020 to 2025:
ØUse the CSV file to create a bar chart of the top 50 highest-paid players.
ØPlot a salary trend line for the highest-paid player from 2020 to 2025.

Hint: cols = ['2022/2023', '2023/2024', '2024/2025', '2024/2025.1']

df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
HW4-4
• Using the previously mentioned file for basic statistical analysis:
ØCalculate the minimum, maximum, mean, median, and standard deviation
of salaries for the 2024-2025 season.
ØCreate a box plot for the 2024-2025 season salaries and discuss the
statistical measures included in it.
HW4-5
• Read the tips.csv file (https://ppt.cc/fPDkSx) into a DataFrame using
pandas and answer the following questions:
1. Display the DataFrame contents.
2. Plot a histogram using the total_bill column.
3. Count the number of smoking and non-smoking customers (Hint:
use the value_counts() method) and visualize it as a bar chart.
4. Create a pie chart showing the number of customers per day.
5. Plot a scatter plot with total_bill on the X-axis and tip on the Y-
axis, and discuss whether total_bill and tip exhibit a linear
correlation.

04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Pandas
No ratings yet
Pandas
29 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
ML Unit-2 Notes
No ratings yet
ML Unit-2 Notes
17 pages
JJKJK
No ratings yet
JJKJK
10 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Chapter 1 Python Pandas Complete
No ratings yet
Chapter 1 Python Pandas Complete
2 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
EX-02-Data Manipulation Pandas Matplot
No ratings yet
EX-02-Data Manipulation Pandas Matplot
9 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas
No ratings yet
Pandas
44 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
MLL Ip Xii
No ratings yet
MLL Ip Xii
22 pages
Pandas
No ratings yet
Pandas
13 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Notes - EDA-Unit2
No ratings yet
Notes - EDA-Unit2
43 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
Unit 3
No ratings yet
Unit 3
10 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Cheat Sheet
No ratings yet
Cheat Sheet
12 pages
Pandas
No ratings yet
Pandas
25 pages
Pandas
No ratings yet
Pandas
5 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
AI & Data Science Lab Record
No ratings yet
AI & Data Science Lab Record
28 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
Pandas
No ratings yet
Pandas
63 pages
Introduction to Pandas Library
No ratings yet
Introduction to Pandas Library
31 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas
No ratings yet
Pandas
49 pages
Panda
No ratings yet
Panda
33 pages
Unit 2
No ratings yet
Unit 2
81 pages
Mohit
No ratings yet
Mohit
19 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Subject IP
No ratings yet
Subject IP
9 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
4 pages
Pandas DataFrame Cheat Sheet
100% (1)
Pandas DataFrame Cheat Sheet
10 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
4 Pandas
No ratings yet
4 Pandas
35 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Unit 1 Data Handling Pandas
No ratings yet
Unit 1 Data Handling Pandas
1 page
Pandas Series and DataFrame Guide
No ratings yet
Pandas Series and DataFrame Guide
98 pages
Unit IV
No ratings yet
Unit IV
49 pages
Final Formatted After Iloc Loc
No ratings yet
Final Formatted After Iloc Loc
34 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
1 Data Handlinng Using Pandas-I
No ratings yet
1 Data Handlinng Using Pandas-I
46 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Exercise06 Ans
No ratings yet
Exercise06 Ans
17 pages
03 - Basic Python Programming II
No ratings yet
03 - Basic Python Programming II
80 pages
Part 9 Technique For 5G and Beyond
No ratings yet
Part 9 Technique For 5G and Beyond
34 pages
補充資料 - Assignments and Delays
No ratings yet
補充資料 - Assignments and Delays
11 pages
q8-2020 Compressed
No ratings yet
q8-2020 Compressed
58 pages
112ab0736 Chiu Bo Zhi 341246 20250515 Homework
No ratings yet
112ab0736 Chiu Bo Zhi 341246 20250515 Homework
2 pages
Chap7多電子原子筆記
No ratings yet
Chap7多電子原子筆記
7 pages
電磁學第一次小考
No ratings yet
電磁學第一次小考
1 page
2024備註（要寫作業前一定要先看喔） 1
No ratings yet
2024備註（要寫作業前一定要先看喔） 1
1 page
CH 3potentials
No ratings yet
CH 3potentials
14 pages
Prob 2.1 Extension (Extra Credit Problem)
No ratings yet
Prob 2.1 Extension (Extra Credit Problem)
4 pages
English Test for Grade 1 Students
No ratings yet
English Test for Grade 1 Students
6 pages
Understanding Faith: Action vs. Belief
100% (2)
Understanding Faith: Action vs. Belief
6 pages
Practical System Programming With C: Pragmatic Example Applications in Linux and Unix-Based Operating Systems 1st Edition Sri Manikanta Palakollu Newest Edition 2025
100% (1)
Practical System Programming With C: Pragmatic Example Applications in Linux and Unix-Based Operating Systems 1st Edition Sri Manikanta Palakollu Newest Edition 2025
141 pages
1991 - Inner Space Introduction To Kabbalah, Meditation, and Prophecy 2nd Ed. - A. Kaplan, Ed. A. Sutton
100% (26)
1991 - Inner Space Introduction To Kabbalah, Meditation, and Prophecy 2nd Ed. - A. Kaplan, Ed. A. Sutton
134 pages
CICS Overview and Programming Guide
100% (1)
CICS Overview and Programming Guide
183 pages
Renouncer and Householder in Early Buddhism
No ratings yet
Renouncer and Householder in Early Buddhism
15 pages
Computer Basics for Non-CS Students
No ratings yet
Computer Basics for Non-CS Students
11 pages
The Q
No ratings yet
The Q
4 pages
Orthodox Mission and Holiness
No ratings yet
Orthodox Mission and Holiness
12 pages
Microprocessor Architecture Guide
No ratings yet
Microprocessor Architecture Guide
4 pages
ROHDE - SCHWARZ CMS54 Spec
No ratings yet
ROHDE - SCHWARZ CMS54 Spec
16 pages
Omicron Protection
100% (5)
Omicron Protection
188 pages
C# Notes
No ratings yet
C# Notes
18 pages
Assessment of Oral Language and Early Literacy in Early Childhood
100% (1)
Assessment of Oral Language and Early Literacy in Early Childhood
52 pages
Mehul Resume
No ratings yet
Mehul Resume
2 pages
Question Paper
No ratings yet
Question Paper
9 pages
UASA Template SR Year 5 2024 (QP)
No ratings yet
UASA Template SR Year 5 2024 (QP)
14 pages
Java Internship Report II - Vishal Kumbhkar
No ratings yet
Java Internship Report II - Vishal Kumbhkar
48 pages
ENGLISH 6 Week I 1st Quarter
No ratings yet
ENGLISH 6 Week I 1st Quarter
86 pages
Mudit Goel PDF
No ratings yet
Mudit Goel PDF
1 page
LTE HARQ Process Overview
No ratings yet
LTE HARQ Process Overview
4 pages
Shakespeare Sonnet Analysis Guide
No ratings yet
Shakespeare Sonnet Analysis Guide
13 pages
BÀI TẬP BỔ TRỢ TIẾNG ANH 1 DISCOVERY UNIT 2
No ratings yet
BÀI TẬP BỔ TRỢ TIẾNG ANH 1 DISCOVERY UNIT 2
6 pages
The Writing Process: Lecturer:Sahrish Saif
No ratings yet
The Writing Process: Lecturer:Sahrish Saif
9 pages
Introduction To Database Systems: Relational Algebra
No ratings yet
Introduction To Database Systems: Relational Algebra
51 pages
4am 1st term exam النموذج 28
No ratings yet
4am 1st term exam النموذج 28
3 pages
JFET Presentation 1
No ratings yet
JFET Presentation 1
14 pages
JPR Microproject Report
No ratings yet
JPR Microproject Report
21 pages
Activity Codes
No ratings yet
Activity Codes
10 pages
Penitential Rites Booklet Form
No ratings yet
Penitential Rites Booklet Form
2 pages

05 - Python Data Science Modules II

Uploaded by

05 - Python Data Science Modules II

Uploaded by

巨量資料探勘與應用

Big Data Mining and Applications

Python Data Science Modules II

ØCreating a Series from a dictionary

ØCreating a Series from a scalar

ØData content in Series format:

• Creating a DataFrame from a list with rows as the direction:

Hint: cols = ['2022/2023', '2023/2024', '2024/2025', '2024/2025.1']

You might also like