巨量資料探勘與應用
Big Data Mining and Applications
Python Data Science Modules II
李建樂
Chien-Yueh Lee, Ph.D.
Assistant Professor
Master Program in Artificial Intelligence
Innovation Frontier Institute of Research for Science and Technology
Department of Electrical Engineering
National Taipei University of Technology
Mar. 17, 2025
Outline
• pandas
ØIntroduction
ØSeries and DataFrame
ØBasic Operations
ØHandling NaN
ØBasic Arithmetic Operations/Functions
ØDataFrame Grouping
ØPivot Table
ØDataFrame Merging
ØFile Data Input/Output
ØVarious Plotting Functions
Understanding pandas
• pandas is an open-source third-party Python package that provides high-
performance and easy-to-use data structures and data analysis tools. Its main
features are as follows:
Ø Provides two primary data structures: Series (序列) and DataFrame (資料框), which are
used to handle one-dimensional and two-dimensional data, respectively. They can also
store heterogeneous data (different data types).
Ø After loading data, pandas allows quick preprocessing using structured object methods,
such as data imputation, removing or replacing missing values, etc.
Ø Offers methods for data analysis, statistics, and visualization, supporting multiple data
input/output formats such as TXT, CSV, Excel spreadsheets, JSON, HTML, relational
databases, and more.
• To install pandas, enter the following command: pip install pandas
• Before using it, the module must be imported. In practice, pandas is often
abbreviated as pd:
import pandas as pd
pandas Data Structure – Series
• A Series is an object composed of one-dimensional array-like data.
• pandas provides multiple ways to create a Series, such as:
ØCreating a Series from a list
Index Data
ØCreating a Series from a dictionary
pandas Data Structure – Series
ØCreating a Series from a NumPy ndarray
ØCreating a Series from a scalar
Index and Values of a Series
• You can use the index and values attributes to retrieve the
index and values of a Series, respectively.
Reading a Series
• Similar to a NumPy ndarray, a Series can be accessed using index
values or slicing operations.
pandas Data Structure – DataFrame
• A DataFrame is a two-dimensional data structure in pandas, similar to a table
in an Excel spreadsheet.
• A DataFrame can be thought of as a dictionary of multiple Series, where each
Series shares the same index and has its own column name.
• Each column in a DataFrame can have a different data type, such as numbers,
strings, or boolean values.
https://www.runoob.com/pandas/pandas-dataframe.html
pandas Data Structure – DataFrame
• Creating a DataFrame from a dictionary with columns as the direction
ØData content in list format:
ØData content in Series format:
pandas Data Structure – DataFrame
• Creating a DataFrame from a dictionary with rows as the direction:
• Creating a DataFrame from a list with rows as the direction:
pandas Data Structure – DataFrame
• pandas provides multiple methods to load data from files into a DataFrame, such as:
Ø pd.read_csv(file): Import data in CSV format
Ø pd.read_excel(file): Import data in Excel format
Ø pd.read_sql(file): Import data from an SQLite database
Ø pd.read_json(file): Import data in JSON format
Ø pd.read_html(file): Import data from web pages
Ø pd.read_clipboard(): Import data from the clipboard
Reading a DataFrame
• Accessing data by column name, similar to dictionary-style access,
returns a Series.
• Accessing data by index number or name:
Øat[row, col]: Retrieve a single value using row and column names.
Øiat[i, j]: Retrieve a single value using row and column index numbers.
Øloc[row, col]: Retrieve partial data using row/column names.
Øiloc[i, j]: Retrieve partial data using row/column index numbers.
Reading a DataFrame by Column Name
Reading a DataFrame by Column Name
Accessing Data by Index
• Using at, iat, loc, and iloc attributes of pandas to retrieve partial data from a Series or
DataFrame by name or index number.
Basic Arithmetic Operations/Functions
• Performing basic arithmetic operations/functions of Series or DataFrame objects,
including addition, subtraction, multiplication, division, comparison, modulus,
exponentiation, product, rounding, and matrix multiplication.
Handling NaN
• When a Series or DataFrame contains missing
values like NaN, pandas provides the following
functions for handling them:
Ø isna()/isnull(): Check for missing values
Ø notna()/notnull(): Check for non-missing values
Ø fillna(value): Fill missing values with a specified
value
Ø dropna(axis=0, how='any'): Remove rows or
columns containing NaN
axis=0: Delete by row; axis=1: Delete by column
how='any': Delete if at least one NaN exists in
the row/column
how='all': Delete only if all values in the
row/column are NaN
Statistical Functions
Function Description Function Description
abs() Absolute value prod(axis=None) Product of values along the
specified axis
all(axis=0) Check if all values along the specified
axis are True. sum(axis=None) Sum of values along the
specified axis
any(axis=0) Check if at least one value along the
specified axis is True diff(axis=0) Difference between adjacent
values along the specified axis
count(axis=0) Count non-NaN values along the
specified axis describe() Statistical summary
cummax(axis=None) Cumulative maximum along the cov() Covariance
specified axis
kurt(axis=None) Kurtosis (峰度) along the
cummin(axis=None) Cumulative minimum along the specified axis
specified axis
median(axis=None) Median along the specified
cumprod(axis=None) Cumulative product along the specified axis
axis
mean(axis=None) Mean along the specified axis
cumsum(axis=None) Cumulative sum along the specified
std(axis=None) Standard deviation along the
axis
specified axis
max(axis=None) Maximum value along the specified
skew(axis=None) Skewness (偏度) along the
axis
specified axis
min(axis=None) Minimum value along the specified axis
Statistical Functions
DataFrame Grouping
DataFrame Grouping
Pivot Table
DataFrame.pivot_table(index, [options])
• Common Parameters:
Ø index: Required. The specified column becomes the first column index and is
compared with other columns (e.g., those specified in columns). Accepts list or
array inputs, resulting in a nested structure.
Ø values: Optional. Filters data for calculations. If multiple values are provided as a
list or array, results are displayed in separate columns.
Ø columns: Optional. Used to split data and specify columns for comparison.
Ø aggfunc: Function parameter. Supports built-in functions like max, min, mean
(default), sum, or custom functions. Multiple functions can be passed as a list.
• Other Parameters:
Ø fill_value: Replaces NULL values with a specific value.
Ø dropna: Boolean; removes missing values if True.
Pivot Table
Pivot Table
DataFrame Concatenating – pd.concat()
pd.concat([df1, df2, …], [options])
• Required Parameter:
Ø[df1, df2, …]: A list of two or more DataFrames to be concatenated.
• Common Parameters:
Øaxis: Optional. Default is 0 (vertical concatenation); set to 1 for horizontal
concatenation.
Øignore_index: Optional. Default is False. Set to True to ignore existing
indexes and generate new ones.
DataFrame Concatenating – pd.concat()
DataFrame Merging – pd.merge()
pd.merge(left_df, right_df, [options])
• Required Parameters:
Ø left_df: Left DataFrame to merge.
Ø right_df:Right DataFrame to merge.
• Common Parameters:
Ø on: Optional. Specifies the column name used for merging. Both DataFrames
must have the same column name. If not specified, it automatically finds
common columns.
Ø left_on: Optional. Specifies the column name for merging from the left
DataFrame.
Ø right_on: Optional. Specifies the column name for merging from the right
DataFrame.
Ø how: Optional. Specifies the merge type:
—inner (default): Only keeps matching values.
—outer: Keeps all values from both DataFrames.
—left: Keeps all values from the left DataFrame.
—right: Keeps all values from the right DataFrame.
DataFrame Merging – pd.merge()
Illustration of Various Join Methods in pd.merge()
https://read01.com/GPQBMxx.html#.YboVw1l-Xb0
Merging a DataFrame and Series with Indexes
File Data Input/Output
Use the to_csv() function to write a DataFrame to a text file, e.g.,
>>> df = pd.DataFrame(np.array([[15, 160, 48], [14, 175, 66], [15, 153, 50], [15, 162, 44]]))
>>> df
0 1 2
0 15 160 48
1 14 175 66
2 15 153 50
3 15 162 44
>>> df.to_csv("df.csv", header = 0, index = 0)
File Data Input/Output
Use the read_csv() function to read data from .csv, .txt, and other text
files, e.g.,
>>> pd.read_csv("df.csv", names = ["Age", "Height", "Weight"])
Age Height Weight
0 15 160 48
1 14 175 66
2 15 153 50
3 15 162 44
>>> pd.read_csv("E:\\df.csv", names = ["Age", "Height", "Weight"], nrows = 2) #Take the first 2 rows
Age Height Weight
0 15 160 48
1 14 175 66
>>> pd.read_csv("E:\\df.csv", names = ["Age", "Height", "Weight"], skiprows = 1) #Skip the first rows
Age Height Weight
0 14 175 66
1 15 153 50
2 15 162 44
pandas cheat sheet
• https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Various Plotting Functions
• Line Plot (折線圖): df.plot()
• Bar Chart (長條圖): df.plot.bar() or df.plot(kind='bar')
• Horizontal Bar Chart (橫向長條圖): df.plot.barh() or df.plot(kind='barh')
• Histogram (直方圖): df.plot.hist() or df.plot(kind='hist')
• Pie Chart (圓餅圖): df.plot.pie() or df.plot(kind='pie')
• Scatter Plot (散佈圖): df.plot.scatter() or df.plot(kind='scatter')
• Box Plot (箱形圖/盒鬚圖): df.plot.box() or df.plot(kind='box')
• Kernel Density Estimation Plot (核密度估計圖): df.plot.kde() or df.plot(kind='kde')
• Area Plot (面積圖): df.plot.area() or df.plot(kind='area')
• Hexagonal Binning (六邊形分箱圖): df.plot.hexbin() or df.plot(kind='hexbin')
Line Plot from a Series
• Pandas' Series and DataFrame types provide basic plotting methods.
By default, the plot() function generates a line plot.
Line Plot from a DataFrame
Bar Chart
Horizontal Bar Chart
HW4-1
• Please merge the following two DataFrames and find the employee with the
highest salary in each department along with their years of service.
df1 = pd.DataFrame({
'EmpID': ['E01', 'E02', 'E03', 'E04'],
'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'Department': ['HR', 'HR', 'IT', 'IT']
})
df2 = pd.DataFrame({
'EmpID': ['E01', 'E02', 'E03', 'E04'],
'Salary': [50000, 60000, 70000, 80000],
'Years': [1, 2, 3, 4]
})
HW4-2
• Please visit this website (https://www.kaggle.com/datasets/mayukh18/deap-
deciphering-environmental-air-pollution) to download the environmental
air pollution dataset:
ØList the top 3 cities (City) with the highest PM10_median.
ØCalculate the average PM10_median for all states (State), excluding null values.
HW4-3
• Please visit the Kaggle website
(https://www.kaggle.com/datasets/omarsobhy14/nba-players-
salaries?resource=download) to download the CSV file for NBA player
salaries from 2020 to 2025:
ØUse the CSV file to create a bar chart of the top 50 highest-paid players.
ØPlot a salary trend line for the highest-paid player from 2020 to 2025.
Hint: cols = ['2022/2023', '2023/2024', '2024/2025', '2024/2025.1']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
HW4-4
• Using the previously mentioned file for basic statistical analysis:
ØCalculate the minimum, maximum, mean, median, and standard deviation
of salaries for the 2024-2025 season.
ØCreate a box plot for the 2024-2025 season salaries and discuss the
statistical measures included in it.
HW4-5
• Read the tips.csv file (https://ppt.cc/fPDkSx) into a DataFrame using
pandas and answer the following questions:
1. Display the DataFrame contents.
2. Plot a histogram using the total_bill column.
3. Count the number of smoking and non-smoking customers (Hint:
use the value_counts() method) and visualize it as a bar chart.
4. Create a pie chart showing the number of customers per day.
5. Plot a scatter plot with total_bill on the X-axis and tip on the Y-
axis, and discuss whether total_bill and tip exhibit a linear
correlation.