[go: up one dir, main page]

0% found this document useful (0 votes)
4 views85 pages

04 Getting Started with pandas

The document provides an introduction to the pandas library in Python, covering its data structures, essential functionalities, and data types. It includes a case study on Yahoo! Finance and references a book for further reading. Key topics include Series and DataFrame objects, data manipulation techniques, and the handling of structured and semi-structured data.

Uploaded by

abdklaib233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views85 pages

04 Getting Started with pandas

The document provides an introduction to the pandas library in Python, covering its data structures, essential functionalities, and data types. It includes a case study on Yahoo! Finance and references a book for further reading. Key topics include Series and DataFrame objects, data manipulation techniques, and the handling of structured and semi-structured data.

Uploaded by

abdklaib233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Getting Started with pandas

Prof. Gheith Abandah

Adapted by Prof. Iyad Jafar and Dr. Mohammad Abdel-Majeed

Developing Curricula for Artificial Intelligence and Robotics (DeCAIR)


618535-EPP-1-2020-1-JO-EPPKA2-CBHE-JP
1
Getting Started with pandas

• YouTube Video from Python Programmer

What is Pandas? Why and How to Use Pandas in Python

https://youtu.be/dcqPhpY7tWk

2
Outline
• Introduction to pandas Data Structures
• Essential Functionality
• Data Types
• Summarizing and Computing Descriptive Statistics
• Pandas for Descriptive Statistics
• Case Study – Yahoo! Finance

3
Reference
• Chapter 5
• Wes McKinney, Python for Data Analysis:
Data Wrangling with Pandas, NumPy, and
IPython, O’Reilly Media, 3rd Edition, 2022.
• Material:
https://github.com/wesm/pydata-book

4
Introduction to pandas Data
Structures

5
Pandas: Labeled Column-Oriented Data
• Pandas is high-performance, easy-to-use data structures and data
analysis tools.
• Contains a set of labeled-array data structures (Series and DataFrame)
• Designed for working with tabular or heterogeneous data.
• Index objects enabling both simple axis indexing and multi-level /
hierarchical axis indexing
• Input/Output tools: loading tabular data from flat files (CSV, delimited,
Excel 2003), and saving and loading pandas objects
• Website: https://pandas.pydata.org/
• Also, check the tutorial on Learn Python:
https://www.learnpython.org/en/Pandas_Basics
6
Series
• 1D array-like object containing
a sequence of values of the
same type, and an associated
array of data labels, called its
index.

• Has .name, .values and


.index attributes.

7
Series
• You can specify the index
• at creation
• or by assignment.

• Index/labels can be used


to access single or
multiple values for
reading and writing.
8
Series
• Accepts functions and
operations like NumPy:
• Scalar multiplication
• Applying math functions
• Filtering with a Boolean
array

• You can check labels similar


to dictionaries using the in
operator.
9
Series
• Can convert
dictionary to Series.

• A Series can be
converted back to a
dictionary with its
to_dict method
10
Series
• Can rearrange the index
with new index.
• values found in sdata were
placed in the appropriate
locations
• no value for "California" was
found, it appears as NaN

• Support methods like


.isnull.

11
DataFrame
• Table of data containing an
ordered collection of
columns, each of which can
be a different value type.

• Has both a row and column


index
• It can be thought of as a dict
of Series all sharing the
same index.
12
DataFrame
• The .head(n) method selects
only the n rows.

• The .tail(n) returns the last n


rows.

• Default n is 5

13
DataFrame – Reading Columns
• A column in a DataFrame can be retrieved as a Series either by
dictionary-like or by using the dot attribute notation

14
DataFrame – Modifying Columns
• Columns can be modified by assignment

• Use del to delete columns.


• How about .pop()? 15
DataFrame – Reading Rows
• A row in a DataFrame can be retrieved using .loc[] and .iloc[]

16
DataFrame – Converting to numpy

17
Essential Functionality

18
Reindexing Rows in Series and DataFrame
• Assign different index using
.index attribute

• Rearrange the index using


the .reindex() method

• With DataFrames

19
Reindexing Columns in DataFrame
• DataFrame columns
can be rearranged
during creation from
dictionary using
columns keyword

• Rearrange columns
using .reindex()
• Assigning a column that
doesn’t exist will create
a new column with NaN
values.
20
Dropping Entries from an Axis
• If you don’t want some
data, you can drop rows
or columns.
• inplace
• or in a new object

• You can drop multiple


items or one item.

21
Indexing, Selection, and Filtering
• You can use integers to obj['b']
index Series as well as
obj[1]
labels.
obj[['b', 'a', 'd']]

obj[[1, 3]]
• Slicing with labels
includes the endpoint. obj[2:4]

obj['b':'c'] = 5

22
Indexing, Selection, and Filtering
• With DataFrames, []
with single or list of
elements selects
column(s).

• But slicing and


selection with a
Boolean array selects
rows.

23
Change Index in Frame
• Use .set_index() to change
the index to one of the
columns

• Reset the index using


.reset_index()

24
Selection with loc and iloc
data
• DataFrame indexing on the one two three four
rows: to select a subset of Ohio 0 0 0 0
Colorado 0 5 6 7
the rows and columns, use: Utah 8 9 10 11
• axis labels with loc New York 12 13 14 15

data.loc['Colorado', ['two', 'three']]


• or integers with iloc. two 5
three 6
data.iloc[2, [3, 0, 1]]
four 11
one 8
two 9
25
Arithmetic and Data Alignment
• When you are adding objects, if any index pairs are not the same,
the respective index in the result will be the union of the index
pairs.

26
Arithmetic and Data Alignment
• In the case of DataFrame, df1 + df2
b c d e
alignment is performed on Colorado NaN NaN NaN NaN
both the rows and the Ohio 3.0 NaN 6.0 NaN
columns. Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

df1 df2
b c d b d e
Ohio 0.0 1.0 2.0 Utah 0.0 1.0 2.0
Texas 3.0 4.0 5.0 Ohio 3.0 4.0 5.0
Colorado 6.0 7.0 8.0 Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
27
Arithmetic methods with fill values
• You can fill with a special value, df1.add(df2, fill_value=0)
b c d e
like 0, when an axis label is Colorado 6.0 7.0 8.0 NaN
found in one object but not the Ohio 3.0 1.0 6.0 5.0
other using add(fill_value=0). Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0

df1 df2
b c d b d e
Ohio 0.0 1.0 2.0 Utah 0.0 1.0 2.0
Texas 3.0 4.0 5.0 Ohio 3.0 4.0 5.0
Colorado 6.0 7.0 8.0 Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
28
Arithmetic methods with fill values
• Series and DataFrame
methods for arithmetic.
• The method starting with
the letter r has arguments
flipped.
• Equivalent:
1 / df1
df1.rdiv(1)

29
Operations between DataFrame and Series
• Arithmetic between series = frame.iloc[0]

DataFrame and Series is series frame


b 0.0 b d e
possible. d 1.0 Utah 0.0 1.0 2.0
e 2.0 Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
• By default, arithmetic Oregon 9.0 10.0 11.0

matches the index of the frame - series


b d e
Series on the DataFrame’s Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
columns, broadcasting Texas 6.0 6.0 6.0
down the rows. Oregon 9.0 9.0 9.0

30
Operations between DataFrame and Series
• If you want to broadcast series3 = frame['d']
over the columns, matching series3 frame
on the rows, you have to use Utah 1.0 b d e
Ohio 4.0 Utah 0.0 1.0 2.0
one of the arithmetic Texas 7.0 Ohio 3.0 4.0 5.0
methods with Oregon 10.0 Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
axis='index'.
frame.sub(series3, axis='index')
b d e
Utah -1.0 0.0 1.0 or 0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
31
Function Application and Mapping
• NumPy ufuncs work with frame = pd.DataFrame(randn(2, 3),
columns=list('bde'),
pandas objects.
index=['Utah', 'Ohio'])
frame
b d e
Utah -0.291993 0.085824 -0.222663
Ohio -1.473446 1.049407 -1.035874
np.abs(frame)
b d e
Utah 0.291993 0.085824 0.222663
Ohio 1.473446 1.049407 1.035874

32
Function Application and Mapping
• DataFrame’s apply method f = lambda x: x.max() - x.min()
applies a function on one- frame.apply(f)
frame
b 1.181453
dimensional arrays to each d 0.963583
b d
Utah -0.291993 0.085824
e
-0.222663
Ohio -1.473446 1.049407 -1.035874
column or row. e 0.813211
frame.apply(f, axis='columns')
• Apply functions to all Utah 0.377817
Ohio 2.522853
elements:
• map for Series frame['e'].map(f2)
• applymap for element-wise
transformations in frame.applymap(f2)
DataFrame
33
Sorting
• You can sort Series and
DataFrame using
sort_index.

• You can select the axis and


sort direction.

34
Sorting
• You can also sort by the
values of one or multiple
columns.

35
Sorting and Ranking
• Ranking assigns ranks from 1
through the number of valid
data points in an array.
• DataFrame can compute
ranks over the rows or the
columns

36
Axis Indexes with Duplicate Labels
• Indexing a series with a label with multiple entries returns a
Series.

37
Axis Indexes with Duplicate Labels
• The same logic extends to
indexing rows (or columns)
in a DataFrame

38
Data Types

39
Structured Data
• Structured data refers to highly organized and formatted data that is stored
in a fixed schema, typically in a tabular form within databases.

• Uses rows and columns, making it easy to store, search, and analyze.

• Examples
• Relational Databases: Employee records, student information, sales
transactions
• Spreadsheets: Excel tables with defined columns (Name, Age, Salary)
• CRM Data: Customer details, purchase history, contact information
• Inventory Systems: Product IDs, stock levels, prices, spreadsheets, CSV
files.
40
Structured Data
• Relational Databases:
Relational tables, highly
structured

• Data matrix, e.g.,


numerical matrix, crosstabs

41
Structured Data TID Items

• Transaction data 1 Bread, Coke, Milk


2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
• Document data: Term-
frequency vector (matrix)

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
of text documents
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

42
Important Characteristics of Structured Data
• Dimensionality
• As the number of dimensions (features) increases, data points become sparse,
distances lose meaning, and computational complexity grows exponentially
• Sparsity
• A sparse dataset contains mostly empty or zero elements compared to the total
number of possible elements.
• Resolution
• level of detail or granularity at which data is recorded, stored, or analyzed.
• Higher resolution means more precise and detailed data, while lower resolution
means more aggregated or generalized data.
• Distribution
• Centrality and dispersion
43
Semi-Structured Data
• Semi-structured data is a type of data that does not conform to the rigid
structure of traditional relational databases.
• Yet, it still has some organizational properties and elements that make it
easier to analyze such as tags, markers or metadata to define fields and
records.
• Examples
• JSON (JavaScript Object Notation) – Used in APIs and web applications.
• XML (Extensible Markup Language) – Used in data exchange formats.
• NoSQL databases (e.g., MongoDB, Cassandra) – Store data in flexible formats.
• Email (header structure + unstructured body content).
• Graph and Networks (e.g., Neo4j) – Store relationships in a semi-structured
way using nodes (entities) and edges (relationships) that form a structure.
44
Semi-Structured Data
• Transportation network

• World Wide Web

• Molecular Structures
• Social or information
networks
• XML file

45
Un-Structured Data
• Unstructured data does not have a predefined model or organized format.
• It lacks a fixed schema, making it more difficult to store, process, and
analyze using traditional relational databases.
• Unstructured data typically requires advanced techniques like natural
language processing (NLP), image processing, machine learning, and big
data analytics for meaningful insights.
• Examples
• Text Data: Emails, chat messages, social media posts, reports, books.
• Multimedia Data: Images, videos, audio files, scanned documents.
• Web Data: Blogs, forum discussions, user reviews, website logs.
• Sensor Data: IoT device logs, satellite images, surveillance footage.
• Medical Data: X-rays, MRI scans, doctor's notes, pathology reports.
46
Data Objects
• Data sets are made up of data objects/samples/examples/
instances/data points
• A data object is an entity and is described by attributes
• In a database
• rows → data objects
• columns → attributes
• Examples
• sales database: customers, store items, sales
• university database: students, professors, courses
47
Attributes/Dimensions/Features/Variables
• A data field, representing a characteristic or feature of
a data object.
• E.g., customer _ID, name, address
• Types of attributes:
• Nominal (e.g., red, blue)
• Binary (e.g., {true, false})
• Ordinal (e.g., {freshman, sophomore, junior, senior})
• Numeric: quantitative
• Discrete vs. Continuous Attributes
48
Attribute Types
• Nominal/categories/ states/names of things
• Hair_color = {auburn, black, blond, brown, grey, red,
white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
49
Attribute Types
• Ordinal
• Values have a meaningful order (ranking) but magnitude
between successive values is not known.

• Size = {small, medium, large}


• Grades = {A, B, C, D, E, F}
• Army rankings = {Second Lieutenant, First Lieutenant,
Captain, Major, Colonel}

50
Numeric Attribute Types
• Numeric  Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point: Not every arithmetic operation can be
performed
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities
51
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Typically represented as floating-point variables
52
Attribute Types

53
Data Attributes

54
Summarizing and Computing
Descriptive Statistics

55
Data Distribution
• Refers to the way data values are spread or distributed across
a dataset.

• Understanding data distribution is crucial in statistics, machine


learning, and data science, as it helps in data preprocessing,
choosing appropriate models, and making predictions.

56
Key Aspects of Data Distribution
• Range: The difference between the maximum and minimum
values in the dataset.
• Central Tendency: Includes measures like mean, median, and
mode that describe the center of the data.
• Dispersion: Includes variance, standard deviation, and
interquartile range (IQR) to measure the spread of the data.
• Shape: Determines whether the data is normally distributed,
skewed, or multimodal.
• Outliers: Extreme values that deviate significantly from the
rest of the data.
57
Descriptive Measures of Data
• Measuring the Central Tendency
• Measuring the Dispersion of Data
• Covariance and Correlation Analysis

58
Measuring the Central Tendency
• Mean
1 n
x   xi
n i 1

• Median: Middle value if odd number of values, or


average of the middle two values otherwise.

• Mode: Value that occurs most frequently in the data.

59
Symmetric/Normal/Bell-Shaped Distribution
• The left and right halves of the distribution are mirror images
• mean = median = mode
• Implications
• The data is evenly distributed around the central value.
• Many statistical tests (e.g., t-tests, ANOVA) assume normality.
• Common in natural phenomena (e.g., height, IQ scores).
• Examples
• Heights of adults
• standardized test scores symmetric

60
Positively Skewed (Right-Skewed) Distribution
• The right tail (higher values) is longer
• Mean > Median > Mode.
• Implications
• Most points are on the left, but some large values pull the mean to the right.
• The median is a better measure of central tendency than the mean (less
affected by outliers).
• Often indicates presence of extreme values.
• Examples: positively
• Income distribution
skewed
• housing prices
• waiting times in queues.

61
Negatively Skewed (Left-Skewed) Distribution
• The left tail (lower values) is longer
• Mean < Median < Mode.
• Implications
• Most points are on the right; but some small values pull the mean to the left.
• The median is again a better central tendency measure than the mean.
• Common in datasets with a natural lower bound (e.g., exam scores with a
maximum score).
• May indicate presence of a floor effect.
• Example: negatively
• scores in an easy test
skewed
• age of retirement
• mortality rates.

62
Measuring Data Dispersion
• Variance: Measures how far data points
are spread out from the mean.
n
1
 
2

N
 (x  )
i 1
i
2

• Standard Deviation: Square root of


variance, representing dispersion in the
same unit as data.
• Lower values → Data is close to the mean.
• Higher values → Data is more spread out.

63
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency


64
Correlation and Covariance
• Covariance measures the relationship between two variables. It
indicates how changes in one variable are associated with
changes in another.

• Range (−∞,+∞)
• Positive covariance → Variables increase together.
• Negative covariance → One increases, the other decreases.
• If X and Y are independent, then cov is 0; however, the opposite is not true.
65
Correlation and Covariance
• Correlation between two random variables is the covariance
of the two variables normalized by the variance of each
variable.

• Pearson’s correlation coefficient.


• It is unitless and easy to compare and interpret.
• Ranges from -1 to 1
• +1 → Strong positive relationship.
• -1 → Strong negative relationship.
• 0 → No linear relationship.
66
Correlation and Covariance

67
Pandas for Descriptive Statistics

68
Summarizing and Computing Descriptive
Statistics
• pandas has mathematical
and statistical methods.
• Most are reductions
methods like the sum or
mean.
• They have built-in handling
for missing data.

69
Summarizing and Computing Descriptive
Statistics
• Some methods, like idxmax
and idxmin return indirect
statistics, like the index value
where the minimum or
maximum values are attained

• Other methods are


accumulations like cumsum.

70
Summarizing and Computing Descriptive
Statistics
• Some methods are neither
reductions nor accumulations
• describe returns summary
statistics.

• On non-numeric data, it
produces alternative
summary statistics.

71
Descriptive and
summary
statistics

72
Unique Values, Value Counts, and Membership
• pandas has methods that
extract information about
the values contained in a
Series.
• Array of unique values
• The count of each value

73
Unique Values, Value Counts, and Membership
• pandas has methods that
extract information about
the values contained in a
Series.
• membership check

• The resulting mask can be


used to access matches.

74
Case Study – Yahoo! Finance

75
Case Study: Yahoo! Finance
• Real-time and historical stock market data

76
Case Study: Yahoo! Finance

77
Data Information
• Lets work with the close
price data for the four
companies

78
Data Statistics
• Calculate some stats • Calculate daily percentage change

79
Correlation and Covariance

80
Exercises

81
Pandas Exercise 1
• Given the following code, solve the following exercises using pandas features.
• import pandas as pd

# Creating a dictionary of data


data = {'Year': [2019, 2019, 2020, 2020, 2021, 2021],
'Region': ['North', 'South', 'North', 'South', 'North',
'South'],
'Sales': [100, 200, 150, 250, 300, 400]}

1. Create a DataFrame from the dictionary.


2. What are the total sales?
3. Add a new column (Profit) that is 15% of the Sales and find the total Profit.
4. What are the average sales of the North region?
5. How many years does this data cover?
6. What is the correlation between the North and South Sales?

82
Pandas Exercise 2
• Given the following Dictionary solve the following exercises using Pandas .
data = { 'Employee_ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Department': ['HR', 'Finance', 'IT', 'IT', 'HR'],
'Salary': [60000, 70000, 80000, 75000, 65000],
'Age': [25, 30, 35, 28, 40] }

1. Create Dataframe from the dictionary and set the 'Employee_ID' as the index.
2. Select the Name and Age columns for Employee_ID 102
3. Select the first three rows of the Salary column
4. Retrieve all employees who work in the IT department
5. Reset the index of the DataFrame and display the updated DataFrame
6. Set the Department as the index and display the DataFrame.

83
Pandas Exercise 3
Create a DataFrame named df with the following data and answer
the following questions:
• Print the first two rows of the DataFrame.
• Show the column names and data types.
• Select the "Name" column.
• Retrieve the Age of Charlie.
• Update Emma’s salary to 58000.
• Select only employees with a salary greater than 60000.
• Get the names of employees under 30 years old.
• Add a new column "Department" with values: HR, IT,
Finance, IT, HR.
• Delete the "City" column.
• Sort the DataFrame by Age in descending order.

84
Pandas Exercise 4
• Given the NBA players dataset that contains basketball player statistics (points,
assists, rebounds, etc.), solve the following exercises using pandas.

1. Read the data from the csv file and store it as a dataframe.
2. Find the player with the highest points per game.
3. How many players have more than 10 assists per game?
4. What is the average height of players in the dataset?
5. Find the top 5 teams with the highest total points.
6. Count how many players play as Point Guard (PG).

85

You might also like