Pandas (Ziad)
Pandas (Ziad)
DataFrame
A Library that is Used for Data Manipulation and Analysis Tool
Using Powerful Data Structures
By
Dr. Ziad Al-Sharif
Pandas First Steps: install and import
• Pandas is an easy package to install. Open up your terminal program (shell or cmd)
and install it using either of the following commands:
$ conda install pandas
OR
$ pip install pandas
• For jupyter notebook users, you can run this cell: !pip install pandas
• The ! at the beginning runs cells as if they were in a terminal.
• To import pandas we usually import it with a shorter name since it's used so much:
import pandas as pd
Installation: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
pandas: Data Table Representation
Core components of pandas: Series & DataFrames
• The primary two components of pandas are the Series and DataFrame.
• Series is essentially a column, and
• DataFrame is a multi-dimensional table made up of a collection of Series.
• DataFrames and Series are quite similar in that many operations that you can do with one you can do
with the other, such as filling in null values and calculating the mean.
• A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
columns
• Features of DataFrame
• Potentially columns are of different types
• Size – Mutable
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows
and columns
rows
Types of Data Structure in Pandas
Data Structure Dimensions Description
• data: data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
• index: For the row labels, that are to be used for the resulting frame, Optional, Default is np.arrange(n)if no index is passed.
• columns: For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
• dtype: Data type of each column.
• copy: This command (or whatever it is) is used for copying of data, if the default is False.
• Create DataFrame
• A pandas DataFrame can be created using various inputs like −
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
Creating a DataFrame from scratch
Creating a DataFrame from scratch
• There are many ways to create a DataFrame from scratch, but a great option is to just use a
simple dict. But first you must import pandas.
import pandas as pd
• Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each
fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could
do something like:
data = { 'apples':[3, 2, 0, 1] , 'oranges':[0, 3, 7, 2] }
df = pd.DataFrame(data)
How did that work?
• Each (key, value) item in data corresponds to a column in the resulting DataFrame.
• The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also
create our own when we initialize the DataFrame.
• E.g. if you want to have customer names as the index:
df.loc['Ali']
pandas.DataFrame.from_dict
pandas.DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)
• data : dict
• Of the form {field:array-like} or {field:dict}.
https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.from_dict.html
pandas’ orient keyword
data = {'col_1':[3, 2, 1, 0], 'col_2':
['a','b','c','d']}
pd.DataFrame.from_dict(data)
df = pd.read_csv('dataset.csv')
• CSVs don't have indexes like our DataFrames, so all we need to do is just designate the
index_col when reading:
df = pd.read_csv('dataset.csv', index_col=0)
df = pd.read_json('dataset.json')
• Notice this time our index came with us correctly since using JSON allowed indexes to work
through nesting.
• Pandas will try to figure out how to create a DataFrame by analyzing structure of your JSON, and
sometimes it doesn't get it right.
• Often you'll need to set the orient keyword argument depending on the structure
Example #1:Reading data from JSON
Example #2: Reading data from JSON
Example #3: Reading data from JSON
Converting back to a CSV or JSON
• So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice.
Similar to the ways we read in data, pandas provides intuitive commands to save it:
df.to_csv('new_dataset.csv')
df.to_json('new_dataset.json')
df.to_sql('new_dataset', con)
• When we save JSON and CSV files, all we have to input into those functions is our desired
filename with the appropriate file extension.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Most important
DataFrame operations
• DataFrames possess hundreds of methods and other operations that are crucial to any analysis.
• As a beginner, you should know the operations that:
• that perform simple transformations of your data and those
• that provide fundamental statistical analysis on your data.
Loading dataset
• We're loading this dataset from a CSV and designating the movie titles to be our
index.
https://grouplens.org/datasets/movielens/
Viewing your data
• The first thing to do when opening a new dataset is print out a few rows to keep as a visual
reference. We accomplish this with .head():
movies_df.head()
• .head() outputs the first five rows of your DataFrame by default, but we could also pass a number
as well: movies_df.head(10) would output the top ten rows, for example.
• To see the last five rows use .tail() that also accepts a number, and in this case we printing
the bottom two rows.:
movies_df.tail(2)
Getting info about your data
• .info() should be one of the very first commands you run after loading your data
• .info() provides the essential details about your dataset, such as the number of rows and
columns, the number of non-null values, what type of data is in each column, and how much
memory your DataFrame is using.
movies_df.info()
movies_df.shape
Handling duplicates
• This dataset does not have duplicate rows, but it is always important to verify you aren't
aggregating duplicate rows.
• To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
• Using append() will return a copy without affecting the original DataFrame. We are capturing
this copy in temp so we aren't working with the real data.
• Notice call .shape quickly proves our DataFrame rows have doubled.
temp_df = movies_df.append(movies_df)
temp_df.shape
temp_df.drop_duplicates(inplace=True)
• Another important argument for drop_duplicates() is keep, which has three possible
options:
• first: (default) Drop duplicates except for the first occurrence.
• last: Drop duplicates except for the last occurrence.
• False: Drop all duplicates.
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
Understanding your variables
• Using .describe() on an entire DataFrame we can get a summary of the distribution of
continuous variables:
movies_df.describe()
• .describe() can also be used on a categorical variable to get the count of rows, unique
count of categories, top category, and freq of top category:
movies_df['genre'].describe()
• This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-
Fi, which shows up 50 times (freq).
More Examples
import pandas as pd
data = [1,2,3,10,20,30]
df = pd.DataFrame(data)
print(df)
import pandas as pd
data = {'Name' : ['AA', 'BB'], 'Age': [30,45]}
df = pd.DataFrame(data)
print(df)
More Examples
import pandas as pd a b c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data) 0 1 2 NaN
print(df) 1 5 10 20.0
import pandas as pd a b c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second']) first 1 2 NaN
print(df) second 5 10 20.0
More Examples
E.g. This shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices with one index with other name
df2 = pd.DataFrame(data,index=['first','second'],columns=['a','b1'])
a b
print(df1) first 1 2
print('...........')
print(df2) second 5 10
...........
a b1
first 1 NaN
second 5 NaN
More Examples:
Create a DataFrame from Dict of Series
import pandas as pd
d = {'one' : pd.Series([1, 2, 3] , index=['a', 'b', 'c']),
'two' : pd.Series([1,2, 3, 4], index=['a', 'b', 'c', 'd'])
}
df = pd.DataFrame(d)
print(df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
More Examples: Column Addition
import pandas as pd Adding a column using Series:
d = {'one':pd.Series([1,2,3], index=['a','b','c']),
'two':pd.Series([1,2,3,4], index=['a','b','c','d']) one two three
} a 1.0 1 10.0
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object b 2.0 2 20.0
# with column label by passing new series c 3.0 3 30.0
one two
c 3.0 3
d NaN 4
More Examples: Addition of rows
import pandas as pd one two
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c','d']) a 1.0 1
} b 2.0 2
df = pd.DataFrame(d)
print(df) c 3.0 3
d NaN 4
df2 = pd.DataFrame([[5,6], [7,8]], columns = ['a', 'b'])
df = df.append(df2 )
print(df) one two a b
a 1.0 1.0 NaN NaN
b 2.0 2.0 NaN NaN
c 3.0 3.0 NaN NaN
d NaN 4.0 NaN NaN
0 NaN NaN 5.0 6.0
1 NaN NaN 7.0 8.0
More Examples: Deletion of rows
import pandas as pd one two
a 1.0 1
d = {'one':pd.Series([1, 2, 3], index=['a','b','c']), b 2.0 2
'two':pd.Series([1, 2, 3, 4], index=['a','b','c','d']) c 3.0 3
} d NaN 4
df = pd.DataFrame(d)
print(df) one two a b
a 1.0 1.0 NaN NaN
df2 = pd.DataFrame([[5,6], [7,8]], columns = ['a', 'b']) b 2.0 2.0 NaN NaN
c 3.0 3.0 NaN NaN
df = df.append(df2 )
d NaN 4.0 NaN NaN
print(df) 0 NaN NaN 5.0 6.0
1 NaN NaN 7.0 8.0
df = df.drop(0)
print(df) one two a b
a 1.0 1.0 NaN NaN
b 2.0 2.0 NaN NaN
c 3.0 3.0 NaN NaN
d NaN 4.0 NaN NaN
1 NaN NaN 7.0 8.0
More Examples: Reindexing • Pandas
dataframe.reindex_like()
function return an object with
import pandas as pd matching indices to myself.
# Creating the first dataframe
• Any non-matching indexes are filled
df1 = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4], with NaN values.
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]},
index =["A1", "A2", "A3", "A4", "A5"])
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'], 'SSN':[10,20], 'marks':[90, 95] })
df2 = pd.DataFrame({'Name':['B','C'], 'SSN':[25,30], 'marks':[80, 97] })
df3 = pd.concat([df1, df2])
df3
References
• pandas documentation
• https://pandas.pydata.org/pandas-docs/stable/index.html
• pandas: Input/output
• https://pandas.pydata.org/pandas-docs/stable/reference/io.html
• pandas: DataFrame
• https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
• pandas: Series
• https://pandas.pydata.org/pandas-docs/stable/reference/series.html
• pandas: Plotting
• https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html