Pandas - Series and Dataframe
There is a close connection between the DataFrames and the Series of Pandas. A DataFrame
can be seen as a concatenation of Series, each Series having the same index, i.e. the index of
the DataFrame.
We will demonstrate this in the following example.
We define the following three Series:
import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83,
3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83,
3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19,
1963.10], index=years)
pd.concat([shop1, shop2, shop3])
What happens, if we concatenate these "shop" Series? Pandas provides a concat function
for this purpose:
shops_df = pd.concat([shop1, shop2, shop3],
axis=1)
shops_df
Let's do some fine sanding by giving names to the columns:
cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3],
axis=1)
print(shops_df2)
print(type(shops_df))
This means, we can arrange or concat Series into DataFrames!
DataFrames from Dictionaries
A DataFrame has a row and column index; it's like a dict of Series with a
common index.
cities = {"name": ["London", "Berlin",
"Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235,
2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain",
"Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame
Retrieving the Column Names
It's possible to get the names of the columns as a list:
city_frame.columns.values
Custom Index
We can see that an index (0,1,2, ...) has been automatically assigned to
the DataFrame. We can also assign a custom index to the DataFrame
object:
ordinals = ["first", "second", "third",
"fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities,
index=ordinals)
city_frame
Rearranging the Order of Columns
We can also define and rearrange the order of the columns at the time of
creation of the DataFrame. This makes also sure that we will have a
defined ordering of our columns, if we create the DataFrame from a
dictionary. Dictionaries are not ordered
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
But what if you want to change the column names and the ordering of an
existing DataFrame?
city_frame.reindex(["country", "name",
"population"])
city_frame
Now, we want to rename our columns. For this purpose, we will use the
DataFrame method 'rename'. This method supports two calling
conventions
(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
We will rename the columns of our DataFrame into Romanian names in
the following example. We set the parameter inplace to True so that our
DataFrame will be changed instead of returning a new DataFrame, if
inplace is set to False, which is the default!
city_frame.rename(columns={"name":"Nume",
"country":"țară", "population":"populație"},
inplace=True)
city_frame
Existing Column as the Index of a DataFrame
We want to create a more useful index in the following example. We will
use the country name as the index, i.e. the list value associated to the key
"country" of our cities dictionary:
city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame
Alternatively, we can change an existing DataFrame. We can us the
method set_index to turn a column into an index. "set_index" does not
work in-place, it returns a new data frame with the chosen column as the
index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
We saw in the previous example that the set_index method returns a
new DataFrame object and doesn't change the original DataFrame. If we
set the optional parameter "inplace" to True, the DataFrame will be
changed in place, i.e. no new object will be created:
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
Label-Indexing on the Rows
So far we have indexed DataFrames via the columns. We will
demonstrate now, how we can access rows from DataFrames via the
locators 'loc' and 'iloc'. ('ix' is deprecated and will be removed in the
future)
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
print(city_frame.loc[["Germany", "France"]])
print(city_frame.loc[city_frame.population>2000
000])
Sum and Cumulative Sum
We can calculate the sum of all the columns of a DataFrame or the sum of
certain columns:
print(city_frame.sum())
city_frame["population"].sum()
x = city_frame["population"].cumsum()
print(x)
Assigning New Values to Columns
x is a Pandas Series. We can reassign the previously calculated cumulative
sums to the population column:
city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the
cumulative sum, we want to add the cumulative population sum as a new
culumn with the name "cum_population".
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
We can see that the column "cum_population" is set to Nan, as we
haven't provided any data for it.
We will assign now the cumulative sums to this column:
city_frame["cum_population"] =
city_frame["population"].cumsum()
city_frame
We can also include a column name which is not contained in the
dictionary, when we create the DataFrame from the dictionary. In this
case, all the values of this column will be set to NaN:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
Accessing the Columns of a DataFrame
There are two ways to access a column of a DataFrame. The result is in
both cases a Series:
# in a dictionary-like way:
print(city_frame["population"])
# as an attribute
print(city_frame.population)
print(type(city_frame.population))
city_frame.population
From the previous example, we can see that we have not copied the
population column. "p" is a view on the data of city_frame.
Assigning New Values to a Column
The column area is still not defined. We can set all elements of the
column to the same value:
city_frame["area"] = 1572
print(city_frame)
In this case, it will be definitely better to assign the exact area to the
cities. The list with the area values needs to have the same length as the
number of rows in our DataFrame.
# area in square km:
area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
# area could have been designed as a list, a
Series, an array or a scalar
city_frame["area"] = area
print(city_frame)
Sorting DataFrames
Let's sort our DataFrame according to the city area:
city_frame = city_frame.sort_values(by="area",
ascending=False)
print(city_frame)
Let's assume, we have only the areas of London, Hamburg and Milan. The
areas are in a series with the correct indices. We can assign this series as
well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
Inserting new columns into existing DataFrames
In the previous example we have added the column area at creation time.
Quite often it will be necessary to add or insert columns into existing
DataFrames.
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
city_frame
idx = 1
city_frame.insert(loc=idx, column='area',
value=area)
city_frame
DataFrame from Nested Dictionaries
A nested dictionary of dicts can be passed to a DataFrame as well. The
indices of the outer dictionary are taken as the the columns and the inner
keys. i.e. the keys of the nested dictionaries, are used as the row indices:
growth = {"Switzerland": {"2010": 3.0, "2011":
1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012":
0.4, "2013": 0.1},
"France": {"2010":2.0, "2011":2.1, "2012":
0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-
6.6, "2013": -3.3},
"Italy": {"2010":1.7, "2011": 0.6, "2012":-
2.3, "2013":-1.9}
}
growth_frame = pd.DataFrame(growth)
growth_frame
You like to have the years in the columns and the countries in the rows?
No problem, you can transpose the data:
growth_frame.T
growth_frame = growth_frame.T
growth_frame2 =
growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
Filling a DataFrame with random values:
import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido',
'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12,
5)*1000).round(2),
columns=names,
index=index)
df
Reading CSV and DSV Files
Pandas offers two ways to read in CSV or DSV files to be precise:
DataFrame.from_csv
read_csv
There is no big difference between those two functions, e.g. they have
different default values in some cases and read_csv has more paramters.
We will focus on read_csv, because DataFrame.from_csv is kept inside
Pandas for reasons of backwards compatibility.
import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t")
print(exchange_rates)
As we can see, read_csv used automatically the first line as the names for
the columns. It is possible to give other names to the columns. For this
purpose, we have to skip the first line by setting the parameter "header"
to 0 and we have to assign a list with the column names to the parameter
"names":
import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t",
header=0,
names=["year", "min", "max", "days"])
print(exchange_rates)
Exercise
The file "countries_population.csv" is a csv file, containing the population
numbers of all countries (July 2014). The delimiter of the file is a space
and commas are used to separate groups of thousands in the numbers.
The method 'head(n)' of a DataFrame can be used to give out only the
first n rows or lines. Read the file into a DataFrame.
pop =
pd.read_csv("/home/madhu/Desktop/datasets/countri
es_population.csv",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
Writing csv Files
We can create csv (or dsv) files with the method "to_csv". Before we do
this, we will prepare some data to output, which we will write to a file. We
have two csv files with population data for various
countries. countries_male_population.csv contains the figures of the male
populations and countries_male_population.csv correspondingly the
numbers for the female populations. We will create a new csv file with the
sum:
column_names = ["Country"] + list(range(2002,
2013))
male_pop =
pd.read_csv("/home/madhu/Desktop/datasets/count
ries_male_population.csv",
header=None,
index_col=0,
names=column_names)
female_pop =
pd.read_csv("/home/madhu/Desktop/datasets/count
ries_female_population.csv",
header=None,
index_col=0,
names=column_names)
population = male_pop + female_pop
population
population.to_csv("/home/madhu/Desktop/
datasets/countries_total_population.csv")
We want to create a new DataFrame with all the information, i.e. female,
male and complete population. This means that we have to introduce an
hierarchical index. Before we do it on our DataFrame, we will introduce
this problem in a simple example:
import pandas as pd
shop1 = {"foo":{2010:23, 2011:25}, "bar":
{2010:13, 2011:29}}
shop2 = {"foo":{2010:223, 2011:225}, "bar":
{2010:213, 2011:229}}
shop1 = pd.DataFrame(shop1)
shop2 = pd.DataFrame(shop2)
both_shops = shop1 + shop2
print("Sales of shop1:\n", shop1)
print("\nSales of both shops\n", both_shops)
shops = pd.concat([shop1, shop2], keys=["one",
"two"])
shops
We want to swap the hierarchical indices. For this we will
use 'swaplevel':
shops.swaplevel()
shops.sort_index(inplace=True)
shops
We will go back to our initial problem with the population
figures. We will apply the same steps to those
DataFrames:
pop_complete = pd.concat([population.T,
male_pop.T,
female_pop.T],
keys=["total", "male", "female"])
df = pop_complete.swaplevel()
df.sort_index(inplace=True)
df[["Austria", "Australia", "France"]]
df.to_csv("/home/madhu/Desktop/datasets/countries_total_population.csv")
Exercise
Read in the dsv file (csv) bundeslaender.txt. Create a new
file with the columns 'land', 'area', 'female', 'male',
'population' and 'density' (inhabitants per square
kilometres.
print out the rows where the area is greater than
30000 and the population is greater than 10000
Print the rows where the density is greater than 300
lands =
pd.read_csv('/home/madhu/Desktop/datasets/bunde
slaender.txt', sep=" ")
print(lands.columns.values)
‘nan' in Python
Python knows NaN values as well. We can create it with "float":
n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)
"nan" is also part of the math module since Python 3.5:
import math
n1 = math.nan
print(n1)
print(math.isnan(n1))
nan
True
NaN in Pandas
Example without NaNs
Before we will work with NaN data, we will process a file without any NaN
values. The data file temperatures.csv contains the temperature data of
six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.
Reading in the data file can be done with the read_csv function:
import pandas as pd
df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
decimal=",")
df.loc[:3]
We want to calculate the avarage temperatures per measuring point over
all the sensors. We can use the DataFrame method 'mean'. If we use
'mean' without parameters it will sum up the sensor columns, which isn't
what we want, but it may be interesting as well:
df.mean()
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])
sensors = df.columns.values[1:]
# all columns except the time column will be
removed:
df = df.drop(sensors, axis=1)
print(df[:5])
We will assign now the average temperature values as a
new column 'temperature':
# best practice:
df = df.assign(temperature=average_temp_series)
# inplace option not available
# alternatively:
#df.loc[:,"temperature"] = average_temp_series
df[:3]
Example with NaNs
We will use now a data file similar to the previous temperature csv, but
this time we will have to cope with NaN data, when the sensors
malfunctioned.
We will create a temperature DataFrame, in which some data is not
defined, i.e. NaN.
We will use and change the data from the the temperatures.csv file:
temp_df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
index_col=0,
decimal=",")
We will randomly assign some NaN values into the data frame. For this
purpose, we will use the where method from DataFrame. If we apply
where to a DataFrame object df, i.e. df.where(cond, other_df), it will
return an object of same shape as df and whose corresponding entries are
from df where the corresponding element of cond is True and otherwise
are taken from other_df.
Before we continue with our task, we will demonstrate the way of working
of where with some simple examples:
s = pd.Series(range(5)) s.where(s > 0)
import numpy as np
A = np.random.randint(1, 30, (4, 2))
df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df
For our task, we need to create a DataFrame 'nan_df', which consists
purely of NaN values and has the same shape as our temperature
DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also
need a DataFrame with the conditions "df_bool" as True values. For this
purpose we will create a DataFrame with random values between 0 and 1
and by applying 'random_df < 0.8' we get the df_bool DataFrame, in
which about 20 % of the values will be True:
random_df =
pd.DataFrame(np.random.random(size=(54, 6)),
columns=temp_df.columns.values,
index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
columns=temp_df.columns.values,
index=temp_df.index)
df_bool = random_df<0.8
df_bool[:5]
Finally, we have everything toghether to create our DataFrame with
distrubed measurements:
disturbed_data = temp_df.where(df_bool, nan_df)
disturbed_data.to_csv("data1/
temperatures_with_NaN.csv")
disturbed_data[:10]
Using dropna on the DataFrame
'dropna' is a DataFrame method. If we call this method without
arguments, it will return an object where every row is ommitted, in which
data are missing, i.e. some value is NaN:
df = disturbed_data.dropna()
df
'dropna' can also be used to drop all columns in which some values are
NaN. This can be achieved by assigning 1 to the axis parameter. The
default value is False, as we have seen in our previous example. As every
column from our sensors contain NaN values, they will all disappear:
df = disturbed_data.dropna(axis=1)
df[:5]
Let us change our task: We only want to get rid of all the rows, which
contain more than one NaN value. The parameter 'thresh' is ideal for this
task. It can be set to the minimum number. 'thresh' is set to an integer
value, which defines the minimum number of non-NaN values. We have
six temperature values in every row. Setting 'thresh' to 5 makes sure that
we will have at least 5 valid floats in every remaining row:
cleansed_df = disturbed_data.dropna(thresh=5,
axis=0)
cleansed_df[:7]
Now we will calculate the mean values again, but this time on the
DataFrame 'cleansed_df', i.e. where we have taken out all the rows,
where more than one NaN value occurred.
average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)
# best practice:
df = df.assign(temperature=average_temp_series)
# inplace option not available
df[:6]