5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
IBM Developer
Round 2 of the Call for Code Global Challenge is now open! Register to grow your AI skills and win prizes.
Data science >
Analyze structured and unstructured data to extract knowledge and insights.
Site feedback
Using pandas to perform data analysis
Using pandas to perform data
analysis
Learn about pandas and how to use it to work with structured data
By Margriet Groenendijk, Yamini Rao, Samaya Madhavan
Published July 16, 2021
To save your progress, you
must be logged in to IBM
Developer
Log in now
Python overview
You might think that Python is only for developers and people with computer science degrees.
However, Python is great for beginners, even those with little coding experience because it's
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 1/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
free, open source, and runs on any platform. The Python packages documentation is great,
and after an introductory course , you have a good foundation to build on.
Python is a general purpose and high-level programming language that is used for more than
working with data. For example, it's good for developing desktop GUI applications, websites,
and web applications. However, this tutorial focuses on the data and only goes through getting
started with data.
Unstructured versus structured data
Data is broadly classified into unstructured and structured data. Unstructured data refers to
data that is mostly free form with information that does not have consistent data types. Hand-
written notes by doctors and movie reviews collected from blogs are two examples of
unstructured data. On the other hand, structured data is information that is available in an
organized format so that it is easily readable. Examples of structured data are tables with
variables as columns and records as rows or key-value pairs in a noSQL database.
Introduction to pandas
pandas is an open source Python Library that provides high-performance data manipulation
and analysis. With the combination of Python and pandas, you can accomplish five typical
steps in the processing and analysis of data, regardless of the origin of data: load, prepare,
manipulate, model, and analyze.
There are many options when working with the data using pandas. The following list shows
some of the things that can be done using pandas.
Cleaning data by removing or replacing missing values
Converting data formats
Sorting rows
Deleting or adding rows and columns
Merging or joining DataFrames
Summarizing data by pivoting or reshaping
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 2/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Creating visualizations
This list is far from complete. See the pandas documentation for more of what you can do.
This tutorial walks you though some of the most interesting features of pandas using
structured data that contains information about the boroughs in London. You can download
the data used in the tutorial from data.gov.uk .
Getting started with Jupyter
Notebooks
Instead of writing code in a text file and then running the code with a Python command in the
terminal, you can do all of your data analysis in one place. Code, output, tables, and charts can
all be edited and viewed in one window in any web browser with Jupyter Notebooks . As the
name suggests, this is a notebook to keep all of your ideas and data explorations in one place.
In this tutorial, you use IBM Watson Studio to run a notebook. For this, you need a free IBM
Cloud account. The following steps show you how sign up and get started. When you have the
notebook up and running, you can go through the notebook.
Prerequisites
To complete this tutorial, you need:
An IBM Cloud
IBM Watson Studio
Steps
Set up Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 3/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
1. Sign up or log in to IBM Cloud.
2. Click Create Resource at the top of the Resources page. You'll see the resources under the
hamburger menu at the upper left.
3. Search for Watson Studio, and click that tile.
4. Select the Lite plan, and click Create.
5. Go back to the Resources list, click your Watson Studio service, and then click Get Started.
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 4/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
You should now be in Watson Studio.
6. Click either Create a project or New project.
7. Select Create an empty project.
8. Give the project a name.
9. Choose an existing Object Storage service instance or create a new one. This is used to
store the notebooks and data. Note: Don't forget to click refresh when returning to the
Project page.
10. Click Create.
Load and run a notebook
1. Click Add to project, then click Notebook to add a new notebook.
2. Download the notebook .
3. Choose new notebook From File.
4. Select the downloaded notebook.
5. Select the default runtime.
6. Click Create Notebook. The notebook loads.
7. Run the notebook. In the open notebook, click Run to run the cells one at a time.
Notebook overview
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 5/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
The following list shows some of the capabilities within pandas that are covered in this
tutorial.
Data exploration (loading data, Series, and DataFrames)
Data transformation (cleaning data, selecting data, merging data, and grouping data)
Data visualization
The notebook associated with this tutorial displays more elaborate functions of pandas.
Data exploration
Loading data
As long as the data is formatted consistently and has multiple records with numbers, text, or
dates, you can typically read the data with pandas. For example, a comma-separated text file
that is saved from a table in Excel can be loaded into a notebook with the following command.
import pandas as pd
df = pd.read_csv('data.csv')
Show more
You can load other formats of data files such as HTML, JSON, or Apache Parquet a similar
way.
Series and DataFrames
A Series is a one-dimensional labeled array that can contain data of any type (for example,
integer, string, float, or Python objects).
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Show Preferences
Cookie more
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 6/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
A DataFrame is a two-dimensional data structure. The data consists of rows and columns that
you can create in various ways. For example, you can load a file or use a NumPy array and a
date for the index. NumPy is a Python library for working with multidimensional arrays and
matrices with a large collection of mathematical functions to operate on these arrays.
The following code is an example of a DataFrame df1 with dates as the index, a 6x4 array of
random numbers as values, and column names A, B, C, and D.
dates = pd.date_range('20200101', periods=6)
numbers = np.random.randn(6, 4)
df1 = pd.DataFrame(numbers, index=dates, columns=['A', 'B', 'C', 'D'])
df1
Show more
Running the previous code generates an output similar to the following image. The notebook
shows a few more ways of creating a DataFrame.
Selecting data
To select data, you access a single row or groups of rows and columns with labels using
.loc[]. (This only works for the column that was set to the index.) Or, you select data by a
position with .iloc[]. Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 7/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
boroughs = df.copy()
boroughs = boroughs.set_index(['Code'])
boroughs.loc['E09000001', 'Inland_Area_(Hectares)']
Show more
boroughs.iloc[0]
Show more
boroughs.iloc[:,1]
Show more
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 8/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Selecting rows based on a certain condition can be done with boolean indexing. This uses the
actual values of the data in the DataFrame as opposed to the row and column labels or index
positions. You can combine different columns using &, |, and == operators.
Data transformation
Cleaning data
When exploring data, there are always transformations needed to get it in the format that you
need for your analysis, visualization, or model. The best way to learn is to find a data set and
try to answer questions with the data. Some things to check when cleaning data are:
Is the data tidy, such as each variable forms a column, each observation forms a row, and
each type of observational unit forms a table?
Are all columns in the right data format?
Are there missing values?
Are there unrealistic outliers? Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 9/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
The following code shows how to add and drop a column that might not be required from the
data set using pandas.
boroughs['new'] = 1
boroughs.head()
Show more
boroughs = boroughs.drop(columns='new')
boroughs.head()
Show more
Merging data
pandas has several different options to combine or merge data. The documentation covers
these examples. In this notebook, you create two new DataFrames to explore how to merge
data. Then, you use append() to combine these DataFrames.
data = {'city': ['London','Manchester','Birmingham','Leeds','Glasgow'],
'population': [9787426, 2553379, 2440986, 1777934,1209143],
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 10/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
'area': [1737.9, 630.3, 598.9, 487.8, 368.5 ]}
cities = pd.DataFrame(data)
data2 = {'city': ['Liverpool','Southampton'],
'population': [864122, 855569],
'area': [199.6, 192.0]}
cities2 = pd.DataFrame(data2)
cities = cities.append(cities2)
cities
Show more
Grouping data
Grouping data is a quick way to calculate values for classes in your DataFrame.
boroughs.groupby(['Inner/Outer']).mean()
Show more
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 11/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
When you have multiple categorical variables, you can create a nested index.
boroughs.groupby(['Inner/Outer','Political control']).sum().head(8)
Show more
Data visualization
Visualization in pandas uses the Matplotlib library. This plotting library uses an object-
oriented API to embed plots into applications. Some of the examples are line plot, histograms,
scatter plot, image plot, and 3D plot. In this tutorial, we use matplotlib.pyplot, which is a
collection of command-style functions that make matplotlib work like MATLAB.
The following example shows a visualization of the employment rate through a histogram. You
can change the number of bins to get the wanted output on your histogram.
%matplotlib inline
import matplotlib.pyplot as plt
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 12/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
boroughs = boroughs.reset_index()
boroughs['Employment_rate_(%)_(2015)'].plot.hist(bins=10);
Show more
Conclusion
This tutorial walked you through the steps to get IBM Cloud, Watson Studio, and a Jupyter
Notebook installed. It gave you an overview of the ways of analyzing data using pandas and a
notebook that you can run to try it yourself.
Legend
Categories
Analytics Data science Python
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 13/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
Table of Contents
Previous
Data preprocessing in detail
Next
Explore other data science topics
Related
Learning Path
Get started with machine learning
Tutorial
Object-oriented programming in Python
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 14/15
5/7/23, 8:15 AM Using pandas to perform data analysis - IBM Developer
IBM Developer Follow Us
About Twitter
FAQ LinkedIn
Build
Third-party notice YouTube
Smart Explore
Build
Newsletters
Patterns
Secure APIs
Articles
Tutorials
Open source projects
Videos
Events
Community
Career Opportunities
Privacy
Terms of use
Accessibility
Cookie preferences
Sitemap
Cookie Preferences
https://developer.ibm.com/learningpaths/get-started-data-science/data-analysis-in-python-using-pandas/ 15/15