0% found this document useful (0 votes)

58 views8 pages

DA0101EN-Review-Introduction - Jupyter Notebook

The document discusses reading data from various sources into a Pandas dataframe. It demonstrates loading a CSV dataset from an online source and exploring the data. Headers are added to describe the data columns. Missing values are dropped to clean the data. Methods for saving the dataframe to different file formats are also presented.

Uploaded by

Sohail Doulah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views8 pages

DA0101EN-Review-Introduction - Jupyter Notebook

Uploaded by

Sohail Doulah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

(https://skills.network/?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01)

Introduction Notebook
Estimated time needed: 10 minutes

Objectives
After completing this lab you will be able to:

Acquire data in various ways

Obtain insights from data with Pandas library

Table of Contents ¶

1. Data Acquisition (https://#data_acquisition)

2. Basic Insight of Dataset (https://#basic_insight)

Data Acquisition
There are various formats for a dataset: .csv, .json, .xlsx etc. The dataset can be stored in different
places, on your local machine or sometimes online.

In this section, you will learn how to load a dataset into our Jupyter Notebook.

In our case, the Automobile Dataset is an online source, and it is in a CSV (comma separated
value) format. Let's use this dataset as an example to practice data reading.

Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

(https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=100
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-
01-01)
localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 1/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

Data type: csv

The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our
Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import
Pandas without installing.

In [ ]: #install specific version of libraries used in lab

#! mamba install pandas==1.3.3 -y
#! mamba install numpy=1.21.2 -y

In [ ]: # import pandas library

import pandas as pd
import numpy as np

Read Data
We use pandas.read_csv() function to read the csv file. In the brackets, we put the file path
along with a quotation mark so that pandas will read the file into a dataframe from that address.
The file path can be either an URL or your local file address.

Because the data does not include headers, we can add an argument headers = None inside
the read_csv() method so that pandas will not automatically set the first row as a header.

You can also assign the dataset to any variable you create.

This dataset was hosted on IBM Cloud object. Click HERE

(https://cocl.us/DA101EN_object_storage?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01)
for free storage.

In [ ]: # Import pandas library

import pandas as pd

# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/
df = pd.read_csv(other_path, header=None)

After reading the dataset, we can use the dataframe.head(n) method to check the top n rows
of the dataframe, where n is an integer. Contrary to dataframe.head(n) , dataframe.tail(n)
will show you the bottom n rows of the dataframe.

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 2/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

In [ ]: # show the first 5 rows using dataframe.head() method

print("The first 5 rows of the dataframe")
df.head(5)

Question #1:
Check the bottom 10 rows of data frame "df".

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

Add Headers
Take a look at our dataset. Pandas automatically set the header with an integer starting from 0.

To better describe our data, we can introduce a header. This information is available at:
https://archive.ics.uci.edu/ml/datasets/Automobile
(https://archive.ics.uci.edu/ml/datasets/Automobile?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01).

Thus, we have to add headers manually.

First, we create a list "headers" that include all column names in order. Then, we use
dataframe.columns = headers to replace the headers with the list we created.

In [ ]: # create headers list

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-
"drive-wheels","engine-location","wheel-base", "length","width","height"
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compres
"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)

We replace headers and recheck our dataframe:

In [ ]: df.columns = headers
df.head(10)

We need to replace the "?" symbol with NaN so the dropna() can remove the missing values:

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 3/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

In [ ]: df1=df.replace('?',np.NaN)

We can drop missing values along the column "price" as follows:

In [ ]: df=df1.dropna(subset=["price"], axis=0)
df.head(20)

Now, we have successfully read the raw dataset and added the correct headers into the
dataframe.

Question #2:
Find the name of the columns of the dataframe.

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

Save Dataset
Correspondingly, Pandas enables us to save the dataset to csv. By using the
dataframe.to_csv() method, you can add the file path and name along with quotation marks in
the brackets.

For example, if you would save the dataframe df as automobile.csv to your local machine, you
may use the syntax below, where index = False means the row names will not be written.

df.to_csv("automobile.csv", index=False)

We can also read and save other file formats. We can use similar functions like pd.read_csv()
and df.to_csv() for other data formats. The functions are listed in the following table:

Read/Save Other Data Formats

Data Formate Read Save

csv pd.read_csv() df.to_csv()

json pd.read_json() df.to_json()

excel pd.read_excel() df.to_excel()

hdf pd.read_hdf() df.to_hdf()

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 4/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

Data Formate Read Save

sql pd.read_sql() df.to_sql()

... ... ...

Basic Insight of Dataset

After reading data into Pandas dataframe, it is time for us to explore the dataset.

There are several ways to obtain essential insights of the data to help us better understand our
dataset.

Data Types
Data has a variety of types.

The main types stored in Pandas dataframes are object, float, int, bool and datetime64. In order
to better learn about each attribute, it is always good for us to know the data type of each column.
In Pandas:

In [ ]: df.dtypes

A series with the data type of each column is returned.

In [ ]: # check the data type of data frame "df" by .dtypes

print(df.dtypes)

As shown above, it is clear to see that the data type of "symboling" and "curb-weight" are int64 ,
"normalized-losses" is object , and "wheel-base" is float64 , etc.

These data types can be changed; we will learn how to accomplish this in a later module.

Describe
If we would like to get a statistical summary of each column e.g. count, column mean value,
column standard deviation, etc., we use the describe method:

dataframe.describe()

This method will provide various summary statistics, excluding NaN (Not a Number) values.

In [ ]: df.describe()

This shows the statistical summary of all numeric-typed (int, float) columns.
localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 5/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

For example, the attribute "symboling" has 205 counts, the mean value of this column is 0.83, the
standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th
percentile is 2, and the maximum value is 3.

However, what if we would also like to check all the columns including those that are of type
object?

You can add an argument include = "all" inside the bracket. Let's try it again.

In [ ]: # describe all the columns in "df"

df.describe(include = "all")

Now it provides the statistical summary of all the columns, including object-typed attributes.

We can now see how many unique values there, which one is the top value and the frequency of
top value in the object-typed columns.

Some values in the table above show as "NaN". This is because those numbers are not available
regarding a particular column type.

Question #3:
You can select the columns of a dataframe by indicating the name of each column. For
example, you can select the three columns as follows:
dataframe[[' column 1 ',column 2', 'column 3']]
Where "column" is the name of the column, you can apply the method ".describe()" to get the
statistics of those columns as follows:
dataframe[[' column 1 ',column 2', 'column 3'] ].describe()
Apply the method to ".describe()" to the columns 'length' and 'compression-ratio'.

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

Info
Another method you can use to check your dataset is:

dataframe.info()

It provides a concise summary of your DataFrame.

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 6/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook
It provides a concise summary of your DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null
values and memory usage.

In [ ]: # look at the info of "df"

df.info()

Excellent! You have just completed the

Introduction Notebook!

Thank you for completing this lab!

Author
Joseph Santarcangelo (https://www.linkedin.com/in/joseph-s-50398b136/?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01)

Other Contributors
Mahdi Noorian PhD (https://www.linkedin.com/in/mahdi-noorian-58219234/?
utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01)

Bahare Talayian

Eric Xiao

Steven Dong

Parizad

Hima Vasudevan

Fiorella Wenver (https://www.linkedin.com/in/fiorellawever/?

utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=1000655
SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01)

Yi Yao (https:// https://www.linkedin.com/in/yi-leng-yao-84451275/ ).

Change Log
Date (YYYY-MM-DD) Version Changed By Change Description

2020-10-30 2.3 Lakshmi Changed URL of the csv

2020-09-22 2.2 Nayef Added replace() method to remove '?'

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 7/8
11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

Date (YYYY-MM-DD) Version Changed By Change Description

2020-09-09 2.1 Lakshmi Made changes in info method of dataframe

2020-08-27 2.0 Lavanya Moved lab to course repo in GitLab

localhost:8889/notebooks/DA0101EN-Review-Introduction.ipynb 8/8

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Introduction To Generative AI
No ratings yet
Introduction To Generative AI
2 pages
Python Module 1 Question Bank Answers
No ratings yet
Python Module 1 Question Bank Answers
23 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Cloud Linux
No ratings yet
Cloud Linux
347 pages
Data Acquisition Python
No ratings yet
Data Acquisition Python
12 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
21 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
01 Python For Data Analysis (Ziad)
No ratings yet
01 Python For Data Analysis (Ziad)
53 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Python2_master
No ratings yet
Python2_master
12 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
Read CSV Files Using Pandas Library
No ratings yet
Read CSV Files Using Pandas Library
11 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
Pandas
No ratings yet
Pandas
41 pages
week 3 python (1)
No ratings yet
week 3 python (1)
152 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
introduction to pandas
No ratings yet
introduction to pandas
14 pages
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
Python-for-Data-Analysis-edgar
No ratings yet
Python-for-Data-Analysis-edgar
49 pages
lecture-week2
No ratings yet
lecture-week2
72 pages
Python for ML
No ratings yet
Python for ML
41 pages
2_Pandas
No ratings yet
2_Pandas
22 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
The Basics of Pandas Library
No ratings yet
The Basics of Pandas Library
8 pages
SBMT5730 pt3
No ratings yet
SBMT5730 pt3
48 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Pandas (Ziad)
No ratings yet
Pandas (Ziad)
38 pages
Python For DA
100% (2)
Python For DA
47 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Pandas
No ratings yet
Pandas
21 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
24
No ratings yet
24
7 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
What is pandas
No ratings yet
What is pandas
9 pages
UNIT -4 -PART 2
No ratings yet
UNIT -4 -PART 2
36 pages
Unit 4.2
No ratings yet
Unit 4.2
24 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
05 Brute Force
No ratings yet
05 Brute Force
56 pages
Adf7030 1 PDF
No ratings yet
Adf7030 1 PDF
55 pages
End Term Question Paper Linux For Devices 2021
No ratings yet
End Term Question Paper Linux For Devices 2021
2 pages
UNIT 3 PPTT
No ratings yet
UNIT 3 PPTT
35 pages
AOI Operator Assessment
No ratings yet
AOI Operator Assessment
1 page
Panel de Operacion
No ratings yet
Panel de Operacion
84 pages
Python Training
0% (1)
Python Training
16 pages
Edge IIoTset DatasetFL
No ratings yet
Edge IIoTset DatasetFL
25 pages
" %D": #Include Int Int For Int Return
No ratings yet
" %D": #Include Int Int For Int Return
50 pages
PCS 7 Advanced Process Library (V9.1 SP2) 2-Basics of APL (47-390)
No ratings yet
PCS 7 Advanced Process Library (V9.1 SP2) 2-Basics of APL (47-390)
344 pages
STULZ E2 Controller Operation Manual OZU0037M
No ratings yet
STULZ E2 Controller Operation Manual OZU0037M
82 pages
Anselin GeoDa-Presentation
No ratings yet
Anselin GeoDa-Presentation
44 pages
6421c250fad3f7fad4abd84f Aria 7X Datasheet
No ratings yet
6421c250fad3f7fad4abd84f Aria 7X Datasheet
2 pages
Phim Tat Trong SAP Command Filed
No ratings yet
Phim Tat Trong SAP Command Filed
1 page
Use Excel VBA To Open A Text File and Search It For A Specific String
No ratings yet
Use Excel VBA To Open A Text File and Search It For A Specific String
3 pages
EST3 Amp
No ratings yet
EST3 Amp
1 page
Top 80+ JavaScript Interview Questions (Ultimate List)
No ratings yet
Top 80+ JavaScript Interview Questions (Ultimate List)
49 pages
Microguard Application Loader Manual For Greer A4507Xx Series Computers With External Programming Port
No ratings yet
Microguard Application Loader Manual For Greer A4507Xx Series Computers With External Programming Port
16 pages
01 IT110-Intro1
No ratings yet
01 IT110-Intro1
26 pages
Intel BootGuard PDF
No ratings yet
Intel BootGuard PDF
67 pages
The Future Shape of Banking
0% (1)
The Future Shape of Banking
20 pages
Chapter 1 Binary and Hexadecimal
No ratings yet
Chapter 1 Binary and Hexadecimal
72 pages
Java FDP
No ratings yet
Java FDP
40 pages
Chat GPT Promps
No ratings yet
Chat GPT Promps
3 pages
Experiment 1: Write A C Programme To Perform The Following Operation On A Single Linked List (I) Creation (Ii) Insertion (Iii) Traverse Program
No ratings yet
Experiment 1: Write A C Programme To Perform The Following Operation On A Single Linked List (I) Creation (Ii) Insertion (Iii) Traverse Program
5 pages
Mohamed Elbanna CV
No ratings yet
Mohamed Elbanna CV
1 page

DA0101EN-Review-Introduction - Jupyter Notebook

Uploaded by

DA0101EN-Review-Introduction - Jupyter Notebook

Uploaded by

11/10/22, 12:44 PM DA0101EN-Review-Introduction - Jupyter Notebook

Acquire data in various ways

1. Data Acquisition (https://#data_acquisition)

Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Data type: csv

In [ ]: #install specific version of libraries used in lab

In [ ]: # import pandas library

This dataset was hosted on IBM Cloud object. Click HERE

In [ ]: # Import pandas library

In [ ]: # show the first 5 rows using dataframe.head() method

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

Thus, we have to add headers manually.

In [ ]: # create headers list

We replace headers and recheck our dataframe:

We can drop missing values along the column "price" as follows:

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

Read/Save Other Data Formats

csv pd.read_csv() df.to_csv()

json pd.read_json() df.to_json()

excel pd.read_excel() df.to_excel()

hdf pd.read_hdf() df.to_hdf()

Data Formate Read Save

sql pd.read_sql() df.to_sql()

... ... ...

Basic Insight of Dataset

A series with the data type of each column is returned.

In [ ]: # check the data type of data frame "df" by .dtypes

In [ ]: # describe all the columns in "df"

In [ ]: # Write your code below and press Shift+Enter to execute

Click here for the solution

It provides a concise summary of your DataFrame.

In [ ]: # look at the info of "df"

Excellent! You have just completed the

Thank you for completing this lab!

Fiorella Wenver (https://www.linkedin.com/in/fiorellawever/?

Yi Yao (https:// https://www.linkedin.com/in/yi-leng-yao-84451275/ ).

2020-10-30 2.3 Lakshmi Changed URL of the csv

2020-09-22 2.2 Nayef Added replace() method to remove '?'

Date (YYYY-MM-DD) Version Changed By Change Description

2020-09-09 2.1 Lakshmi Made changes in info method of dataframe

2020-08-27 2.0 Lavanya Moved lab to course repo in GitLab

© IBM Corporation 2020. All rights reserved.

You might also like