[go: up one dir, main page]

0% found this document useful (0 votes)
26 views19 pages

EDA Assignment Day 14.ipynb

The document outlines an exploratory data analysis (EDA) process using a Cars dataset from Kaggle to predict car prices based on various features. It includes steps such as importing libraries, loading the dataset, checking and cleaning data by dropping irrelevant columns, renaming features, removing duplicates, handling missing values, and identifying outliers. The goal is to prepare the dataset for accurate model training by ensuring the data is clean and relevant.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views19 pages

EDA Assignment Day 14.ipynb

The document outlines an exploratory data analysis (EDA) process using a Cars dataset from Kaggle to predict car prices based on various features. It includes steps such as importing libraries, loading the dataset, checking and cleaning data by dropping irrelevant columns, renaming features, removing duplicates, handling missing values, and identifying outliers. The goal is to prepare the dataset for accurate model training by ensuring the data is clean and relevant.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 19

{

"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "DgE0o3YHBw-n"
},
"source": [
"<center> <h1 style=\"background-color:orange; color:white\"><br>Exploratory
Data Analysis<br></h1></center>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w6lzj4kjDJWu"
},
"source": [
"# `Problem Statement:`\n",
"We have used Cars dataset from kaggle with features including make, model,
year, engine, and other properties of the car used to predict its price."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JpZPe8JBBw-y"
},
"source": [
"## `Importing the necessary libraries`\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "dl9ocdwHBw-2"
},
"outputs": [],
"source": [
"# import pandas as pd\n",
"# import numpy as np\n",
"# import seaborn as sns #visualisation\n",
"# import matplotlib.pyplot as plt #visualisation\n",
"# %matplotlib inline \n",
"# sns.set(color_codes=True)\n",
"# from scipy import stats\n",
"# import warnings\n",
"# warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K5JcLAN2Bw-7"
},
"source": [
"## `Load the dataset into dataframe`"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "Yc-ChymZBw_A"
},
"outputs": [],
"source": [
"## load the csv file \n",
"# df = "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "ZUd5Fl7jBw_C",
"outputId": "79c6280b-0909-4245-a805-9607cb59effa"
},
"outputs": [],
"source": [
"## print the head of the dataframe\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gi3_9poxrSjE"
},
"source": [
"Now we observe the each features present in the dataset.<br>\n",
"\n",
" `Make:` The Make feature is the company name of the Car.<br>\n",
"`Model:` The Model feature is the model or different version of Car
models.<br>\n",
"`Year:` The year describes the model has been launched.<br>\n",
"`Engine Fuel Type:` It defines the Fuel type of the car model.<br>\n",
"`Engine HP:` It's say the Horsepower that refers to the power an engine
produces.<br>\n",
"`Engine Cylinders:` It define the nos of cylinders in present in the
engine.<br>\n",
"`Transmission Type:` It is the type of feature that describe about the car
transmission type i.e Mannual or automatic.<br>\n",
"`Driven_Wheels:` The type of wheel drive.<br>\n",
"`No of doors:` It defined nos of doors present in the car.<br>\n",
"`Market Category:` This features tells about the type of car or which category
the car belongs. <br>\n",
"`Vehicle Size:` It's say about the about car size.<br>\n",
"`Vehicle Style:` The feature is all about the style that belongs to car.<br>\
n",
"`highway MPG:` The average a car will get while driving on an open stretch of
road without stopping or starting, typically at a higher speed.<br>\n",
"`city mpg:` City MPG refers to driving with occasional stopping and
braking.<br>\n",
"`Popularity:` It can refered to rating of that car or popularity of car.<br>\
n",
"`MSRP:` The price of that car.\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VQ9qn4PaBw_i"
},
"source": [
"## `Check the datatypes`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "OPozGraJBw_l",
"outputId": "b72042d2-5913-43d8-c78a-2101feea6294"
},
"outputs": [],
"source": [
"# Get the datatypes of each columns number of records in each column.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gFyzAJLIBw_n"
},
"source": [
"## `Dropping irrevalent columns`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZZ863Z4jBw_p"
},
"source": [
"If we consider all columns present in the dataset then unneccessary columns
will impact on the model's accuracy.<br>\n",
"Not all the columns are important to us in the given dataframe, and hence we
would drop the columns that are irrevalent to us. It would reflect our model's
accucary so we need to drop them. Otherwise it will affect our model.\n",
"\n",
"\n",
"The list cols_to_drop contains the names of the cols that are irrevalent, drop
all these cols from the dataframe.\n",
"\n",
"\n",
"`cols_to_drop = [\"Engine Fuel Type\", \"Market Category\", \"Vehicle Style\",
\"Popularity\", \"Number of Doors\", \"Vehicle Size\"]`\n",
"\n",
"These features are not neccessary to obtain the model's accucary. It does not
contain any relevant information in the dataset. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "oW5t3xE-Bw_p"
},
"outputs": [],
"source": [
"# initialise cols_to_drop\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "RJvrJS9-Bw_r",
"outputId": "69709257-f66a-41b3-f3e8-0cced7dbb28b"
},
"outputs": [],
"source": [
"# drop the irrevalent cols and print the head of the dataframe\n",
"# df = \n",
"\n",
"# print df head\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Jg4y0BS7Bw_s"
},
"source": [
"## `Renaming the columns`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aDciVmlRBw_t"
},
"source": [
"Now, Its time for renaming the feature to useful feature name. It will help to
use them in model training purpose.<br>\n",
"\n",
"We have already dropped the unneccesary columns, and now we are left with
useful columns. One extra thing that we would do is to rename the columns such that
the name clearly represents the essence of the column.\n",
"\n",
"The given dict represents (in key value pair) the previous name, and the new
name for the dataframe columns"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "LPr2b3NPBw_u"
},
"outputs": [],
"source": [
"# rename cols \n",
"# rename_cols = \n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "YpY0qGvIBw_v"
},
"outputs": [],
"source": [
"# use a pandas function to rename the current columns - \n",
"# df = \n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "3N1i99nYBw_v",
"outputId": "d4c5d762-55ef-4566-c6d3-374cc8f9160e"
},
"outputs": [],
"source": [
"# Print the head of the dataframe\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UgNExPnZBw_w"
},
"source": [
"## `Dropping the duplicate rows`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ozWzkdrSBw_x"
},
"source": [
"There are many rows in the dataframe which are duplicate, and hence they are
just repeating the information. Its better if we remove these rows as they don't
add any value to the dataframe. \n",
"\n",
"For given data, we would like to see how many rows were duplicates. For this,
we will count the number of rows, remove the dublicated rows, and again count the
number of rows."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "drvQvYs2Bw_x",
"outputId": "a7e6f707-fab9-47f8-86c4-9cbd9f1b110f"
},
"outputs": [],
"source": [
"# number of rows before removing duplicated rows\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "LvwZZUruBw_x",
"outputId": "617daeb0-f1e8-46dd-9623-34dd5b4d3bdf"
},
"outputs": [],
"source": [
"# drop the duplicated rows\n",
"# df = \n",
"\n",
"# print head of df\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "Gg4hjGakBw_y",
"outputId": "a0f3f48c-7f23-4f2b-911b-57529b32663b"
},
"outputs": [],
"source": [
"# Count Number of rows after deleting duplicated rows\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q06o1NwrBw_z"
},
"source": [
"## `Dropping the null or missing values`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ddf1mIspBw_z"
},
"source": [
"Missing values are usually represented in the form of Nan or null or None in
the dataset.\n",
"\n",
"Finding whether we have null values in the data is by using the isnull()
function.\n",
"\n",
"There are many values which are missing, in pandas dataframe these values are
reffered to as np.nan. We want to deal with these values beause we can't use nan
values to train models. Either we can remove them to apply some strategy to replace
them with other values.\n",
"\n",
"To keep things simple we will be dropping nan values"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "s0MtVaYABw_z",
"outputId": "61fbc5cc-d21a-453c-8bf5-8ba42a7f553e"
},
"outputs": [],
"source": [
"# check for nan values in each columns\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "58N8lvWRlIVT"
},
"source": [
"As we can see that the HP and Cylinders have null values of 69 and 30. As
these null values will impact on models' accuracy. So to avoid the impact we will
drop the these values. As these values are small camparing with dataset that will
not impact any major affect on model accuracy so we will drop the values."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "TObFlN7xBw_0"
},
"outputs": [],
"source": [
"# drop missing values\n",
"# df = \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "q3tsOjvcBw_0",
"outputId": "067469f3-04d9-4894-f1e2-7ee4132a1d79"
},
"outputs": [],
"source": [
"# Make sure that missing values are removed\n",
"# check number of nan values in each col again\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "N0Ge8_yfBw_1",
"outputId": "88459604-4bba-434c-d5fb-6e81910b4b50"
},
"outputs": [],
"source": [
"#Describe statistics of df\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qBk8SZ29Bw_1"
},
"source": [
"## `Removing outliers`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tn5lLccGBw_2"
},
"source": [
"Sometimes a dataset can contain extreme values that are outside the range of
what is expected and unlike the other data. These are called outliers and often
machine learning modeling and model skill in general can be improved by
understanding and even removing these outlier values."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "2QnFqFbyBw_3",
"outputId": "b0a85d54-e5d7-4943-aec5-854695406cac"
},
"outputs": [],
"source": [
"## Plot a boxplot for 'Price' column in dataset. \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qCpI41VqBci9"
},
"source": [
"### **`Observation:`**<br>\n",
"\n",
"Here as you see that we got some values near to 1.5 and 2.0 . So these values
are called outliers. Because there are away from the normal values.\n",
"Now we have detect the outliers of the feature of Price. Similarly we will
checking of anothers features."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "lvDBhe4jBw_3",
"outputId": "6acf12e7-757f-4cbc-9020-d1d6a6e40564"
},
"outputs": [],
"source": [
"## PLot a boxplot for 'HP' columns in dataset\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-YWNqTn7GI-4"
},
"source": [
"### **`Observation:`**<br>\n",
"Here boxplots show the proper distribution of of 25 percentile and 75
percentile of the feature of HP."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "S9tucB8ABw_4"
},
"source": [
"print all the columns which are of int or float datatype in df. \n",
"\n",
"Hint: Use loc with condition"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "4uEumv0uBw_4",
"outputId": "c0c5515e-96dc-4e40-ca4b-e83c76ce7fad"
},
"outputs": [],
"source": [
"# print all the columns which are of int or float datatype in df.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pQOOqmvEBw_5"
},
"source": [
"### `Save the column names of the above output in variable list named 'l'`\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "PgJz8dtQBw_5"
},
"outputs": [],
"source": [
"# save column names of the above output in variable list\n",
"# l=\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3iAhdSFPBw_5"
},
"source": [
"## **`Outliers removal techniques - IQR Method`**\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4u67f7AzBw_6"
},
"source": [
"**Here comes cool Fact for you!**\n",
"\n",
"IQR is the first quartile subtracted from the third quartile; these quartiles
can be clearly seen on a box plot on the data."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eMW1PTL_Bw_6"
},
"source": [
"- Calculate IQR and give a suitable threshold to remove the outliers and save
this new dataframe into df2.\n",
"\n",
"Let us help you to decide threshold: Outliers in this case are defined as the
observations that are below (Q1 − 1.5x IQR) or above (Q3 + 1.5x IQR)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "G5EHp8JxBw_6"
},
"outputs": [],
"source": [
"## define Q1 and Q2\n",
"# Q1 = \n",
"# Q3 = \n",
"\n",
"# # define IQR (interquantile range) \n",
"# IQR = \n",
"\n",
"# # define df2 after removing outliers\n",
"# df2 = \n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# find the shape of df & df2\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "Ok1cLuSEBxAB",
"outputId": "40c55ded-4804-4ecb-b6ab-9795033207dd"
},
"outputs": [],
"source": [
"# find unique values and there counts in each column in df using value counts
function.\n",
"\n",
"# for i in df.columns:\n",
"# print (\"--------------- %s ----------------\" % i)\n",
"# # code here"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zQ0GaJ_kBxAB"
},
"source": [
"## `Visualising Univariate Distributions`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H0PQlhWEBxAC"
},
"source": [
"We will use seaborn library to visualize eye catchy univariate plots. \n",
"\n",
"Do you know? you have just now already explored one univariate plot. guess
which one? Yeah its box plot.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SnzpC8JABxAC"
},
"source": [
"### `Histogram & Density Plots`\n",
"\n",
"Histograms and density plots show the frequency of a numeric variable along
the y-axis, and the value along the x-axis. The ```sns.distplot()``` function plots
a density curve. Notice that this is aesthetically better than vanilla
```matplotlib```."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "-uqWiICoBxAC",
"outputId": "47e45800-1103-40e0-e407-93977635ea53"
},
"outputs": [],
"source": [
"#ploting distplot for variable HP\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1GSaLnCxiWHc"
},
"source": [
"### **`Observation:`**\n",
"We plot the Histogram of feature HP with help of distplot in seaborn.<br> \n",
"In this graph we can see that there is max values near at 200. similary we
have also the 2nd highest value near 400 and so on. <br>\n",
"It represents the overall distribution of continuous data variables.<br>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-P7Xup3vBxAD"
},
"source": [
"Since seaborn uses matplotlib behind the scenes, the usual matplotlib
functions work well with seaborn. For example, you can use subplots to plot
multiple univariate distributions.\n",
"- Hint: use matplotlib subplot function"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "CdlvvfvfBxAD",
"outputId": "23484911-5553-41bd-cdf6-8bd38a526ce7"
},
"outputs": [],
"source": [
"# plot all the columns present in list l together using subplot of dimention
(2,3).\n",
"\n",
"\n",
"# c=0\n",
"# plt.figure(figsize=(15,10))\n",
"# for i in l:\n",
"# # code here\n",
"# plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ziOcNh-sBxAD"
},
"source": [
"## `Bar Chart Plots`\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lF54VPLRBxAE"
},
"source": [
"Plot a histogram depicting the make in X axis and number of cars in y axis.
<br>"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "d1gpl5LxBxAE",
"outputId": "726eae7f-c413-456a-e989-960d43a9c89b"
},
"outputs": [],
"source": [
"# plt.figure(figsize = (12,8))\n",
"\n",
"# use nlargest and then .plot to get bar plot like below output\n",
"# Plot Title, X & Y label\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N-8CXMKVkn-I"
},
"source": [
"### **`Observation:`**\n",
"In this plot we can see that we have plot the bar plot with the cars model and
nos. of cars."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Xk2s0-9UBxAE"
},
"source": [
"### `Count Plot`\n",
"A count plot can be thought of as a histogram across a categorical, instead of
quantitative, variable.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OmT9X5aBBxAF"
},
"source": [
" Plot a countplot for a variable Transmission vertically with hue as Drive
mode"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"id": "UyYYXn36BxAF",
"outputId": "24b59852-4612-4065-cf6e-29b02c259565"
},
"outputs": [],
"source": [
"# plt.figure(figsize=(15,5))\n",
"\n",
"# plot countplot on transmission and drive mode\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9I0XvhdTla4h"
},
"source": [
"### **`Observation:`**\n",
"In this count plot, We have plot the feature of Transmission with help of
hue.<br>\n",
"We can see that the the nos of count and the transmission type and automated
manual is plotted. Drive mode as been given with help of hue.<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zDHMfUpNBxAF"
},
"source": [
"# `Visualising Bivariate Distributions`\n",
"\n",
"\n",
"Bivariate distributions are simply two univariate distributions plotted on x
and y axes respectively. They help you observe the relationship between the two
variables.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DQxcdTZsBxAG"
},
"source": [
"## `Scatter Plots`\n",
"Scatterplots are used to find the correlation between two continuos
variables.\n",
"\n",
"Using scatterplot find the correlation between 'HP' and 'Price' column of the
data. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"id": "L5zvuQD8BxAG",
"outputId": "6cc2ef16-7039-4eaa-df3f-7bdd6b4e5c80"
},
"outputs": [],
"source": [
"## Your code here - \n",
"# fig, ax = plt.subplots(figsize=(10,6))\n",
"\n",
"# plot scatterplot on hp and price\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kPLqA4B6o92w"
},
"source": [
"### **`Observation:`**<br>\n",
"It is a type of plot or mathematical diagram using Cartesian coordinates to
display values for typically two variables for a set of data.<br>\n",
"We have plot the scatter plot with x axis as HP and y axis as Price.<br>\n",
"The data points between the features should be same either wise it give
errors.<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HEUOARh5BxAN"
},
"source": [
"## `Plotting Aggregated Values across Categories`\n",
"\n",
"\n",
"### `Bar Plots - Mean, Median and Count Plots`\n",
"\n",
"\n",
"\n",
"Bar plots are used to **display aggregated values** of a variable, rather than
entire distributions. This is especially useful when you have a lot of data which
is difficult to visualise in a single figure. \n",
"\n",
"For example, say you want to visualise and *compare the Price across
Cylinders*. The ```sns.barplot()``` function can be used to do that.\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"id": "dTSOpY5jBxAN",
"outputId": "13ca613f-edab-42d8-819d-84cc5b566ee2"
},
"outputs": [],
"source": [
"# bar plot with default statistic=mean between Cylinder and Price\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rFd9QisOBxAO"
},
"source": [
"### **`Observation:`**<br>\n",
"By default, seaborn plots the mean value across categories, though you can
plot the count, median, sum etc.<br>\n",
"Also, barplot computes and shows the confidence interval of the mean as well.\
n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "od8Fuqm_BxAO"
},
"source": [
"## `When you want to visualise having a large number of categories, it is
helpful to plot the categories across the y-axis.`\n",
"\n",
"### `Let's now drill down into Transmission sub categories.`"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"id": "lJnPU4KtBxAP",
"outputId": "2dfa446f-874f-435f-dba0-a17f30f34718"
},
"outputs": [],
"source": [
"# Plotting categorical variable Transmission across the y-axis\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q5Y7xg3ZBxAQ"
},
"source": [
"These plots looks beutiful isn't it? In Data Analyst life such charts are
there unavoidable friend.:)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QX2szH0MBxAQ"
},
"source": [
"# `Multivariate Plots`\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_wiepyZEBxAT"
},
"source": [
"## `Heatmaps`\n",
"\n",
"\n",
"A heat map is a two-dimensional representation of information with the help of
colors. Heat maps can help the user visualize simple or complex information"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VslkQJNWBxAU"
},
"source": [
"Using heatmaps plot the correlation between the features present in the
dataset."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"id": "DWpcsVJCBxAU",
"outputId": "dae92aaa-5a7f-4acf-8082-03555340ee16"
},
"outputs": [],
"source": [
"#find the correlation of features of the data \n",
"# corr = \n",
"\n",
"# print corr\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "rDqYeuI1BxAW",
"outputId": "e20f0d9a-e76f-4f59-8ebb-11047156049d"
},
"outputs": [],
"source": [
"# Using the correlated df, plot the heatmap \n",
"# set cmap = 'BrBG', annot = True - to get the same graph as shown below \n",
"# set size of graph = (12,8)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-uMl7P-DBxAX"
},
"source": [
"### **`Observation:`**<br>\n",
"A heatmap contains values representing various shades of the same colour for
each value to be plotted. Usually the darker shades of the chart represent higher
values than the lighter shade. For a very different value a completely different
colour can also be used.\n",
"\n",
"\n",
"The above heatmap plot shows correlation between various variables in the
colored scale of -1 to 1. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}

You might also like