[go: up one dir, main page]

0% found this document useful (0 votes)
28 views36 pages

Fods Lab

The document outlines the installation and usage of Python packages such as NumPy, SciPy, and Pandas, detailing commands for installation and basic operations like array creation, indexing, slicing, and DataFrame manipulation. It includes examples of creating arrays, performing array operations, and working with DataFrames in Pandas. Additionally, it explains data types in NumPy and provides examples of loading data into DataFrames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views36 pages

Fods Lab

The document outlines the installation and usage of Python packages such as NumPy, SciPy, and Pandas, detailing commands for installation and basic operations like array creation, indexing, slicing, and DataFrame manipulation. It includes examples of creating arrays, performing array operations, and working with DataFrames in Pandas. Additionally, it explains data types in NumPy and provides examples of loading data into DataFrames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

PROCEDURE: (DOWNLOAD / INSTALLATION - windows)

1. Python version 3.7 to be installed prior.


2. To check whether Python exists,
Go to Search Type cmd Command prompt appears and type the commands as given below,
>>> python –V
//python --version
Python 3.7.8rc1
>>>python –m pip install numpy
(If already installed, a message will be prompted as “Requirement already satisfied” else
the installation would be continued and completed with success message).
>>>python –m pip install scipy
>>>python –m pip install statsmodels
>>>python –m pip install jupyter
>>>python –m pip install pandas

Note:

pip - Package Installer for Python is the de facto and recommended package management system
written in Python and is used to install and manage software packages. It connects to an online
repository of public packages, called the Python Package Index.

Package installation: NumPy, SciPy, Jupyter, StatsModel, Pandas


RESULT:
PROCEDURE:
If we have Python and PIP already installed on a system, then installation of NumPy is very easy.
Installation – NumPy package:

C:\Users\User>pip install numpy


Once NumPy is installed, import it in your applications by adding the import keyword:
>>> import numpy as np

ARRAY CREATION:

Single-dimensional NumPy Array:


>>> import numpy as np
>>> a=np.array([1,2,3])
>>> print(a)
[1,2,3]

Multi-dimensional Numpy Array:

>>> a=np.array([(1,2,3),(4,5,6)])
>>> print(a)
[[ 1 2 3]
[4 5 6]]
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5])
>>>print(arr)
[1 2 3 4 5]
>>>print(type(arr))
<class 'numpy.ndarray'>
>>>a = np.array(42)
>>>b = np.array([1, 2, 3, 4, 5])
>>>c = np.array([[1, 2, 3], [4, 5, 6]])
>>d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
>>>print(a.ndim)
0
>>>print(b.ndim)
1
>>>print(c.ndim)
2
>>>print(d.ndim)
3

ARRAY INDEXING:

Array indexing is the same as accessing an array element. We can access an array element by
referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first
element has index 0, and the second has index 1 etc.

>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[0])
1
>>>print(arr[2])
3
>>>print(arr[4])
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
print(arr[4])
IndexError: index 4 is out of bounds for axis 0 with size 4
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[2] + arr[3])
7
>>>arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
>>>print('2nd element on 1st row: ', arr[0, 1])
2nd element on 1st row: 2
>>>print('5th element on 2nd row: ', arr[1, 4])
5th element on 2nd row: 10
>>>print('Last element from 2nd dim: ', arr[1, -1])
Last element from 2nd dim: 10
>>>arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
>>>print(arr[0, 1, 2])
6

ARRAY SLICING:

Slicing in python means retrieving elements from one given index to another given index.

 We pass slice instead of index like this: [start:end].


 We can also define the step, like this: [start:end:step].

If we don't pass start its considered 0. If we don't pass end it’s considered length of array in that
dimension. If we don't pass step it’s considered 1.

>>>arr = np.array([1, 2, 3, 4, 5, 6, 7])


>>>print(arr[1:5])
[2 3 4 5]>>>print(arr[4:])
[5 6 7]
>>>print(arr[:4])
[1 2 3 4]
>>>print(arr[-3:-1])
[5 6]
>>>print(arr[1:5:2])
[2 4]
>>>print(arr[::2])
[1 3 5 7]
>>>print(arr[1, 1:4])
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
print(arr[1, 1:4])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
>>>arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
>>>print(arr[0:2, 2])
[3 8]
>>>print(arr[0:2, 1:4])
[[2 3 4]
[7 8 9]]
ARRAY SHAPE / RESHAPE:
Array Shape - NumPy arrays have an attribute called shape that returns a tuple with each index
having the number of corresponding elements.

import numpy as np
>>>arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>>print(arr.shape)
(2, 4)

Array Reshape - By reshaping we can add or remove dimensions or change the number of elements
in each dimension.

#Converting a 1d array to 2d
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
>>>newarr = arr.reshape(4, 3)
>>>print(newarr)
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
ARRAY ITERATION:
Iterating means looping through elements one by one for specific number of times.
>>>import numpy as np
>>> arr = np.array([1, 2, 3])
>>> for x in arr:
print(x)
1
2
3ARRAY JOINING:
Joining is the process of combining contents of two or more arrays in a single array.
>>>import numpy as np
>>>arr1 = np.array([1, 2, 3])
>>>arr2 = np.array([4, 5, 6])
>>>arr = np.concatenate((arr1, arr2))
>>>print(arr)
[1 2 3 4 5 6]

ARRAY SPLITTING:
Splitting is the reverse process operation of Joining. Splitting breaks one array into multiple
subarrays.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6])
>>>newarr = np.array_split(arr,3)
>>>print(newarr)
[array([1, 2]), array([3, 4]), array([5, 6])]
>>>print(np.array_split(arr,5))
[array([1, 2]), array([3]), array([4]), array([5]), array([6])]
ARRAY SORTING:
Sorting is the process of combining elements in an ordered sequence either in the ascending or
descending order.
>>>import numpy as np
#sorting numbers in ascending order
>>>arr = np.array([3, 2, 0, 1])
>>>print(np.sort(arr))
[0 1 2 3]
#sorting in alphabetical order
>>>arr = np.array(['banana', 'cherry', 'apple'])
>>>print(np.sort(arr))
['apple' 'banana' 'cherry']
SEARCHING ARRAYS:
Search an array for a certain value returns the index that gets a match. To search an array, use the
where ( ) method.
Find the indexes where the value is 4:
>>>arr = np.array([1, 2, 3, 4, 5, 4, 4])
>>>x = np.where(arr == 4)
>>>print(x)
(array([3, 5, 6], dtype=int32),)
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
>>>x = np.where(arr%2 == 0)
>>>print(x)
(array([1, 3, 5, 7], dtype=int32),)
>>>x = np.where(arr%2 == 1)
>>>print(x)
(array([0, 2, 4, 6], dtype=int32),)

DATA TYPES:
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the
characters used to represent them.
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr.dtype)
int32
>>>arr = np.array(['apple', 'banana', 'cherry'])
>>>print(arr.dtype)
<U6
>>>arr = np.array([1, 2, 3, 4], dtype='S')
>>>print(arr)
[b'1' b'2' b'3' b'4']
>>>print(arr.dtype)
|S1
>>>arr = np.array([1, 2, 3, 4], dtype='i4')
>>>print(arr)
[1 2 3 4]
>>>print(arr.dtype)
int32
>>>arr = np.array(['a', '2', '3'], dtype='i')
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
arr = np.array(['a', '2', '3'], dtype='i')
ValueError: invalid literal for int() with base 10: 'a'
>>>arr = np.array([1, 0, 3])
>>>newarr = arr.astype(bool)
>>>print(newarr)
[ True False True]
>>>print(newarr.dtype)
bool

RESULT:
Create a simple Pandas DataFrame

import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

calories duration

0 420 50
1 380 40
2 390 45

Pandas use the loc attribute to return one or more specified row(s)

Return row 0:
#refer to the row index:
print(df.loc[0])

calories 420
duration 50

Name: 0, dtype: int64

Return row 0 and 1:


0#use a list of indexes:
print(df.loc[[0, 1]])

calories duration

0 420 50
1 380 40

Named Indexes:

With the index argument, we can name your own indexes.

import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

calories duration

day1 420 50
day2 380 40
day3 390 45

Load Files into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame. Load a comma separated
file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
iso_code ... excess_mortality_cumulative_per_million
0 AFG ... NaN
1 AFG ... NaN
2 AFG ... NaN
3 AFG ... NaN
4 AFG ... NaN
... ... ... ...
166321 ZWE ... NaN
166322 ZWE ... NaN
166323 ZWE ... NaN
166324 ZWE ... NaN
166325 ZWE ... NaN
.
[166326 rows x 67 columns]

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

Name Age
rank1 Tom 28
rank2 Jack 34
rank3 Steve 29
rank4 Ricky 42
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)
a b c
first 1 2 NaN
second 5 10 20.0

Creating a DataFrame using List:


DataFrame can be created using a single list or a list of lists.

import pandas as pd
# list of strings
lst = ['Pandas', 'SciPy', 'DataFrames', 'NumPy', 'Analytics']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
0 Pandas
1 SciPy
2 DataFrames
3 NumPy
4 Analytics

Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of

narray/list, all the narray must be of same length. If index is passed then the length index should be
equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is
the array length.

import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
print(df)
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

Column Selection: In order to select a column in Pandas DataFrame, we can either access the
columns by calling them by their columns name.

import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],'Age':[27, 24, 22, 32],'Address':['Delhi',
'Kanpur', 'Allahabad', 'Kannauj'],'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be
selected by passing integer location to an iloc[] function.

File used: country.csv

import pandas as pd
data = pd.read_csv("country.csv", index_col ="iso_code")
first = data.loc["AFG"]
second = data.loc["NOR"]
print(first, "\n\n\n", second)

iso_code continent location date total_cases


AFG Asia Afghanistan 2/24/2020 5
AFG Asia Afghanistan 2/25/2020 5
AFG Asia Afghanistan 2/26/2020 5
AFG Asia Afghanistan 2/27/2020 5
AFG Asia Afghanistan 2/28/2020 5
AFG Asia Afghanistan 2/29/2020 5

iso_code continent location date total_cases


NOR Europe Norway 10/1/2021 189915
NOR Europe Norway 10/2/2021 190224
NOR Europe Norway 10/3/2021 190533
NOR Europe Norway 10/4/2021 191017
NOR Europe Norway 10/5/2021 191599
NOR Europe Norway 10/6/2021 192079
NOR Europe Norway 10/7/2021 192587

Indexing a DataFrame using indexing operator []:


Indexing operator is used to refer to the square brackets following an object.
The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing
operator to refer to df[].

Working with Missing Data:


Missing Data can occur, when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also
refer to as NA(Not Available) values in pandas.

isnull() and notnull():


Both function help in checking whether a value is NaN or not. This function can also be used in Pandas
Series in order to find null values in a series.

import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.isnull()) First Score Second Score Third Score

0 False False True


1 False False False
2 True False False
3 False True False
fillna(), replace() and interpolate():

All these function help in filling null values in datasets of a DataFrame. Interpolate () function is
basically used to fill NA values in the DataFrame, but it uses various interpolation technique to fill the
missing values rather than hard-coding the value.

import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.fillna(0))

First Score Second Score Third Score


0 100.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0

Iterating over rows and columns:

Pandas DataFrame consists of rows and columns so, in order to iterate over DataFrame, we have to
iterate a DataFrame like a dictionary. In order to iterate over rows, we can use three function
iteritems(), iterrows(), itertuples() . These three functions will help in iteration over rows.

import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],'degree': ["MBA", "BCA", "M.Tech",
"MBA"],'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df)

name degree score


0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
3 Geeku MBA 98

RESULT:
PROCEDURE:

CASE 1: READING DATA FROM EXCEL/CSV FILE

We will use the Pandas library to load the Iris data set CSV file, and will convert it into the dataframe.
read_csv() method which is used to read CSV files.

import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]

# Printing top 5 rows

print(df.head())

sepallength sepalwidth petallength petalwidth class


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

#Use the shape parameter to get the shape of the dataset.


print(df.shape)
(150, 5)
#To view the columns and their data types
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149

Data columns (total 5 columns):


# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepallength 150 non-null float64
1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64
4 class 150 non-null object

dtypes: float64(4), object(1)


memory usage: 6.0+ KB
None
The describe() function applies basic statistical computations on the dataset like extreme values,
count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped.
describe() function gives a good picture of the distribution of data.
print(df.describe())
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

CASE 2: READING DATA FROM TEXT FILE


file1 = open("/content/sample_data/Basics-Python.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
Output of Read function is
Python is a very popular general-purpose interpreted, interactive, object
oriented, and high-level programming language. Python is dynamically-typed
and garbage-collected programming language. It was created by Guido van
Rossum during 1985- 1990. Like Perl, Python source code is also available
under the GNU General Public License (GPL).

Python is consistently rated as one of the world's most popular programming


languages. Python is fairly easy to learn, so if you are starting to learn
any programming language then Python could be your great choice. Today
various Schools, Colleges and Universities are teaching Python as their
primary programming language. There are many other good reasons which makes
Python as the top choice of any programmer:

Python is Open Source which means its available free of cost.


Python is simple and so easy to learn
Python is versatile and can be used to create many different things.
Python has powerful development libraries include AI, ML etc.
Python is much in demand and ensures high salary

CASE 3: READING DATA FROM WEB

//To download a file from web using wget module


# wget module to be installed

pip install wget


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab
wheels/public/simple/
Collecting wget
Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
Building wheel for wget (setup.py) ... done
Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674
sha256=c0e498fded138e8bf764bbcda6a413bfac3d6338f40f4be9b5ce9384baa4c957
Stored in directory:
/root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c1
3e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Descriptive Statistics — is used to understand your data by calculating various statistical
values for given numeric variables. For any given data our approach is to understand it and calculated
various statistical values. This will help us to identify various statistical tests that can be done on
provided data.
Under descriptive statistics we can calculate following values,
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile range(IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
We have system defined functions to get these values for any given datasets.
# Changing the column headers in Iris dataset
import pandas as pd
import numpy as np
df = pd.read_csv("/content/sample_data/Iris.csv")
data=pd.DataFrame(df,columns=list("ABCDE"))
print(data)
A B C D E
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
.. .. .. .. .. ..
145 NaN NaN NaN NaN NaN
146 NaN NaN NaN NaN NaN
147 NaN NaN NaN NaN NaN
148 NaN NaN NaN NaN NaN
149 NaN NaN NaN NaN NaN

[150 rows x 5 columns]

1. Calculating Central Tendency


data[‘A’].mean()
data[‘A’].median()
data[‘A’].mode()
#mean — is average value of given numeric values
#median — is middle most value of given values
#mode — is most frequently occurring value of given numeric variables
# Mean, Median, Mode on Iris dataset
print(df)

sepallength sepalwidth petallength petalwidth class


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]


df['sepallength'].mean()
5.843333333333334
df['sepalwidth'].median()
3.0
df['petalwidth'].mode()
0
0.2
dtype: float64
df['class'].mode()
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
dtype: object

2. Dispersion
Dispersion is used to define variation present in given variable. Variation means how
values are close or away from the mean value.
Variance — its gives average deviation from mean value
Standard Deviation — it is square root of variance
Range — it gives difference between max and min value
InterQuartile Range(IQR) — it gives difference between Q3 and Q1, where Q3 is 3rd
Quartile value and Q1 is 1st Quartile value.
data[‘A’].var()
data[‘A’].std()
data[‘A’].max()-data[‘A’].min()
data[‘A’].quantile([.25,.5,.75])
df["sepalwidth"].var()
0.1880040268456376
df["sepallength"].std()
0.4335943113621737
df["sepallength"].max()-df["sepalwidth"].min()
5.9
df["petalwidth"].quantile([.25,.5,.75])
0.50 1.3
0.75 1.8
Name: petalwidth, dtype: float64

3. Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry
means equal distribution of observation above or below the mean.
skewness = 0: if data is symmetric along with mean
skewness = Negative: if data is not symmetric and right side tail is longer than left side tail of density
plot.
skewness = Positive: if data is not symmetric and left side tail is longer than right side tail in density
plot.
We can find skewness of given variable by below given formula.
data[‘A’].skew()
df["sepallength"].skew()
0.3149109566369728
df["sepalwidth"].skew()
0.3340526621720866
df["class"].skew()
ValueError: could not convert string to float: 'Iris-setosa'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/nanops.py in _f(*args,
**kwargs)
99 # object arrays that contain strings
100 if is_object_dtype(args[0]):
--> 101 raise TypeError(e) from e
102 raise
103
TypeError: could not convert string to float: 'Iris-setosa

4. Kurtosis
Kurtosis is used to defined peakedness (or flatness) of density plot (normal distribution
plot). As per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.

kurtosis = 0: if peakedness of graph is equal to normal distribution.


kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)
kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)
We can find kurtosis of given variable by below given formula.
data[‘A’].kurt()
df["sepalwidth"].kurt()
0.2907810623654279
df["sepallength"].kurt()
-0.5520640413156395
Let see the graph representation of given variable and interpretation of skewness and
peakedness of distribution from it.
import seaborn as sns
sns.distplot(df[“sepallength”],hist=True,kde=True)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure
level function with similar flexibility) or `histplot` (an axes-level
function for histograms).
warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa94e2957d0>
Density plot of variable ‘sepallength’

In the above graph, we can clearly see that left side and right side of plot is equally
distributed. Histogram is above the line that means data has flat plot. This means kurtosis of this
distribution is Normal.

Checking Missing Values

Missing values can occur when no information is provided for one or more items or for a whole unit.
We will use the isnull() method.

df.isnull().sum()
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64

Checking Duplicates

Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method
helps in removing duplicates from the data frame.
#interactive table view
data = df.drop_duplicates(subset ="class",)
data

df.value_counts("sepalwidth")
sepalwidth
3.0 26
2.8 14
3.2 13
3.4 12
3.1 12
2.9 10
2.7 9
2.5 8
3.3 6
3.5 6
3.8 6
2.6 5
2.3 4
2.4 3
2.2 3
3.6 3
3.7 3
3.9 2
4.1 1
4.2 1
2.0 1
4.0 1
4.4 1
dtype: int64
Data Visualization

Visualizing the target column - Our target column will be the sepalwidth column because at the end, we
need the result according to the sepalwidth only. Let’s see a countplot for species. (We will use Matplotlib
and Seaborn library for the data visualization.)

import seaborn as sns


import matplotlib.pyplot as plt
sns.countplot(x="sepalwidth", data=df, )
plt.show()

Comparing Sepal Length and Sepal Width

import seaborn as sns


import matplotlib.pyplot as plt
sns.scatterplot(x="sepallength", y="sepalwidth",hue="class", data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()
Histograms

Histograms allow seeing the distribution of data for various columns. It can be used for uni as well as bi-
variate analysis.

import seaborn as sns


import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df["sepallength"], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df["sepalwidth"], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df["petallength"], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df["petalwidth"], bins=6);

Output:
 The highest frequency of the sepal length is between 30 and 35 which is between 5.5 and 6.
 The highest frequency of the sepal width is around 70 which is between 3.0 and 3.5.
 The highest frequency of the petal length is around 50 which is between 1 and 2.
 The highest frequency of the petal width is between 40 and 50 which is between 0.0 and 0.5

RESULT:
PROCEDURE:

(5a) Univariate Analysis - Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis

import pandas as pd
import numpy as np
df = pd.read_csv("diabetes.csv")
print(df)

Age Gender Polyuria ... Alopecia Obesity class


0 40 Male No ... Yes Yes Positive
1 58 Male No ... Yes No Positive
2 41 Male Yes ... Yes No Positive
3 45 Male No ... No No Positive
4 60 Male Yes ... Yes Yes Positive
.. ... ... ... ... ... ... ...
515 39 Female Yes ... No No Positive
516 48 Female Yes ... No No Positive
517 58 Female Yes ... No Yes Positive
518 32 Female No ... Yes No Negative
519 42 Male No ... No No Negative
[520 rows x 17 columns]

>>>print(df['Age'].mean())
48.02884615384615
>>>print(df['Age'].median())
47.5
>>>print(df['Age'].mode())
0 35
dtype: int64
>>>print(df["Age"].var())
147.65812583370388
>>>print(df["Age"].std())
12.151465995249458
>>>print(df["Age"].skew())
0.3293593578272701
>>>print(df["Age"].kurt())
-0.19170941407070163

Data-Visualization:(pima-diabetes.csv)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
Data_X= df.copy(deep=True)
Data_X= Data_X.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
#Plotting Histogram of Data
Data_X.hist(bins=40)
plt.show()
(5b) Bivariate Analysis – Linear and Logistic Regression

Simple Linear Regression - It is an approach for predicting a response using a single feature. It is
assumed that the two variables are linearly related. So, we try to find a linear function that predicts the
response value(y) as accurately as possible as a function of the feature or independent variable(x). Let us
consider a dataset where we have a value of response y for every feature x as given below(example):

Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response
for any new feature values. (i.e. a value of x not present in a dataset). This line is called a regression line.
The equation of regression line is represented as: h(xi) = b0+b1xi
Here,
 h(xi) represents the predicted response value for ith observation.
 b0 and b1 are regression coefficients and represent y-intercept and slope of regression line
respectively.
Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response
for any new feature values. (i.e. a value of x not present in a dataset). This line is called a regression line.
The equation of regression line is represented as: h(xi) = b0+b1xi
Here,

 h(xi) represents the predicted response value for ith observation.


 b0 and b1 are regression coefficients and represent y-intercept and slope of regression line
respectively.

SOURCE CODE:

import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\n b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

Output:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
Logistic Regression:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
import sklearn
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
LR = LogisticRegression()
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
print("Accuracy ", LR.score(X_test, y_test)*100)
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()
(5c) Multiple Regression Analysis
import pandas as pd
from sklearn import linear_model
df = pd.read_csv("pima-diabetes.csv")
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#Predict age based on Glucose and BloodPressure
predictedage = regr.predict([[185, 145]])
print(predictedage)
Output:
[48.13025197]

(5d) Comparative Analysis


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Outcome'] == 0]["Pregnancies"], color='green') # Healthy - green
sns.distplot(df[df['Outcome'] == 1]["Pregnancies"], color='red') # Diabetic - Red
plt.title('Healthy vs Diabetic by Pregnancy', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()

From above graph, we can infer that the Pregnancy isn't likely cause for diabetes as the distribution between
the Healthy and Diabetic is almost same.
//diabetes.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Gender'] == 'Male']["Age"], color='green')
sns.distplot(df[df['Polyuria'] == 'No']["Age"], color='red')
plt.title('Male vs Polyuria by Age', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()

RESULT:
SOURCE CODE:

# Normal Curve
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.Glucose[0:50]
mean=st.mean(x)
sd=st.stdev(x)
pyplot.plot(x,norm.pdf(x,mean,sd))
pyplot.title("Normal plot")
pyplot.show()

OUTPUT:

#density plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()
OUTPUT:

#contour plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.BloodPressure[0:2]
y=data.Glucose[0:2]
z=((data.BMI[0:2],data.Age[0:2]))
pyplot.figure(figsize=(7,5))
pyplot.title("Contour plot")
contours=pyplot.contour(x,y,z)
pyplot.show()

OUTPUT:
#correlation plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
names=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"]
correlation = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlation, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,8,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.title("Correlation")
pyplot.show()

OUTPUT:

#scatter plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
scatter_matrix(data)
pyplot.show()
OUTPUT:

#Histograms
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.hist()
pyplot.show()

OUTPUT:

#three dimensonal plotting


import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d

data = pd.read_csv('diabetes.csv')
fig = pyplot.figure()

ax = pyplot.axes(projection='3d')
ax = pyplot.axes(projection='3d')
zline = np.array(data.BMI)
xline = np.sin(zline)
yline = np.cos(zline)

ax.plot3D(xline, yline, zline, 'gray')


zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Blues')
pyplot.show()

OUTPUT:

RESULT:
PROCEDURE:
#Basemap and other packages installation
!pip install basemap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting basemap
Downloading basemap-1.3.6-cp38-cp38-manylinux1_x86_64.whl (863 kB)
863 kB 14.5 MB/s
Collecting basemap-data<1.4,>=1.3.2
Downloading basemap_data-1.3.2-py2.py3-none-any.whl (30.5 MB)
30.5 MB 1.4 MB/s
Requirement already satisfied: matplotlib<3.7,>=1.5 in /usr/local/lib/python3.8/dist-packages (from
basemap) (3.2.2)
Collecting pyproj<3.5.0,>=1.9.3
Downloading pyproj-3.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
7.8 MB 55.4 MB/s
Collecting pyshp<2.4,>=1.2
Downloading pyshp-2.3.1-py2.py3-none-any.whl (46 kB)
46 kB 3.6 MB/s
Collecting numpy<1.24,>=1.22
Downloading numpy-1.23.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
17.1 MB 46.7 MB/s
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (1.4.4)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from pyproj<3.5.0,>=1.9.3->
basemap) (2022.9.24)

Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-


dateutil>=2.1->matplotlib<3.7,>=1.5->basemap) (1.15.0)
Installing collected packages: numpy, pyshp, pyproj, basemap-data, basemap
Attempting uninstall: numpy
Found existing installation: numpy 1.21.6
Uninstalling numpy-1.21.6:
Successfully uninstalled numpy-1.21.6
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.5 which is incompatible.
Successfully installed basemap-1.3.6 basemap-data-1.3.2 numpy-1.23.5 pyproj-3.4.0 pyshp-2.3.1
!pip install basemap-data
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: basemap-data in /usr/local/lib/python3.8/dist-packages (1.3.2)
!pip install basemap-data-hires
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting basemap-data-hires
Downloading basemap_data_hires-1.3.2-py2.py3-none-any.whl (91.1 MB)
91.1 MB 57 kB/s
Installing collected packages: basemap-data-hires
Successfully installed basemap-data-hires-1.3.2
!pip install chain
Requirement already satisfied: chain in /usr/local/lib/python3.8/dist-packages (1.0)

SOURCE CODE:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5)
plt.show()

The useful thing is that the globe shown here is not a mere image; it is a fully-functioning Matplotlib axes
that understands spherical coordinates and which allows us to easily over plot data on the map.
fig = plt.figure(figsize=(8, 8))
m=Basemap(projection='lcc', resolution=None,width=8E6, height=8E6,lat_0=45, lon_0=-100)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

Map Projections:
The Basemap package implements several dozen such projections, all referenced by a short format code.
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

Cylindrical projections

The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents
equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude
lines varies between different cylindrical projections, leading to different conservation properties, and
different distortion near the poles.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

Cylindrical projections

The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents
equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude
lines varies between different cylindrical projections, leading to different conservation properties, and
different distortion near the poles.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; The Mollweide projection (projection='moll') is one common example of this, in which all
meridians are elliptical arcs. It is constructed so as to preserve area across the map: though there are
distortions near the poles, the area of small patches reflects the true area. Other pseudo-cylindrical
projections are the sinusoidal (projection='sinu') and Robinson (projection='robin') projections.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically
lies within the Earth!). One common example is the orthographic projection (projection='ortho'), which
shows one side of the globe as seen from a viewer at a very long distance.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=0)
draw_map(m)

Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can lead to very good
local properties, but regions far from the focus point of the cone may become much distorted.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

Example – Dataset (California_cities.csv)


import pandas as pd
cities = pd.read_csv('/content/sample_data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m=Basemap(projection='lcc',resolution='h',lat_0=37.5,lon_0=119,width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='lower left');

This shows us where larger populations of people have settled in California: they are clustered near the coast in
the Los Angeles and San Francisco areas, stretched along the highways in the flat central valley, and avoiding
almost completely the mountainous regions along the borders of the state.

RESULT:

You might also like