Fods Lab
Fods Lab
Note:
pip - Package Installer for Python is the de facto and recommended package management system
written in Python and is used to install and manage software packages. It connects to an online
repository of public packages, called the Python Package Index.
ARRAY CREATION:
>>> a=np.array([(1,2,3),(4,5,6)])
>>> print(a)
[[ 1 2 3]
[4 5 6]]
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5])
>>>print(arr)
[1 2 3 4 5]
>>>print(type(arr))
<class 'numpy.ndarray'>
>>>a = np.array(42)
>>>b = np.array([1, 2, 3, 4, 5])
>>>c = np.array([[1, 2, 3], [4, 5, 6]])
>>d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
>>>print(a.ndim)
0
>>>print(b.ndim)
1
>>>print(c.ndim)
2
>>>print(d.ndim)
3
ARRAY INDEXING:
Array indexing is the same as accessing an array element. We can access an array element by
referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first
element has index 0, and the second has index 1 etc.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[0])
1
>>>print(arr[2])
3
>>>print(arr[4])
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
print(arr[4])
IndexError: index 4 is out of bounds for axis 0 with size 4
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[2] + arr[3])
7
>>>arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
>>>print('2nd element on 1st row: ', arr[0, 1])
2nd element on 1st row: 2
>>>print('5th element on 2nd row: ', arr[1, 4])
5th element on 2nd row: 10
>>>print('Last element from 2nd dim: ', arr[1, -1])
Last element from 2nd dim: 10
>>>arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
>>>print(arr[0, 1, 2])
6
ARRAY SLICING:
Slicing in python means retrieving elements from one given index to another given index.
If we don't pass start its considered 0. If we don't pass end it’s considered length of array in that
dimension. If we don't pass step it’s considered 1.
import numpy as np
>>>arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>>print(arr.shape)
(2, 4)
Array Reshape - By reshaping we can add or remove dimensions or change the number of elements
in each dimension.
#Converting a 1d array to 2d
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
>>>newarr = arr.reshape(4, 3)
>>>print(newarr)
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
ARRAY ITERATION:
Iterating means looping through elements one by one for specific number of times.
>>>import numpy as np
>>> arr = np.array([1, 2, 3])
>>> for x in arr:
print(x)
1
2
3ARRAY JOINING:
Joining is the process of combining contents of two or more arrays in a single array.
>>>import numpy as np
>>>arr1 = np.array([1, 2, 3])
>>>arr2 = np.array([4, 5, 6])
>>>arr = np.concatenate((arr1, arr2))
>>>print(arr)
[1 2 3 4 5 6]
ARRAY SPLITTING:
Splitting is the reverse process operation of Joining. Splitting breaks one array into multiple
subarrays.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6])
>>>newarr = np.array_split(arr,3)
>>>print(newarr)
[array([1, 2]), array([3, 4]), array([5, 6])]
>>>print(np.array_split(arr,5))
[array([1, 2]), array([3]), array([4]), array([5]), array([6])]
ARRAY SORTING:
Sorting is the process of combining elements in an ordered sequence either in the ascending or
descending order.
>>>import numpy as np
#sorting numbers in ascending order
>>>arr = np.array([3, 2, 0, 1])
>>>print(np.sort(arr))
[0 1 2 3]
#sorting in alphabetical order
>>>arr = np.array(['banana', 'cherry', 'apple'])
>>>print(np.sort(arr))
['apple' 'banana' 'cherry']
SEARCHING ARRAYS:
Search an array for a certain value returns the index that gets a match. To search an array, use the
where ( ) method.
Find the indexes where the value is 4:
>>>arr = np.array([1, 2, 3, 4, 5, 4, 4])
>>>x = np.where(arr == 4)
>>>print(x)
(array([3, 5, 6], dtype=int32),)
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
>>>x = np.where(arr%2 == 0)
>>>print(x)
(array([1, 3, 5, 7], dtype=int32),)
>>>x = np.where(arr%2 == 1)
>>>print(x)
(array([0, 2, 4, 6], dtype=int32),)
DATA TYPES:
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the
characters used to represent them.
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr.dtype)
int32
>>>arr = np.array(['apple', 'banana', 'cherry'])
>>>print(arr.dtype)
<U6
>>>arr = np.array([1, 2, 3, 4], dtype='S')
>>>print(arr)
[b'1' b'2' b'3' b'4']
>>>print(arr.dtype)
|S1
>>>arr = np.array([1, 2, 3, 4], dtype='i4')
>>>print(arr)
[1 2 3 4]
>>>print(arr.dtype)
int32
>>>arr = np.array(['a', '2', '3'], dtype='i')
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
arr = np.array(['a', '2', '3'], dtype='i')
ValueError: invalid literal for int() with base 10: 'a'
>>>arr = np.array([1, 0, 3])
>>>newarr = arr.astype(bool)
>>>print(newarr)
[ True False True]
>>>print(newarr.dtype)
bool
RESULT:
Create a simple Pandas DataFrame
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
calories duration
0 420 50
1 380 40
2 390 45
Pandas use the loc attribute to return one or more specified row(s)
Return row 0:
#refer to the row index:
print(df.loc[0])
calories 420
duration 50
calories duration
0 420 50
1 380 40
Named Indexes:
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
If your data sets are stored in a file, Pandas can load them into a DataFrame. Load a comma separated
file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
iso_code ... excess_mortality_cumulative_per_million
0 AFG ... NaN
1 AFG ... NaN
2 AFG ... NaN
3 AFG ... NaN
4 AFG ... NaN
... ... ... ...
166321 ZWE ... NaN
166322 ZWE ... NaN
166323 ZWE ... NaN
166324 ZWE ... NaN
166325 ZWE ... NaN
.
[166326 rows x 67 columns]
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)
Name Age
rank1 Tom 28
rank2 Jack 34
rank3 Steve 29
rank4 Ricky 42
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)
a b c
first 1 2 NaN
second 5 10 20.0
import pandas as pd
# list of strings
lst = ['Pandas', 'SciPy', 'DataFrames', 'NumPy', 'Analytics']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
0 Pandas
1 SciPy
2 DataFrames
3 NumPy
4 Analytics
narray/list, all the narray must be of same length. If index is passed then the length index should be
equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is
the array length.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
print(df)
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
Column Selection: In order to select a column in Pandas DataFrame, we can either access the
columns by calling them by their columns name.
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],'Age':[27, 24, 22, 32],'Address':['Delhi',
'Kanpur', 'Allahabad', 'Kannauj'],'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be
selected by passing integer location to an iloc[] function.
import pandas as pd
data = pd.read_csv("country.csv", index_col ="iso_code")
first = data.loc["AFG"]
second = data.loc["NOR"]
print(first, "\n\n\n", second)
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.isnull()) First Score Second Score Third Score
All these function help in filling null values in datasets of a DataFrame. Interpolate () function is
basically used to fill NA values in the DataFrame, but it uses various interpolation technique to fill the
missing values rather than hard-coding the value.
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.fillna(0))
Pandas DataFrame consists of rows and columns so, in order to iterate over DataFrame, we have to
iterate a DataFrame like a dictionary. In order to iterate over rows, we can use three function
iteritems(), iterrows(), itertuples() . These three functions will help in iteration over rows.
import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],'degree': ["MBA", "BCA", "M.Tech",
"MBA"],'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df)
RESULT:
PROCEDURE:
We will use the Pandas library to load the Iris data set CSV file, and will convert it into the dataframe.
read_csv() method which is used to read CSV files.
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
print(df.head())
2. Dispersion
Dispersion is used to define variation present in given variable. Variation means how
values are close or away from the mean value.
Variance — its gives average deviation from mean value
Standard Deviation — it is square root of variance
Range — it gives difference between max and min value
InterQuartile Range(IQR) — it gives difference between Q3 and Q1, where Q3 is 3rd
Quartile value and Q1 is 1st Quartile value.
data[‘A’].var()
data[‘A’].std()
data[‘A’].max()-data[‘A’].min()
data[‘A’].quantile([.25,.5,.75])
df["sepalwidth"].var()
0.1880040268456376
df["sepallength"].std()
0.4335943113621737
df["sepallength"].max()-df["sepalwidth"].min()
5.9
df["petalwidth"].quantile([.25,.5,.75])
0.50 1.3
0.75 1.8
Name: petalwidth, dtype: float64
3. Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry
means equal distribution of observation above or below the mean.
skewness = 0: if data is symmetric along with mean
skewness = Negative: if data is not symmetric and right side tail is longer than left side tail of density
plot.
skewness = Positive: if data is not symmetric and left side tail is longer than right side tail in density
plot.
We can find skewness of given variable by below given formula.
data[‘A’].skew()
df["sepallength"].skew()
0.3149109566369728
df["sepalwidth"].skew()
0.3340526621720866
df["class"].skew()
ValueError: could not convert string to float: 'Iris-setosa'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/nanops.py in _f(*args,
**kwargs)
99 # object arrays that contain strings
100 if is_object_dtype(args[0]):
--> 101 raise TypeError(e) from e
102 raise
103
TypeError: could not convert string to float: 'Iris-setosa
4. Kurtosis
Kurtosis is used to defined peakedness (or flatness) of density plot (normal distribution
plot). As per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.
In the above graph, we can clearly see that left side and right side of plot is equally
distributed. Histogram is above the line that means data has flat plot. This means kurtosis of this
distribution is Normal.
Missing values can occur when no information is provided for one or more items or for a whole unit.
We will use the isnull() method.
df.isnull().sum()
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
Checking Duplicates
Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method
helps in removing duplicates from the data frame.
#interactive table view
data = df.drop_duplicates(subset ="class",)
data
df.value_counts("sepalwidth")
sepalwidth
3.0 26
2.8 14
3.2 13
3.4 12
3.1 12
2.9 10
2.7 9
2.5 8
3.3 6
3.5 6
3.8 6
2.6 5
2.3 4
2.4 3
2.2 3
3.6 3
3.7 3
3.9 2
4.1 1
4.2 1
2.0 1
4.0 1
4.4 1
dtype: int64
Data Visualization
Visualizing the target column - Our target column will be the sepalwidth column because at the end, we
need the result according to the sepalwidth only. Let’s see a countplot for species. (We will use Matplotlib
and Seaborn library for the data visualization.)
Histograms allow seeing the distribution of data for various columns. It can be used for uni as well as bi-
variate analysis.
Output:
The highest frequency of the sepal length is between 30 and 35 which is between 5.5 and 6.
The highest frequency of the sepal width is around 70 which is between 3.0 and 3.5.
The highest frequency of the petal length is around 50 which is between 1 and 2.
The highest frequency of the petal width is between 40 and 50 which is between 0.0 and 0.5
RESULT:
PROCEDURE:
(5a) Univariate Analysis - Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis
import pandas as pd
import numpy as np
df = pd.read_csv("diabetes.csv")
print(df)
>>>print(df['Age'].mean())
48.02884615384615
>>>print(df['Age'].median())
47.5
>>>print(df['Age'].mode())
0 35
dtype: int64
>>>print(df["Age"].var())
147.65812583370388
>>>print(df["Age"].std())
12.151465995249458
>>>print(df["Age"].skew())
0.3293593578272701
>>>print(df["Age"].kurt())
-0.19170941407070163
Data-Visualization:(pima-diabetes.csv)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
Data_X= df.copy(deep=True)
Data_X= Data_X.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
#Plotting Histogram of Data
Data_X.hist(bins=40)
plt.show()
(5b) Bivariate Analysis – Linear and Logistic Regression
Simple Linear Regression - It is an approach for predicting a response using a single feature. It is
assumed that the two variables are linearly related. So, we try to find a linear function that predicts the
response value(y) as accurately as possible as a function of the feature or independent variable(x). Let us
consider a dataset where we have a value of response y for every feature x as given below(example):
Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response
for any new feature values. (i.e. a value of x not present in a dataset). This line is called a regression line.
The equation of regression line is represented as: h(xi) = b0+b1xi
Here,
h(xi) represents the predicted response value for ith observation.
b0 and b1 are regression coefficients and represent y-intercept and slope of regression line
respectively.
Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response
for any new feature values. (i.e. a value of x not present in a dataset). This line is called a regression line.
The equation of regression line is represented as: h(xi) = b0+b1xi
Here,
SOURCE CODE:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\n b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
Logistic Regression:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
import sklearn
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
LR = LogisticRegression()
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
print("Accuracy ", LR.score(X_test, y_test)*100)
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()
(5c) Multiple Regression Analysis
import pandas as pd
from sklearn import linear_model
df = pd.read_csv("pima-diabetes.csv")
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#Predict age based on Glucose and BloodPressure
predictedage = regr.predict([[185, 145]])
print(predictedage)
Output:
[48.13025197]
From above graph, we can infer that the Pregnancy isn't likely cause for diabetes as the distribution between
the Healthy and Diabetic is almost same.
//diabetes.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Gender'] == 'Male']["Age"], color='green')
sns.distplot(df[df['Polyuria'] == 'No']["Age"], color='red')
plt.title('Male vs Polyuria by Age', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()
RESULT:
SOURCE CODE:
# Normal Curve
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.Glucose[0:50]
mean=st.mean(x)
sd=st.stdev(x)
pyplot.plot(x,norm.pdf(x,mean,sd))
pyplot.title("Normal plot")
pyplot.show()
OUTPUT:
#density plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()
OUTPUT:
#contour plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.BloodPressure[0:2]
y=data.Glucose[0:2]
z=((data.BMI[0:2],data.Age[0:2]))
pyplot.figure(figsize=(7,5))
pyplot.title("Contour plot")
contours=pyplot.contour(x,y,z)
pyplot.show()
OUTPUT:
#correlation plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
names=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"]
correlation = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlation, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,8,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.title("Correlation")
pyplot.show()
OUTPUT:
#scatter plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
scatter_matrix(data)
pyplot.show()
OUTPUT:
#Histograms
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.hist()
pyplot.show()
OUTPUT:
data = pd.read_csv('diabetes.csv')
fig = pyplot.figure()
ax = pyplot.axes(projection='3d')
ax = pyplot.axes(projection='3d')
zline = np.array(data.BMI)
xline = np.sin(zline)
yline = np.cos(zline)
OUTPUT:
RESULT:
PROCEDURE:
#Basemap and other packages installation
!pip install basemap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting basemap
Downloading basemap-1.3.6-cp38-cp38-manylinux1_x86_64.whl (863 kB)
863 kB 14.5 MB/s
Collecting basemap-data<1.4,>=1.3.2
Downloading basemap_data-1.3.2-py2.py3-none-any.whl (30.5 MB)
30.5 MB 1.4 MB/s
Requirement already satisfied: matplotlib<3.7,>=1.5 in /usr/local/lib/python3.8/dist-packages (from
basemap) (3.2.2)
Collecting pyproj<3.5.0,>=1.9.3
Downloading pyproj-3.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
7.8 MB 55.4 MB/s
Collecting pyshp<2.4,>=1.2
Downloading pyshp-2.3.1-py2.py3-none-any.whl (46 kB)
46 kB 3.6 MB/s
Collecting numpy<1.24,>=1.22
Downloading numpy-1.23.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
17.1 MB 46.7 MB/s
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from
matplotlib<3.7,>=1.5->basemap) (1.4.4)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from pyproj<3.5.0,>=1.9.3->
basemap) (2022.9.24)
SOURCE CODE:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5)
plt.show()
The useful thing is that the globe shown here is not a mere image; it is a fully-functioning Matplotlib axes
that understands spherical coordinates and which allows us to easily over plot data on the map.
fig = plt.figure(figsize=(8, 8))
m=Basemap(projection='lcc', resolution=None,width=8E6, height=8E6,lat_0=45, lon_0=-100)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
Map Projections:
The Basemap package implements several dozen such projections, all referenced by a short format code.
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')
Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents
equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude
lines varies between different cylindrical projections, leading to different conservation properties, and
different distortion near the poles.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents
equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude
lines varies between different cylindrical projections, leading to different conservation properties, and
different distortion near the poles.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; The Mollweide projection (projection='moll') is one common example of this, in which all
meridians are elliptical arcs. It is constructed so as to preserve area across the map: though there are
distortions near the poles, the area of small patches reflects the true area. Other pseudo-cylindrical
projections are the sinusoidal (projection='sinu') and Robinson (projection='robin') projections.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically
lies within the Earth!). One common example is the orthographic projection (projection='ortho'), which
shows one side of the globe as seen from a viewer at a very long distance.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=0)
draw_map(m)
Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can lead to very good
local properties, but regions far from the focus point of the cone may become much distorted.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)
This shows us where larger populations of people have settled in California: they are clustered near the coast in
the Los Angeles and San Francisco areas, stretched along the highways in the flat central valley, and avoiding
almost completely the mountainous regions along the borders of the state.
RESULT: