FINAL FDS MANUAL Print
FINAL FDS MANUAL Print
1. NUMPY
One of the most fundamental packages in Python, NumPy is a general-purpose array-
processing package. It provides high-performance multidimensional array objects and tools
to work with the arrays. NumPy is an efficient container of generic multi-dimensional data.
NumPy’s main object is the homogeneous multidimensional array. It is a table of Elements
or numbers of the same datatype, indexed by a tuple of positive integers. In NumPy,
dimensions are called axes and the number of axes is called rank. NumPy’s array class is
called ndarray aka array.
Basic array operations: add, multiply, slice, flatten, reshape, index arrays
Advanced array operations: stack arrays, split into sections, broadcast arrays
Work with DateTime or Linear Algebra
Basic Slicing and Advanced Indexing in NumPy Python.
2. SCIPY
The SciPy library is one of the core packages that make up the SciPy stack. Now, there is a
difference between SciPy Stack and SciPy, the library. SciPy builds on the NumPy array
object and is part of the stack which includes tools like Matplotlib, Pandas, and SymPy with
additional tools, SciPy library contains modules for efficient mathematical routines as linear
algebra, interpolation, optimization, integration, and statistics. There are various issues
related to Scientific Computation that arises while working with data science.
SciPy provides us with a variety of sub-packages to solve these issues efficiently.
SciPy library has amazingly fast computational power and easy to use.
It can operate an array of NumPy libraries and has also optimized the functions used
in NumPy.
After GNU Scientific library, SciPy is one of the most used scientific libraries.
3. PANDAS
Pandas is an open-source Python package that provides high-performance, easy-to-use
data structures and data analysis tools for the labeled data in Python programming
language. Pandas stand for Python Data Analysis Library. Pandas is a perfect tool for data
wrangling or munging. It is designed for quick and easy data manipulation, reading,
aggregation, and visualization. Pandas take data in a CSV or TSV file or a SQL database
and create a Python object with rows and columns called a data frame. The data frame is
very similar to a table in statistical software, say Excel or SPSS.
4. STATSMODELS
Statsmodels is built for hardcore statistics. The core of the Statsmodels Library is
production ready”. Traditional models like robust linear models, generalized linear model
(GLM) etc. have all been around for a long time and have been validated against “R &
Stata”. It also contains the time series analysis section, which includes vector
autoregression (VAR), AR & ARMA.
Linear/ Multiple regression – Linear regression is a statistical method for modeling
the linear relationship between a dependent variable and one or more explanatory
variables.
Logistic regression – The logistic model is used in statistics to model the
likelihood of a specific event/class occurring such as win/lose, pass/fail, etc.
Time series analysis – It refers to the analysis of time series data to retrieve
meaningful statistics and many other data characteristics
Statistical tests – Refers to the many statistical tests that can be done using the
Statsmodels Library.
5. JUPYTER
Project Jupyter is a suite of software products used in interactive computing. Packages
under Jupyter project include
Jupyter notebook − A web based interface to programming environments of Python,
Julia, R and many others
QtConsole − Qt based terminal for Jupyter kernels similar to IPython
nbviewer − Facility to share Jupyter notebooks
JupyterLab − Modern web based integrated interface for all products.
Offers a powerful interactive Python shell.
Acts as a main kernel for Jupyter notebook and other front end tools of Project
Jupyter.
Possesses object introspection ability. Introspection is the ability to check
properties of an object during runtime.
Syntax highlighting.
Stores the history of interactions.
Tab completion of keywords, variables and function names.
Magic command system useful for controlling Python environment and
performing OS tasks.
PYTHON INSTALLATION
Open the python official web site. (https://www.python.org/)
Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10 or above
versions)
Install "python-3.10.6-amd64.exe"
PACKAGE INSTALLATION
Open command prompt and enter the following code to check whether the python was installed
properly or not, “python –version”. If installation is proper it returns the version of python
Enter the following code to check whether the python package manager was installed properly
or not, “pip –version”.
Enter the following code to install the Numpy library: pip install numpy
Enter the following code to install the SciPy library: pip install scipy
Enter the following code to install the Statsmodels library: pip install statsmodels
Enter the following code to install the Pandas library: pip install Pandas
Enter the following code to install the Jupyter: pip install Jupyter
OUTPUT:
PROGRAM:
1. Creating Arrays:
0-D Arrays
Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called 1-D array.
import numpy as np
arr = np.array([1, 2,3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
2. Array Dimensions:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim) print(b.ndim) print(c.ndim) print(d.ndim)
3. Access 2-D Arrays:
To access elements from 2-D arrays we can use comma separated integers
representing the dimension and the index of the element.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
5. Array Slicing:
Slicing in python means taking elements from one given index to another given index.
We pass slice instead of index like this: [start:end]. We can also define the step, like
this: [start:end:step].
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
6. Data Types:
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc.
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)
8. Make a view:
import numpy as np
arr = np.array([1, 2, 3, 4, 5]) x = arr.view()
arr[0] = 42
print(arr) print(x)
15. Sorting:
Sorting means putting elements in an ordered sequence. Ordered sequence is any
sequence that has an order corresponding to elements, like numeric or alphabetical,
ascending or descending. The NumPy ndarray object has a function called sort(), that
will sort a specified array.
import numpy as np
arr = np.array([3, 2, 0, 1]) print(np.sort(arr))
16. Filtering Arrays:
Getting some elements out of an existing array and creating a new array out of them is
called filtering. In NumPy, you filter an array using a boolean index list.
import numpy as np
arr = np.array([41, 42, 43, 44]) x = [True, False, True, False] newarr = arr[x]
print(newarr)
OUTPUT:
PROGRAM:
import numpy as np
b = np.array([[2,3,4],[5,6,7], [8,9,10]])
mul= np.multiply(a,b)
add= np.add(a,b)
sub=np.subtract(a,b)
div=np.divide(a,b)
import pandas as pd
print("Original DataFrame:")
print(df)
df = df.drop([0, 1])
df1 = df1.drop([2])
print("\nNew DataFrames:")
print(df) print(df1)
print('\n"one_to_one”: check if merge keys are unique in both left and right datasets:"')
print(df_one_to_one)
print(df_one_to_many)
print(df_many_to_one)
PROGRAM:
#DATA COLLECT
import pandas as pd
import numpy as np
importmatplotlib.pyplot as plt
importseaborn as sns
dataset=pd.read_csv("iris.txt")
dataset.head()
dataset=pd.read_excel("iris.xlsx")
dataset.head()
dataset=pd.read_csv("iris.csv")
dataset.head()
dataset.info()
dataset.Species.unique()
#EDA
dataset.describe()
dataset.corr()
dataset.Species.value_counts()
sns.FacetGrid(dataset,hue="Species",size=6).map(plt.scatter,"Sepal.Length","Sepal.Width")
add_legend()
sns.FacetGrid(dataset,hue="Species",size=6).map(plt.scatter,"Petal.Length","Petal.Widh")
add_legend()
sns.pairplot(dataset,hue="Species")
plt.hist(dataset["Sepal.Length"],bin=25);
sns.FacetGrid(dataset,hue="Species",size=6).map(sns.displot,"Sepal.Width").add_legend();
sns.boxplot(x='Species',y='Petal.Length',data=dataset)
#PREPROCESSING
ss=StandardScaler()
x=dataset.drop(['Species'],axis=1) y=dataset['Species']
scaler=ss.fit(x)
x_stdscaler=scaler.transform(x) x_stdscaler
le=LabelEncoder()
y=le.fit_transform(y)
#SPLITTING
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
x_train.value_counts
#MODEL SELECTION
svc=SVC(kernel="linear")
svc.fit(x_train,y_train)
y_pred=svc.predict(x_test)
y_pred
accuracy_score(y_pred,y_test)
#PREDICTION
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)
KNeighborsClassifier(n_neighbors=3)
y_pred=knn.predict(x_test)
accuracy_score(y_pred,y_test)
OUTPUT:
DATASET HEADS:
DATASET INFORMATION:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
DATASET UNIQUE:
setosa 50
versicolor 50
virginica 50
DATASET DESCRIPTION:
150.0000
count 150.000000 150.000000 150.000000 150.000000
00
DATASET CORRELATION:
SCATTER PLOT:
PAIRPLOT:
HISTOGRAM:
BOXPLOT:
PREPROCESSING:
-1.34022653e+00, -1.31544430e+00],
-1.39706395e+00, -1.31544430e+00],
-1.28338910e+00, -1.31544430e+00],
SPLITTING:
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
MODEL SELECTION:
1.0
PREDICTION:
1.0
PROGRAM:
import pandas as pd
import numpy as np
df=pd.read_csv("diabetes_csv.csv")
df.head()
df.skin.value_counts()
df.mean(axis = 0)
print(df.loc[:,'skin'].mean())
df.mean(axis = 1)[0:5]
df.median()
print(df.loc[:,'skin'].median())
df.std() print(df.loc[:,'skin'].std())
df.std(axis = 1)[0:5]
df.var()
print(df.skew())
df.describe()
df.describe(include='all')
print(df.kurtosis())
OUTPUT:
HEAD DATA’S:
FREQUENCY:
0 227
32 31
30 27
27 23
23 22
33 20
28 20
18 20
31 19
19 18
39 18
29 17
40 16
25 16
MEAN:
20.536458333333332
0 43.153375
1 29.868875
2 38.871500
3 40.283375
4 57.298500
dtype: float64
MODE:
preg plas pres skin insu mass pedi age class
MEDIAN:
23.0
0 34.30
1 27.80
2 15.65
3 25.55
4 37.50
dtype: float64
STANDARD DEVIATION:
15.952217567727677
0 49.397286
1 31.519803
2 62.253392
3 37.591100
4 61.533847
VARIANCE:
preg 11.354056
plas 1022.248314
pres 374.647271
skin 254.473245
insu 13281.180078
mass 62.159984
pedi 0.109779
age 138.303046
dtype: float64
SKEWNESS:
preg 0.901674
plas 0.173754
pres -1.843608
skin 0.109372
insu 2.272251
dtype: float64
KURTOSIS:
preg 0.159220
plas 0.640780
pres 5.180157
skin -0.520072
insu 7.214260
mass 3.290443
pedi 5.594954
age 0.643159
dtype: float64
GRAPH:
PROGRAM:
import pandas as pd
import numpy as np
df=pd.read_csv("pima-indians-diabetes.csv")
df.head()
df.mean(axis = 0)
print(df.loc[:,'35'].mean())
df.mean(axis = 1)[0:5]
df.median()
print(df.loc[:,'33.6'].median())
df.std()
print(df.loc[:,'35'].std())
print(df.skew())
print(df.kurtosis())
norm_data = pd.DataFrame(np.random.normal(size=100000))
norm_data.plot(kind="density",figsize=(10,10));
OUTPUT:
HEAD DATA’S:
0 1 85 66 29 0 26.6 0.351 31 0
2 1 89 66 23 94 28.1 0.167 21 0
20.517601043024772
0 26.550111
1 34.663556
2 35.807444
3 51.043111
4 27.866778
dtype: float64
MODE:
MEDIAN:
32.0
0 26.6
1 8.0
2 23.0
3 35.0
4 5.0
dtype: float64
STANDARD DEVIATION:
15.954059060433842
0 31.119744
1 59.585320
2 37.639873
3 60.541569
4 41.114755
dtype: float64
VARIANCE:
6 11.362809
148 1022.622445
72 375.125415
35 254.532001
0 13290.194335
33.6 62.237755
0.627 0.109890
50 138.116452
1 0.227226
dtype: float64
SKEWNESS:
6 0.903976
148 0.176412
72 -1.841911
35 0.112058
0 2.270630
33.6 -0.427950
0.627 1.921190
50 1.135165
1 0.638949
dtype: float64
KURTOSIS:
6 0.161293
148 0.642992
72 5.168578
35 -0.518325
0 7.205266
33.6 3.282498
0.627 5.593374
50 0.660872
1 -1.595913
dtype: float64
GRAPH:
PROGRAM:
import pandas as pd
import numpy as np
%matplotlib inline
diabetes=pd.read_csv("C:\\Users\\KSK\\Documents\\diabetes.csv")
diabetes.head()
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)
diabetes.feature_names
# Now we will split the data into the independent and independent variable
X = diabetes.data[:,np.newaxis,3]
Y = diabetes.target
#We will split the data into training and testing data fromsklearn.model_selection
# Linear Regression
reg.fit(x_train,y_train)
y_pred = reg.predict(x_test)
Coef=reg.coef_
print(Coef)
MSE=mean_squared_error(y_test,y_pred)
R2=r2_score(y_test,y_pred) print(R2,MSE)
frommatplotlib.pyplot
plt.scatter(y_pred, y_test)
plt.xlabel('y_pred') plt.ylabel('y_test')
plt.plot(x_test,y_pred,linewidth=2)
plt.title('Linear Regression')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
model = LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score = model.score(x_test,y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_predict))
OUTPUT:
DIABETES DESCRIPTION:
Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood
Pressure, and six blood serum measurements were obtained for each of n = 442
diabetes patients, as well as the response of interest, a
: Attribute Information:
- Sex
COEFFICIENT VALUE:
[731.87600042]
LINEAR REGRESSION:
MODEL SCORE FOR LOGISTIC REGRESSION:
0.007518796992481203
[[130 17]
[ 38 46]]
PROGRAM:
import numpy as np
import pandas as pd
diabetes=pd.read_csv("C:\\Users\\KSK\\Documents\\FDS LAb\\diabetes.csv")
diabetes.head()
importstatsmodels.api as sm
predictions = model2.predict(X) # make the predictions by the model # Print out the
statistics
model2.summary()
OUTPUT:
HEAD DATA’S:
1 1 85 66 29 0 26.6 0.351 31 0
Df Model: 2
PROGRAM:
import numpy as np
import pandas as pd
importseaborn as sn
importmatplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\KSK\\Documents\\train.csv")
df.head()
mean = df.loc[:,'Fare'].mean()
sd = df.loc[:,'Fare'].std()
plt.show()
OUTPUT:
NORMAL CURVE:
PROGRAM:
import numpy as np
import pandas as pd
importseaborn as sn
importmatplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\KSK\\Documents\\train.csv")
df.head()
sns.distplot(df["Fare"]) sns.distplot(df["Age"])
plt.contour(df[["Fare","Parch"]])
OUTPUT:
DENSITY PLOT:
CONTOUR PLOT:
PROGRAM:
import numpy as np
import pandas as pd
importseaborn as sn
importmatplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\KSK\\Documents\\train.csv") df.head()
plt.figure(figsize=(8,8))
df.corr()
plt.show()
OUTPUT:
SCATTER PLOT:
HEAP MAP:
PROGRAM:
import numpy as np
import pandas as pd
importseaborn as sn
importmatplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\KSK\\Documents\\train.csv")
df.head()
plt.hist(df["Fare"])
OUTPUT:
HISTOGRAM:
array([732., 106., 31., 2., 11., 6., 0., 0., 0., 3.]),
import numpy as np
import pandas as pd
importseaborn as sn
importmatplotlib.pyplot as plt
%matplotlib inline
xdata = df[["Age"]]
ydata = df[["Parch"]]
frommpl_toolkits.basemap i
draw_map(m)
draw_map(m);
draw_map(m)
OUTPUT:
ORTHO PROJECTION:
MAPPING LONGITUDE AND LATITUDE:
CYLINDRICAL PROJECTIONS:
PSEUDO-CYLINDRICAL PROJECTIONS:
PERSPECTIVE PROJECTION:
CONIC PROJECTION: