100% found this document useful (1 vote)

80 views7 pages

Outliers, Hypothesis and Natural Language Processing

This document discusses outliers, hypothesis testing, and natural language processing techniques on an iris dataset. It shows how to identify and treat outliers, perform hypothesis tests to check if a sample follows a normal distribution, and convert text to word vectors using techniques like removing stop words and bag-of-words modeling. The key steps include identifying outliers, replacing them with the median, using Kolmogorov–Smirnov tests to check distributions, removing stop words from text, and creating word count vectors.

Uploaded by

subhajitbasak001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

80 views7 pages

Outliers, Hypothesis and Natural Language Processing

Uploaded by

subhajitbasak001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Week-07 Outliers, Hypothesis and Natural Language Processing

[25]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

[26]: iris = pd.read_csv('iris.csv')

iris

[26]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. … … … … …
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

[150 rows x 5 columns]

[27]: iris.columns

[27]: Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',

'species'],
dtype='object')

[28]: import pandas as pd

from sklearn.preprocessing import LabelEncoder

1
from sklearn.model_selection import train_test_split

[29]: target_column = 'species'

X = iris.drop(target_column, axis=1)
y = iris[target_column]

[30]: le = LabelEncoder()
y_encoded = le.fit_transform(y)
iris[target_column] = y_encoded

[31]: sns.heatmap(iris.corr(method='pearson').drop(
[], axis=1).drop([], axis=0),
annot = True);

plt.show()

2
#Treating Outliers
var = iris['sepal_width']
[34]: var

[34]: 0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
…
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64

[35]: q1 = np.percentile(var, 25)

q3 = np.percentile(var, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in var if x < lower_bound or x > upper_bound]

outliers

[35]: [4.4, 4.1, 4.2, 2.0]

[36]: median_data = var.median()

median_data

[36]: 3.0

[37]: for i in range(len(var)):

if var[i] in outliers:
var[i] = median_data

print("Data with Outliers Replaced by Median:\n", var)

3
Data with Outliers Replaced by Median:
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
…
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64

[38]: q1 = np.percentile(var, 25)

q3 = np.percentile(var, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in var if x < lower_bound or x > upper_bound]

print(outliers)

[32]: import sweetviz as sv

[33]: advert_report = sv.analyze(iris)

#display the report
advert_report.show_html('Advertising.html')
Report Advertising.html was generated! NOTEBOOK/COLAB USERS: the web browser
MAY not pop up, regardless, the report IS saved in your notebook/colab files.

4
if(len(outliers) == 0):
print("No outliers.")

[]
No outliers.
Hypothesis
[39]: import numpy as np
from scipy.stats import kstest, norm

# Generate a sample of data that you want to test

np.random.seed(0) # Setting a seed for reproducibility
sample_data = np.random.normal(loc=0, scale=1, size=1000) # Sample data from a␣
↪normal distribution

# Perform a KS test to check if the sample_data follows a normal distribution

ks_statistic, p_value = kstest(var, 'norm')

# Define the significance level (alpha)

alpha = 0.05

# Check the result of the KS test

if p_value < alpha:
print(f"The data does NOT follow a normal distribution (p-value =␣
↪{p_value})")

else:
print(f"The data follows a normal distribution (p-value = {p_value})")

The data does NOT follow a normal distribution (p-value =

5.8803781394734095e-279)

[40]: # Generate a sample of data that you want to test

np.random.seed(0) # Setting a seed for reproducibility
sample_data_1 = np.random.normal(0,1,100) # Sample data from a normal␣
↪distribution

# Perform a KS test to check if the sample_data follows a normal distribution

ks_statistic, p_value = kstest(sample_data_1, 'norm')

# Define the significance level (alpha)

alpha = 0.05

# Check the result of the KS test

if p_value < alpha:
print(f"The sample does NOT follow a normal distribution (p-value =␣
↪{p_value})")

else:

5
print(f"The sample follows a normal distribution (p-value = {p_value})")

The sample follows a normal distribution (p-value = 0.8667717341286251)

[41]: # Generate a sample of data that you want to test

np.random.seed(0) # Setting a seed for reproducibility
sample_data_2 = np.random.uniform(0,1,100) # Sample data from a normal␣
↪distribution

# Perform a KS test to check if the sample_data follows a normal distribution

ks_statistic, p_value = kstest(sample_data_2, 'norm')

# Define the significance level (alpha)

alpha = 0.05

# Check the result of the KS test

if p_value < alpha:
print(f"The sample does NOT follow a normal distribution (p-value =␣
↪{p_value})")

else:
print(f"The sample follows a normal distribution (p-value = {p_value})")

The sample does NOT follow a normal distribution (p-value =

7.902176095057778e-24)
Natural Language Processing
[ ]: # This is related to convering a text in to vector
import pandas as pd
import numpy as np
import collections
import re

[ ]: #Sample documents
doc1 = 'Game of Thrones is an amazing tv series!, Game of Thrones is the best␣
↪tv series! and Game of Thrones is so great'

#Sentance without punctuations and split them

w_doc1= re.sub(r'[^\w\s]','', doc1.lower()).split()
# Print the sentence without punctuation
print(w_doc1)

['game', 'of', 'thrones', 'is', 'an', 'amazing', 'tv', 'series', 'game', 'of',
'thrones', 'is', 'the', 'best', 'tv', 'series', 'and', 'game', 'of', 'thrones',
'is', 'so', 'great']

[ ]: import nltk
from nltk.corpus import stopwords

6
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data…

[nltk_data] Unzipping corpora/stopwords.zip.
True

[ ]: stop_words = set(stopwords.words('english'))
filtered_words = [word for word in w_doc1 if word.lower() not in stop_words]

# Reconstruct the text without stop words

filtered_text = ' '.join(filtered_words)

# Print the text without stop words

print(filtered_text)

game thrones amazing tv series game thrones best tv series game thrones great

[ ]: from sklearn.feature_extraction.text import CountVectorizer

doc1 = ['Game of Thrones is an amazing tv series!, Game of Thrones is the best␣
↪tv series! and Game of Thrones is so great']

# Create an instance of CountVectorizer

vectorizer = CountVectorizer()
# Fit the vectorizer on the sentences and transform them into a Bag of Words␣
↪representation

X = vectorizer.fit_transform(doc1)
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Convert the Bag of Words representation to a dense matrix and print it
print(X.toarray())
print("Feature names (words):", feature_names)

[[1 1 1 1 3 1 3 3 2 1 1 3 2]]
Feature names (words): ['amazing' 'an' 'and' 'best' 'game' 'great' 'is' 'of'
'series' 'so' 'the'
'thrones' 'tv']

The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Book
100% (1)
Book
480 pages
1-PPS Python Lab Manual CSEL202
No ratings yet
1-PPS Python Lab Manual CSEL202
86 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
100% (1)
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
41 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Econ209 f2024 Lab 4 Truong Gia Han
No ratings yet
Econ209 f2024 Lab 4 Truong Gia Han
11 pages
Linear Regression
100% (1)
Linear Regression
51 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Machine Learning and Data Analytics Using Python Lab
No ratings yet
Machine Learning and Data Analytics Using Python Lab
36 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
IRIS BPNN - Ipynb - Colaboratory
100% (1)
IRIS BPNN - Ipynb - Colaboratory
4 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Lab7.ipynb - Colaboratory
100% (1)
Lab7.ipynb - Colaboratory
5 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Unit 4
No ratings yet
Unit 4
79 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Best Practices Building Isv Integrations
No ratings yet
Best Practices Building Isv Integrations
61 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
1 page
MMW Lecture 4.2 Data Management Part 2
100% (1)
MMW Lecture 4.2 Data Management Part 2
57 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Welding Repair Sandvik
100% (2)
Welding Repair Sandvik
42 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
9 Measures of Central Tendency
No ratings yet
9 Measures of Central Tendency
23 pages
Best IELTS Coaching Institutes in Chandigarh
No ratings yet
Best IELTS Coaching Institutes in Chandigarh
15 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
HW1
100% (1)
HW1
8 pages
ML Practical File
100% (2)
ML Practical File
43 pages
Glass Classification
100% (2)
Glass Classification
3 pages
Manual Triplex Pump
100% (1)
Manual Triplex Pump
45 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
KEDGE - Legal&Ethical Aspects of AI - S Marcellin - October 2023
No ratings yet
KEDGE - Legal&Ethical Aspects of AI - S Marcellin - October 2023
96 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Gap Model of Service Quality (5 Gap Model)
No ratings yet
Gap Model of Service Quality (5 Gap Model)
10 pages
Nichols 2009 Health, Climate Change and Sustainability. A Systematic Review and Thematic Analysis of The Literature
No ratings yet
Nichols 2009 Health, Climate Change and Sustainability. A Systematic Review and Thematic Analysis of The Literature
26 pages
Data Science
No ratings yet
Data Science
39 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Haunted Magazine - Issue 32 - The Great American Ghost Trip - 7 December 2021
No ratings yet
Haunted Magazine - Issue 32 - The Great American Ghost Trip - 7 December 2021
102 pages
Nike+ Sensor User Guide
No ratings yet
Nike+ Sensor User Guide
28 pages
TP Regression
100% (1)
TP Regression
1 page
Slidesgo Rusting of Iron 202411290531573tte
No ratings yet
Slidesgo Rusting of Iron 202411290531573tte
20 pages
Powerpack
No ratings yet
Powerpack
39 pages
Python Setup For Machine Learning
100% (1)
Python Setup For Machine Learning
3 pages
Ch4 Lec Continuous RV PDF
No ratings yet
Ch4 Lec Continuous RV PDF
43 pages
Getit Physics Electricity Key Words Gcse Aug 2017
No ratings yet
Getit Physics Electricity Key Words Gcse Aug 2017
9 pages
Pa Turnpike Design Consistancy Manual 2011
No ratings yet
Pa Turnpike Design Consistancy Manual 2011
208 pages
Booking Process 30-04-2021 (COMPANY)
No ratings yet
Booking Process 30-04-2021 (COMPANY)
1 page
Microsoft SQL SERVER Microsoft SQL SERVER 2008: Prepared By: Engr. Cherryl D. Cordova 1
No ratings yet
Microsoft SQL SERVER Microsoft SQL SERVER 2008: Prepared By: Engr. Cherryl D. Cordova 1
26 pages
Cot 1 2022
No ratings yet
Cot 1 2022
9 pages
Calculation of Friction Losses, Power, Developed Head and Available Net Positive Suction Head of A Pump
No ratings yet
Calculation of Friction Losses, Power, Developed Head and Available Net Positive Suction Head of A Pump
11 pages
Severin Final
No ratings yet
Severin Final
26 pages
Python Data Science
100% (2)
Python Data Science
353 pages
Computer Fundamental
No ratings yet
Computer Fundamental
75 pages
Human Error and Production Rate Correlation in Assembly Process of Electronics Goods
No ratings yet
Human Error and Production Rate Correlation in Assembly Process of Electronics Goods
6 pages
Analisis Spektrum Sinyal Digital - Pertemuan 1
No ratings yet
Analisis Spektrum Sinyal Digital - Pertemuan 1
23 pages
Lesson 2 To Buy or Not To Buy..
No ratings yet
Lesson 2 To Buy or Not To Buy..
3 pages
Python For Data Science The Ultimate Beginners Guide To Learning Python Data Science Step by Step - Compress
100% (5)
Python For Data Science The Ultimate Beginners Guide To Learning Python Data Science Step by Step - Compress
148 pages
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (8)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Data Analytics Using Python
100% (1)
Data Analytics Using Python
982 pages
Python Machine Learning Workbook For Beginners
No ratings yet
Python Machine Learning Workbook For Beginners
264 pages
Undergraduate International: Medicine
No ratings yet
Undergraduate International: Medicine
2 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
The Python Bible
97% (31)
The Python Bible
506 pages
NLM 17
No ratings yet
NLM 17
3 pages
EASA FORM 19 2012 - Editable 2.doc - 20120727120309
No ratings yet
EASA FORM 19 2012 - Editable 2.doc - 20120727120309
2 pages
Daniel Tian Li: Circular Column Design Based On ACI 318-05
No ratings yet
Daniel Tian Li: Circular Column Design Based On ACI 318-05
1 page
Grout
No ratings yet
Grout
2 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (15)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Data Visualization Complete Notes
100% (9)
Data Visualization Complete Notes
28 pages
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
100% (8)
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
339 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Data Structure and Algorithms With Python
100% (14)
Data Structure and Algorithms With Python
369 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
100% (21)
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
133 pages
SQL For Data Science
75% (4)
SQL For Data Science
350 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Intelligent Techniques For Data Science
100% (12)
Intelligent Techniques For Data Science
282 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (10)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Big Data Analytics Tutorial
100% (15)
Big Data Analytics Tutorial
101 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet