0% found this document useful (0 votes)

20 views18 pages

Data Mining Lab Record

Uploaded by

Likkitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views18 pages

Data Mining Lab Record

Uploaded by

Likkitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Ex no:01

Date: 15-02-2024

Aim: Perform Data Cleaning techniques for the given datase

To perform data cleaning techniques for the given dataset.

Implementation:
df.describe()
df.info()
print("Column Names:")
print(df.columns.tolist())
data_types = df.dtypes
print(data_types)
columns_to_check = ['GPA', 'Gender', 'breakfast', 'calories_chicken', 'calories_day', 'calories_scone',
'coffee', 'comfort_food', 'comfort_food_reasons', 'comfort_food_reasons_coded', 'cook',
'comfort_food_reasons_coded.1', 'cuisine', 'diet_current', 'diet_current_coded', 'drink',
'eating_changes', 'eating_changes_coded', 'eating_changes_coded1', 'eating_out', 'employment',
'ethnic_food', 'exercise', 'father_education', 'father_profession', 'fav_cuisine', 'fav_cuisine_coded',
'fav_food', 'food_childhood', 'fries', 'fruit_day', 'grade_level', 'greek_food', 'healthy_feeling',
'healthy_meal', 'ideal_diet', 'ideal_diet_coded', 'income', 'indian_food', 'italian_food', 'life_rewarding',
'marital_status', 'meals_dinner_friend', 'mother_education', 'mother_profession', 'nutritional_check',
'on_off_campus', 'parents_cook', 'pay_meal_out', 'persian_food', 'self_perception_weight', 'soup',
'sports', 'thai_food', 'tortilla_calories', 'turkey_calories', 'type_sports', 'veggies_day', 'vitamins',
'waffle_calories', 'weight']
missing_values_count = df[columns_to_check].isnull().sum()
print("Number of missing values in each column:")
print(missing_values_count)
cleaned_df = df.dropna(subset=columns_to_check)
cleaned_df = cleaned_df.drop('weight', axis=1)
missing_values_count_cleaned = cleaned_df.isnull().sum()
print("Number of missing values in each column of the cleaned DataFrame:")
print(missing_values_count_cleaned)
Output:
GPA object
Gender int64
breakfast int64
calories_chicken int64
calories_day float64
...
type_sports object
veggies_day int64
vitamins int64
waffle_calories int64
weight object
Length: 61, dtype: object

Number of missing values in each column: GPA 2

Gender 0
breakfast 0
calories_chicken 0
calories_day 19
..
type_sports 26
veggies_day 0
vitamins 0
waffle_calories 0
weight 2
Length: 61, dtype: int64
Ex no:02
Date:15-02-2024
Data Normalization
Aim:
To implement Data Normalization techniques such as z-score, decimal-scaler and min-max
scaler for the dataset.
Implementation:

 Min-Max Normalization:
Formula:X_norm =((X -Xmin )/ (Xmax – Xmin) )

Where:
X – Original value
Xmin – minimum value in the
dataset Xmax - maximum value in
the dataset Xnorm – normalized value

 Z-score Normalization:
Formula: Z = (x -µ) /

Where:
X – original value
µ - mean of the dataset
sigma – standard deviation of the
dataset z – standardized value

 Normalization by decimal Scaling:

Formula: Xnorm = X/10n

Where:
X – original value
N – smallest integer such as Xnorm <1
Xnorm – normalized value.

Code:
numerical_columns = cleaned_df.select_dtypes(include=['float64', 'int64']).columns
min_max_scaler = MinMaxScaler()
cleaned_df[numerical_columns] = min_max_scaler.fit_transform(cleaned_df[numerical_columns])
z_score_scaler = StandardScaler()
cleaned_df[numerical_columns] =
z_score_scaler.fit_transform(cleaned_df[numerical_columns]) decimal_scaler =
StandardScaler(with_mean=False)
cleaned_df[numerical_columns] =
decimal_scaler.fit_transform(cleaned_df[numerical_columns]) cleaned_df.head(5)
Output:
Ex no:03
Date: 01-03-2024
Apriori and FP-growth Algorithm

Aim:
To extract frequent item sets using Candidate generation and without using Candidate generation.

Procedure:
 Consider dataset or set of items for transactions and make it into a dataframe.
 Scale the data
 Use Apriori and FP-growth algorithm to find frequent items pattern.
 Obtaining the association rules for both techniques
 Antecent, Consequent, Conviction, Support, Confidence etc.. are to be used to find
the frequent item pattern base.

Support – ((no of transactions in which the particular items appears)/(Total no of

transactions))

Confidence – (Support (any items of dataset) / support (items))

Code:
import pandas as pd
from mlxtend.frequent_patterns import apriori,
fpgrowth from mlxtend.frequent_patterns import
association_rules data = {
"Transaction ID": list(range(1, 49)),
"Items": [
{'milk', 'bread', 'eggs'},
{'bread', 'butter', 'cheese'},
{'milk', 'bread', 'butter'},
{'eggs', 'cheese'},
{'milk', 'bread', 'cheese'},
{'bread', 'butter'},
{'milk', 'eggs'},
{'bread', 'eggs'},
{'butter', 'cheese'},
{'milk', 'bread', 'butter', 'cheese'},
]
}
df = pd.DataFrame(data)
df['Items'] = df['Items'].apply(lambda x: ', '.join(sorted(x)))
df = df['Items'].str.get_dummies(sep=', ').join(df['Transaction ID'])
df.set_index('Transaction ID', inplace=True)
frequent_itemsets_apriori = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets_fpgrowth = fpgrowth(df, min_support=0.1, use_colnames=True)
rules_apriori = association_rules(frequent_itemsets_apriori, metric="lift",
min_threshold=1)
rules_fpgrowth = association_rules(frequent_itemsets_fpgrowth, metric="lift", min_threshold=1)
print("Frequent Itemsets (Apriori):")
display(frequent_itemsets_apriori)
print("\nFrequent Itemsets (FP-
Growth):")
display(frequent_itemsets_fpgrowth)
print("Association Rules (Apriori):")
display(rules_apriori)
print("\nAssociation Rules (FP-Growth):")
display(rules_fpgrowth)

Output:
Frequent Itemsets (Apriori):
support itemsets
0 0.687500 (bread)
1 0.437500 (butter)
2 0.541667 (cheese)
3 0.458333 (eggs)
4 0.458333 (milk)
5 0.250000 (butter, bread)
6 0.333333 (cheese, bread)
7 0.333333 (eggs, bread)
8 0.354167 (milk, bread)
9 0.312500 (butter, cheese)
10 0.125000 (butter, milk)
11 0.125000 (eggs, cheese)
12 0.208333 (milk, cheese)
13 0.229167 (milk, eggs)
14 0.125000 (butter, cheese, bread)
15 0.125000 (butter, milk, bread)
16 0.104167 (cheese, eggs, bread)
17 0.208333 (cheese, milk, bread)
18 0.125000 (milk, eggs, bread)
19 0.104167 (butter, milk, cheese)
20 0.104167 (butter, cheese, milk,
bread)

Frequent Itemsets (FP-Growth):

support itemsets
0 0.687500 (bread)
1 0.458333 (milk)
2 0.458333 (eggs)
3 0.541667 (cheese)
4 0.437500 (butter)
5 0.354167 (milk, bread)
6 0.208333 (milk, cheese)
7 0.208333 (cheese, milk, bread)
8 0.333333 (eggs, bread)
9 0.229167 (milk, eggs)
10 0.125000 (eggs, cheese)
11 0.125000 (milk, eggs, bread)
12 0.104167 (cheese, eggs, bread)
13 0.333333 (cheese, bread)
14 0.312500 (butter, cheese)
15 0.250000 (butter, bread)
16 0.125000 (butter, milk)
17 0.125000 (butter, cheese, bread)
18 0.125000 (butter, milk, bread)
19 0.104167 (butter, milk, cheese)
20 0.104167 (butter, cheese, milk,
bread)
Ex no:04
Date:14-03-2024
Extracting Patterns from Multi-dimentional Data

Aim:
To extract patterns from the taken dataset which is Multi - dimensional.

Procedure:
 Load the dataset
 Drop the columns and separate the target variables
 Use standard scaler to scale the data so that it could be trained in the model
 Remove the null values by imputing mean values in the null space
 Determine the no of clusters to be used by using elbow curve
 Using PCA, the principal components can be determined and can be used for
further prediction and analysis
 Label the clusters and print the original dimensionality and reduced dimensionality.

Code:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
data = pd.read_csv("C:\\Users\\LENOVO\\Downloads\\car_prices.csv\\car_prices.csv") numeric_features =
['odometer', 'mmr', 'sellingprice']
non_numeric_features = ['make', 'model']
imputer =
SimpleImputer(strategy='mean')
data[numeric_features] =
imputer.fit_transform(data[numeric_features]) X_numeric =
data[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X_pca)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
print("Eigenvalues:")
print(pca.explained_variance_)
print("\nEigenvectors:")
print(pca.components_) print("\
nCluster Centroids:")
centroids = scaler.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_))
print(pd.DataFrame(centroids, columns=numeric_features))

Output:
Eigenvalues:
[2.45449366 0.52917194]

Eigenvectors:
[[-0.49452681 0.61516816 0.61401251]
[ 0.86913926 0.3448503 0.35450701]]

Cluster Centroids:
odometer mmr sellingprice
0 42227.768830 14227.743279 14032.671056
1 130715.926025 5019.314721 4853.519667
2 29827.178914 30570.382605 30560.756739
Ex no:05
Date: 15-03-2024
Linear Regression

Aim:
To develop a model to apply Linear Regression for Predicition.

Procedure:
 Load the dataset
 Drop the columns and separate the target variables.
 Fit the model using Mean squared error, coefficient and the intercepts.

Code:
data = pd.read_csv("C:\\Users\\LENOVO\\Downloads\\Salary_dataset.csv") X =
data[['YearsExperience']]
y = data['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Coefficients:", model.coef_)
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Testing data')
plt.plot(X_train, model.predict(X_train), color='green', label='Linear Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()

Output:
Mean Squared Error: 49830096.855908334
Coefficients: [9423.81532303]
Ex no:06
Date:21-03-2024
Bayesian Before Network

Aim:
To implement Bayesian Before Network for the Training dataset.
Procedure:
 Import the necessary libraries
 Define the structure of the Bayesian network based on the relationship between
different variables in the dataset
 Train the Bayesian Network model.
 Perform inference using the trained model
 Print the result of inference for unique class of breast cancer which is Malignant or Not.

Implementation:
import numpy as
np import pandas
as pd import
collections
import matplotlib.pyplot as
plt class
CancerClassifier(object):
"""Infers a hidden variable and uses Bayesian classification to predict whether a tumor
is malignant or benign"""
def init (self, filename):
data =
pd.read_csv(filename)
clusters = []
for _, row in data.iterrows():
best = -1
sim = 0.5
for j, cluster in enumerate(clusters):
x = sum(cluster[key][value] / sum(cluster[key].values())
for key, value in row.items()) / len(data.columns)
if x > sim:
best = j
sim = x
if best == -1:
clusters.append(collections.defaultdict(lambda:
collections.defaultdict(float))) print(len(clusters), 'clusters found')
for key, value in row.items():
clusters[best][key][value] += 1.0
index = []
for column in data.columns:
index.extend([(column, value) for value in data[column].unique()])
self.probabilities = pd.DataFrame({(key, value): [cluster[key][value] + 1.0 for cluster in
clusters]
for key, value in index}).T
self.prior = self.probabilities.sum(axis=0)
self.prior /= self.prior.sum()
self.edibility_prior = self.probabilities.loc['class'].sum(axis=1)
self.edibility_prior /= self.edibility_prior.sum()
def normalize(group):
return group.div(group.sum(axis=0), axis='columns')
self.probabilities = self.probabilities.groupby(axis=0, level=0).apply(normalize)
def call (self, **kwargs):
"Estimates the probability that a tumor is malignant given the features in kwargs"
category = self.prior.copy()
for key, value in kwargs.items():
category *= self.probabilities.loc[(key, value)]
category /= category.sum()
result = self.edibility_prior * ((self.probabilities.loc['class'] * category).sum(axis=1))
return result / result.sum()
def test(self, filename):
"""Produces KDE plots of the estimated probability"""
data = pd.read_csv(filename)
observables = [column for column in data.columns if column != 'class']
results = pd.DataFrame([self(**row) for _, row in data[observables].iterrows()])
results['class'] = data['class']
return results
CC = CancerClassifier("C:\\Users\\LENOVO\\Downloads\\breast-cancer-data.csv") # Replace
"path/to/your/dataset.csv" with the actual path to your dataset
CC.edibility_prior.plot.bar()
plt.show()
CC.prior.plot.bar()
plt.show()
CC.probabilities.loc['class'].T.plot.bar()
plt.show()

Output:
1 clusters found
2 clusters found
3 clusters found
4 clusters found
5 clusters found
6 clusters found
7 clusters found
8 clusters found
9 clusters found
10 clusters found
11 clusters found
12 clusters found
13 clusters found
14 clusters found
15 clusters found
16 clusters found
17 clusters found
18 clusters found
19 clusters found
20 clusters found
0.7
0.16
0.6

0.14
0.5

0.12
0.4

0.10
0.3

0.08
0.2

0.06
0.1

0.04

II I I
0.0

0.02

.'!",' ." ,
I
;, ,
0.00
;,,
1!
"
:,
"
C: i"

None.None
- (class, recurrence-events")
0.8 - (class, no-recurrence-events')

0.6

0.4

0.2

" ...
,.._
0
.... .... ....
N
.... .... ....
U'1
....
"' .... ....
"' ....
' "'
Ex no:07
Date: 21-03-2024
Outliers Detection

Aim:
To implement Outlier Detection using various techniques like Z score, Standard Deviation,
Interquartile range method.
Procedure:
 Import the necessary libraries
 Define a list containing the names of columns
 Define the method such as standard deviation , Z-score and Inter-Quartile Range
Standard Deviation:
 Calculate mean , Standard Deviation for each feature
 Identify Outliers as Data points that fall outside certain numbers.
 Common threshold include + or – 3 or + or – 2.
Z-score method:
 It represents how many SD an observation form the mean
 Identify outliers as data points with Z-score above or below a threshold.
Inter-Quartile Range:
 Calculate first Quartile and third Quartile
 Calculate Inter-Quartile range , the difference between Q3 and Q1

Code:
from sklearn.datasets import load_wine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame(load_wine()["data"],columns=load_wine()["feature_names"])
data.head()
data.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20));

IQR Method:
def iqr_outlier(x,factor):
q1 = x.quantile(0.25)
q3 = x.quantile(0.75)
iqr = q3 - q1
min_ = q1 - factor * iqr
max_ = q3 + factor * iqr
result_ = pd.Series([0] * len(x))
result_[((x < min_) | (x > max_))] = 1
return result_
fig, ax = plt.subplots(7, 2, figsize=(20, 30))
row = col = 0
for n,i in
enumerate(data.columns): if (n
% 2 == 0) & (n > 0):
row += 1
col = 0
outliers = iqr_outlier(data[i], 1.5)
if sum(outliers) == 0:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], legend=False, color =
'green')
else:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], hue = outliers, palette =
['green','red'])
for x,y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row,col].text(x = x, y = y, s = y, fontsize = 8)
ax[row,col].set_ylabel("")
ax[row,col].set_title(i)
ax[row,col].xaxis.set_visible(False) if
sum(outliers) > 0:
ax[row,col].legend(ncol=2)
col += 1
ax[row,col].axis('off')
plt.show()

Z Score method:
def zscore_outlier(x,lb,ub):
zscore = ((x - x.mean()) / x.std()).copy()
result_ = pd.Series([0] * len(x))
result_[((zscore < lb) | (zscore > ub))] = 1
return result_
fig, ax = plt.subplots(7, 2, figsize=(20, 30))
row = col = 0
for n,i in
enumerate(data.columns): if (n
% 2 == 0) & (n > 0):
row += 1
col = 0
outliers = zscore_outlier(data[i], -3, 3)
if sum(outliers) == 0:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], legend=False, color =
'green')
else:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], hue = outliers, palette =
['green','red'])
for x,y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row,col].text(x = x, y = y, s = y, fontsize = 8)
ax[row,col].set_ylabel("")
ax[row,col].set_title(i)
ax[row,col].xaxis.set_visible(False) if
sum(outliers) > 0:
ax[row,col].legend(ncol=2)
col += 1
ax[row,col].axis('off')
plt.show()

Standard Deviation Method

def std_dev_outlier(x, threshold):
mean =
x.mean()
std_dev = x.std()
lower_bound = mean - threshold *
std_dev upper_bound = mean + threshold
* std_dev result_ = pd.Series([0] * len(x))
result_[(x < lower_bound) | (x > upper_bound)] =
1 return result_

fig, ax = plt.subplots(7, 2, figsize=(20, 30))

row = col = 0
for n, i in enumerate(data.columns):
if (n % 2 == 0) & (n > 0):
row += 1
col = 0
outliers = std_dev_outlier(data[i], threshold=3) # Change the threshold as
needed if sum(outliers) == 0:
sns.scatterplot(x=np.arange(len(data[i])), y=data[i], ax=ax[row, col], legend=False, color='green')
else:
sns.scatterplot(x=np.arange(len(data[i])), y=data[i], ax=ax[row, col], hue=outliers,
palette=['green', 'red'])
for x, y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row, col].text(x=x, y=y, s=y, fontsize=8)
ax[row, col].set_ylabel("")
ax[row, col].set_title(i)
ax[row, col].xaxis.set_visible(False)
if sum(outliers) > 0:
ax[row, col].legend(ncol=2)
col += 1
ax[row, col].axis('off')
plt.show()

Output:
Ex no:08
Date:21-03-2024
Evaluation Measures for Text Retrival

Aim:
To evaluate the retrieval of the text using various measures.

Procedure:
 Define the path containing the dataset.
 Read the csv file using Pandas function and load the dataframe.
 Define Query and Query Category representing the search and category of interest.
 Calculating metrices which are the Total Retrivel Documents , True Postives , Retrived
relevant documents , Precision ,Recall , F1-score.
 Print the result.

Code:
import pandas as pd
import numpy as np
from google_play_scraper import app , Sort , reviews_all
import plotly.express as px
hk_project=reviews_all('com.hikingproject.android',
sleep_milliseconds=0,lang='en',country='IN',sort=Sort.NEWEST
df=pd.json_normalize(hk_project)
df.head()
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
print(sentiment_analysis("I like your application alot!"))

Output:
Precision_score = 0.6999
Recall = 1.0
F1-score = 0.1980

DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
Fa22-Bcs-025 MOAZ Assignment 1
No ratings yet
Fa22-Bcs-025 MOAZ Assignment 1
9 pages
Indexdw
No ratings yet
Indexdw
34 pages
DMC - Record
No ratings yet
DMC - Record
54 pages
DMT Cia2
No ratings yet
DMT Cia2
11 pages
1 Lab Program 3 2 Vinay Sirohi 3 2139472: December 1, 2021
No ratings yet
1 Lab Program 3 2 Vinay Sirohi 3 2139472: December 1, 2021
6 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
Ds 2
No ratings yet
Ds 2
3 pages
Equent Itemsets & Clustering
No ratings yet
Equent Itemsets & Clustering
27 pages
Task-4: Algorithm
No ratings yet
Task-4: Algorithm
4 pages
Unit 2
No ratings yet
Unit 2
8 pages
Da Exp 9
No ratings yet
Da Exp 9
5 pages
Ass 2
No ratings yet
Ass 2
3 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
DataAnalytics Practical3
No ratings yet
DataAnalytics Practical3
3 pages
Exp 2
No ratings yet
Exp 2
6 pages
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
No ratings yet
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
10 pages
Q1R Ext
No ratings yet
Q1R Ext
4 pages
Exp 3
No ratings yet
Exp 3
14 pages
Document 1116
No ratings yet
Document 1116
6 pages
Weantuday: T Deuhh Anytha
No ratings yet
Weantuday: T Deuhh Anytha
23 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
Unit - 3 Mining Frequent Patterns
No ratings yet
Unit - 3 Mining Frequent Patterns
10 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Advanced Big Data Assignment 3
No ratings yet
Advanced Big Data Assignment 3
3 pages
Apriori
No ratings yet
Apriori
5 pages
Da Pra Week 15 (Apriori Algo) - 114413
No ratings yet
Da Pra Week 15 (Apriori Algo) - 114413
11 pages
Modified Frequent Pattern Mining From Data Stream
No ratings yet
Modified Frequent Pattern Mining From Data Stream
38 pages
DM Lab Cycle 7 1
No ratings yet
DM Lab Cycle 7 1
7 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Market Basket Analysis For Data Mining - Msthesis PDF
No ratings yet
Market Basket Analysis For Data Mining - Msthesis PDF
75 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
Data Mining with FP-Growth
No ratings yet
Data Mining with FP-Growth
2 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
R - Practical
No ratings yet
R - Practical
50 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
CSE4005
No ratings yet
CSE4005
6 pages
Chota Bheem
No ratings yet
Chota Bheem
6 pages
Big Data Prcatical
No ratings yet
Big Data Prcatical
3 pages
DWDM Answer
No ratings yet
DWDM Answer
19 pages
DWM Exp8
No ratings yet
DWM Exp8
8 pages
Algorithm
No ratings yet
Algorithm
8 pages
Data Cleaning
No ratings yet
Data Cleaning
22 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
DWDM Mid Ii
No ratings yet
DWDM Mid Ii
13 pages
Ex 1
No ratings yet
Ex 1
8 pages
Group 04 - Data Mining - 235MI5601 - Final Project
No ratings yet
Group 04 - Data Mining - 235MI5601 - Final Project
32 pages
June 2022 MS
No ratings yet
June 2022 MS
13 pages
Maths Lit P1 - 2024 Roadshows Presentation
No ratings yet
Maths Lit P1 - 2024 Roadshows Presentation
34 pages
Annex 3 Gap Analysis Template With Division Targets Based On DEDP
No ratings yet
Annex 3 Gap Analysis Template With Division Targets Based On DEDP
20 pages
Home Environment & Student IQ Study
No ratings yet
Home Environment & Student IQ Study
34 pages
Statistics For Economists - Lecture Notes
No ratings yet
Statistics For Economists - Lecture Notes
171 pages
Top Teacher Wins Arts & Crafts Award
No ratings yet
Top Teacher Wins Arts & Crafts Award
72 pages
MCT and MD For Pharmacy Students
No ratings yet
MCT and MD For Pharmacy Students
58 pages
Waqar Ansari's RISE QM Ch#08
No ratings yet
Waqar Ansari's RISE QM Ch#08
21 pages
Muhammad Saleem Assignment 2
No ratings yet
Muhammad Saleem Assignment 2
11 pages
Math Quest Further Maths VCE 12 (2016 Edition)
100% (1)
Math Quest Further Maths VCE 12 (2016 Edition)
786 pages
Module 2 - Probability and Statistics
No ratings yet
Module 2 - Probability and Statistics
5 pages
Multi-Objective Economic Load Dispatch Using Hybrid NSGA-II and PVDE Techniques
No ratings yet
Multi-Objective Economic Load Dispatch Using Hybrid NSGA-II and PVDE Techniques
10 pages
NCKH ĐỀ 4
No ratings yet
NCKH ĐỀ 4
20 pages
Mathematics P2 QP June 2022 Eng Eastern Cape
No ratings yet
Mathematics P2 QP June 2022 Eng Eastern Cape
13 pages
Pratt Stats Deal Update Q1 2016
No ratings yet
Pratt Stats Deal Update Q1 2016
28 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
12 pages
Section02 Answerkey PDF
No ratings yet
Section02 Answerkey PDF
11 pages
Revision Questions Statistics
No ratings yet
Revision Questions Statistics
172 pages
Data Analytics For Lean Six Sigma
No ratings yet
Data Analytics For Lean Six Sigma
28 pages
1.3 Measure of Variability and Position
No ratings yet
1.3 Measure of Variability and Position
47 pages
To Artificial Intelligence: What Is Data Science?
100% (1)
To Artificial Intelligence: What Is Data Science?
131 pages
Understanding Standard Deviation
No ratings yet
Understanding Standard Deviation
7 pages
Rapid Miner - Data Preparation
100% (1)
Rapid Miner - Data Preparation
17 pages
Descriptive Statistics Unit 2
No ratings yet
Descriptive Statistics Unit 2
72 pages
MBA Research Methodology Assignment
No ratings yet
MBA Research Methodology Assignment
7 pages
Student Notes 1.3 New
No ratings yet
Student Notes 1.3 New
6 pages
AP Statistics Topic 1.7: Summary Statistics For A Quantitative Variable
No ratings yet
AP Statistics Topic 1.7: Summary Statistics For A Quantitative Variable
10 pages
Tutorial 1
No ratings yet
Tutorial 1
5 pages
Math 1f - All Lessons
No ratings yet
Math 1f - All Lessons
81 pages
w3 ch2 Anno
100% (1)
w3 ch2 Anno
28 pages

Data Mining Lab Record

Uploaded by

Data Mining Lab Record

Uploaded by

Ex no:01

Aim: Perform Data Cleaning techniques for the given datase

To perform data cleaning techniques for the given dataset.

Number of missing values in each column: GPA 2

 Normalization by decimal Scaling:

Support – ((no of transactions in which the particular items appears)/(Total no of

Confidence – (Support (any items of dataset) / support (items))

Frequent Itemsets (FP-Growth):

Standard Deviation Method

fig, ax = plt.subplots(7, 2, figsize=(20, 30))

You might also like