[go: up one dir, main page]

0% found this document useful (0 votes)
20 views18 pages

Data Mining Lab Record

Uploaded by

Likkitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Data Mining Lab Record

Uploaded by

Likkitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Ex no:01

Date: 15-02-2024

Aim: Perform Data Cleaning techniques for the given datase

To perform data cleaning techniques for the given dataset.


Implementation:
df.describe()
df.info()
print("Column Names:")
print(df.columns.tolist())
data_types = df.dtypes
print(data_types)
columns_to_check = ['GPA', 'Gender', 'breakfast', 'calories_chicken', 'calories_day', 'calories_scone',
'coffee', 'comfort_food', 'comfort_food_reasons', 'comfort_food_reasons_coded', 'cook',
'comfort_food_reasons_coded.1', 'cuisine', 'diet_current', 'diet_current_coded', 'drink',
'eating_changes', 'eating_changes_coded', 'eating_changes_coded1', 'eating_out', 'employment',
'ethnic_food', 'exercise', 'father_education', 'father_profession', 'fav_cuisine', 'fav_cuisine_coded',
'fav_food', 'food_childhood', 'fries', 'fruit_day', 'grade_level', 'greek_food', 'healthy_feeling',
'healthy_meal', 'ideal_diet', 'ideal_diet_coded', 'income', 'indian_food', 'italian_food', 'life_rewarding',
'marital_status', 'meals_dinner_friend', 'mother_education', 'mother_profession', 'nutritional_check',
'on_off_campus', 'parents_cook', 'pay_meal_out', 'persian_food', 'self_perception_weight', 'soup',
'sports', 'thai_food', 'tortilla_calories', 'turkey_calories', 'type_sports', 'veggies_day', 'vitamins',
'waffle_calories', 'weight']
missing_values_count = df[columns_to_check].isnull().sum()
print("Number of missing values in each column:")
print(missing_values_count)
cleaned_df = df.dropna(subset=columns_to_check)
cleaned_df = cleaned_df.drop('weight', axis=1)
missing_values_count_cleaned = cleaned_df.isnull().sum()
print("Number of missing values in each column of the cleaned DataFrame:")
print(missing_values_count_cleaned)
Output:
GPA object
Gender int64
breakfast int64
calories_chicken int64
calories_day float64
...
type_sports object
veggies_day int64
vitamins int64
waffle_calories int64
weight object
Length: 61, dtype: object

Number of missing values in each column: GPA 2


Gender 0
breakfast 0
calories_chicken 0
calories_day 19
..
type_sports 26
veggies_day 0
vitamins 0
waffle_calories 0
weight 2
Length: 61, dtype: int64
Ex no:02
Date:15-02-2024
Data Normalization
Aim:
To implement Data Normalization techniques such as z-score, decimal-scaler and min-max
scaler for the dataset.
Implementation:

 Min-Max Normalization:
Formula:X_norm =((X -Xmin )/ (Xmax – Xmin) )

Where:
X – Original value
Xmin – minimum value in the
dataset Xmax - maximum value in
the dataset Xnorm – normalized value

 Z-score Normalization:
Formula: Z = (x -µ) /

Where:
X – original value
µ - mean of the dataset
sigma – standard deviation of the
dataset z – standardized value

 Normalization by decimal Scaling:


Formula: Xnorm = X/10n

Where:
X – original value
N – smallest integer such as Xnorm <1
Xnorm – normalized value.

Code:
numerical_columns = cleaned_df.select_dtypes(include=['float64', 'int64']).columns
min_max_scaler = MinMaxScaler()
cleaned_df[numerical_columns] = min_max_scaler.fit_transform(cleaned_df[numerical_columns])
z_score_scaler = StandardScaler()
cleaned_df[numerical_columns] =
z_score_scaler.fit_transform(cleaned_df[numerical_columns]) decimal_scaler =
StandardScaler(with_mean=False)
cleaned_df[numerical_columns] =
decimal_scaler.fit_transform(cleaned_df[numerical_columns]) cleaned_df.head(5)
Output:
Ex no:03
Date: 01-03-2024
Apriori and FP-growth Algorithm

Aim:
To extract frequent item sets using Candidate generation and without using Candidate generation.

Procedure:
 Consider dataset or set of items for transactions and make it into a dataframe.
 Scale the data
 Use Apriori and FP-growth algorithm to find frequent items pattern.
 Obtaining the association rules for both techniques
 Antecent, Consequent, Conviction, Support, Confidence etc.. are to be used to find
the frequent item pattern base.

Support – ((no of transactions in which the particular items appears)/(Total no of


transactions))

Confidence – (Support (any items of dataset) / support (items))

Code:
import pandas as pd
from mlxtend.frequent_patterns import apriori,
fpgrowth from mlxtend.frequent_patterns import
association_rules data = {
"Transaction ID": list(range(1, 49)),
"Items": [
{'milk', 'bread', 'eggs'},
{'bread', 'butter', 'cheese'},
{'milk', 'bread', 'butter'},
{'eggs', 'cheese'},
{'milk', 'bread', 'cheese'},
{'bread', 'butter'},
{'milk', 'eggs'},
{'bread', 'eggs'},
{'butter', 'cheese'},
{'milk', 'bread', 'butter', 'cheese'},
]
}
df = pd.DataFrame(data)
df['Items'] = df['Items'].apply(lambda x: ', '.join(sorted(x)))
df = df['Items'].str.get_dummies(sep=', ').join(df['Transaction ID'])
df.set_index('Transaction ID', inplace=True)
frequent_itemsets_apriori = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets_fpgrowth = fpgrowth(df, min_support=0.1, use_colnames=True)
rules_apriori = association_rules(frequent_itemsets_apriori, metric="lift",
min_threshold=1)
rules_fpgrowth = association_rules(frequent_itemsets_fpgrowth, metric="lift", min_threshold=1)
print("Frequent Itemsets (Apriori):")
display(frequent_itemsets_apriori)
print("\nFrequent Itemsets (FP-
Growth):")
display(frequent_itemsets_fpgrowth)
print("Association Rules (Apriori):")
display(rules_apriori)
print("\nAssociation Rules (FP-Growth):")
display(rules_fpgrowth)

Output:
Frequent Itemsets (Apriori):
support itemsets
0 0.687500 (bread)
1 0.437500 (butter)
2 0.541667 (cheese)
3 0.458333 (eggs)
4 0.458333 (milk)
5 0.250000 (butter, bread)
6 0.333333 (cheese, bread)
7 0.333333 (eggs, bread)
8 0.354167 (milk, bread)
9 0.312500 (butter, cheese)
10 0.125000 (butter, milk)
11 0.125000 (eggs, cheese)
12 0.208333 (milk, cheese)
13 0.229167 (milk, eggs)
14 0.125000 (butter, cheese, bread)
15 0.125000 (butter, milk, bread)
16 0.104167 (cheese, eggs, bread)
17 0.208333 (cheese, milk, bread)
18 0.125000 (milk, eggs, bread)
19 0.104167 (butter, milk, cheese)
20 0.104167 (butter, cheese, milk,
bread)

Frequent Itemsets (FP-Growth):


support itemsets
0 0.687500 (bread)
1 0.458333 (milk)
2 0.458333 (eggs)
3 0.541667 (cheese)
4 0.437500 (butter)
5 0.354167 (milk, bread)
6 0.208333 (milk, cheese)
7 0.208333 (cheese, milk, bread)
8 0.333333 (eggs, bread)
9 0.229167 (milk, eggs)
10 0.125000 (eggs, cheese)
11 0.125000 (milk, eggs, bread)
12 0.104167 (cheese, eggs, bread)
13 0.333333 (cheese, bread)
14 0.312500 (butter, cheese)
15 0.250000 (butter, bread)
16 0.125000 (butter, milk)
17 0.125000 (butter, cheese, bread)
18 0.125000 (butter, milk, bread)
19 0.104167 (butter, milk, cheese)
20 0.104167 (butter, cheese, milk,
bread)
Ex no:04
Date:14-03-2024
Extracting Patterns from Multi-dimentional Data

Aim:
To extract patterns from the taken dataset which is Multi - dimensional.

Procedure:
 Load the dataset
 Drop the columns and separate the target variables
 Use standard scaler to scale the data so that it could be trained in the model
 Remove the null values by imputing mean values in the null space
 Determine the no of clusters to be used by using elbow curve
 Using PCA, the principal components can be determined and can be used for
further prediction and analysis
 Label the clusters and print the original dimensionality and reduced dimensionality.

Code:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
data = pd.read_csv("C:\\Users\\LENOVO\\Downloads\\car_prices.csv\\car_prices.csv") numeric_features =
['odometer', 'mmr', 'sellingprice']
non_numeric_features = ['make', 'model']
imputer =
SimpleImputer(strategy='mean')
data[numeric_features] =
imputer.fit_transform(data[numeric_features]) X_numeric =
data[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X_pca)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
print("Eigenvalues:")
print(pca.explained_variance_)
print("\nEigenvectors:")
print(pca.components_) print("\
nCluster Centroids:")
centroids = scaler.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_))
print(pd.DataFrame(centroids, columns=numeric_features))

Output:
Eigenvalues:
[2.45449366 0.52917194]

Eigenvectors:
[[-0.49452681 0.61516816 0.61401251]
[ 0.86913926 0.3448503 0.35450701]]

Cluster Centroids:
odometer mmr sellingprice
0 42227.768830 14227.743279 14032.671056
1 130715.926025 5019.314721 4853.519667
2 29827.178914 30570.382605 30560.756739
Ex no:05
Date: 15-03-2024
Linear Regression

Aim:
To develop a model to apply Linear Regression for Predicition.

Procedure:
 Load the dataset
 Drop the columns and separate the target variables.
 Fit the model using Mean squared error, coefficient and the intercepts.

Code:
data = pd.read_csv("C:\\Users\\LENOVO\\Downloads\\Salary_dataset.csv") X =
data[['YearsExperience']]
y = data['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Coefficients:", model.coef_)
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='red', label='Testing data')
plt.plot(X_train, model.predict(X_train), color='green', label='Linear Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()

Output:
Mean Squared Error: 49830096.855908334
Coefficients: [9423.81532303]
Ex no:06
Date:21-03-2024
Bayesian Before Network

Aim:
To implement Bayesian Before Network for the Training dataset.
Procedure:
 Import the necessary libraries
 Define the structure of the Bayesian network based on the relationship between
different variables in the dataset
 Train the Bayesian Network model.
 Perform inference using the trained model
 Print the result of inference for unique class of breast cancer which is Malignant or Not.

Implementation:
import numpy as
np import pandas
as pd import
collections
import matplotlib.pyplot as
plt class
CancerClassifier(object):
"""Infers a hidden variable and uses Bayesian classification to predict whether a tumor
is malignant or benign"""
def init (self, filename):
data =
pd.read_csv(filename)
clusters = []
for _, row in data.iterrows():
best = -1
sim = 0.5
for j, cluster in enumerate(clusters):
x = sum(cluster[key][value] / sum(cluster[key].values())
for key, value in row.items()) / len(data.columns)
if x > sim:
best = j
sim = x
if best == -1:
clusters.append(collections.defaultdict(lambda:
collections.defaultdict(float))) print(len(clusters), 'clusters found')
for key, value in row.items():
clusters[best][key][value] += 1.0
index = []
for column in data.columns:
index.extend([(column, value) for value in data[column].unique()])
self.probabilities = pd.DataFrame({(key, value): [cluster[key][value] + 1.0 for cluster in
clusters]
for key, value in index}).T
self.prior = self.probabilities.sum(axis=0)
self.prior /= self.prior.sum()
self.edibility_prior = self.probabilities.loc['class'].sum(axis=1)
self.edibility_prior /= self.edibility_prior.sum()
def normalize(group):
return group.div(group.sum(axis=0), axis='columns')
self.probabilities = self.probabilities.groupby(axis=0, level=0).apply(normalize)
def call (self, **kwargs):
"Estimates the probability that a tumor is malignant given the features in kwargs"
category = self.prior.copy()
for key, value in kwargs.items():
category *= self.probabilities.loc[(key, value)]
category /= category.sum()
result = self.edibility_prior * ((self.probabilities.loc['class'] * category).sum(axis=1))
return result / result.sum()
def test(self, filename):
"""Produces KDE plots of the estimated probability"""
data = pd.read_csv(filename)
observables = [column for column in data.columns if column != 'class']
results = pd.DataFrame([self(**row) for _, row in data[observables].iterrows()])
results['class'] = data['class']
return results
CC = CancerClassifier("C:\\Users\\LENOVO\\Downloads\\breast-cancer-data.csv") # Replace
"path/to/your/dataset.csv" with the actual path to your dataset
CC.edibility_prior.plot.bar()
plt.show()
CC.prior.plot.bar()
plt.show()
CC.probabilities.loc['class'].T.plot.bar()
plt.show()

Output:
1 clusters found
2 clusters found
3 clusters found
4 clusters found
5 clusters found
6 clusters found
7 clusters found
8 clusters found
9 clusters found
10 clusters found
11 clusters found
12 clusters found
13 clusters found
14 clusters found
15 clusters found
16 clusters found
17 clusters found
18 clusters found
19 clusters found
20 clusters found
0.7
0.16
0.6

0.14
0.5

0.12
0.4

0.10
0.3

0.08
0.2

0.06
0.1

0.04

II I I
0.0

0.02

.'!",' ." ,
I
;, ,
0.00
;,,
1!
"
:,
"
C: i"

None.None
- (class, recurrence-events")
0.8 - (class, no-recurrence-events')

0.6

0.4

0.2

" ...
,.._
0
.... .... ....
N
.... .... ....
U'1
....
"' .... ....
"' ....
' "'
Ex no:07
Date: 21-03-2024
Outliers Detection

Aim:
To implement Outlier Detection using various techniques like Z score, Standard Deviation,
Interquartile range method.
Procedure:
 Import the necessary libraries
 Define a list containing the names of columns
 Define the method such as standard deviation , Z-score and Inter-Quartile Range
Standard Deviation:
 Calculate mean , Standard Deviation for each feature
 Identify Outliers as Data points that fall outside certain numbers.
 Common threshold include + or – 3 or + or – 2.
Z-score method:
 It represents how many SD an observation form the mean
 Identify outliers as data points with Z-score above or below a threshold.
Inter-Quartile Range:
 Calculate first Quartile and third Quartile
 Calculate Inter-Quartile range , the difference between Q3 and Q1

Code:
from sklearn.datasets import load_wine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame(load_wine()["data"],columns=load_wine()["feature_names"])
data.head()
data.plot(kind="box",subplots=True,layout=(7,2),figsize=(15,20));

IQR Method:
def iqr_outlier(x,factor):
q1 = x.quantile(0.25)
q3 = x.quantile(0.75)
iqr = q3 - q1
min_ = q1 - factor * iqr
max_ = q3 + factor * iqr
result_ = pd.Series([0] * len(x))
result_[((x < min_) | (x > max_))] = 1
return result_
fig, ax = plt.subplots(7, 2, figsize=(20, 30))
row = col = 0
for n,i in
enumerate(data.columns): if (n
% 2 == 0) & (n > 0):
row += 1
col = 0
outliers = iqr_outlier(data[i], 1.5)
if sum(outliers) == 0:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], legend=False, color =
'green')
else:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], hue = outliers, palette =
['green','red'])
for x,y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row,col].text(x = x, y = y, s = y, fontsize = 8)
ax[row,col].set_ylabel("")
ax[row,col].set_title(i)
ax[row,col].xaxis.set_visible(False) if
sum(outliers) > 0:
ax[row,col].legend(ncol=2)
col += 1
ax[row,col].axis('off')
plt.show()

Z Score method:
def zscore_outlier(x,lb,ub):
zscore = ((x - x.mean()) / x.std()).copy()
result_ = pd.Series([0] * len(x))
result_[((zscore < lb) | (zscore > ub))] = 1
return result_
fig, ax = plt.subplots(7, 2, figsize=(20, 30))
row = col = 0
for n,i in
enumerate(data.columns): if (n
% 2 == 0) & (n > 0):
row += 1
col = 0
outliers = zscore_outlier(data[i], -3, 3)
if sum(outliers) == 0:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], legend=False, color =
'green')
else:
sns.scatterplot(x = np.arange(len(data[i])), y = data[i], ax = ax[row, col], hue = outliers, palette =
['green','red'])
for x,y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row,col].text(x = x, y = y, s = y, fontsize = 8)
ax[row,col].set_ylabel("")
ax[row,col].set_title(i)
ax[row,col].xaxis.set_visible(False) if
sum(outliers) > 0:
ax[row,col].legend(ncol=2)
col += 1
ax[row,col].axis('off')
plt.show()

Standard Deviation Method


def std_dev_outlier(x, threshold):
mean =
x.mean()
std_dev = x.std()
lower_bound = mean - threshold *
std_dev upper_bound = mean + threshold
* std_dev result_ = pd.Series([0] * len(x))
result_[(x < lower_bound) | (x > upper_bound)] =
1 return result_

fig, ax = plt.subplots(7, 2, figsize=(20, 30))


row = col = 0
for n, i in enumerate(data.columns):
if (n % 2 == 0) & (n > 0):
row += 1
col = 0
outliers = std_dev_outlier(data[i], threshold=3) # Change the threshold as
needed if sum(outliers) == 0:
sns.scatterplot(x=np.arange(len(data[i])), y=data[i], ax=ax[row, col], legend=False, color='green')
else:
sns.scatterplot(x=np.arange(len(data[i])), y=data[i], ax=ax[row, col], hue=outliers,
palette=['green', 'red'])
for x, y in zip(np.arange(len(data[i]))[outliers == 1], data[i][outliers == 1]):
ax[row, col].text(x=x, y=y, s=y, fontsize=8)
ax[row, col].set_ylabel("")
ax[row, col].set_title(i)
ax[row, col].xaxis.set_visible(False)
if sum(outliers) > 0:
ax[row, col].legend(ncol=2)
col += 1
ax[row, col].axis('off')
plt.show()

Output:
Ex no:08
Date:21-03-2024
Evaluation Measures for Text Retrival

Aim:
To evaluate the retrieval of the text using various measures.

Procedure:
 Define the path containing the dataset.
 Read the csv file using Pandas function and load the dataframe.
 Define Query and Query Category representing the search and category of interest.
 Calculating metrices which are the Total Retrivel Documents , True Postives , Retrived
relevant documents , Precision ,Recall , F1-score.
 Print the result.

Code:
import pandas as pd
import numpy as np
from google_play_scraper import app , Sort , reviews_all
import plotly.express as px
hk_project=reviews_all('com.hikingproject.android',
sleep_milliseconds=0,lang='en',country='IN',sort=Sort.NEWEST
df=pd.json_normalize(hk_project)
df.head()
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
print(sentiment_analysis("I like your application alot!"))

Output:
Precision_score = 0.6999
Recall = 1.0
F1-score = 0.1980

You might also like