[go: up one dir, main page]

0% found this document useful (0 votes)
21 views28 pages

Index: SR. NO. Practical Name Date of Perform NO. Sign

The document outlines practical exercises for a Data Science course, covering topics such as Excel usage, data pre-processing, hypothesis testing, ANOVA, regression analysis, logistic regression, clustering, PCA, and data visualization. Each section includes specific tasks, programming examples, and aims to provide hands-on experience with data manipulation and analysis techniques. The practicals are structured with dates and page numbers for reference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views28 pages

Index: SR. NO. Practical Name Date of Perform NO. Sign

The document outlines practical exercises for a Data Science course, covering topics such as Excel usage, data pre-processing, hypothesis testing, ANOVA, regression analysis, logistic regression, clustering, PCA, and data visualization. Each section includes specific tasks, programming examples, and aims to provide hands-on experience with data manipulation and analysis techniques. The practicals are structured with dates and page numbers for reference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

INDEX

SR. DATE OF PAGE


PRACTICAL NAME SIGN.
NO. PERFORM NO.
1. Introduction to Excel
 Perform conditional formatting on a dataset
using various criteria.
 Create a pivot table to analyze and
summarize data.
 Use VLOOKUP function to retrieve
14/12/2024 4-6
information from a different worksheet or
table.
 Perform what-if analysis using Goal Seek to
determine input values for desired output.
2. Data Frames and Basic Data Pre-processing
 Read data from CSV and JSON files into a
data frame.
 Perform basic data pre-processing tasks such
as handling missing values and outliers.
04/01/2025 7-9
 Manipulate and transform data using
functions like filtering, sorting, and
grouping.
3. Feature Scaling and Domification
 Apply feature-scaling techniques like
standardization and normalization to
numerical features. 04/01/2025 10-12
 Perform feature domification to convert
categorical variables into numerical
representations
4. Hypothesis Testing
 Formulate null and alternative hypotheses
for a given problem.
 Conduct a hypothesis test using appropriate 11/01/2025 13-15
statistical tests (e.g., t-test, chi square test).
 Interpret the results and draw conclusions
based on the test outcomes.
Name: Yash Pandit Std: TY. B.Sc. Computer Science
Subject: Data Science - Practical Roll No. 51 , Batch :A , Div : A

5. ANOVA (Analysis of Variance)


 Perform one-way ANOVA to compare
means across multiple groups. 11/01/2025 16-17
 Conduct post-hoc tests to identify significant
differences between group means.
6. Regression and Its Types
 Implement simple linear regression using a
dataset.
 Explore and interpret the regression model
coefficients and goodness-of-fit measures.
18/01/2025 18-19
 Extend the analysis to multiple linear
regression and assess the impact of
additional predictors.
7. Logistic Regression and Decision Tree
 Build a logistic regression model to predict
a binary outcome.
 Evaluate the model's performance using
classification metrics (e.g., accuracy,
25/01/2025 20-21
precision, recall).
 Construct a decision tree model and interpret
the decision rules for classification.
8. K-Means Clustering
 Apply the K-Means algorithm to group
similar data points into clusters.
 Determine the optimal number of clusters 08/02/2025 22-23
using elbow method or silhouette analysis.
 Visualize the clustering results and analyse
the cluster characteristics.
9. Principal Component Analysis (PCA)
 Perform PCA on a dataset to reduce
dimensionality.
08/02/2025 24-26
 Evaluate the explained variance and select
the appropriate number of principal
components.

2
 Visualize the data in the reduced-
dimensional space.

10. Data Visualization and Storytelling


 Create meaningful visualizations using data
visualization tools
 Combine multiple visualizations to tell a 15/02/2025 27-29
compelling data story.
 Present the findings and insights in a clear
and concise manner.

3
Practical 1
Aim: Introduction to Excel
 Perform conditional formatting on a dataset using various criteria.
 Create a pivot table to analyze and summarize data.
 Use VLOOKUP function to retrieve information from a different worksheet or
table.
 Perform what-if analysis using Goal Seek to determine input values for desired
output.
A: Perform conditional formatting on a dataset using various criteria.
Step 1: Go to conditional formatting > Greater Than

Step 2: Enter the greater than filter value for example 2000.

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

4
B. Create a pivot table to analyse and summarize data.
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart .
Step 2: Select “New worksheet” in the create pivot chart window.

Step 3: Select and drag attributes in the below boxes.

C. Use VLOOKUP function to retrieve information from a different worksheet or table.


Step 1: Click on an empty cell and type the following command. =VLOOKUP(B3, B3:D3,1,
TRUE)

5
Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.

Step 2: Fill the information in the window accordingly and click ok.

6
Practical 2
Aim: Data Frames and Basic Data Pre-processing
 Read data from CSV and JSON files into a data frame.
 Perform basic data pre-processing tasks such as handling missing values and
outliers.
 Manipulate and transform data using functions like filtering, sorting, and
grouping.

Program 1: Read data from CSV and JSON files into a data frame.

import pandas as pd
db=pd.read_csv('D:\DATA SCIENCE\student_marks.csv')
data=pd.read_json('D:\DATA SCIENCE\IRIS.json')
print("CSV Dataset")
print(df)
print("JSON Dataset")
print(data)

Output:

Program 2 : Perform basic data pre-processing tasks such as handling missing values
and outliers.
import pandas as pd
df=pd.read_csv(r'D:\\DATA SCIENCE\\titanic.csv')
print(df.head(10))
data=pd.read_json(r'D:\\DATA SCIENCE\\IRIS.json')
print("Dataset after filling NA values with 0:")
df.fillna(value=0, inplace=True)
print(df.head(10))
print("Dataset after dropping remaining NA Values:")

7
df.dropna(inplace=True)
print(df.head(10))

Output:

Program 3: Manipulate and transform data using functions like filtering, sorting, and
grouping.

import pandas as pd
iris = pd.read_csv('iris.csv')
setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples: ")
print(setosa.head())
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)
print('\nSorted iris dataset: ')

8
print(sorted_iris.head())
grouped_species=iris.groupby('Species').mean()
print('\nMean measurements for each species:')
print(grouped_species)

Output:

9
Practical 3
Aim: Feature Scaling and Dummification
 Apply feature-scaling techniques like standardization and normalization to
numerical features.
 Perform feature dummification to convert categorical variables into numerical
representations.

Program 1: Apply feature-scaling techniques like standardization and normalization to


numerical features.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Reading Data
df = pd.read_csv('D:/Data Science/wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
# Renaming Columns
df.columns = ['Class', 'Alcohol', 'Malic Acid']
# Printing Original DataFrame
print("Original DataFrame:")
print(df)
# MinMax Scaling
scaling = MinMaxScaler()
scaled_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_value
# Printing DataFrame after MinMax Scaling
print("\nDataFrame after MinMax Scaling:")
print(df)
# Standard Scaling
scaling = StandardScaler()
scaled_standard_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_standard_value
# Printing DataFrame after Standard Scaling
print("\nDataFrame after Standard Scaling:")
print(df)

Output:

10
Program 2: Perform feature dummification to convert categorical variables into
numerical representations.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
#Reading Data
iris=pd.read_csv("D:\Data Science\computer.csv")
#Printing Iris Columns
print("Columns in dataset: ")
print(iris.columns)
#Printing Iris Rows
print("Head in dataset: ")
print(iris.head())

11
#Encoding Categorical Data
le=LabelEncoder()
if 'Species' in iris.columns:
iris['code'] = le.fit_transform(iris['Species'])
print("\nDataset after Label Encoding: ")
print(iris)
else:
print("The column 'Species' is not found in dataset")

Output:

12
Practical 4
Aim: Hypothesis Testing
 Formulate null and alternative hypotheses for a given problem.
 Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi square
test).
 Interpret the results and draw conclusions based on the test outcomes.
Program: -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy import stats

np.random.seed(42)

# Two-Sample t-test

sample1 = np.random.normal(10, 2, 30)


sample2 = np.random.normal(12, 2, 30)

t_stat, p_value = stats.ttest_ind(sample1, sample2)

alpha = 0.05

print(f'T-statistic: {t_stat}, P-value: {p_value}, DF: {len(sample1) + len(sample2) - 2}')

# Plotting the distributions


plt.hist([sample1, sample2], alpha=0.5, label=['Sample 1', 'Sample 2'], color=['blue', 'orange'])

plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)


plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)

plt.title('Distributions of Sample 1 and Sample 2')


plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()

if p_value < alpha:


plt.fill_between(np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000), 0, 0.15, color='red', alpha=0.3, label='Critical Region')

plt.text(np.mean(sample2), 5, f'T-statistic: {t_stat:.2f}', ha='center', va='center', color='black',


backgroundcolor='white')

plt.show()

# Conclusion for t-test


if p_value < alpha:

13
print(f"Conclusion: Reject null hypothesis. Mean of Sample {'1' if np.mean(sample1) >
np.mean(sample2) else '2'} is significantly higher.")
else:
print("Conclusion: Fail to reject null hypothesis. No significant difference in means.")

# Chi-Square Test on 'mpg' dataset


df = sb.load_dataset('mpg')

# Bin the 'horsepower' column into categories


df['horsepower_new'] = pd.cut(df['horsepower'], bins=[0, 75, 150, 240], labels=['low',
'medium', 'high'])

# Bin the 'model year' column into categories


df['modelyear_new'] = pd.cut(df['model_year'], bins=[69, 72, 74, 84], labels=['t1', 't2', 't3'])

# Perform Chi-Square test


chi2_stat, p_val_chi, dof, expected =
stats.chi2_contingency(pd.crosstab(df['horsepower_new'], df['modelyear_new']))

print(f"Chi-square: {chi2_stat}, P-value: {p_val_chi}, DF: {dof}")

# Conclusion for Chi-Square Test


if p_val_chi < alpha:
print("Conclusion: Reject null hypothesis. Significant association between horsepower and
model year.")
else:
print("Conclusion: Fail to reject null hypothesis. No significant association.")

Output: -

14
15
Practical no. 5
Aim: ANOVA (Analysis of Variance)
 Perform one-way ANOVA to compare means across multiple groups.
 Conduct post-hoc tests to identify significant differences between
group means.

Program:

import pandas as pd
from scipy.stats import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
group1 = [23,25,29,34,30]
group2 = [19,20,22,24,25]
group3 = [15,18,20,21,17]
group4 = [28,24,26,30,29]
all_data = group1 + group2 + group3 + group4
group_labels = ['Group 1']*len(group1) + ['Group 2']*len(group2) + ['Group 3']*len(group3)
+ ['Group 4']*len(group4)
f_stats, p_value = stats.f_oneway(group1,group2,group3,group4)
print("One-way ANOVA Results: ")
print(f"F-statistics: {f_stats: .4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("\nTukey-Kramer post-hoc test:")
tukey_results = pairwise_tukeyhsd(all_data, group_labels)
print(tukey_results)
else:
print("\nNo significant differences found in ANOVA; post-hoc test not needed.")

Output:

16
Practical 6
Aim: -Regression and Its Types
 Implement simple linear regression using a dataset.
 Explore and interpret the regression model coefficients and goodness-
of-fit measures.
 Extend the analysis to multiple linear regression and assess the impact
of additional predictors.
Program: -

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score # Fixed the import

# Load dataset
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df['PRICE'] = housing.target

print("First few rows of the dataset:")


print(housing_df.head())

print("\nSimple Linear Regression:")


X = housing_df[['AveRooms']] # Fixed column name to match dataset
y = housing_df['PRICE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")


print(f"R-squared: {r2:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"Coefficient: {model.coef_[0]:.4f}")

print("\nMultiple Linear Regression:")


X = housing_df.drop('PRICE', axis=1)
y = housing_df['PRICE']

17
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #
Fixed missing y_train

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")


print(f"R-squared: {r2:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

print("Coefficients:")
for feature, coef in zip(housing_df.columns[:-1], model.coef_):
print(f"{feature}: {coef:.4f}")

Output: -

18
Practical 7
Aim: Logistic Regression and Decision Tree
 Build a logistic regression model to predict a binary outcome.
 Evaluate the model's performance using classification metrics (e.g.,
accuracy, precision, recall).
 Construct a decision tree model and interpret the decision rules for
classification.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
classification_report

# Load the Iris dataset and create a binary classification problem


iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])

# Keep only two classes for binary classification


binary_df = iris_df[iris_df['target'] != 2]
X = binary_df.drop('target', axis=1)
y = binary_df['target']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model and evaluate its performance


logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)

print("Logistic Regression Metrics")


print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("Precision:", precision_score(y_test, y_pred_logistic))
print("Recall:", recall_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))

# Train a decision tree model and evaluate its performance

19
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)

print("\nDecision Tree Metrics")


print("Accuracy:", accuracy_score(y_test, y_pred_tree))
print("Precision:", precision_score(y_test, y_pred_tree))
print("Recall:", recall_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))

Output:

20
Practical 8
Aim: K-Means Clustering
 Apply the K-Means algorithm to group similar data points into
clusters.
 Determine the optimal number of clusters using elbow method or
silhouette analysis.
 Visualize the clustering results and analyse the cluster characteristics.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
# Load the dataset
data_path = "C:\Users\Admin\Downloads\wholesaler.csv"
data = pd.read_csv(data_path)
print(data.head())
# Define categorical and continuous features
categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']
print(data[continuous_features].describe())
# One-hot encoding for categorical features
for col in categorical_features:
dummies = pd.get_dummies(data[col], prefix=col)
data = pd.concat([data, dummies], axis=1)
data.drop(col, axis=1, inplace=True)
print(data.head())
# Scale the data
scaler = MinMaxScaler()
data_transformed = scaler.fit_transform(data)
# Elbow method to determine the optimal number of clusters
sum_of_squared_distances = []
k_range = range(1, 15)
for k in k_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
# Plot the Elbow graph
plt.figure()
plt.plot(k_range, sum_of_squared_distances, 'bo-')

21
plt.xlabel("Number of clusters (K)")
plt.ylabel("Sum of squared distances (Inertia)")
plt.title("Elbow Method for Optimal K")
plt.show()

Output:

22
Practical 9
Aim: Principal Component Analysis (PCA)
 Perform PCA on a dataset to reduce dimensionality.
 Evaluate the explained variance and select the appropriate number of
principal components.
 Visualize the data in the reduced-dimensional space.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset


iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +
['target'])

# Separate features and target


X = iris_df.drop("target", axis=1)
y = iris_df["target"]

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_

# Plot cumulative explained variance


plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio), marker='o', linestyle='-')
plt.title('Cumulative Explained Variance Ratio')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)

# Find number of components explaining 95% variance

23
n_components = np.argmax(np.cumsum(explained_variance_ratio) >= 0.95) + 1
plt.axvline(x=n_components, color='r', linestyle='--')
plt.text(n_components, 0.9, '95% variance\nexplained', color='red', ha='right')
plt.show()

print(f"Number of principal components to explain 95% variance: {n_components}")

# Reduce dimensions using selected components


pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)

# Scatter plot of first two principal components


plt.figure(figsize=(10, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title('Data in Reduced-dimensional Space (PC1 and PC2)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label="Target")
plt.show()
Output:

24
25
Practical 10
Aim: Data Visualization and Storytelling
 Create meaningful visualizations using data visualization tools
 Combine multiple visualizations to tell a compelling data story.
 Present the findings and insights in a clear and concise manner.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def generate_data():
np.random.seed(42)

data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]), dtype='category')
})

return data

def create_visualizations(data):
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5, c='b')
plt.title('Figure 1: Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.grid(True)
plt.show()

# Count plot
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data, palette='coolwarm')
plt.title('Figure 2: Distribution of Categories', fontsize=16)
plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data[['variable1', 'variable2', 'variable3']].corr(), annot=True,
cmap='coolwarm')

26
plt.title('Figure 3: Correlation Heatmap', fontsize=16)
plt.show()

def data_storytelling():
print("\nData Storytelling\n")
print("Title: Exploring the Relationship between Variable 1 and Variable 2")
print("\nFigure 1: Scatter Plot of Variable 1 and Variable 2 shows a positive correlation.")
print("Figure 2: Bar Chart of Categories shows Category A is the most common.")
print("Figure 3: Correlation Heatmap shows a strong correlation between Variable 1 and
Variable 2.")

def main():
data = generate_data()
create_visualizations(data)
data_storytelling()

if name == " main ":


main()

Output:

Scatter Plot:

27
Bar Chart:

Correlation Heatmap

28

You might also like