Index: SR. NO. Practical Name Date of Perform NO. Sign
Index: SR. NO. Practical Name Date of Perform NO. Sign
2
Visualize the data in the reduced-
dimensional space.
3
Practical 1
Aim: Introduction to Excel
Perform conditional formatting on a dataset using various criteria.
Create a pivot table to analyze and summarize data.
Use VLOOKUP function to retrieve information from a different worksheet or
table.
Perform what-if analysis using Goal Seek to determine input values for desired
output.
A: Perform conditional formatting on a dataset using various criteria.
Step 1: Go to conditional formatting > Greater Than
Step 2: Enter the greater than filter value for example 2000.
4
B. Create a pivot table to analyse and summarize data.
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart .
Step 2: Select “New worksheet” in the create pivot chart window.
5
Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.
Step 2: Fill the information in the window accordingly and click ok.
6
Practical 2
Aim: Data Frames and Basic Data Pre-processing
Read data from CSV and JSON files into a data frame.
Perform basic data pre-processing tasks such as handling missing values and
outliers.
Manipulate and transform data using functions like filtering, sorting, and
grouping.
Program 1: Read data from CSV and JSON files into a data frame.
import pandas as pd
db=pd.read_csv('D:\DATA SCIENCE\student_marks.csv')
data=pd.read_json('D:\DATA SCIENCE\IRIS.json')
print("CSV Dataset")
print(df)
print("JSON Dataset")
print(data)
Output:
Program 2 : Perform basic data pre-processing tasks such as handling missing values
and outliers.
import pandas as pd
df=pd.read_csv(r'D:\\DATA SCIENCE\\titanic.csv')
print(df.head(10))
data=pd.read_json(r'D:\\DATA SCIENCE\\IRIS.json')
print("Dataset after filling NA values with 0:")
df.fillna(value=0, inplace=True)
print(df.head(10))
print("Dataset after dropping remaining NA Values:")
7
df.dropna(inplace=True)
print(df.head(10))
Output:
Program 3: Manipulate and transform data using functions like filtering, sorting, and
grouping.
import pandas as pd
iris = pd.read_csv('iris.csv')
setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples: ")
print(setosa.head())
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)
print('\nSorted iris dataset: ')
8
print(sorted_iris.head())
grouped_species=iris.groupby('Species').mean()
print('\nMean measurements for each species:')
print(grouped_species)
Output:
9
Practical 3
Aim: Feature Scaling and Dummification
Apply feature-scaling techniques like standardization and normalization to
numerical features.
Perform feature dummification to convert categorical variables into numerical
representations.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Reading Data
df = pd.read_csv('D:/Data Science/wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
# Renaming Columns
df.columns = ['Class', 'Alcohol', 'Malic Acid']
# Printing Original DataFrame
print("Original DataFrame:")
print(df)
# MinMax Scaling
scaling = MinMaxScaler()
scaled_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_value
# Printing DataFrame after MinMax Scaling
print("\nDataFrame after MinMax Scaling:")
print(df)
# Standard Scaling
scaling = StandardScaler()
scaled_standard_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_standard_value
# Printing DataFrame after Standard Scaling
print("\nDataFrame after Standard Scaling:")
print(df)
Output:
10
Program 2: Perform feature dummification to convert categorical variables into
numerical representations.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
#Reading Data
iris=pd.read_csv("D:\Data Science\computer.csv")
#Printing Iris Columns
print("Columns in dataset: ")
print(iris.columns)
#Printing Iris Rows
print("Head in dataset: ")
print(iris.head())
11
#Encoding Categorical Data
le=LabelEncoder()
if 'Species' in iris.columns:
iris['code'] = le.fit_transform(iris['Species'])
print("\nDataset after Label Encoding: ")
print(iris)
else:
print("The column 'Species' is not found in dataset")
Output:
12
Practical 4
Aim: Hypothesis Testing
Formulate null and alternative hypotheses for a given problem.
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi square
test).
Interpret the results and draw conclusions based on the test outcomes.
Program: -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy import stats
np.random.seed(42)
# Two-Sample t-test
alpha = 0.05
plt.show()
13
print(f"Conclusion: Reject null hypothesis. Mean of Sample {'1' if np.mean(sample1) >
np.mean(sample2) else '2'} is significantly higher.")
else:
print("Conclusion: Fail to reject null hypothesis. No significant difference in means.")
Output: -
14
15
Practical no. 5
Aim: ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
Conduct post-hoc tests to identify significant differences between
group means.
Program:
import pandas as pd
from scipy.stats import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
group1 = [23,25,29,34,30]
group2 = [19,20,22,24,25]
group3 = [15,18,20,21,17]
group4 = [28,24,26,30,29]
all_data = group1 + group2 + group3 + group4
group_labels = ['Group 1']*len(group1) + ['Group 2']*len(group2) + ['Group 3']*len(group3)
+ ['Group 4']*len(group4)
f_stats, p_value = stats.f_oneway(group1,group2,group3,group4)
print("One-way ANOVA Results: ")
print(f"F-statistics: {f_stats: .4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("\nTukey-Kramer post-hoc test:")
tukey_results = pairwise_tukeyhsd(all_data, group_labels)
print(tukey_results)
else:
print("\nNo significant differences found in ANOVA; post-hoc test not needed.")
Output:
16
Practical 6
Aim: -Regression and Its Types
Implement simple linear regression using a dataset.
Explore and interpret the regression model coefficients and goodness-
of-fit measures.
Extend the analysis to multiple linear regression and assess the impact
of additional predictors.
Program: -
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score # Fixed the import
# Load dataset
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df['PRICE'] = housing.target
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
17
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #
Fixed missing y_train
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Coefficients:")
for feature, coef in zip(housing_df.columns[:-1], model.coef_):
print(f"{feature}: {coef:.4f}")
Output: -
18
Practical 7
Aim: Logistic Regression and Decision Tree
Build a logistic regression model to predict a binary outcome.
Evaluate the model's performance using classification metrics (e.g.,
accuracy, precision, recall).
Construct a decision tree model and interpret the decision rules for
classification.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
classification_report
19
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)
Output:
20
Practical 8
Aim: K-Means Clustering
Apply the K-Means algorithm to group similar data points into
clusters.
Determine the optimal number of clusters using elbow method or
silhouette analysis.
Visualize the clustering results and analyse the cluster characteristics.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
# Load the dataset
data_path = "C:\Users\Admin\Downloads\wholesaler.csv"
data = pd.read_csv(data_path)
print(data.head())
# Define categorical and continuous features
categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']
print(data[continuous_features].describe())
# One-hot encoding for categorical features
for col in categorical_features:
dummies = pd.get_dummies(data[col], prefix=col)
data = pd.concat([data, dummies], axis=1)
data.drop(col, axis=1, inplace=True)
print(data.head())
# Scale the data
scaler = MinMaxScaler()
data_transformed = scaler.fit_transform(data)
# Elbow method to determine the optimal number of clusters
sum_of_squared_distances = []
k_range = range(1, 15)
for k in k_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
# Plot the Elbow graph
plt.figure()
plt.plot(k_range, sum_of_squared_distances, 'bo-')
21
plt.xlabel("Number of clusters (K)")
plt.ylabel("Sum of squared distances (Inertia)")
plt.title("Elbow Method for Optimal K")
plt.show()
Output:
22
Practical 9
Aim: Principal Component Analysis (PCA)
Perform PCA on a dataset to reduce dimensionality.
Evaluate the explained variance and select the appropriate number of
principal components.
Visualize the data in the reduced-dimensional space.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
23
n_components = np.argmax(np.cumsum(explained_variance_ratio) >= 0.95) + 1
plt.axvline(x=n_components, color='r', linestyle='--')
plt.text(n_components, 0.9, '95% variance\nexplained', color='red', ha='right')
plt.show()
24
25
Practical 10
Aim: Data Visualization and Storytelling
Create meaningful visualizations using data visualization tools
Combine multiple visualizations to tell a compelling data story.
Present the findings and insights in a clear and concise manner.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def generate_data():
np.random.seed(42)
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]), dtype='category')
})
return data
def create_visualizations(data):
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5, c='b')
plt.title('Figure 1: Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.grid(True)
plt.show()
# Count plot
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data, palette='coolwarm')
plt.title('Figure 2: Distribution of Categories', fontsize=16)
plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data[['variable1', 'variable2', 'variable3']].corr(), annot=True,
cmap='coolwarm')
26
plt.title('Figure 3: Correlation Heatmap', fontsize=16)
plt.show()
def data_storytelling():
print("\nData Storytelling\n")
print("Title: Exploring the Relationship between Variable 1 and Variable 2")
print("\nFigure 1: Scatter Plot of Variable 1 and Variable 2 shows a positive correlation.")
print("Figure 2: Bar Chart of Categories shows Category A is the most common.")
print("Figure 3: Correlation Heatmap shows a strong correlation between Variable 1 and
Variable 2.")
def main():
data = generate_data()
create_visualizations(data)
data_storytelling()
Output:
Scatter Plot:
27
Bar Chart:
Correlation Heatmap
28