[go: up one dir, main page]

0% found this document useful (0 votes)
8 views29 pages

Statistics For Data Science

Uploaded by

starorionlabz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

Statistics For Data Science

Uploaded by

starorionlabz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Q1. Generate a list of 100 integers containing values


between 90 to 130 and store it in the variable int_list .
After generating the list, find the following:
(i) Write a Python function to calculate the mean of a given list of numbers. Create
a function to find the median of a list of numbers.

(ii) Develop a program to compute the mode of a list of integers.

(iii)Implement a function to calculate the weighted mean of a list of values and


their corresponding weights.

(iv) Write a Python function to find the geometric mean of a list of positive
numbers.

(v) Create a program to calculate the harmonic mean of a list of values.

(vi) Build a function to determine the midrange of a list of numbers (average of


the minimum and maximum).

(vii)Implement a Python program to find the trimmed mean of a list, excluding a


certain percentage of outliers.

In [2]: int_list=np.random.randint(90,130,100)
int_list

Out[2]: array([106, 102, 123, 126, 124, 99, 123, 106, 95, 97, 101, 106, 116,
123, 110, 99, 110, 127, 112, 113, 96, 108, 120, 119, 120, 127,
123, 94, 128, 95, 94, 103, 102, 127, 126, 101, 91, 97, 119,
120, 96, 93, 96, 112, 91, 95, 93, 111, 93, 106, 110, 128,
108, 97, 124, 123, 108, 118, 125, 97, 110, 121, 109, 111, 123,
127, 108, 102, 119, 125, 91, 94, 109, 105, 127, 116, 122, 122,
110, 104, 99, 106, 93, 111, 107, 122, 91, 119, 102, 114, 111,
98, 90, 99, 118, 124, 110, 123, 105, 117])

(i) Write a Python function to calculate the mean of a given list of numbers. Create
a function to find the median of a list of numbers.

In [3]: mean_int_list=np.mean(int_list)
median_int_list=np.median(int_list)
print("Mean:",mean_int_list)
print("Median:",median_int_list)

Mean: 109.66
Median: 110.0

(ii) Develop a program to compute the mode of a list of integers.

In [4]: import statistics as stats


mode_int_list=stats.mode(int_list)
print("Mode:",mode_int_list)
Mode: 123

(iii) Implement a function to calculate the weighted mean of a list of values and
their corresponding weights.

In [5]: def calculate_weighted_mean(values, weights):


if len(values) != len(weights):
return "Error: The number of values and weights should be the same."
weighted_sum = 0
total_weight = 0
for i in range(len(values)):
weighted_sum += values[i] * weights[i]
total_weight += weights[i]
weighted_mean = weighted_sum / total_weight
return weighted_mean

values = int_list
weights = int_list
result = calculate_weighted_mean(values, weights)
print("The weighted mean is:", round(result,2))

The weighted mean is: 110.87

In [6]: #Using numpy:


weighted_mean=np.average(int_list,weights=int_list)
print("The weighted mean is:", round(weighted_mean,2))

The weighted mean is: 110.87

In [7]: v = int_list
w = int_list
def weighted_mean(v,w):
return sum(x*y for x,y in zip(v,w))/sum(w)

print(round(weighted_mean(v,w),2))

110.87

(iv) Write a Python function to find the geometric mean of a list of positive
numbers.

In [8]: def calculate_geometric_mean(values):


return np.exp(np.mean(np.log(values)))

values = int_list
result = calculate_geometric_mean(values)
print("The geometric mean is:", result)

The geometric mean is: 109.04969524412776

In [9]: # or use one line code


geo_mean = np.exp(np.mean(np.log(int_list)))
geo_mean

Out[9]: 109.04969524412776

(v) Create a program to calculate the harmonic mean of a list of values.

In [10]: def calculate_harmonic_mean(values):


harmonic_mean = 1/np.mean(1 / np.array(values))
return harmonic_mean

values = int_list
result = calculate_harmonic_mean(values)
print("The harmonic mean is:", result)
The harmonic mean is: 108.43686474429205

(vi) Build a function to determine the midrange of a list of numbers (average of


the minimum and maximum).

In [11]: def Calculate_midrange(l):


max_=max(l)
min_=min(l)
midrange=(max_+min_)/2
return midrange
Calculate_midrange(int_list)

Out[11]: 109.0

(vii) Implement a Python program to find the trimmed mean of a list, excluding a
certain percentage of outliers.

In [12]: def calculate_trimmed_mean(num_list, percentage):


sorted_list = sorted(num_list)
exclude_count = round((percentage / 100) * len(sorted_list))
if exclude_count == 0:
return sum(sorted_list)/len(sorted_list) #no trimming

trimmed_list = sorted_list[exclude_count:-exclude_count]
return sum(trimmed_list) / len(trimmed_list)
calculate_trimmed_mean(int_list,10)

Out[12]: 109.725

In [13]: # Or using numpy


def trimmed_mean(values, trim_percent):
"""
Calculate the trimmed mean by removing a percentage of smallest and largest values.
"""
if not 0 <= trim_percent < 50:
raise ValueError("trim_percent must be between 0 and 50")

arr = np.sort(np.array(values))
n = len(arr)
k = int(n * trim_percent / 100)

trimmed = arr[k:n-k] # exclude outliers from both ends


return np.mean(trimmed)

result = trimmed_mean(int_list, 10) # trim 10% from each side


print("Trimmed Mean:", result)

Trimmed Mean: 109.725

Q2. Generate a list of 500 integers containing values


between 200 to 300 and store it in the variable
"int_list2". After generating the list, find the following:
(i) Compare the given list of visualization for the given data:

1. Frequency & Gaussian distribution

2. Frequency smoothened KDE plot

3. Gaussian distribution & smoothened KDE plot

(ii) Write a Python function to calculate the range of a given list of numbers.
(iii) Create a program to find the variance and standard deviation of a list of
numbers.

(iv) Implement a function to compute the interquartile range (IQR) of a list of


values.

(v) Build a program to calculate the coefficient of variation for a dataset.

(vi) Write a Python function to find the mean absolute deviation (MAD) of a list of
numbers.

(vii) Create a program to calculate the quartile deviation of a list of values.

(viii) Implement a function to find the range-based coefficient of dispersion for a


dataset

In [14]: int_list2=np.random.randint(200,300,500)
int_list2

Out[14]: array([242, 247, 263, 209, 245, 287, 223, 257, 216, 230, 228, 235, 210,
209, 297, 255, 250, 214, 256, 288, 230, 222, 215, 242, 211, 271,
203, 204, 287, 278, 211, 262, 295, 210, 219, 254, 292, 264, 266,
277, 288, 254, 273, 298, 235, 245, 261, 202, 249, 281, 299, 237,
292, 299, 244, 251, 241, 254, 257, 235, 275, 225, 295, 269, 240,
275, 251, 264, 260, 208, 279, 282, 251, 229, 253, 206, 229, 217,
213, 253, 262, 294, 262, 237, 256, 260, 280, 283, 241, 261, 213,
214, 251, 221, 235, 246, 245, 273, 201, 228, 205, 215, 249, 299,
299, 239, 253, 289, 228, 216, 247, 205, 277, 241, 217, 235, 271,
240, 266, 217, 288, 203, 284, 203, 287, 248, 206, 274, 266, 205,
285, 261, 262, 234, 266, 238, 217, 246, 208, 233, 271, 231, 220,
226, 219, 206, 270, 243, 242, 209, 259, 283, 287, 281, 246, 222,
256, 293, 284, 205, 210, 287, 203, 276, 262, 273, 218, 287, 235,
263, 240, 274, 221, 212, 247, 276, 272, 263, 269, 213, 250, 256,
285, 228, 200, 289, 258, 260, 231, 249, 260, 277, 235, 218, 241,
224, 204, 281, 219, 247, 239, 208, 283, 287, 257, 257, 252, 295,
258, 234, 283, 220, 251, 250, 257, 245, 220, 207, 295, 297, 244,
265, 204, 243, 209, 252, 265, 284, 243, 274, 242, 225, 200, 263,
262, 225, 243, 264, 233, 220, 258, 238, 294, 222, 232, 267, 267,
294, 260, 243, 245, 210, 221, 298, 251, 227, 228, 260, 249, 239,
254, 201, 283, 296, 223, 237, 275, 259, 246, 292, 203, 293, 283,
219, 274, 244, 232, 212, 261, 200, 227, 254, 260, 250, 265, 219,
214, 270, 248, 279, 227, 265, 273, 205, 293, 240, 255, 243, 224,
208, 253, 280, 284, 211, 201, 235, 272, 284, 293, 221, 243, 298,
289, 227, 246, 253, 269, 255, 292, 277, 298, 289, 282, 221, 261,
279, 219, 252, 296, 233, 262, 268, 277, 295, 256, 237, 241, 295,
228, 267, 240, 267, 273, 223, 259, 211, 277, 274, 234, 208, 239,
267, 260, 282, 216, 223, 258, 268, 289, 267, 262, 231, 263, 216,
250, 226, 246, 293, 279, 256, 283, 271, 217, 219, 282, 284, 227,
270, 276, 288, 271, 225, 202, 289, 254, 254, 261, 205, 238, 292,
250, 248, 229, 260, 235, 213, 279, 202, 264, 293, 219, 269, 267,
290, 236, 281, 250, 273, 243, 234, 218, 271, 207, 261, 232, 217,
251, 295, 208, 241, 205, 257, 270, 262, 283, 290, 240, 234, 236,
278, 225, 232, 291, 247, 226, 273, 282, 269, 283, 204, 242, 243,
295, 234, 224, 260, 268, 222, 271, 296, 299, 245, 249, 207, 273,
247, 234, 277, 236, 263, 243, 297, 203, 200, 296, 236, 240, 283,
247, 221, 232, 225, 216, 240, 275, 297, 207, 297, 265, 264, 243,
296, 249, 267, 281, 222, 206, 219, 238, 270, 297, 296, 202, 244,
216, 230, 209, 212, 295, 222])

(i).Compare the given list of visualization for the given data:

1. Frequency & Gaussian distribution: This visualization shows the frequency of


data points along with a Gaussian distribution curve. It helps in understanding the
distribution of the data and how closely it aligns with a normal distribution.
In [15]: import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Step 1: Generate the list


int_list2 = np.random.randint(200, 301, 500)

# Step 2: Plot frequency (histogram)


plt.hist(int_list2, bins=15, density=True, alpha=0.6, color='skyblue', edgecolor='black', l

# Step 3: Fit a normal distribution & plot Gaussian curve


mu, sigma = np.mean(int_list2), np.std(int_list2)
x = np.linspace(min(int_list2), max(int_list2), 200)
pdf = norm.pdf(x, mu, sigma)
plt.plot(x, pdf, 'r', linewidth=2, label=f'Gaussian fit\nμ={mu:.2f}, σ={sigma:.2f}')

# Step 4: Labels & legend


plt.title("Frequency Distribution & Gaussian Fit")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

In [16]: import matplotlib.pyplot as plt

#Frequency distribution
plt.hist(int_list2, bins=15, density=True, alpha=0.6, color='skyblue', edgecolor='black', l
plt.ylabel("Values")
plt.xlabel("Counts")
plt.title("Frequency Distribution")
plt.show()

#Gaussian curve
mu, sigma = np.mean(int_list2), np.std(int_list2)
x = np.linspace(min(int_list2), max(int_list2), 200)
y = norm.pdf(x, mu, sigma)
plt.plot(x, y, 'r', linewidth=2, label=f'Gaussian fit\nμ={mu:.2f}, σ={sigma:.2f}')
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

2. Frequency smoothened KDE plot : This visualization represents the data using a
Kernel Density Estimation (KDE) plot, which smoothes the data and provides a
continuous density estimate. It shows the distribution of the data in a smooth
curve, giving insights into the shape and density of the data.

In [17]: #In one line command


sns.kdeplot(int_list2, bw_adjust=0.5, color='skyblue', label='KDE (Smoothed Frequency)', li
Out[17]: <Axes: ylabel='Density'>

In [18]: data=int_list2
values,counts=np.unique(data,return_counts=True)

#Smooth the frequency table by resampling the data:


smoothed_values = np.repeat(values, counts)

sns.kdeplot(smoothed_values,bw_adjust=0.5)
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Frequency smoothened KDE plot")
plt.show()
3. Gaussian distribution & smoothened KDE plot: This visualization combines the
Gaussian distribution curve with the smoothened KDE plot. It allows for a
comparison between the actual data distribution and the estimated distribution
based on the KDE.

In [19]: #Gaussian Distribution


mu=np.mean(int_list2)
sigma=np.std(int_list2)
x=np.linspace(mu-3*sigma,mu+3*sigma,100)
y=(1/(sigma*np.sqrt(2*np.pi)))*np.exp(-0.5*((x-mu)/sigma)**2)

plt.plot(x,y)
plt.xlabel("Values")
plt.ylabel("Probability Distribution")
plt.title("Gaussian Distribution")
plt.show()

#Smoothened KDE plot


import seaborn as sns
data=int_list2
values,counts=np.unique(data,return_counts=True)
#Smooth the frequency table by resampling the data:
smoothed_values = np.repeat(values, counts)

sns.kdeplot(smoothed_values,bw_adjust=0.5)
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Frequency smoothened KDE plot")
plt.show()
(ii) Write a python function to calculate the range of a given list of numbers

In [20]: def range_(l):


return max(l)-min(l)
print('Range:',range_(int_list2))

Range: 100

(iii) Create a program to find the variance and standard deviation of list of
numbers.

In [23]: def Variance(l,dof):


n=len(l)
#Find out mean
mean=sum(l)/n
#Deviation
deviation=[(x-mean)**2 for x in l]
variance=sum(deviation)/(n-dof)
return variance
Sample_Variance=Variance(int_list2,1)
Population_Variance=Variance(int_list2,0)
Sample_Std = round(np.sqrt(Sample_Variance), 2)
Population_Std = round(np.sqrt(Population_Variance), 2)
print("Sample Variance:",round(Sample_Variance,2))
print("population Variance",round(Population_Variance,2))
print("Sample Standard Deviation:", Sample_Std)
print("Population Standard Deviation:", Population_Std)

Sample Variance: 816.18


population Variance 814.54
Sample Standard Deviation: 28.57
Population Standard Deviation: 28.54

In [24]: # uning Numpy


var_p = round(np.var(int_list2, ddof=0),2) #ddof=0 for population n
var_s = round(np.var(int_list2, ddof=1),2) #ddof=1 for sample n-1
std_dev_p = round(np.std(int_list2, ddof=0),2)
std_dev_s = round(np.std(int_list2, ddof=1),2)
print("Population Variance",var_p)
print("Sample Variance",var_s)
print("Population Standard Deviation",std_dev_p)
print("Sample Standard Deviation",std_dev_s)

Population Variance 814.54


Sample Variance 816.18
Population Standard Deviation 28.54
Sample Standard Deviation 28.57

(iv) Implement a function to compute the interquartile range (IQR) of a list of


values.

In [25]: def IQR(l):


q1,q3=np.percentile(l,[25,75])
return q3-q1
print("IQR:",IQR(int_list2))

IQR: 48.25

In [26]: def compute_iqr(values):


q1 = np.percentile(values,25)
q3 = np.percentile(values,75)
iqr = q3-q1
return iqr
iqr_vaule = compute_iqr(int_list2)
print("IQR", iqr_vaule)

IQR 48.25

(v) Build a program to calculate the coefficient of variation for a dataset.

In [27]: def coefficient_of_variation(data):


mean = np.mean(data)
std_dev = np.std(data)
coefficient = (std_dev / mean) * 100
return coefficient

cv = coefficient_of_variation(int_list2).round(2)
print("The coefficient of variation is:", cv)

The coefficient of variation is: 11.35

(vi) Write a python function to find the mean absolute deviation (MAD) of a list of
numbers.

In [28]: def mean_absolute_deviation(numbers):


mean = sum(numbers) / len(numbers)
deviations = [abs(X - mean) for X in numbers]
mad = sum(deviations) / len(numbers)
return mad

data = int_list2
mad = mean_absolute_deviation(data).round(2)
print("The Mean Absolute Deviation is:", mad)

The Mean Absolute Deviation is: 24.69

(vii) Create a program to calculate the quartile deviation of a list of values.

In [29]: def quartile_deviation(data):


q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
deviation = (q3 - q1) / 2
return deviation
dataset = int_list2
qd = quartile_deviation(dataset)
print("The quartile deviation is:", qd)

The quartile deviation is: 24.125

(viii) Implement a function to find the range-based coefficient of dispersion for a


dataset.

In [30]: def range_coefficient_of_dispersion(data):


coefficient = (max(data)-min(data))/(max(data)+min(data))
return coefficient

dataset = int_list2
rcd = range_coefficient_of_dispersion(dataset).round(2)
print("The range-based coefficient of dispersion is:", rcd)

The range-based coefficient of dispersion is: 0.2

3. Write a Python class representing a discrete random


variable with methods to calculate its expected value and
variance.
In [31]: class DiscreteRandomVariable:
def __init__(self,values,probabilities):
self.values=values
self.probabilities=probabilities
def expected_value(self):
return sum(value*probability for value, probability in zip(self.values,self.probabi

def variance(self):
expected_value=self.expected_value()
return sum((value-expected_value)**2*probability for value,probability in zip(self.
values = [1, 2, 3, 4]
probabilities = [0.2, 0.3, 0.4, 0.1]

rv = DiscreteRandomVariable(values, probabilities)
print("Expected Value:", round(rv.expected_value(),2))
print("Variance:", round(rv.variance(),2))

Expected Value: 2.4


Variance: 0.84

In [41]: #Another Way


class DiscreteRandomVariable:
def __init__(self, distribution):
self.distribution = distribution
def expected_value(self):
return sum(x*p for x, p in self.distribution.items())
def variance(self):
mean = self.expected_value()
return sum((x-mean)**2*p for x,p in self.distribution.items())

dist = {1:0.2, 2:0.3, 3:0.4, 4:0.1}


rv = DiscreteRandomVariable(dist)

print("Expected Value",round(rv.expected_value(),2))
print("Variance", round(rv.variance(),2))

Expected Value 2.4


Variance 0.84
4. Implement a program to simulate the rolling of a fair
six-sided die and calculate the expected value and
variance of the outcomes.
In [47]: import random

def roll_die():
return random.randint(1, 6)

def simulate_rolls(num_rolls):
rolls = [roll_die() for i in range(num_rolls)]
return rolls

def calculate_expected_value(rolls):
return sum(rolls) / len(rolls)

def calculate_variance(rolls):
expected_value = calculate_expected_value(rolls)
squared_diff = [(roll - expected_value) ** 2 for roll in rolls]
return sum(squared_diff) / len(rolls)

rolls = simulate_rolls(100)

expected_value = calculate_expected_value(rolls)
variance = calculate_variance(rolls)

print("Expected Value:", expected_value)


print("Variance:", variance)

Expected Value: 3.6


Variance: 2.7800000000000007

In [48]: #Another Way

import random
import statistics

def simulate_die_rolls(n=100):
outcomes = [random.randint(1,6) for i in range(n)]
return outcomes

rolls = simulate_die_rolls(100)

exp_val = statistics.mean(rolls)
var = statistics.pvariance(rolls)

print("Expected Value", exp_val)


print("Variance", var)

Expected Value 3.69


Variance 2.9339

5. Create a Python function to generate random sample


from a given probability distribution(eg.
binomial,poisson) and calculate their mean and variance.
In [68]: def generate_sample(distribution,size):
if distribution == "binomial":
sample=np.random.binomial(n=10,p=0.5,size=size)
elif distribution == "poisson":
sample=np.random.poisson(lam=5,size=size)
else:
return "Invalid distribution. Please choose either 'binomial' or 'poisson'."

mean=np.mean(sample)
variance=np.var(sample)
return sample,mean, variance

samples_binomial, binomial_mean, binomial_variance = generate_sample("binomial", 1000)


print("Mean:", round(binomial_mean,2),"Variance:", round(binomial_variance,2))
samples_poisson, poisson_mean, poisson_variance = generate_sample("poisson", 1000)
print("Mean:", round(poisson_mean,2),"Variance:", round(poisson_variance,2))

Mean: 5.02 Variance: 2.54


Mean: 5.06 Variance: 4.86

In [70]: #Another way


def generate_samples(distribution, params, size=1000):
if distribution == "binomial":
samples = np.random.binomial(params["n"], params["p"], size)
elif distribution == "poisson":
samples = np.random.poisson(params["lam"], size)
else:
raise ValueError("Unsupported distribution. Use 'binomial' or 'poisson'.")

mean = np.mean(samples)
variance = np.var(samples)

return samples, mean, variance

samples_binomial, mean_binomial, var_binomial = generate_samples(


distribution="binomial",
params={"n": 10, "p": 0.5},
size=1000
)

samples_poisson, mean_poisson, var_poisson = generate_samples(


distribution="poisson",
params={"lam": 5},
size=1000
)

print("Binomial Distribution → Mean:", mean_binomial, "Variance:", var_binomial)


print("Poisson Distribution → Mean:", mean_poisson, "Variance:", var_poisson)

Binomial Distribution → Mean: 5.017 Variance: 2.5687110000000004


Poisson Distribution → Mean: 5.027 Variance: 5.1862710000000005

6. Write a Python script to generate random numbers


from a Gaussian (normal) distribution and compute the
mean, variance, and standard deviation of the samples.
In [75]: def generate_gaussian_samples(mean, std_dev, size):
samples = np.random.normal(mean, std_dev, size)
sample_mean = np.mean(samples).round(4)
sample_variance = np.var(samples).round(4)
sample_std_dev = np.std(samples).round(4)
return sample_mean, sample_variance, sample_std_dev

mean = 0
std_dev = 1
sample_size = 1000
sample_mean, sample_variance, sample_std_dev = generate_gaussian_samples(mean, std_dev, sam
print("Mean:", sample_mean)
print("Variance:", sample_variance)
print("Standard Deviation:", sample_std_dev)
Mean: -0.0314
Variance: 0.9933
Standard Deviation: 0.9967

7. Use seaborn library to load 'tips' dataset. Find the


following from the dataset for the columns "total_bill"
and "tip":
(i) Write a Python function that calculates their skewness.

(ii) Create a program that determines whether the columns exhibit positive
skewness, negative skewness, or is approximately symmetric.

(iii) Write a function that calculates the covariance between two columns.

(iv) Implement a Python program that calculates the Pearson correlation


coefficient between two columns.

(v) Write a script to visualize the correlation between two specific columns in a
Pandas DataFrame using scatter plots.

In [3]: import seaborn as sns


tips=sns.load_dataset("tips")
tips

Out[3]: total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

... ... ... ... ... ... ... ...

239 29.03 5.92 Male No Sat Dinner 3

240 27.18 2.00 Female Yes Sat Dinner 2

241 22.67 2.00 Male Yes Sat Dinner 2

242 17.82 1.75 Male No Sat Dinner 2

243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

(i) Write a python function that calculate thir skewness.

In [4]: def calculate_skewness():


total_bill=tips["total_bill"]
tip=tips["tip"]
total_bill_skewness=total_bill.skew().round(2)
tip_skewness=tip.skew().round(2)
print("Skewness of total_bill column:", total_bill_skewness)
print("Skewness of tip column:", tip_skewness)
calculate_skewness()

Skewness of total_bill column: 1.13


Skewness of tip column: 1.47
In [5]: #Another Way
from scipy.stats import skew
def calculate_skew(series):
return skew(series, bias=False)

total_bill_skew = calculate_skew(tips['total_bill']).round(2)
tip_skew = calculate_skew(tips['tip']).round(2)

print("Skewness of total_bill column",total_bill_skew)


print("Skewness of tip column",tip_skew)

Skewness of total_bill column 1.13


Skewness of tip column 1.47

(ii) Create a program that determines whether the columns exhibit positive
skewness, negative skewness, or approximate symmetry.

In [22]: for col in ['total_bill', 'tip']:


s = skew(tips[col].dropna())
print(f"{col}:{'positive Skewed' if s >0.5 else 'Negative skewed' if s<- 0.5 else 'Appr

total_bill:positive Skewed
tip:positive Skewed

(iii)Write a function that calculate the covariance between two columns.

In [26]: covariance = tips["total_bill"].cov(tips["tip"]).round(2)


print("The covariance between 'total_bill' and 'tip' is:", covariance)

The covariance between 'total_bill' and 'tip' is: 8.32

(iv)Implement a python program that calculate the Pearson correlation coefficient


between two columns.

In [28]: correlation = tips['total_bill'].corr(tips["tip"]).round(2)


print("The Pearson correlation coefficient between 'total_bill' and 'tip' is:", correlation

The Pearson correlation coefficient between 'total_bill' and 'tip' is: 0.68

(v) Write a script to visualize the correlation between two specific columns in a
pandas DataFrame using scatter plots.

In [9]: import matplotlib.pyplot as plt


x= tips["total_bill"]
y= tips["tip"]

plt.scatter(x, y, alpha =0.7)


plt.xlabel("total_bill")
plt.ylabel("tip")
plt.title("Correlation between total_bill and tip")
plt.show()
8. Write a Python function to calculate the probability
density function (PDF) of a continuous random variable
for a given normal distribution.
In [14]: from scipy.stats import norm
def normal_pdf(x, mean=0, std_dev=1):
coeff = 1 / (std_dev * np.sqrt(2 * np.pi))
exponent = np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
return coeff * exponent

x=np.linspace(-5,5,100)

pdf=norm.pdf(x,mean,std_dev)

plt.plot(x,pdf)
plt.xlabel("x")
plt.ylabel("PDF")
plt.title("Normal Distribution PDF")
plt.show()
9. Create a program to calculate the cumulative
distribution function (CDF) of exponential distribution.
In [15]: def exponential_cdf(x, lam=1.0):
return 1 - np.exp(-lam * x)

x_values = np.linspace(0, 10, 100)


cdf_values = exponential_cdf(x_values, lam=0.5)

plt.plot(x_values, cdf_values, label="Exponential CDF (λ=0.5)", color="red")


plt.xlabel("x")
plt.ylabel("CDF")
plt.title("Exponential Distribution CDF")
plt.legend()
plt.grid(True)
plt.show()
10. Write a Python function to calculate the probability
mass function (PMF) of the Poisson distribution.
In [31]: from scipy.stats import poisson
import math
def calculate_pmf(k, lam):
return [(math.exp(-lam) * (lam ** i)) / math.factorial(i) for i in k]

k = np.arange(0, 10) #values of k


lam = 2 #parameter lambda
pmf = calculate_pmf(k, lam)

plt.stem(k, pmf)
plt.xlabel('k')
plt.ylabel('PMF')
plt.title('Poisson Distribution PMF')
plt.show()
11. A company wants to test if a new website layout leads to a higher
conversion rate (percentage of visitors who make a purchase). They
collect data from the old and new layouts to compare. Apply the z-
test to find which layout is successful.

To generate the data use following command :

:python
import numpy as np

#50 purchase out of 1000 visitors


old_layouts=np.array([1]*50+[0]*950)
#70 purchase out of 1000 visitors
new_layouts=np.array([1]*70+[0]*930)

In [37]: #Define the data


old_layouts = np.array([1] * 50 + [0] * 950)
new_layouts = np.array([1] * 70 + [0] * 930)

In [38]: from statsmodels.stats.proportion import proportions_ztest


#Perform the z-test
successes = np.array([np.sum(old_layouts), np.sum(new_layouts)])
nobs = np.array([len(old_layouts), len(new_layouts)])

z_score, p_value = proportions_ztest(successes, nobs)

#Interpret the results


if p_value < 0.05:
print("The difference in conversion rates is statistically significant.")
if z_score < 0:
print("The new layout has a higher conversion rate.")
else:
print("The old layout has a higher conversion rate.")
else:
print("There is no statistically significant difference in conversion rates.")

There is no statistically significant difference in conversion rates.

12. A tutoring service claims that its program improves student's


exam scores. A sample of students who participated in th program
was taken, and their scores before and after the program we
recorded. Use the below code to generate sample of respective array
of marks:
before_program=np.array([75,80,85,70,90,78,92,88,82,87])
after_program=np.array([80,85,90,80,92,80,95,90,85,88])

Use z-test to find if the claims made by the tutor are true or false.
In [13]: import numpy as np
from scipy.stats import norm

#Define the data


before_program = np.array([75, 80, 85, 70, 90, 78, 92, 88, 82, 87])
after_program = np.array([80, 85, 90, 80, 92, 80, 95, 90, 85, 88])

#Calculate the mean and standard deviation of the differences


differences = after_program - before_program
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1) / np.sqrt(len(differences))

#Perform the z-test


z_score = (mean_diff - 0) / std_diff
p_value = (1 - norm.cdf(abs(z_score)))

#Interpret the results


if p_value < 0.05:
print("The tutoring program has a statistically significant impact on exam scores.")
if z_score > 0:
print("The students' scores improved after the program.")
else:
print("The students' scores decreased after the program.")
else:
print("There is no statistically significant impact of the tutoring program on exam sco

print(f"Z-Score {z_score:.2f}")
print((f"P-value {p_value:.2e}"))

The tutoring program has a statistically significant impact on exam scores.


The students' scores improved after the program.
Z-Score 4.59
P-value 2.18e-06

13.A pharmaceutical company wants to determine if a new drug is


effective in reducing blood pressure. They conduct a study and
record blood pressure measurements before and after administering
the drug. Use the below code to generate a sample of respective
arrays of blood pressure:
before_drug=np.array([145,150,140,135,155,160,152,148,130,138])
after_drug=np.array([130,140,132,128,145,148,138,136,125,130])

Implement z_test to find if the drug really works or not.


In [22]: from scipy.stats import norm

# Define the data


before_drug = np.array([145, 150, 140, 135, 155, 160, 152, 148, 130, 138])
after_drug = np.array([130, 140, 132, 128, 145, 148, 138, 136, 125, 130])

# Calculate the mean and standard deviation of the differences


differences = after_drug - before_drug
n = len(differences)
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1) / np.sqrt(n)

# Perform the z-test


z_score = (mean_diff - 0) / std_diff
p_value = 2*(1 - norm.cdf(abs(z_score)))

# Interpret the results


if p_value < 0.05:
print("The drug has a statistically significant effect in reducing blood pressure.")
if z_score < 0:
print("The drug lowers blood pressure.")
else:
print("The drug increases blood pressure.")
else:
print("There is no statistically significant effect of the drug in reducing blood press

print(f"Z-score {z_score:.2f}")
print(f"P-value {p_value:.2e}")

The drug has a statistically significant effect in reducing blood pressure.


The drug lowers blood pressure.
Z-score -10.05
P-value 0.00e+00

14.A customer service department claims that their average


response time is less than 5 minutes. A sample of recent customer
interaction was taken, and the response times were recorded.
Implement the below code to generate the array of response times:
response_times=np.array([4.3,3.8,5.1,4.9,4.7,4.2,5.2,4.5,4.6,4.4])

Implement z_test to find the claims made by customer service


department are true or false.
In [9]: from scipy.stats import norm

#Define the data


response_times = np.array([4.3, 3.8, 5.1, 4.9, 4.7, 4.2, 5.2, 4.5, 4.6, 4.4])

#Calculate the sample mean and standard deviation


sample_mean = np.mean(response_times)
sample_std = np.std(response_times, ddof=1)

#Set the null hypothesis mean and significance level


null_mean = 5
alpha = 0.05
H1 : null_mean < alpha
#Calculate the z-score
z_score = (sample_mean - null_mean) / (sample_std / np.sqrt(len(response_times)))

#Calculate the p-value


p_value = norm.cdf(z_score)

#Interpret the results


if p_value < alpha:
print("The claims made by the customer service department are true.")
else:
print("The claims made by the customer service department are false.")

print(f"Z-score {z_score:.2f}")
print(f"P-value {p_value:.2e}")

The claims made by the customer service department are true.


Z-score -3.18
P-value 7.25e-04

15.A company is testing two different website layouts to see which


one leads to higher click-through rates. Write a Python function to
perform an A/B test analysis, including calculating the t-statistics,
degree of freedom, and p-value:
Use the following data:
layouts_a_click=[28,32,33,29,31,34,30,35,36,37]
layouts_b_clicks=[40,41,38,42,39,44,43,41,45,47]

In [20]: import scipy.stats as stats

def ab_test(layouts_a_click, layouts_b_clicks):


t_stat, p_value = stats.ttest_ind(layouts_a_click, layouts_b_clicks,equal_var=True)
dof = len(layouts_a_click) + len(layouts_b_clicks) - 2
return t_stat, dof, p_value

layouts_a_click = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]
layouts_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]
mean_a= np.mean(layouts_a_click)
mean_b= np.mean(layouts_b_clicks)
t_stat, dof, p_value = ab_test(layouts_a_click, layouts_b_clicks)
if p_value < 0.05:
print(" Significant difference between layouts (reject H0)")
else:
print("No significant difference between layouts (fail to reject H0)")

print(f"t_stats:{t_stat:.2f}")
print("dof:",dof)
print(f"p_value: {p_value:.2e}")
print(f"Mean_A: {mean_a}")
print(f"Mean_B: {mean_b}")

if mean_a>mean_b:
print("Layout A performs better than B")
else:
print("Layout B performs better than A")

Significant difference between layouts (reject H0)


t_stats:-7.30
dof: 18
p_value: 8.83e-07
Mean_A: 32.5
Mean_B: 42.0
Layout B performs better than A

16.A pharmaceutical company wants to determine if a new drug is


more effective than an existing drug in reducing cholesterol
levels.Create a program to analyze the clinical trial data and
calculate the t-statistic and p-value for the treatment effect.Use the
following data of cholesterol level:
existing_drug_level=[180,182,175,185,178,172,184,179,183]
new_drug_levels=[170,172,165,168,175,173,170,178,172,176]

In [31]: import scipy.stats as stats

def analyze_clinical_trial(existing_drug_levels, new_drug_levels):


t_stat, p_value = stats.ttest_ind(existing_drug_levels, new_drug_levels, equal_var=True
dof_student = len(existing_drug_levels) + len(new_drug_levels) - 2
return t_stat, p_value, dof_student

t_critical = stats.t.ppf(1 - alpha/2, dof_student)


Mean_existing = np.mean(existing_drug_levels)
Mean_new = np.mean(new_drug_levels)
existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]
new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]

t_stat, p_value,dof_student = analyze_clinical_trial(existing_drug_levels, new_drug_levels)


if abs(t_stat) > t_critical:
print("Significant difference between the two drugs (Reject H0)")
else:
print("No significant difference between the two drugs (Fail to Reject H0)")

print(f"t_statistics: {t_stat:.2f}")
print(f"t_critical: {t_critical:.2f} ")
print(f"p_value:{p_value:.2e}")
print("Degree of freedom", dof_student)

if Mean_existing>Mean_new:
print("Existing Drug is performs better than New Drug")
else:
print("New Drug performs better than Existing Drug")

Significant difference between the two drugs (Reject H0)


t_statistics: 4.14
t_critical: 2.10
p_value:6.14e-04
Degree of freedom 18
Existing Drug is performs better than New Drug

17. A school district introduces an educational intervention program


to improve math scores. Write a Python function to analyze pre- and
post-intervention test scores, calculating the t-statistics and p-value
to determine if the intervention has a significant impact. Use the
following data of test scores:
pre_intervention_scores=[80,85,90,75,88,82,92,78,85,87]
post_intervention_score=[90,92,88,92,95,91,96,93,89,93]

In [45]: import scipy.stats as stats

def analyze_intervention(pre_intervention_scores, post_intervention_scores):


t_stat, p_value = stats.ttest_rel(pre_intervention_scores, post_intervention_scores) #
dof = len(pre_intervention_scores)-1 #degrees of freedom for paired t-test
return t_stat, p_value, dof

t_critical = stats.t.ppf(1 - alpha/2, dof)


pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]
post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

Mean_pre = np.mean(pre_intervention_scores)
Mean_pro = np.mean(post_intervention_scores)

t_stat, p_value,dof= analyze_intervention(pre_intervention_scores, post_intervention_scores


if abs(t_stat) > t_critical:
print("Significant difference between the Pre and Post Inervention (Reject H0)")
else:
print("No significant difference between the Pre and Post Inervention (Fail to Reject H

print(f"t_statistics: {t_stat:.2f}")
print(f"t_critical: {t_critical:.2f}")
print(f"p_value:{p_value:.2e}")
print("Degrees of freedom", dof)

if Mean_pre>Mean_pro:
print("Pre Invention performs better than Post Invention")
else:
print("Post Invention performs better than Pre Invention")

Significant difference between the Pre and Post Inervention (Reject H0)
t_statistics: -4.43
t_critical: 2.01
p_value:1.65e-03
Degrees of freedom 9
Post Invention performs better than Pre Invention

18. An HR department wants to investigate if there's a gender-based


salary gap within the company. Develop a program to analyze salary
data, calculate the t-statistics, and determine if there's a statistically
significant difference between the average salaries of male and
female employees. Use the below code to generate synthetic data:
Generate synthetic salary data for male and female employees
np.random.seed(0) #For reproducibility
male_salaries=np.random.normal(loc=50000,scale=10000,size=20)
female_salaries=np.random.normal(loc=55000,scale=9000,size=20)

In [43]: import scipy.stats as stats


# Generate synthetic salary data for male and female employees

np.random.seed(0) # For reproducibility


male_salaries = np.random.normal(loc=50000, scale=10000, size=20)
female_salaries = np.random.normal(loc=55000, scale=9000, size=20)
alpha = 0.05
def analyze_salary_gap(male_salaries, female_salaries):
t_stat, p_value = stats.ttest_ind(male_salaries, female_salaries, equal_var = True)
dof = len(male_salaries)+len(female_salaries)-2
return t_stat, p_value,dof

t_stat, p_value, dof = analyze_salary_gap(male_salaries, female_salaries)

if p_value < alpha:


print("Significant salary gap found (Reject H0)")
else:
print("No significant salary gap (Fail to Reject H0)")

print(f"t_statistics: {t_stat:.2f}")
print(f"p_value:{p_value:.2e}")
print("Degrees of freedom",dof)

No significant salary gap (Fail to Reject H0)


t_statistics: 0.06
p_value:9.52e-01
Degrees of freedom 38

19. A manufacturer produce two different versions of a product and


wants to compare their quality score. Create a Python function to
analyze quality assesment data, calculate the t-statistic, and decide
whether there's significant difference in quality between the two
versions. Use the following data:
version1_scores=
[85,88,82,89,87,84,90,88,85,86,91,83,87,84,89,86,84,88,85,86,89,90,87,88,85]
version2_scores=
[80,78,83,81,79,82,76,80,78,81,77,82,80,79,82,79,80,81,79,82,79,78,80,81,82]

In [51]: import scipy.stats as stats

version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88,
version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81,

def compare_product_versions(version1_scores, version2_scores):


t_stat, p_value = stats.ttest_ind(version1_scores, version2_scores)
dof = len(version1_scores)+len(version2_scores)-2
return t_stat, p_value,dof
Mean_V1 = np.mean(version1_scores)
Mean_V2 = np.mean(version2_scores)
t_stat, p_value,dof = compare_product_versions(version1_scores, version2_scores)
if p_value < alpha:
print("Significant difference in product quality (Reject H0)")
else:
print("No significant difference in product quality (Fail to Reject H0)")

print(f"t_statistics:{t_stat:.2f}")
print(f"p_value: {p_value:.2e}")
print("Degree of freedom", dof)

if Mean_V1>Mean_V2:
print("Version 1 performs better than Version 2")
else:
print("Version 2 performs better than Version 1")

Significant difference in product quality (Reject H0)


t_statistics:11.33
p_value: 3.68e-15
Degree of freedom 48
Version 1 performs better than Version 2

20.A restaurant chain collects customer satisfaction scores for two


different branches.Write a program to analyze the score, calculate
the t-statistic, and determine if there's statistically significant
difference in customer satisfaction between the branches. Use the
below data of scores:
branch_a_scores=[4,5,3,4,5,4,5,3,4,4,5,4,4,3,4,5,5,4,3,4,5,4,3,5,4,4,5,3,4,5,4]
branch_b_scores=[3,4,2,3,4,3,4,2,3,3,4,3,3,2,3,4,4,3,2,3,4,3,2,4,3,3,4,2,3,4,3]

In [52]: import scipy.stats as stats

branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5,
branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4,

def analyze_customer_satisfaction(branch_a_scores, branch_b_scores):


t_stat, p_value = stats.ttest_ind(branch_a_scores, branch_b_scores)
dof = len(branch_a_scores)+len(branch_b_scores)-2
return t_stat, p_value, dof
Mean_a = np.mean(branch_a_scores)
Mean_b = np.mean(branch_b_scores)
t_stat, p_value, dof = analyze_customer_satisfaction(branch_a_scores, branch_b_scores)
if p_value < 0.05:
print("There is a statistically significant difference in customer satisfaction between
else:
print("There is no statistically significant difference in customer satisfaction betwee

print(f"T-statistics {t_stat:.2f}")
print(f"P_value {p_value:.2f}")
print("Degree of freedom: ",dof)

if Mean_a>Mean_b:
print("Branch a performs better than Branch b")
else:
print("Branch b performs better than Branch a")

There is a statistically significant difference in customer satisfaction between the branche


s.
T-statistics 5.48
P_value 0.00
Degree of freedom: 60
Branch a performs better than Branch b

21.A political analyst wants to determine if there is a significant


association between age groups and voter preferences(Candidate A
or Candidate B).They collect data from a sample of 500 voters and
classify them into different age groups and candidate
preferences.Perform a Chi-Square test to determine if there is a
significant association between age groups and voter preferences.
Use the below code to generate data:
np.random.seed(0)
age_groups=np.random.choice(['18-30','31-50','51+','51+'],size=30)
voter_prferences=np.random.choice(["Candidate A","Candidate B"],size=30)

In [73]: from scipy.stats import chi2_contingency

np.random.seed(0)
age_groups = np.random.choice(['18-30', '31-50', '51+'], size=30)
voter_preferences = np.random.choice(["Candidate A", "Candidate B"], size=30)

#Create a contingency table


contingency_table = np.zeros((3, 2))
for i in range(len(age_groups)):
if age_groups[i] == '18-30':
row = 0
elif age_groups[i] == '31-50':
row = 1
else:
row = 2
if voter_preferences[i] == 'Candidate A':
col = 0
else:
col = 1
contingency_table[row, col] += 1

#Perform Chi-Square test


chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic : {chi2:.4f}")


print(f"P-value : {p_value:.6f}")
print(f"Degrees of Freedom : {dof}")
print(f"Expected Freq :\n {expected}")

if p_value < 0.05:


print("There is a significant association between age groups and voter preferences.")
else:
print("There is no significant association between age groups and voter preferences.")
Chi-Square Statistic : 1.4402
P-value : 0.486712
Degrees of Freedom : 2
Expected Freq :
[[5.6 6.4 ]
[5.13333333 5.86666667]
[3.26666667 3.73333333]]
There is no significant association between age groups and voter preferences.

In [75]: #Another Way


from scipy.stats import chi2_contingency

# Generate synthetic data


np.random.seed(0)
age_groups = np.random.choice(['18-30','31-50','51+'], size=30)
voter_preferences = np.random.choice(["Candidate A","Candidate B"], size=30)

# Create DataFrame
df = pd.DataFrame({"Age Group": age_groups, "Voter Preference": voter_preferences})

# Create contingency table


contingency_table = pd.crosstab(df["Age Group"], df["Voter Preference"])

print(f"Contingency Table :{contingency_table}")

# Perform Chi-Square test


chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic : {chi2:.4f}")


print(f"P-value : {p:.6f}")
print(f"Degrees of Freedom : {dof}")
print(f"Expected Freq: \n{expected}")

# Decision
alpha = 0.05
if p < alpha:
print("\n Significant association between Age Group and Voter Preference (Reject H0)")
else:
print("\n No significant association (Fail to Reject H0)")

Contingency Table :Voter Preference Candidate A Candidate B


Age Group
18-30 4 8
31-50 6 5
51+ 4 3
Chi-Square Statistic : 1.4402
P-value : 0.486712
Degrees of Freedom : 2
Expected Freq:
[[5.6 6.4 ]
[5.13333333 5.86666667]
[3.26666667 3.73333333]]

No significant association (Fail to Reject H0)

22.A company conducted a customer satisfaction survey to


determine if there is a significant relationship between product
satisfaction levels (Satisfied, Neutral, Dissatisfied) and the region
where customers are located (East,West,North,South).The survey
data is summarized in a contingency table. Conduct a Chi-Square
test to determine if there is a significant relationship between
product satisfaction levels and customer regions.
Sample data:
#Sample data: Product satisfaction levels(row) vs. Customer regions(columns)
data=np.array([[50,30,40,20],[30,40,30,50],[20,30,40,30]])

In [89]: from scipy.stats import chi2_contingency


data = np.array([[50, 30, 40, 20], [30, 40, 30, 50], [20, 30, 40, 30]])

#Perform Chi-Square test


chi2, p_value, dof, expected = chi2_contingency(data)

if p_value < 0.05:


print("There is a significant relationship between product satisfaction levels and cust
else:
print("There is no significant relationship between product satisfaction levels and cus

print(f"Chi-Square Statistic: {chi2:.2f}")


print("Degrees of Freedom:", dof)
print(f"P-Value: {p_value:.4f}")
print(f"Expected Frequencies: \n{expected}")

There is a significant relationship between product satisfaction levels and customer region
s.
Chi-Square Statistic: 27.78
Degrees of Freedom: 6
P-Value: 0.0001
Expected Frequencies:
[[34.14634146 34.14634146 37.56097561 34.14634146]
[36.58536585 36.58536585 40.24390244 36.58536585]
[29.26829268 29.26829268 32.19512195 29.26829268]]

23.A company implemented an employee training program to


improve job performance (Effective, Neutral, Ineffective). After the
training, the collected data from a sample of employees and
classified them based on their job performance before and after the
training. Perform a Chi-Square test to determine if there is a
significant difference between job performance levels before and
after the training. Sample data:
#Sample data:Job performance levels before(rows) and after(columns) training
data=np.array([[50,30,20],[30,40,30],[20,30,40]])

In [90]: from scipy.stats import chi2_contingency

data = np.array([[50, 30, 20], [30, 40, 30], [20, 30, 40]])

# Perform Chi-Square test


chi2, p_value, _, _ = chi2_contingency(data)

if p_value < 0.05:


print("There is a significant difference between job performance levels before and afte
else:
print("There is no significant difference between job performance levels before and aft

print("Chi-Square Statistic:", chi2)


print("Degrees of Freedom:", dof)
print("P-Value:", p_value)
print("Expected Frequencies:\n", expected)
There is a significant difference between job performance levels before and after the traini
ng.
Chi-Square Statistic: 22.161728395061726
Degrees of Freedom: 6
P-Value: 0.00018609719479882554
Expected Frequencies:
[[34.14634146 34.14634146 37.56097561 34.14634146]
[36.58536585 36.58536585 40.24390244 36.58536585]
[29.26829268 29.26829268 32.19512195 29.26829268]]

24.A company produces three different versions of a


product:Standard,Premium, and Deluxe.The company wants to
determine if there is a significant difference in customer satisfaction
scores among the three product versions. They conducted a survey
and collected customer satisfaction scores for each version from a
random sample of customers. Perform an ANOVA test to determine
if there is a significant Use the following data:
#Sample data: Customer satisfaction scores for each product version
standard_scores=[80,85,90,78,88,82,92,78,85,87]
premium_scores=[90,92,88,92,95,91,96,93,89,93]
deluxe_scores=[95,98,92,97,96,94,98,97,92,99]

In [94]: from scipy.stats import f_oneway

standard_scores = [80, 85, 90, 78, 88, 82, 92, 78, 85, 87]
premium_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]
deluxe_scores = [95, 98, 92, 97, 96, 94, 98, 97, 92, 99]

#Perform ANOVA test


f_stat, p_value = f_oneway(standard_scores, premium_scores, deluxe_scores)

if p_value < 0.05:


print("There is a significant difference in customer satisfaction scores among the thre
else:
print("There is no significant difference in customer satisfaction scores among the thr

print(f"F-Statistic: {f_stat:.2f}")
print(f"P-Value: {p_value:.2e}")

There is a significant difference in customer satisfaction scores among the three product ve
rsions.
F-Statistic: 27.04
P-Value: 3.58e-07

You might also like