Hands on With Probability and Statistical
Hands on With Probability and Statistical
1. Probability Hands-On
import random
Let’s use a dataset of heights of two groups (Group A and Group B) and perform various statistical
tests.
import numpy as np
np.random.seed(42)
A t-test determines if there’s a significant difference between the means of two groups.
# Perform t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")
# Conversion data
n1, x1 = 200, 50 # Group A: 200 samples, 50 conversions
n2, x2 = 200, 70 # Group B: 200 samples, 70 conversions
# Calculate proportions
p1 = x1 / n1
p2 = x2 / n2
# Pooled proportion
p_pool = (x1 + x2) / (n1 + n2)
se = np.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
print(f"Z-Statistic: {z_stat}")
print(f"P-Value: {p_value}")
The Chi-Square Test checks for independence between two categorical variables.
# Contingency table for survey data (Age Group vs. Preferred Product)
print(f"P-Value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
print("Reject the null hypothesis: There is a significant association between the variables.")
else:
print("Fail to reject the null hypothesis: No significant association between the variables.")
Confidence Intervals
Confidence intervals estimate a range for population parameters like the mean.
mean_a = np.mean(group_a)
Let’s use the statistical tests to analyze A/B test results for a webpage:
print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")
else:
Histograms
plt.legend()
plt.show()
Boxplots
import pandas as pd
plt.show()
***************************&&&&&&&&&&&&**************************************
The probability of an event is given by: P(E)= Number of favorable outcomes/ Total number of outcomes
The probability of two independent events AAA and BBB both occurring is:
P(A∩B)=P(A)×P(B)
prob_heads = 0.5
prob_six = 1 / 6
# Combined probability
P(A∣B)=P(A∩B)/P(B)
Example: Cards in a Deck What is the probability of drawing a king, given it’s a face card?
Total cards: 52
Kings: 4
# Probabilities
prob_face = 12 / 52
prob_king_and_face = 4 / 52
# Conditional probability
Probability Distributions
import numpy as np
# Calculate probabilities
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()
# Calculate probabilities
mean = np.mean(data)
std_dev = np.std(data)
prob_6_heads = binom.pmf(6, n, p)
Monte Carlo simulations are used to estimate probabilities via random sampling.
num_points = 10000
inside_circle = 0
for _ in range(num_points):
x, y = np.random.uniform(-1, 1, size=2)
inside_circle += 1
*********************************&&&&&&&&&&&&&&&&&&&&&&&********************
Hypothesis testing is fundamental in data science for making inferences about populations based on
sample data. Here’s how you can perform hypothesis tests using Scipy and Statsmodels.
import numpy as np
import pandas as pd
import statsmodels.api as sm
3. T-Tests with Scipy: Independent T-Test: Used to compare means of two independent groups.
np.random.seed(42)
print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")
else:
Paired T-Test
Used when comparing two related samples, such as pre-test and post-test scores.
print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")
else:
4. Z-Test with Statsmodels : One-Sample Z-Test : Used for testing if a sample mean is significantly
different from a population mean.
# Population mean
population_mean = 105
print(f"Z-Statistic: {z_stat}")
print(f"P-Value: {p_value}")
else:
print("Fail to reject the null hypothesis: No significant difference.")
print(f"Z-Statistic: {z_stat}")
print(f"P-Value: {p_value}")
else:
5. Chi-Square Test with Scipy The Chi-Square Test is used for testing independence between categorical
variables.
# Contingency table
data = np.array([[50, 30], [20, 80]]) # Rows: Age groups, Columns: Product preferences
print(f"P-Value: {p_value}")
print("Expected Frequencies:")
print(expected)
else: