[go: up one dir, main page]

0% found this document useful (0 votes)
37 views2 pages

Conversation Normalization

Standard Error of the Mean (SEM) is preferred over Standard Deviation (SD) for estimating population means from samples, as SEM reflects the accuracy of the sample mean. Normal distribution is crucial in data science due to its prevalence in real-world data and its role in various statistical tests and machine learning models. Normalization techniques, such as Min-Max Scaling and Z-Score Standardization, are essential for improving model performance and handling data distributions effectively.

Uploaded by

Aashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views2 pages

Conversation Normalization

Standard Error of the Mean (SEM) is preferred over Standard Deviation (SD) for estimating population means from samples, as SEM reflects the accuracy of the sample mean. Normal distribution is crucial in data science due to its prevalence in real-world data and its role in various statistical tests and machine learning models. Normalization techniques, such as Min-Max Scaling and Z-Score Standardization, are essential for improving model performance and handling data distributions effectively.

Uploaded by

Aashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Why do we need to use standard error of mean instead of standard deviation?

Standard Error of the Mean (SEM) is used instead of Standard Deviation (SD) when we are

estimating the mean of a population from a sample.

- SD measures the spread of individual data points in a sample.

- SEM measures how accurately the sample mean estimates the true population mean.

Formula:

- SD = sqrt(sum((X - mean)^2) / (N - 1))

- SEM = SD / sqrt(N)

Example:

Heights of students: [160, 165, 170, 175, 180]

SD approximately 7.91 cm, SEM approximately 3.54 cm

Why is the normal distribution important in data science?

1. Many real-world datasets (e.g., heights, IQ scores) are normally distributed.

2. Many statistical tests assume normality (t-tests, ANOVA, regression).

3. Central Limit Theorem ensures sample means follow normal distribution.

4. Normalization methods (Z-score scaling) improve model performance.

5. Outlier detection (3 standard deviations from mean).

6. Probabilities & decision-making based on normal distribution.

7. Many ML models assume normality (Naïve Bayes, linear regression).

How to normalize data?


1. Min-Max Scaling: Scales values between 0 and 1.

Formula: X_norm = (X - X_min) / (X_max - X_min)

2. Z-Score Standardization: Centers data with mean 0, std 1.

Formula: X_std = (X - mean) / std

3. Robust Scaling: Uses median & IQR (good for outliers).

4. Log Transformation: Converts skewed data.

5. Power Transformation (Box-Cox, Yeo-Johnson): Normalizes skewed data.

6. Decimal Scaling: Moves decimal point based on max value.

Best Methods:

- Min-Max: Good for bounded data (0-1).

- Z-Score: Best when data is normally distributed.

- Robust: Best when data has outliers.

- Log & Power: Useful for skewed distributions.

Conclusion:

- Normalization improves consistency across features in ML models.

- Choose a method based on data distribution and outliers.

You might also like