Veri Bilimi ve Veri Analitiği
Exploratory Data Analysis (EDA)
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a process of analyzing and summarizing datasets
to uncover patterns, relationships, and insights, typically before applying more
advanced modeling techniques.
EDA is a critical first step in the data analysis workflow, helping analysts and data
scientists understand the data’s structure, quality, and feature.
Objectives of EDA
Understand the Data:
➢ Identify the types of variables (categorical, numerical).
➢ Assess the distribution of data and its overall structure.
Detect Data Issues:
➢ Find missing values, duplicates, or outliers.
➢ Evaluate inconsistencies or errors in the data.
Uncover Patterns:
➢ Reveal trends, clusters, or correlations.
➢ Identify relationships between variables.
Steps in EDA
Descriptive Statistics:
➢ Calculate measures like mean, median, mode, variance, and standard
deviation.
➢ Summarize categorical data with frequency counts and percentages.
Data Visualization:
➢ Univariate Analysis: Visualize single variables using histograms, box
plots, or bar charts.
➢ Bivariate Analysis: Study relationships between two variables using scatter
plots or correlation matrices.
➢ Multivariate Analysis: Explore relationships among multiple variables using
heatmaps or pair plots.
Data Cleaning: Handle missing values, outliers, or erroneous data points.
Correlation Analysis: Use metrics like Pearson or Spearman correlation to
examine relationships between numerical variables.
Descriptive Statistics
Descriptive statistics is a branch of statistics that involves summarizing and
organizing data to describe its main characteristics.
Descriptive statistics focuses on presenting data in a clear and
understandable way.
Purpose of Descriptive Statistics:
➢ Simplify large datasets into a manageable format.
➢ Provide insights into the distribution, central tendency, and variability of the
data.
➢ Serve as a foundation for further statistical analysis.
Key Methods of Descriptive Statistics:
➢ Measures of Central Tendency
➢ Measures of Variability
➢ Measures of Shape
Measures of Central Tendency
These metrics describe the center or typical value of a dataset.
Mean (Average): The sum of all values divided by the number of values.
σ𝑥
Formula: Mean= 𝑛
Example: For the data [10,20,20,30,40], the mean is 24.
Median: The middle value when data is sorted.
If the dataset has an even number of values, it is the average of the two middle
values.
Example: For the data [10,20,20,30,40], the median is 20.
Mode: The most frequent value(s) in the dataset.
Example: For the data [10,20,20,30,40], the mode is 20.
Measures of Central Tendency
Measure Strengths Weaknesses
Mean - Takes all data points into account. - Sensitive to outliers or extreme values.
- Useful for numerical datasets. - May not represent skewed data accurately.
Median - Not affected by outliers. - Ignores the magnitude of values.
- Provides a better central measure for skewed data. - May not be ideal for datasets with small variations.
Mode - Useful for categorical data. - May not exist or may have multiple values
(bimodal or multimodal datasets).
- Works well when identifying common trends.
Measures of Variability
These metrics indicate how spread out the data is.
Range: The difference between the maximum and minimum values.
Formula: Range = Max - Min
Example: For the data [10,20,20,30,40], the range is 40 – 10 =30.
Variance: The average of the squared differences from the mean.
2
σ 𝑥−𝑀𝑒𝑎𝑛
Formula: Variance = 𝑛
Example: For the data [10,20,20,30,40], the variance is 104.
Standard Deviation: The square root of the variance, representing the average
distance from the mean.
Example: For the data [10,20,20,30,40], the standard deviation is 10.20.
Interquartile Range (IQR): The range between the 25th percentile (Q1) and the 75th
percentile (Q3).
Percentiles and Quartiles
Percentiles and Quartiles
Measures of Shape
These metrics describe the distribution’s shape:
Skewness: Measures asymmetry in the data distribution.
➢ Positive skew: Long tail on the right.
➢ Negative skew: Long tail on the left.
Example of Skewness
Skewness
Measures of Shape
Kurtosis: Measures the "tailedness" of the distribution.
➢ High kurtosis
➢ Low kurtosis
References
metin, ekran görüntüsü, harita, pembe içeren bir resim The Data Science Design Manual (Texts in Computer Science) A Hands-On Introduction to Data Science
Açıklama otomatik olarak oluşturuldu