Dept.
of CSE(DATA SCIENCE)
CHAPTER 2: GET TO KNOW ABOUT YOUR DATA
1.Enlist and explain different types of data attributes used in data mining.
In data mining, understanding the nature and type of data attributes is critical for selecting
appropriate data preprocessing, analysis, and modeling techniques. Attributes, also referred to
as features or variables, are properties or characteristics of an object that can be measured or
categorized.
Data attributes can be broadly categorized into four main types: nominal, binary, ordinal, and
numeric. Each attribute type has specific characteristics that define how the data values can
be interpreted and used.
1. Nominal Attributes: Nominal attributes are qualitative in nature. They represent
categories or labels without any quantitative meaning or inherent order. The values of
nominal attributes are just names or identifiers. For example, an attribute "color" can
have values like red, green, and blue. Since these categories do not have any ranking
or arithmetic meaning, statistical operations like mean or median do not apply.
2. Binary Attributes: Binary attributes are a special case of nominal attributes with only
two possible states or values. These are typically encoded as 0 and 1, or true and false.
Binary attributes can be further classified as:
o Symmetric: When both values are equally important (e.g., gender: male or
female).
o Asymmetric: When one value is more significant (e.g., test result: positive or
negative).
3. Ordinal Attributes: Ordinal attributes are categorical attributes that have a meaningful
order or ranking among their values. For example, the attribute "education level" can
have values: high school < bachelor < master < Ph.D. While the values are ordered,
the intervals between them are not necessarily equal or known. Arithmetic operations
are not appropriate, but ranking and order-based analysis is meaningful.
4. Numeric Attributes: Numeric attributes represent quantifiable measurements and are
subdivided into:
1
Dept. of CSE(DATA SCIENCE)
o Interval Attributes: These have meaningful differences between values but no
true zero point. For example, temperature in Celsius or Fahrenheit is an
interval attribute. While 30°C is 10°C higher than 20°C, it is not “1.5 times
hotter.”
o Ratio Attributes: These have both meaningful differences and a true zero.
Examples include age, weight, height, and salary. Here, arithmetic operations
like division are valid. For example, a person earning $100,000 earns twice as
much as someone earning $50,000.
2.Differentiate between discrete and continuous attributes with suitable examples.
In data mining and statistical analysis, understanding the nature of attributes—particularly the
distinction between discrete and continuous attributes—is essential for proper data
preprocessing, modeling, and visualization. Both types fall under the category of numeric
attributes but differ significantly in terms of how they represent data and the operations
applicable to them.
Discrete Attributes:
Discrete attributes take on only a finite or countably infinite set of values. These values are
often integers and represent separate, indivisible units. A discrete attribute essentially counts
occurrences or assigns integer-based labels. For instance, the number of children in a
household, the count of cars owned, or the number of visits to a hospital in a year are all
discrete attributes. The key feature is that there are gaps between possible values, and the
values themselves are exact.
One of the important characteristics of discrete attributes is that they can be either numeric or
categorical in nature. When numeric, they typically represent count data. For example, a
variable like "number of times a customer made a purchase" is a discrete, numeric attribute.
These types of attributes are often used in association rule mining or frequency analysis.
Because discrete values are finite and clearly separated, they are typically visualized using
bar charts or histograms. In modeling, they are often treated differently from continuous data.
For example, decision trees can split on each distinct discrete value, whereas continuous
values are split based on ranges.
2
Dept. of CSE(DATA SCIENCE)
Continuous Attributes:
Continuous attributes, on the other hand, can take on any value within a specified range or
interval. These values are typically real numbers and are used to measure physical quantities.
For example, attributes such as height, weight, temperature, age (in years with decimal
precision), or income are continuous. The key characteristic of continuous attributes is that
there is an infinite number of possible values within any given range. For instance, between
5.0 and 6.0 kg, there can be 5.1, 5.15, 5.155, and so on, indicating infinite granularity.
Continuous data is usually visualized using line graphs, density plots, or histograms. Since
the data is measurable and often follows a distribution (e.g., normal distribution), statistical
analysis such as regression, mean, standard deviation, and correlation are highly applicable.
Comparison and Use in Data Mining:
In data mining, discrete attributes are commonly found in categorical datasets and are often
used in classification or association analysis. Continuous attributes, by contrast, are prevalent
in numerical datasets and are essential for techniques like clustering, regression, and outlier
detection.
While both attribute types are essential, preprocessing steps often differ. Continuous
attributes may need normalization or discretization before applying certain algorithms,
whereas discrete attributes may require encoding techniques like one-hot encoding if used in
machine learning models.
3.Explain different measures of central tendency in data mining with suitable examples.
Measures of central tendency are statistical tools used to describe the center or typical value
of a dataset. In the context of data mining, these measures help to summarize large datasets
by identifying a single representative value. Understanding the central tendency allows data
analysts and scientists to make decisions, detect patterns, and assess distribution
characteristics.
The most commonly used measures of central tendency include the mean, median, mode, and
midrange. Each of these has its unique properties and is suitable for different types of data or
distributions.
3
Dept. of CSE(DATA SCIENCE)
1. Mean:
The mean, also known as the arithmetic average, is calculated by summing all values in a
dataset and dividing by the number of values. It is the most commonly used measure due to
its simplicity and mathematical tractability. The formula is:
Mean (μ) = (x₁ + x₂ + x₃ + ... + xₙ) / n
For example, if the salaries of 5 employees are {30, 40, 50, 60, 70}, the mean salary is:
Mean = (30 + 40 + 50 + 60 + 70) / 5 = 250 / 5 = 50
However, the mean is sensitive to outliers. If a single employee earns 500, the new mean
becomes:
Mean = (30 + 40 + 50 + 60 + 500) / 5 = 680 / 5 = 136
This example shows how one extreme value can significantly skew the mean, making it less
reliable for skewed distributions.
2. Median:
The median is the middle value in an ordered dataset. If the number of values is odd, the
median is the middle value. If even, it is the average of the two middle values. The median is
not affected by extreme values or outliers, making it a robust measure for skewed data.
Example: For the dataset {30, 40, 50, 60, 70}, the median is 50.
For {30, 40, 50, 60, 500}, the median remains 50, whereas the mean is distorted.
3. Mode:
The mode is the value that occurs most frequently in the dataset. A dataset can have one mode
(unimodal), more than one (bimodal or multimodal), or no mode at all if all values are
unique. Mode is especially useful for nominal attributes where mean and median do not
apply.
Example: In {30, 30, 40, 50, 60}, the mode is 30.
For attributes like “city of residence,” mode shows the most common category.
4. Midrange:
The midrange is calculated as the average of the minimum and maximum values in the
dataset:
4
Dept. of CSE(DATA SCIENCE)
Midrange = (Minimum value + Maximum value) / 2
This measure gives a quick estimate of the center but is highly sensitive to outliers.
Example: For {30, 40, 50, 60, 70}, Midrange = (30 + 70) / 2 = 50
But for {30, 40, 50, 60, 500}, Midrange = (30 + 500) / 2 = 265
Usage in Data Mining:
Each central tendency measure serves a specific purpose in data mining. The mean is often
used in clustering (like K-means), the median is critical in robust statistics and outlier
detection, and the mode is important for analyzing categorical data. Selecting the appropriate
measure depends on the data type and distribution characteristics.
4.Enlist and explain various statistical measures of dispersion in data mining.
Measures of dispersion (also called measures of variability) describe how spread out the data
values are in a dataset. In data mining, understanding dispersion helps identify consistency,
detect outliers, and understand data distribution.
Below are the key measures of dispersion:
1. Range:
• Definition: Difference between the maximum and minimum values in a dataset.
• Formula: Range = Max – Min
• Example: {10, 20, 30, 40, 50} → Range = 50 – 10 = 40
• Pros: Very simple to compute
• Cons: Highly sensitive to outliers
• Usage: Quick spread estimate, but not reliable alone
2. Variance (σ²):
• Definition: Average of the squared differences from the mean
• Formula (for population):
σ² = (Σ (xi – μ)²) / n
5
Dept. of CSE(DATA SCIENCE)
• Formula (for sample):
s² = (Σ (xi – x̄)²) / (n – 1)
• Example: Dataset = {2, 4, 6}, Mean = 4
Variance = [(2–4)² + (4–4)² + (6–4)²]/3 = (4 + 0 + 2)/3 = 2.67
• Pros: Mathematically powerful
• Cons: Uses squared units (not intuitive), sensitive to outliers
3. Standard Deviation (σ):
• Definition: Square root of variance
• Formula: σ = √Variance
• Example: If variance = 25 → σ = √25 = 5
• Same unit as original data, making it interpretable
• Pros: Widely used; appears in many algorithms (e.g., clustering, Gaussian
distribution)
• Usage: Understanding spread and deviation from the mean
4. Interquartile Range (IQR):
• Definition: Difference between the third quartile (Q3) and the first quartile (Q1)
• Formula: IQR = Q3 – Q1
• Example: {10, 15, 20, 25, 30, 35, 40}
Q1 = 15, Q3 = 35 → IQR = 35 – 15 = 20
• Pros: Robust to outliers
• Usage: Detecting outliers and understanding middle 50% of data
5. Five-Number Summary (used in boxplots):
• Includes:
• Minimum
• Q1 (25th percentile)
• Median (50th percentile)
6
Dept. of CSE(DATA SCIENCE)
• Q3 (75th percentile)
• Maximum
• Helps understand overall data spread visually
• Boxplots use this to show dispersion and detect outliers
5.Explain the construction and usefulness of a boxplot in data mining.
A boxplot, also called a box-and-whisker plot, is a standardized graphical representation used
in data analysis to depict the distribution and variability of numerical data. It is particularly
useful in data mining for identifying data spread, skewness, and detecting outliers. A boxplot
is based on a five-number summary and is effective for visual comparison between different
datasets.
Components of a Boxplot:
1. Five-Number Summary:
The foundation of a boxplot consists of these five values:
o Minimum: The smallest value (excluding outliers)
o Q1 (First Quartile): 25% of the data fall below this value
o Median (Q2): The middle value (50th percentile)
o Q3 (Third Quartile): 75% of the data fall below this value
o Maximum: The largest value (excluding outliers)
2. Box:
o A rectangle drawn from Q1 to Q3.
o Represents the Interquartile Range (IQR = Q3 – Q1), which contains the
middle 50% of the data.
o The width of the box indicates data variability.
3. Median Line:
o A line is drawn inside the box to show the median value.
7
Dept. of CSE(DATA SCIENCE)
o This line indicates where the center of the data lies.
4. Whiskers:
o Lines extending from both ends of the box to the minimum and maximum
values (excluding outliers).
o Typically whiskers go up to 1.5 × IQR from Q1 and Q3.
5. Outliers:
o Data points beyond 1.5 × IQR from Q1 or Q3 are considered outliers.
o These are plotted individually as small dots or stars outside the whiskers.
Example:
For dataset {10, 12, 14, 15, 18, 20, 22, 23, 25, 30, 35}
• Min = 10
• Q1 = 14
• Median = 20
• Q3 = 25
• Max = 35
→ IQR = 25 – 14 = 11
→ Whiskers extend up to 14 – 1.5×11 = -2.5 (rounded to 10) and 25 + 1.5×11 = 41.5
(rounded to 35).
→ No outliers here.
Usefulness of a Boxplot in Data Mining:
• Identifies the spread and symmetry of data distribution.
• Clearly shows skewness:
→ If median is near Q1: positive skew
→ If median is near Q3: negative skew
• Highlights outliers that may need special attention or preprocessing.
• Ideal for comparing multiple datasets (e.g., multiple attributes or classes).
8
Dept. of CSE(DATA SCIENCE)
• Boxplots are non-parametric — they do not assume a specific distribution (e.g.,
normal distribution).
When to Use:
• During exploratory data analysis (EDA)
• To compare distributions across different groups
• Before performing normalization or transformations
• While selecting features or handling missing/outlier values
6.Describe various graphical methods used for univariate and bivariate data
visualization in data mining.
Graphical data visualization is a critical step in data mining and exploratory data analysis
(EDA). It allows users to understand the structure, patterns, and anomalies in the data before
applying any model or transformation. Visualizations are often tailored to the number of
attributes involved:
A. Univariate Data Visualization (single attribute):
1. Histogram:
o Displays the frequency distribution of a numeric (continuous) attribute.
o Data is divided into bins (intervals), and the height of each bar represents the
number of data points in that bin.
o Useful to examine shape (e.g., normal, skewed), spread, and central tendency.
o Example: Examining salary distribution in an organization.
2. Boxplot (Box-and-Whisker Plot):
o Displays the five-number summary: Min, Q1, Median, Q3, and Max.
o Visually identifies the spread, skewness, and presence of outliers.
o Especially useful when comparing multiple groups or variables side-by-side.
3. Bar Chart:
9
Dept. of CSE(DATA SCIENCE)
o Used for categorical attributes (nominal or ordinal).
o Each category is represented as a bar, with height indicating frequency or
count.
o Bars can be vertical or horizontal.
o Example: Frequency of different product types sold.
4. Pie Chart:
o Displays proportions of categories as segments of a circle.
o Best for a small number of categories; not preferred for detailed analysis.
o Example: Market share of brands.
5. Quantile Plot:
o Plots data values against their estimated quantiles.
o Helps visualize how values are distributed and detect skewness or gaps.
o All data points are represented (unlike boxplot).
6. Q–Q Plot (Quantile–Quantile Plot):
o Used to compare the quantiles of a dataset to a theoretical distribution (e.g.,
normal).
o If the data aligns with the line y = x, then the dataset is approximately
normally distributed.
B. Bivariate Data Visualization (two attributes):
1. Scatter Plot:
o Displays a pair of numerical attributes on the X and Y axes.
o Each point represents an object.
o Useful to detect relationships, correlations, and clusters.
o Example: Height vs. weight of students.
2. Line Graph:
10
Dept. of CSE(DATA SCIENCE)
o Connects data points in order (often time-series).
o Used when one variable (like time) is sequential.
o Example: Temperature recorded over a week.
3. Bubble Chart:
o Extension of a scatter plot that includes a third variable using bubble size.
o Useful for visualizing multivariate relationships.
4. Stacked Bar Chart:
o Displays the total value broken into sub-categories.
o Example: Monthly expenses divided into food, rent, utilities.
5. Parallel Coordinate Plot (for multivariate but includes bivariate pairs):
o Attributes are represented as parallel vertical axes.
o Each line crossing the axes represents a data record.
o Useful to observe interactions and patterns across many dimensions.
7.Explain the concept and interpretation of a quantile–quantile (q–q) plot with
examples.
A quantile–quantile plot, commonly known as a q–q plot, is a statistical visualization used to
compare the distribution of a dataset with a theoretical distribution (e.g., normal distribution)
or with another dataset. It helps in assessing whether two datasets follow the same
distribution or whether the observed data follows a specific theoretical model.
Purpose of a Q–Q Plot:
• To assess the similarity between two distributions.
• To test whether a dataset follows a specific theoretical distribution (e.g., normality).
• To visually detect skewness, kurtosis, or outliers in data.
Construction of a Q–Q Plot:
1. Sort both datasets (or dataset and theoretical distribution) in ascending order.
11
Dept. of CSE(DATA SCIENCE)
2. Compute quantiles (percentiles) for each value.
3. Plot the quantiles of one dataset against the quantiles of the other.
o X-axis: Theoretical quantiles (e.g., from normal distribution).
o Y-axis: Observed quantiles from the dataset.
4. Add a reference line y = x (the 45° line) for comparison.
Interpretation of the Plot:
• If the data is from the same distribution as the theoretical one, the points will lie
roughly along the line y = x.
• If the points deviate significantly from this line, the data does not follow the assumed
distribution.
Examples:
1. If you compare a dataset’s quantiles with those of a normal distribution:
o Straight line → Data is approximately normally distributed.
o S-curve → Skewness exists. • Convex (bulge left) = right-skewed. • Concave
(bulge right) = left-skewed.
o Tail divergence → Outliers or heavy/light tails compared to normal.
2. Comparing two datasets:
o Dataset A = {1, 2, 3, 4, 5}
o Dataset B = {10, 20, 30, 40, 50}
o Q–Q plot will be a straight line showing linear relationship but not equal
values.
Use Cases in Data Mining:
• To check assumptions before applying algorithms like linear regression, which
assume normally distributed residuals.
• To evaluate the effectiveness of data transformation (e.g., after applying log, sqrt).
• To compare attribute distributions across classes in classification problems.
12
Dept. of CSE(DATA SCIENCE)
Advantages:
• Non-parametric: Does not require parameters like mean or standard deviation.
• Easy to detect deviation from normality.
• Helps validate assumptions visually.
8.Explain the concepts of skewness and kurtosis with their interpretation in data
analysis.
Skewness and kurtosis are statistical measures that describe the shape and characteristics of a
data distribution. These concepts are essential in data mining and data analysis to understand
how the data deviates from a normal distribution.
1. Skewness:
• Definition:
o Skewness measures the asymmetry of the distribution of values in a dataset.
o It indicates whether the data values are concentrated more on one side of the
mean.
• Types of Skewness:
o Symmetric Distribution:
▪ Skewness ≈ 0
▪ Mean ≈ Median ≈ Mode
▪ Bell-shaped, e.g., normal distribution
o Positive Skew (Right-Skewed):
▪ Skewness > 0
▪ Tail on the right is longer; more values are clustered to the left.
▪ Mean > Median
13
Dept. of CSE(DATA SCIENCE)
▪ Example: Income distribution, where few values are extremely high.
o Negative Skew (Left-Skewed):
▪ Skewness < 0
▪ Tail on the left is longer; more values are clustered to the right.
▪ Mean < Median
▪ Example: Test scores where most students score high, but few score
very low.
• Interpretation:
o Helps identify asymmetry.
o Critical in deciding whether to apply transformation (e.g., log transformation)
to make data more normal.
• Formula (simplified for sample data): Skewness = [n / ((n – 1)(n – 2))] × Σ[(xi – x̄)³ /
s³]
• Usage in Data Mining:
o Before applying statistical models like linear regression or clustering.
o To determine if data needs normalization or transformation.
2. Kurtosis:
• Definition:
o Kurtosis measures the “tailedness” or peakedness of a data distribution.
o It shows how much of the variance is due to outliers or extreme deviations.
• Types of Kurtosis:
o Mesokurtic:
▪ Kurtosis ≈ 3 (excess kurtosis ≈ 0)
▪ Normal distribution
o Leptokurtic (Heavy tails):
14
Dept. of CSE(DATA SCIENCE)
▪ Kurtosis > 3 (excess kurtosis > 0)
▪ Sharp peak with fat tails; more prone to outliers
▪ Example: Stock market returns
o Platykurtic (Light tails):
▪ Kurtosis < 3 (excess kurtosis < 0)
▪ Flatter peak with thin tails; fewer outliers
▪ Example: Uniform distribution
• Formula (sample kurtosis): Kurtosis = [n(n+1)/((n–1)(n–2)(n–3))] × Σ[(xi – x̄)⁴ / s⁴] –
3(n–1)²/((n–2)(n–3))
• Interpretation:
o Higher kurtosis → more extreme outliers.
o Lower kurtosis → flatter and more uniform distribution.
• Usage in Data Mining:
o To assess distribution shape before model selection.
o High kurtosis may require robust methods or outlier treatment.
9.Differentiate between similarity and dissimilarity measures in data mining with
examples.
Similarity and dissimilarity are foundational concepts in data mining used to quantify how
alike or different two data objects are. These measures are especially critical in clustering,
classification, recommendation systems, and anomaly detection.
1. Similarity Measure:
• Definition:
o A similarity measure quantifies how “close” or “related” two objects are.
o Higher similarity values indicate greater resemblance.
• Properties:
15
Dept. of CSE(DATA SCIENCE)
o Symmetric: sim(x, y) = sim(y, x)
o The value usually ranges from 0 to 1 (sometimes up to ∞ depending on the
measure)
o sim(x, x) = 1 (maximum similarity with itself)
• Common Similarity Measures:
o Cosine Similarity: → sim(x, y) = (x ⋅ y) / (‖x‖ × ‖y‖) → Used in text mining
and high-dimensional data
o Jaccard Similarity: → sim = |A ∩ B| / |A ∪ B| → Best for comparing binary or
set-like attributes
o Pearson Correlation: → Measures linear correlation between numerical
variables → sim(x, y) = cov(x, y) / (σx × σy)
• Example:
o For binary vectors A = [1, 0, 1, 1], B = [1, 1, 1, 0], Jaccard similarity: →
Intersection = 2, Union = 4 → sim = 2/4 = 0.5
2. Dissimilarity Measure:
• Definition:
o A dissimilarity measure quantifies how “different” or “distant” two data
objects are.
o Lower values indicate higher similarity; 0 indicates identical objects.
• Properties:
o Non-negative: d(x, y) ≥ 0
o Symmetric: d(x, y) = d(y, x)
o d(x, x) = 0
o Often satisfies the triangle inequality (distance measure)
• Common Dissimilarity Measures:
16
Dept. of CSE(DATA SCIENCE)
o Euclidean Distance: → d(x, y) = √[(x₁ – y₁)² + (x₂ – y₂)² + ... + (xn – yn)²] →
Used in clustering and numerical data comparisons
o Manhattan Distance: → d(x, y) = |x₁ – y₁| + |x₂ – y₂| + ... + |xn – yn| →
Measures grid-like (taxicab) distance
o Hamming Distance: → Used for binary strings; counts the number of
mismatched bits → Example: "1010" vs "1001" → distance = 2
• Example:
o For vectors A = [2, 3], B = [5, 7]: → Euclidean distance = √[(5–2)² + (7–3)²] =
√(9 + 16) = √25 = 5
10.Explain pixel-oriented visualization techniques used for high-dimensional data in
data mining.
Pixel-oriented visualization is a powerful technique designed to display large volumes of
high-dimensional data compactly and effectively. It encodes individual data values as colored
pixels, allowing users to identify patterns, trends, clusters, and anomalies visually.
1. Concept of Pixel-Oriented Visualization:
• Each data value is represented by a single pixel.
• The pixel’s color (or intensity) reflects the attribute’s magnitude.
• Values from multiple dimensions can be arranged in different segments or frames on a
screen.
• Enables visualization of millions of data items simultaneously.
2. Why Use Pixel-Oriented Techniques?
• Traditional plots (like scatter or line plots) are limited by screen space and
dimensionality.
• Pixel-oriented techniques allow visualization of: → Large datasets (millions of
records) → High-dimensional datasets (tens or hundreds of attributes)
• These techniques are non-parametric and scalable.
3. Core Features:
17
Dept. of CSE(DATA SCIENCE)
• High-density display: → Allows one pixel per data value (maximum information per
screen)
• Compact layout: → Multiple attributes visualized in a single frame or split into sub-
windows
• Color encoding: → Darker/brighter/more saturated pixels represent higher or lower
values
4. Types of Pixel-Oriented Techniques:
1. Recursive Pattern:
o Data values are arranged in a space-filling curve (like Hilbert or Peano curve).
o Preserves data locality, i.e., similar values appear close on the display.
o Ideal for datasets with natural ordering (e.g., time-series).
2. Circle Segments:
o Data dimensions are represented as pie segments.
o Each segment shows data values for one attribute.
o Segments are arranged radially in a circular format.
o Good for visual comparison across dimensions.
3. Spiral Display:
o Data is arranged in a spiral shape to emphasize periodic patterns.
o Often used for temporal or periodic data.
o Repeating cycles or fluctuations can be spotted visually.
4. Axes-Based Layout:
o Each axis represents one attribute.
o Pixel columns or grids show values for that attribute across all records.
o Supports detailed inspection of individual attributes.
5. Interpretation:
18
Dept. of CSE(DATA SCIENCE)
• Color bands indicate trends or clusters.
• Random or chaotic color spread may indicate outliers or noise.
• Similar colors aligned together may indicate correlated attributes.
6. Use Cases in Data Mining:
• Visual exploration before modeling or clustering.
• Identifying attribute relevance (features with strong patterns).
• Detecting anomalies and rare events.
• Understanding class distributions in classification tasks.
7. Limitations:
• Requires careful design of layout to avoid confusion.
• Color encoding can be misinterpreted if scale isn’t standardized.
• Not suitable for very small datasets (overkill for <100 records).
19
Dept. of CSE(DATA SCIENCE)
SUMS
11. Analyse the dataset: {45, 50, 55, 55, 60, 65, 65, 70, 75, 80, 85, 100}.
Perform the following:
a) Calculate mean and median
b) Find midrange
c) Compute IQR
d) Detect outliers using IQR
e) Find variance and standard deviation
20
Dept. of CSE(DATA SCIENCE)
21
Dept. of CSE(DATA SCIENCE)
12. For the dataset {10, 15, 20, 20, 25, 30, 35, 40, 40, 45},
a) Create a boxplot
b) State the five-number summary
c) Mention if there are any outliers
22
Dept. of CSE(DATA SCIENCE)
13. Analyze two distributions:
X = {30, 40, 50, 60, 70}
Y = {35, 42, 49, 58, 66}
a) Compare them using a q–q plot (describe the points)
b) Discuss similarities and shifts in distribution
14. Using the data: {100, 102, 105, 108, 112, 115, 119, 125, 130, 200},
a) Find mean, median, mode
b) Discuss the impact of the outlier on mean and median
c) Suggest whether mean or median is better for central tendency
23
Dept. of CSE(DATA SCIENCE)
15. Given the salary data of two departments:
Dept A: {40, 42, 44, 46, 48}
Dept B: {35, 40, 45, 50, 55}
a) Compute standard deviation of both
b) Identify which department has more consistent salary structure
c) Interpret the result
24
Dept. of CSE(DATA SCIENCE)
25