0% found this document useful (0 votes)

9 views25 pages

Get To Know About Data

Chapter 2 of the document discusses various types of data attributes used in data mining, including nominal, binary, ordinal, and numeric attributes, as well as their significance in data analysis. It also covers the distinction between discrete and continuous attributes, measures of central tendency, dispersion, and the construction and usefulness of boxplots in data mining. Additionally, it highlights graphical methods for univariate and bivariate data visualization, emphasizing their importance in understanding data structure and patterns.

Uploaded by

ssitavinya2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

Get To Know About Data

Uploaded by

ssitavinya2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Dept.

of CSE(DATA SCIENCE)

CHAPTER 2: GET TO KNOW ABOUT YOUR DATA

1.Enlist and explain different types of data attributes used in data mining.

In data mining, understanding the nature and type of data attributes is critical for selecting
appropriate data preprocessing, analysis, and modeling techniques. Attributes, also referred to
as features or variables, are properties or characteristics of an object that can be measured or
categorized.

Data attributes can be broadly categorized into four main types: nominal, binary, ordinal, and
numeric. Each attribute type has specific characteristics that define how the data values can
be interpreted and used.

1. Nominal Attributes: Nominal attributes are qualitative in nature. They represent

categories or labels without any quantitative meaning or inherent order. The values of
nominal attributes are just names or identifiers. For example, an attribute "color" can
have values like red, green, and blue. Since these categories do not have any ranking
or arithmetic meaning, statistical operations like mean or median do not apply.

2. Binary Attributes: Binary attributes are a special case of nominal attributes with only
two possible states or values. These are typically encoded as 0 and 1, or true and false.
Binary attributes can be further classified as:

o Symmetric: When both values are equally important (e.g., gender: male or
female).

o Asymmetric: When one value is more significant (e.g., test result: positive or
negative).

3. Ordinal Attributes: Ordinal attributes are categorical attributes that have a meaningful
order or ranking among their values. For example, the attribute "education level" can
have values: high school < bachelor < master < Ph.D. While the values are ordered,
the intervals between them are not necessarily equal or known. Arithmetic operations
are not appropriate, but ranking and order-based analysis is meaningful.

4. Numeric Attributes: Numeric attributes represent quantifiable measurements and are

subdivided into:

1
Dept. of CSE(DATA SCIENCE)

o Interval Attributes: These have meaningful differences between values but no

true zero point. For example, temperature in Celsius or Fahrenheit is an
interval attribute. While 30°C is 10°C higher than 20°C, it is not “1.5 times
hotter.”

o Ratio Attributes: These have both meaningful differences and a true zero.
Examples include age, weight, height, and salary. Here, arithmetic operations
like division are valid. For example, a person earning $100,000 earns twice as
much as someone earning $50,000.

2.Differentiate between discrete and continuous attributes with suitable examples.

In data mining and statistical analysis, understanding the nature of attributes—particularly the
distinction between discrete and continuous attributes—is essential for proper data
preprocessing, modeling, and visualization. Both types fall under the category of numeric
attributes but differ significantly in terms of how they represent data and the operations
applicable to them.

Discrete Attributes:

Discrete attributes take on only a finite or countably infinite set of values. These values are
often integers and represent separate, indivisible units. A discrete attribute essentially counts
occurrences or assigns integer-based labels. For instance, the number of children in a
household, the count of cars owned, or the number of visits to a hospital in a year are all
discrete attributes. The key feature is that there are gaps between possible values, and the
values themselves are exact.

One of the important characteristics of discrete attributes is that they can be either numeric or
categorical in nature. When numeric, they typically represent count data. For example, a
variable like "number of times a customer made a purchase" is a discrete, numeric attribute.
These types of attributes are often used in association rule mining or frequency analysis.

Because discrete values are finite and clearly separated, they are typically visualized using
bar charts or histograms. In modeling, they are often treated differently from continuous data.
For example, decision trees can split on each distinct discrete value, whereas continuous
values are split based on ranges.

2
Dept. of CSE(DATA SCIENCE)

Continuous Attributes:

Continuous attributes, on the other hand, can take on any value within a specified range or
interval. These values are typically real numbers and are used to measure physical quantities.
For example, attributes such as height, weight, temperature, age (in years with decimal
precision), or income are continuous. The key characteristic of continuous attributes is that
there is an infinite number of possible values within any given range. For instance, between
5.0 and 6.0 kg, there can be 5.1, 5.15, 5.155, and so on, indicating infinite granularity.

Continuous data is usually visualized using line graphs, density plots, or histograms. Since
the data is measurable and often follows a distribution (e.g., normal distribution), statistical
analysis such as regression, mean, standard deviation, and correlation are highly applicable.

Comparison and Use in Data Mining:

In data mining, discrete attributes are commonly found in categorical datasets and are often
used in classification or association analysis. Continuous attributes, by contrast, are prevalent
in numerical datasets and are essential for techniques like clustering, regression, and outlier
detection.

While both attribute types are essential, preprocessing steps often differ. Continuous
attributes may need normalization or discretization before applying certain algorithms,
whereas discrete attributes may require encoding techniques like one-hot encoding if used in
machine learning models.

3.Explain different measures of central tendency in data mining with suitable examples.

Measures of central tendency are statistical tools used to describe the center or typical value
of a dataset. In the context of data mining, these measures help to summarize large datasets
by identifying a single representative value. Understanding the central tendency allows data
analysts and scientists to make decisions, detect patterns, and assess distribution
characteristics.

The most commonly used measures of central tendency include the mean, median, mode, and
midrange. Each of these has its unique properties and is suitable for different types of data or
distributions.

3
Dept. of CSE(DATA SCIENCE)

1. Mean:

The mean, also known as the arithmetic average, is calculated by summing all values in a
dataset and dividing by the number of values. It is the most commonly used measure due to
its simplicity and mathematical tractability. The formula is:

Mean (μ) = (x₁ + x₂ + x₃ + ... + xₙ) / n

For example, if the salaries of 5 employees are {30, 40, 50, 60, 70}, the mean salary is:

Mean = (30 + 40 + 50 + 60 + 70) / 5 = 250 / 5 = 50

However, the mean is sensitive to outliers. If a single employee earns 500, the new mean
becomes:

Mean = (30 + 40 + 50 + 60 + 500) / 5 = 680 / 5 = 136

This example shows how one extreme value can significantly skew the mean, making it less
reliable for skewed distributions.

2. Median:

The median is the middle value in an ordered dataset. If the number of values is odd, the
median is the middle value. If even, it is the average of the two middle values. The median is
not affected by extreme values or outliers, making it a robust measure for skewed data.

Example: For the dataset {30, 40, 50, 60, 70}, the median is 50.
For {30, 40, 50, 60, 500}, the median remains 50, whereas the mean is distorted.

3. Mode:

The mode is the value that occurs most frequently in the dataset. A dataset can have one mode
(unimodal), more than one (bimodal or multimodal), or no mode at all if all values are
unique. Mode is especially useful for nominal attributes where mean and median do not
apply.

Example: In {30, 30, 40, 50, 60}, the mode is 30.

For attributes like “city of residence,” mode shows the most common category.

4. Midrange:

The midrange is calculated as the average of the minimum and maximum values in the
dataset:

4
Dept. of CSE(DATA SCIENCE)

Midrange = (Minimum value + Maximum value) / 2

This measure gives a quick estimate of the center but is highly sensitive to outliers.
Example: For {30, 40, 50, 60, 70}, Midrange = (30 + 70) / 2 = 50

But for {30, 40, 50, 60, 500}, Midrange = (30 + 500) / 2 = 265

Usage in Data Mining:

Each central tendency measure serves a specific purpose in data mining. The mean is often
used in clustering (like K-means), the median is critical in robust statistics and outlier
detection, and the mode is important for analyzing categorical data. Selecting the appropriate
measure depends on the data type and distribution characteristics.

4.Enlist and explain various statistical measures of dispersion in data mining.

Measures of dispersion (also called measures of variability) describe how spread out the data
values are in a dataset. In data mining, understanding dispersion helps identify consistency,
detect outliers, and understand data distribution.

Below are the key measures of dispersion:

1. Range:

• Definition: Difference between the maximum and minimum values in a dataset.

• Formula: Range = Max – Min

• Example: {10, 20, 30, 40, 50} → Range = 50 – 10 = 40

• Pros: Very simple to compute

• Cons: Highly sensitive to outliers

• Usage: Quick spread estimate, but not reliable alone

2. Variance (σ²):

• Definition: Average of the squared differences from the mean

• Formula (for population):

σ² = (Σ (xi – μ)²) / n

5
Dept. of CSE(DATA SCIENCE)

• Formula (for sample):

s² = (Σ (xi – x̄)²) / (n – 1)

• Example: Dataset = {2, 4, 6}, Mean = 4

Variance = [(2–4)² + (4–4)² + (6–4)²]/3 = (4 + 0 + 2)/3 = 2.67

• Pros: Mathematically powerful

• Cons: Uses squared units (not intuitive), sensitive to outliers

3. Standard Deviation (σ):

• Definition: Square root of variance

• Formula: σ = √Variance

• Example: If variance = 25 → σ = √25 = 5

• Same unit as original data, making it interpretable

• Pros: Widely used; appears in many algorithms (e.g., clustering, Gaussian

distribution)

• Usage: Understanding spread and deviation from the mean

4. Interquartile Range (IQR):

• Definition: Difference between the third quartile (Q3) and the first quartile (Q1)

• Formula: IQR = Q3 – Q1

• Example: {10, 15, 20, 25, 30, 35, 40}

Q1 = 15, Q3 = 35 → IQR = 35 – 15 = 20

• Pros: Robust to outliers

• Usage: Detecting outliers and understanding middle 50% of data

5. Five-Number Summary (used in boxplots):

• Includes:
• Minimum
• Q1 (25th percentile)
• Median (50th percentile)

6
Dept. of CSE(DATA SCIENCE)

• Q3 (75th percentile)
• Maximum

• Helps understand overall data spread visually

• Boxplots use this to show dispersion and detect outliers

5.Explain the construction and usefulness of a boxplot in data mining.

A boxplot, also called a box-and-whisker plot, is a standardized graphical representation used

in data analysis to depict the distribution and variability of numerical data. It is particularly
useful in data mining for identifying data spread, skewness, and detecting outliers. A boxplot
is based on a five-number summary and is effective for visual comparison between different
datasets.

Components of a Boxplot:

1. Five-Number Summary:
The foundation of a boxplot consists of these five values:

o Minimum: The smallest value (excluding outliers)

o Q1 (First Quartile): 25% of the data fall below this value

o Median (Q2): The middle value (50th percentile)

o Q3 (Third Quartile): 75% of the data fall below this value

o Maximum: The largest value (excluding outliers)

2. Box:

o A rectangle drawn from Q1 to Q3.

o Represents the Interquartile Range (IQR = Q3 – Q1), which contains the

middle 50% of the data.

o The width of the box indicates data variability.

3. Median Line:

o A line is drawn inside the box to show the median value.

7
Dept. of CSE(DATA SCIENCE)

o This line indicates where the center of the data lies.

4. Whiskers:

o Lines extending from both ends of the box to the minimum and maximum
values (excluding outliers).

o Typically whiskers go up to 1.5 × IQR from Q1 and Q3.

5. Outliers:

o Data points beyond 1.5 × IQR from Q1 or Q3 are considered outliers.

o These are plotted individually as small dots or stars outside the whiskers.

Example:

For dataset {10, 12, 14, 15, 18, 20, 22, 23, 25, 30, 35}

• Min = 10

• Q1 = 14

• Median = 20

• Q3 = 25

• Max = 35
→ IQR = 25 – 14 = 11
→ Whiskers extend up to 14 – 1.5×11 = -2.5 (rounded to 10) and 25 + 1.5×11 = 41.5
(rounded to 35).
→ No outliers here.

Usefulness of a Boxplot in Data Mining:

• Identifies the spread and symmetry of data distribution.

• Clearly shows skewness:

→ If median is near Q1: positive skew
→ If median is near Q3: negative skew

• Highlights outliers that may need special attention or preprocessing.

• Ideal for comparing multiple datasets (e.g., multiple attributes or classes).

8
Dept. of CSE(DATA SCIENCE)

• Boxplots are non-parametric — they do not assume a specific distribution (e.g.,

normal distribution).

When to Use:

• During exploratory data analysis (EDA)

• To compare distributions across different groups

• Before performing normalization or transformations

• While selecting features or handling missing/outlier values

6.Describe various graphical methods used for univariate and bivariate data
visualization in data mining.

Graphical data visualization is a critical step in data mining and exploratory data analysis
(EDA). It allows users to understand the structure, patterns, and anomalies in the data before
applying any model or transformation. Visualizations are often tailored to the number of
attributes involved:

A. Univariate Data Visualization (single attribute):

1. Histogram:

o Displays the frequency distribution of a numeric (continuous) attribute.

o Data is divided into bins (intervals), and the height of each bar represents the
number of data points in that bin.

o Useful to examine shape (e.g., normal, skewed), spread, and central tendency.

o Example: Examining salary distribution in an organization.

2. Boxplot (Box-and-Whisker Plot):

o Displays the five-number summary: Min, Q1, Median, Q3, and Max.

o Visually identifies the spread, skewness, and presence of outliers.

o Especially useful when comparing multiple groups or variables side-by-side.

3. Bar Chart:

9
Dept. of CSE(DATA SCIENCE)

o Used for categorical attributes (nominal or ordinal).

o Each category is represented as a bar, with height indicating frequency or

count.

o Bars can be vertical or horizontal.

o Example: Frequency of different product types sold.

4. Pie Chart:

o Displays proportions of categories as segments of a circle.

o Best for a small number of categories; not preferred for detailed analysis.

o Example: Market share of brands.

5. Quantile Plot:

o Plots data values against their estimated quantiles.

o Helps visualize how values are distributed and detect skewness or gaps.

o All data points are represented (unlike boxplot).

6. Q–Q Plot (Quantile–Quantile Plot):

o Used to compare the quantiles of a dataset to a theoretical distribution (e.g.,

normal).

o If the data aligns with the line y = x, then the dataset is approximately
normally distributed.

B. Bivariate Data Visualization (two attributes):

1. Scatter Plot:

o Displays a pair of numerical attributes on the X and Y axes.

o Each point represents an object.

o Useful to detect relationships, correlations, and clusters.

o Example: Height vs. weight of students.

2. Line Graph:

10
Dept. of CSE(DATA SCIENCE)

o Connects data points in order (often time-series).

o Used when one variable (like time) is sequential.

o Example: Temperature recorded over a week.

3. Bubble Chart:

o Extension of a scatter plot that includes a third variable using bubble size.

o Useful for visualizing multivariate relationships.

4. Stacked Bar Chart:

o Displays the total value broken into sub-categories.

o Example: Monthly expenses divided into food, rent, utilities.

5. Parallel Coordinate Plot (for multivariate but includes bivariate pairs):

o Attributes are represented as parallel vertical axes.

o Each line crossing the axes represents a data record.

o Useful to observe interactions and patterns across many dimensions.

7.Explain the concept and interpretation of a quantile–quantile (q–q) plot with

examples.

A quantile–quantile plot, commonly known as a q–q plot, is a statistical visualization used to

compare the distribution of a dataset with a theoretical distribution (e.g., normal distribution)
or with another dataset. It helps in assessing whether two datasets follow the same
distribution or whether the observed data follows a specific theoretical model.

Purpose of a Q–Q Plot:

• To assess the similarity between two distributions.

• To test whether a dataset follows a specific theoretical distribution (e.g., normality).

• To visually detect skewness, kurtosis, or outliers in data.

Construction of a Q–Q Plot:

1. Sort both datasets (or dataset and theoretical distribution) in ascending order.

11
Dept. of CSE(DATA SCIENCE)

2. Compute quantiles (percentiles) for each value.

3. Plot the quantiles of one dataset against the quantiles of the other.

o X-axis: Theoretical quantiles (e.g., from normal distribution).

o Y-axis: Observed quantiles from the dataset.

4. Add a reference line y = x (the 45° line) for comparison.

Interpretation of the Plot:

• If the data is from the same distribution as the theoretical one, the points will lie
roughly along the line y = x.

• If the points deviate significantly from this line, the data does not follow the assumed
distribution.

Examples:

1. If you compare a dataset’s quantiles with those of a normal distribution:

o Straight line → Data is approximately normally distributed.

o S-curve → Skewness exists. • Convex (bulge left) = right-skewed. • Concave

(bulge right) = left-skewed.

o Tail divergence → Outliers or heavy/light tails compared to normal.

2. Comparing two datasets:

o Dataset A = {1, 2, 3, 4, 5}

o Dataset B = {10, 20, 30, 40, 50}

o Q–Q plot will be a straight line showing linear relationship but not equal
values.

Use Cases in Data Mining:

• To check assumptions before applying algorithms like linear regression, which

assume normally distributed residuals.

• To evaluate the effectiveness of data transformation (e.g., after applying log, sqrt).

• To compare attribute distributions across classes in classification problems.

12
Dept. of CSE(DATA SCIENCE)

Advantages:

• Non-parametric: Does not require parameters like mean or standard deviation.

• Easy to detect deviation from normality.

• Helps validate assumptions visually.

8.Explain the concepts of skewness and kurtosis with their interpretation in data
analysis.

Skewness and kurtosis are statistical measures that describe the shape and characteristics of a
data distribution. These concepts are essential in data mining and data analysis to understand
how the data deviates from a normal distribution.

1. Skewness:

• Definition:

o Skewness measures the asymmetry of the distribution of values in a dataset.

o It indicates whether the data values are concentrated more on one side of the
mean.

• Types of Skewness:

o Symmetric Distribution:

▪ Skewness ≈ 0

▪ Mean ≈ Median ≈ Mode

▪ Bell-shaped, e.g., normal distribution

o Positive Skew (Right-Skewed):

▪ Skewness > 0

▪ Tail on the right is longer; more values are clustered to the left.

▪ Mean > Median

13
Dept. of CSE(DATA SCIENCE)

▪ Example: Income distribution, where few values are extremely high.

o Negative Skew (Left-Skewed):

▪ Skewness < 0

▪ Tail on the left is longer; more values are clustered to the right.

▪ Mean < Median

▪ Example: Test scores where most students score high, but few score
very low.

• Interpretation:

o Helps identify asymmetry.

o Critical in deciding whether to apply transformation (e.g., log transformation)

to make data more normal.

• Formula (simplified for sample data): Skewness = [n / ((n – 1)(n – 2))] × Σ[(xi – x̄)³ /
s³]

• Usage in Data Mining:

o Before applying statistical models like linear regression or clustering.

o To determine if data needs normalization or transformation.

2. Kurtosis:

• Definition:

o Kurtosis measures the “tailedness” or peakedness of a data distribution.

o It shows how much of the variance is due to outliers or extreme deviations.

• Types of Kurtosis:

o Mesokurtic:

▪ Kurtosis ≈ 3 (excess kurtosis ≈ 0)

▪ Normal distribution

o Leptokurtic (Heavy tails):

14
Dept. of CSE(DATA SCIENCE)

▪ Kurtosis > 3 (excess kurtosis > 0)

▪ Sharp peak with fat tails; more prone to outliers

▪ Example: Stock market returns

o Platykurtic (Light tails):

▪ Kurtosis < 3 (excess kurtosis < 0)

▪ Flatter peak with thin tails; fewer outliers

▪ Example: Uniform distribution

• Formula (sample kurtosis): Kurtosis = [n(n+1)/((n–1)(n–2)(n–3))] × Σ[(xi – x̄)⁴ / s⁴] –

3(n–1)²/((n–2)(n–3))

• Interpretation:

o Higher kurtosis → more extreme outliers.

o Lower kurtosis → flatter and more uniform distribution.

• Usage in Data Mining:

o To assess distribution shape before model selection.

o High kurtosis may require robust methods or outlier treatment.

9.Differentiate between similarity and dissimilarity measures in data mining with

examples.

Similarity and dissimilarity are foundational concepts in data mining used to quantify how
alike or different two data objects are. These measures are especially critical in clustering,
classification, recommendation systems, and anomaly detection.

1. Similarity Measure:

• Definition:

o A similarity measure quantifies how “close” or “related” two objects are.

o Higher similarity values indicate greater resemblance.

• Properties:

15
Dept. of CSE(DATA SCIENCE)

o Symmetric: sim(x, y) = sim(y, x)

o The value usually ranges from 0 to 1 (sometimes up to ∞ depending on the

measure)

o sim(x, x) = 1 (maximum similarity with itself)

• Common Similarity Measures:

o Cosine Similarity: → sim(x, y) = (x ⋅ y) / (‖x‖ × ‖y‖) → Used in text mining

and high-dimensional data

o Jaccard Similarity: → sim = |A ∩ B| / |A ∪ B| → Best for comparing binary or

set-like attributes

o Pearson Correlation: → Measures linear correlation between numerical

variables → sim(x, y) = cov(x, y) / (σx × σy)

• Example:

o For binary vectors A = [1, 0, 1, 1], B = [1, 1, 1, 0], Jaccard similarity: →

Intersection = 2, Union = 4 → sim = 2/4 = 0.5

2. Dissimilarity Measure:

• Definition:

o A dissimilarity measure quantifies how “different” or “distant” two data

objects are.

o Lower values indicate higher similarity; 0 indicates identical objects.

• Properties:

o Non-negative: d(x, y) ≥ 0

o Symmetric: d(x, y) = d(y, x)

o d(x, x) = 0

o Often satisfies the triangle inequality (distance measure)

• Common Dissimilarity Measures:

16
Dept. of CSE(DATA SCIENCE)

o Euclidean Distance: → d(x, y) = √[(x₁ – y₁)² + (x₂ – y₂)² + ... + (xn – yn)²] →
Used in clustering and numerical data comparisons

o Manhattan Distance: → d(x, y) = |x₁ – y₁| + |x₂ – y₂| + ... + |xn – yn| →
Measures grid-like (taxicab) distance

o Hamming Distance: → Used for binary strings; counts the number of

mismatched bits → Example: "1010" vs "1001" → distance = 2

• Example:

o For vectors A = [2, 3], B = [5, 7]: → Euclidean distance = √[(5–2)² + (7–3)²] =
√(9 + 16) = √25 = 5

10.Explain pixel-oriented visualization techniques used for high-dimensional data in

data mining.

Pixel-oriented visualization is a powerful technique designed to display large volumes of

high-dimensional data compactly and effectively. It encodes individual data values as colored
pixels, allowing users to identify patterns, trends, clusters, and anomalies visually.

1. Concept of Pixel-Oriented Visualization:

• Each data value is represented by a single pixel.

• The pixel’s color (or intensity) reflects the attribute’s magnitude.

• Values from multiple dimensions can be arranged in different segments or frames on a

screen.

• Enables visualization of millions of data items simultaneously.

2. Why Use Pixel-Oriented Techniques?

• Traditional plots (like scatter or line plots) are limited by screen space and
dimensionality.

• Pixel-oriented techniques allow visualization of: → Large datasets (millions of

records) → High-dimensional datasets (tens or hundreds of attributes)

• These techniques are non-parametric and scalable.

3. Core Features:

17
Dept. of CSE(DATA SCIENCE)

• High-density display: → Allows one pixel per data value (maximum information per
screen)

• Compact layout: → Multiple attributes visualized in a single frame or split into sub-
windows

• Color encoding: → Darker/brighter/more saturated pixels represent higher or lower

values

4. Types of Pixel-Oriented Techniques:

1. Recursive Pattern:

o Data values are arranged in a space-filling curve (like Hilbert or Peano curve).

o Preserves data locality, i.e., similar values appear close on the display.

o Ideal for datasets with natural ordering (e.g., time-series).

2. Circle Segments:

o Data dimensions are represented as pie segments.

o Each segment shows data values for one attribute.

o Segments are arranged radially in a circular format.

o Good for visual comparison across dimensions.

3. Spiral Display:

o Data is arranged in a spiral shape to emphasize periodic patterns.

o Often used for temporal or periodic data.

o Repeating cycles or fluctuations can be spotted visually.

4. Axes-Based Layout:

o Each axis represents one attribute.

o Pixel columns or grids show values for that attribute across all records.

o Supports detailed inspection of individual attributes.

5. Interpretation:

18
Dept. of CSE(DATA SCIENCE)

• Color bands indicate trends or clusters.

• Random or chaotic color spread may indicate outliers or noise.

• Similar colors aligned together may indicate correlated attributes.

6. Use Cases in Data Mining:

• Visual exploration before modeling or clustering.

• Identifying attribute relevance (features with strong patterns).

• Detecting anomalies and rare events.

• Understanding class distributions in classification tasks.

7. Limitations:

• Requires careful design of layout to avoid confusion.

• Color encoding can be misinterpreted if scale isn’t standardized.

• Not suitable for very small datasets (overkill for <100 records).

19
Dept. of CSE(DATA SCIENCE)

SUMS

11. Analyse the dataset: {45, 50, 55, 55, 60, 65, 65, 70, 75, 80, 85, 100}.
Perform the following:
a) Calculate mean and median
b) Find midrange
c) Compute IQR
d) Detect outliers using IQR
e) Find variance and standard deviation

20
Dept. of CSE(DATA SCIENCE)

21
Dept. of CSE(DATA SCIENCE)

12. For the dataset {10, 15, 20, 20, 25, 30, 35, 40, 40, 45},
a) Create a boxplot
b) State the five-number summary
c) Mention if there are any outliers

22
Dept. of CSE(DATA SCIENCE)

13. Analyze two distributions:

X = {30, 40, 50, 60, 70}
Y = {35, 42, 49, 58, 66}
a) Compare them using a q–q plot (describe the points)
b) Discuss similarities and shifts in distribution

14. Using the data: {100, 102, 105, 108, 112, 115, 119, 125, 130, 200},
a) Find mean, median, mode
b) Discuss the impact of the outlier on mean and median
c) Suggest whether mean or median is better for central tendency

23
Dept. of CSE(DATA SCIENCE)

15. Given the salary data of two departments:

Dept A: {40, 42, 44, 46, 48}
Dept B: {35, 40, 45, 50, 55}
a) Compute standard deviation of both
b) Identify which department has more consistent salary structure
c) Interpret the result

24
Dept. of CSE(DATA SCIENCE)

Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
CH 2
No ratings yet
CH 2
35 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
02 Data
No ratings yet
02 Data
35 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
DM Lec02
No ratings yet
DM Lec02
32 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Data ch2
No ratings yet
Data ch2
16 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
CH 2
No ratings yet
CH 2
68 pages
Data Preprocessing I
No ratings yet
Data Preprocessing I
39 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
EDS Unit 2 ?
No ratings yet
EDS Unit 2 ?
13 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
01 Data
No ratings yet
01 Data
100 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Unit 3
No ratings yet
Unit 3
43 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
02 Data
No ratings yet
02 Data
36 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02 Data
No ratings yet
02 Data
64 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
Chapter2 Data Exploration
No ratings yet
Chapter2 Data Exploration
78 pages
Oceonology 1-500
No ratings yet
Oceonology 1-500
17 pages
PM Honours
No ratings yet
PM Honours
1 page
1
No ratings yet
1
4 pages
Full CET2024 CS Branches
No ratings yet
Full CET2024 CS Branches
3 pages
Defence
No ratings yet
Defence
3 pages
Infinity 2 Merged
No ratings yet
Infinity 2 Merged
34 pages
Trigonometry Quiz 4
No ratings yet
Trigonometry Quiz 4
10 pages
Wa0000.
No ratings yet
Wa0000.
1 page
Final List
No ratings yet
Final List
6 pages
Chapter 7 Notes Final
No ratings yet
Chapter 7 Notes Final
13 pages
Python Installation Tutorial
No ratings yet
Python Installation Tutorial
15 pages
DL Module 4 Notes
No ratings yet
DL Module 4 Notes
27 pages
Pre Placement 2024 25
No ratings yet
Pre Placement 2024 25
2 pages
RTA 6th Chapter Notes1
No ratings yet
RTA 6th Chapter Notes1
7 pages
Unit 4
No ratings yet
Unit 4
11 pages
1.final WMS For Surveying (SPCL-JOYVILLE-WMS-003)
No ratings yet
1.final WMS For Surveying (SPCL-JOYVILLE-WMS-003)
6 pages
1.HKMC Supplier Quality Management Manual - 210705 - ENG
No ratings yet
1.HKMC Supplier Quality Management Manual - 210705 - ENG
54 pages
Ckan
No ratings yet
Ckan
483 pages
Biological Science March 2022 Room Assignment
No ratings yet
Biological Science March 2022 Room Assignment
14 pages
Competency Guide for Engineers
No ratings yet
Competency Guide for Engineers
18 pages
PL-1939-2D Ccbii, Emd Single Cab Locomotive Brake Control System
No ratings yet
PL-1939-2D Ccbii, Emd Single Cab Locomotive Brake Control System
6 pages
Paid Videos Adda (Telegram) : BY Gagan Pratap
No ratings yet
Paid Videos Adda (Telegram) : BY Gagan Pratap
7 pages
Swift Standards Category 7 Version 11 September 2006
No ratings yet
Swift Standards Category 7 Version 11 September 2006
245 pages
Add Multiple Products To Cart Magento
No ratings yet
Add Multiple Products To Cart Magento
5 pages
01 Drug File MHN
90% (21)
01 Drug File MHN
29 pages
Surge Arrester for Medium Voltage
No ratings yet
Surge Arrester for Medium Voltage
1 page
Malaysia & Singapore Tour Package
No ratings yet
Malaysia & Singapore Tour Package
13 pages
Altronic EPC100E Brochure PDF
No ratings yet
Altronic EPC100E Brochure PDF
4 pages
ULTRA V Vertical Pressure Screen
No ratings yet
ULTRA V Vertical Pressure Screen
4 pages
Lab DHCP Config
No ratings yet
Lab DHCP Config
10 pages
DDIC Objects Creation & Field Addition For Material Legder Sales Stock Split Valuation
No ratings yet
DDIC Objects Creation & Field Addition For Material Legder Sales Stock Split Valuation
9 pages
Sexual Harassment at Workplace in India Lets Stop It Together Free Ebook From LawSikho High Resolution
No ratings yet
Sexual Harassment at Workplace in India Lets Stop It Together Free Ebook From LawSikho High Resolution
44 pages
Dollar General-CIR (GENERAL) 240423-12
No ratings yet
Dollar General-CIR (GENERAL) 240423-12
28 pages
Manual - 1. HR Policy Manual 2024
50% (2)
Manual - 1. HR Policy Manual 2024
112 pages
Module - Gonzales - Facilitating Learning - 2
No ratings yet
Module - Gonzales - Facilitating Learning - 2
4 pages
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
No ratings yet
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
64 pages
Full Stack Data Analyst
No ratings yet
Full Stack Data Analyst
32 pages
Aptitude and Achievement Test
No ratings yet
Aptitude and Achievement Test
8 pages
MHOR04 Manual GB PDF
No ratings yet
MHOR04 Manual GB PDF
31 pages
SEO Tips for Literature Reviews
100% (3)
SEO Tips for Literature Reviews
6 pages
Wheat Fiber for Food Manufacturers
No ratings yet
Wheat Fiber for Food Manufacturers
1 page
GovernorStability, Pressure Rise& Speed Rise
No ratings yet
GovernorStability, Pressure Rise& Speed Rise
5 pages
AB-GU-005 Listado Marcas Aceptadas V1
No ratings yet
AB-GU-005 Listado Marcas Aceptadas V1
15 pages
Kathrein 742213
No ratings yet
Kathrein 742213
2 pages
E560 cmd11 Cs
No ratings yet
E560 cmd11 Cs
10 pages