1
CHAPTER 2
UNDERSTANDING OF DATA
NUMERICAL PROBLEMS
1. For a given univariate dataset S = {5, 10, 15, 20, 25, 30} of marks. Find mean,
median, mode, standard deviation and variance.
Solution
5 + 10 + 15 + 20 + 25 + 30
mean = = 17.5
6
15 + 20 35
median = = = 17.5
2 2
mode = None
Mode gives the element that occur most in the dataset. Since the above dataset has
unique elements, there is no mode. Many software packages often given the first
element as mode if the dataset has unique data elements.
The standard deviation is given
N = 6, Sum = 105,mean = 17.5
2
=
(x − )
i
2
N
(5 − 17.5) 2 + (10 − 17.5) 2 + (15 − 17.5) 2 + (20 − 17.5) 2 + (25 − 17.5) 2 + (30 − 17.5)2
=
6
437.5
=
6
= 72.9166
= 72.9166 = 8.539
2. For a given univariate dataset S = {5, 10, 15, 20, 25, 30} of marks. Find arithmetic
mean, geometric mean.
Solution
5 + 10 + 15 + 20 + 25 + 30 105
mean = = = 17.5
6 6
Geometric mean = 6
5 10 15 20 25 30 = 6 11250000 = 14.96897
2
3. For a given univariate dataset S = {5, 10, 15, 20, 25, 30} of marks. Find five-point
summary and plot the box chart.
Solution
Minimum = 5
Maximum = 30
Q50% = 15
15
Q25% = = 7.5%
2
20 + 25 45
Q75% = = = 22.5
2 2
One can use the book associated python program to draw a box chart.
4. For the given Tables 2.6 and 2.7, perform the descriptive analysis of data:
Table 2.6: Sample Data
Age Weight
1 4.2
2 4.5
3 4.7
4 5.2
5 6
6 6.2
7 7
8 7.2
9 7.5
10 8.5
Table 2.7: Students Marks Table
Sid English Hindi Maths Science
1 45 70.5 90 40
2 60 72.5 80 45
3 60 80 90 50
4 80 80 90 80
5 85 72 70 60
(a) Find min and max marks scored in each subject.
Solution
3
Minimum and Maximum
English = 45,85
Hindi=70.5,72
Maths = 70,90
Science = 40,80
(b) Find details of student who scored highest marks in Maths.
Maths – Maximum mark 90 with Student ID 3 and 4
(c) Find the students with marks English > 60 and Maths > 70.
English more than 60, Student ID 1,4,5
Maths more than 70, Student ID 1,2,3,4
5. For Univariate Attribute such as Weight, English, and Math marks, find the following:
(a) Mean, Median, Mode
Solution
Weight
4.2 + 4.5 + 4.7 + 5.2 + 6 + 6.2 + 7 + 7.2 + 7.5 + 8.5 61
mean weight = = = 6.1
10 10
6 + 6.2 12.2
Median = = = 6.1
2 2
Mode = 0
English marks = {45,60,60,80,85}
45 + 60 + 60 + 80 + 85 330
mean = = = 66
5 5
Median = 60
Mode = 60
Maths Marks = {90,80,90,90,90}
90 + 80 + 90 + 90 + 70 420
mean = = = 84
5 5
Median = 90
Mode = 90
(b) Weighted Mean, Geometric Mean and Harmonic Mean
Weighted mean is applicable for frequency data, that is data having frequency associated with.
Harmonic mean
The formula for harmonic mean is
4
Weight harmonic mean is 5.7935
English harmonic mean is 62.6407
Maths harmonic mean is 87.8049
Geometric mean
Geometric Mean of weight = 10
4.2 8.5 = 5.945
GM of English = 64.32
Maths = 83.59
c) Variance and Standard Deviation
Weight Table 2.6
- Sum = 61, n=10, mean = 6.1
xi xi − x ( xi − x )
2
4.2 -1.9 3.61
4.5 -1.6 2.56
4.7 -1.4 1.96
5.2 -0.9 0.81
6 -0.1 0.01
6.2 0.1 0.01
7 0.9 0.81
7.2 1.1 1.21
7.5 1.4 1.96
8.5 2.4 5.76
( x − )
2
i x =18.7
(4.2 − 6.1) 2 + ... + (8.5 − 6.1) 2 18.7
2 = = = 1.87
10 10
= 1.87 = 1.36
English
330
English mean = x = = 66
5
5
The deviation can be computed as follows
xi xi − x ( xi − x )
2
45 -21 441
60 -6 36
60 -6 36
80 14 196
85 19 361
( x − )
2
i x =1070
1070
2 = = 214
5
= 214 = 14.6287
Maths
420
Maths mean = x = = 84
5
The deviation can be computed as follows
xi xi − x ( xi − x )
2
90 6 36
80 -4 16
90 6 36
90 6 36
90 -14 196
( x − )
2
i x =320
320
2 = = 64
5
= 64 = 8
(d) Absolute Deviation, Mean Absolute Deviation, and Median Absolute Deviation
Deviation =
4.2 − 6.1 + 4.5 − 6.1 + 4.7 − 6.1 + 5.2 − 6.1 + 6 − 6.1 + 6.2 − 6.1 + 7 − 6.1 + 7.2 − 6.1 + 7.5 − 6.1 + 8.5 − 6.1
=0
Mean absolute
4.2 − 6.1 + 4.5 − 6.1 + 4.7 − 6.1 + 5.2 − 6.1 + 6 − 6.1 + 6.2 − 6.1 + 7 − 6.1 + 7.2 − 6.1 + 7.5 − 6.1 + 8.5 − 6.1
Deviation=
10
Hint: For mean absolute deviation, median should replace mean.
(e) Coefficient of Variation for weights
1.31
cv = =
6.1
6
(f) Skewness and Kurtosis for English and Maths
Hint:
Kurtosis is of order 4.
English
Mean = 66, standard deviation = 16.3554, n = 5, skewness = -90/1046.7456 = -0.0051
Similarly Kurtosis = 1.59
Maths
Mean = 88, sd = 4.4721, n = 5, skewness =
-480/286.2144 = -1.3416 and kurtosis = 1.59
Maths = -0.75,
(g) Five-point summary, IQR, Semi-Quartile
Weight
Minimum = 4.2
Maximum = 8.5
Q50% = 6.1
Q25% = 4.6
Q75% = 7.35
IQR = 7.35-4.6 = 2.75
SIQR = 1.375
English
Minimum = 45
Maximum = 85
Q50% = 60
Q25% = 52.5
Q75% = 82.5
IQR = 82.5-52.5=20
SIQR = 10
Math
Minimum = 70
Maximum = 90
7
Q50% = 90
Q25% = 75
Q75% = 90
IQR = 90-75 = 15
SIQR = 7.5
6. For the Bivariate data such as English and Math, find
(a) Covariance and Correlation between two variables
Let x = English = {45,60,60,80,85}
Let y= Maths = {90,80,90,90,70}
The mean of x and y is calculated for follows
45 + 60 + 60 + 80 + 85 330
x = = = 66
5 5
90 + 80 + 90 + 90 + 70 420
y = = = 84
5 5
xi xi − x yi yi − y ( xi − x ) ( yi − y )
45 -21 90 6 -126
60 -6 80 -4 24
60 -6 90 6 -36
80 14 90 6 84
85 19 70 -14 -266
x i = 330 y i = 420 ( x − )( y −
i x i y ) = −320
The covariance formula is given as
Covariance =
(45 − 66)(90 − 84) + (60 − 66)(80 − 84) + (60 − 66)(90 − 84) + (80 − 66)(90 − 84) + (85 − 66)(70 − 84) −320
=
5 5
−320
xy = = −64
5
Correlation Coefficient
The variance and standard deviation can be computed as follows
8
330
English mean = x = = 66
5
420
Maths mean = y = = 84
5
xi xi − x ( xi − x )2 yi yi − y ( yi − y ) 2
45 -21 441 90 6 36
60 -6 36 80 -4 16
60 -6 36 90 6 36
80 14 196 90 6 36
85 19 361 70 -14 196
x i = 330 (x − )
i x
2
=1070 y i = 375 (y − i y ) 2 =320
1070
x = = 214 = 14.6287
5
320
y = = 64 = 8
5
=
(45 − 66)(90 − 84) + (60 − 66)(80 − 84) + (60 − 66)(90 − 84) + (80 − 66)(90 − 84) + (85 − 66)(70 − 84)
= -320
1 −320
Correlation coefficient = = −0.5469
5 214 64
(b) covariance between English and Hindi Marks
Let x = English = {45,60,60,80,85}
Let y = Hindi = {70.5,72.5,80,80,72}
The mean of x and y is calculated for follows
45 + 60 + 60 + 80 + 85 330
x = = = 66
5 5
70.5 + 72.5 + 80 + 80 + 72 375
y = = = 75
5 5
xi xi − x yi yi − y ( xi − x ) ( yi − y )
45 -21 70.5 -4.5 94.5
60 -6 72.5 -2.5 15
60 -6 80 5 -30
80 14 80 5 70
9
85 19 72 -3 -57
x i = 330 y i = 375 ( x − )( y −
i x i y ) = −92.5
The covariance formula is given as
THE covariance can be computed using
92.5
Therefore, xy = = 18.5
5
Correlation Coefficient
The mean of English (x) and Hindi(y) respectively as
330
English mean =x== = 66
5
Hindi mean = y = 375/5 = 75
xi xi − x ( xi − x )2 yi yi − y ( yi − y ) 2
45 -21 441 70.5 -4.5 20.25
60 -6 36 70.25 -42.5 6.25
60 -6 36 80 5 25
80 14 196 80 5 25
85 19 361 72 -3 9
x i = 330 (x − )i x
2
=1070 y i = 375 (y − i y ) 2 =85.5
10
1070
x = = 214 = 14.6287
5
85.5
y = = 17.1 = 4.1352
5
1 92.5
Correlation coefficient = xy = = 0.3058
5 14.6287 4.1352
8. Solve the following set of equations using Gaussian elimination method.
2 x1 + 5 x2 = 7
6 x1 + 12 x2 = 18
Solution
2 5 | 7 R1
~ R1 =
6 12 | 18 2
2 2.5 | 3.5
~ R 2 = R 2 − 6 R1
0 −3 | −3
2 2.5 | 3.5
~ R 2 = R 2 / −3
0 1 | −3
1 0 | 1
~ R1 = R1 − 2.5 R 2
0 1 | 1
Solution is 1 , 1
9. Solve the following set of equations using LU decomposition method.
2 x1 + 5 x2 = 7
6 x1 + 12 x2 = 18
Solution
2 5
Here, A =
6 12
Apply Gaussian Elimination as
2 5
~ R 2 = R 2 − 3 R1
6 12
2 5
~
0 −3
Hence, the matrix L is in terms of the multiplier term as
1 0
L=
3 1
11
And matrix U would be
2 5
U =
0 −3
One can check by multiplying LU to get A or not
1 0 2 5 2 5
A = LU = =
3 1 0 −3 6 12
One can apply the formula Lb = y to solve for unknowns y
1 0 b1 7
=
3 1 b2 18
b1 = 7
3b1 + b2 = 18
21 + b2 = 18
b2 = −3
Using y’s one can obtain the unknown x’s as follows:
2 5 x1 7
=
0 −3 x2 −3
3
x2 = = 1
3
2 x1 + 5(1) = 7
2 x1 = 2
x1 = 1
Therefore, solutions are 1, 1
10. Apply PCA for the following matrix and prove that it works.
4 3
1 2
Solution
The mean vector can be computed as follows:
4 + 3 / 2 3.5
mean = =
1 + 2 / 2 1.5
12
As part of PCA, the mean must be subtracted from the data to get the adjusted data
4 − 3.5 0.5
x1 = =
1 − 1.5 −0.5
3 − 3.5 −0.5
x2 = =
2 − 1.5 0.5
One can find the covariance for these data vectors. The covariance can be obtained
as follows:
0.5 0.25 −0.25
m1 = ( 0.5 −0.5 ) =
−0.5 −0.25 0.25
−0.5 0.25 −0.25
m1 = ( −0.5 0.5 ) =
0.5 −0.25 0.25
0.5 −0.5
C =
0.5 0.5
The final covariance matrix is obtained by adding these two matrices as C as shown
above. The eigen values and eigen vectors can be obtained (left as an exercise) as
1 = 1 −1 1
follows: It is and the eigen vectors are and . The matrix A can be
2 = 0 1 1
obtained by packing the eigen vector of these eigen values (after sorting it) as follows:
−1 1 −1 1
A= , The transpose of A, A =
T
is also the same matrix. The matrix
1 1 1 1
can be normalized by diving each elements of the vector, by the norm of the vector
to get
1 1
− 2 2
A=
1 1
2 2
One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is
A−1 = A and AA−1 = I .
1 1 1 1
− 2 2
−
2 2 1 0
AA =
T
=
1 1 1 1 0 1
2 2 2 2
13
The transformed matrix y is given as
y = A*(x – m ),
Recollect that (x-m) is the adjusted matrix.
1 1
− 2 2 0.5 −0.5
y = A( x − m) =
1 1 −0.5 0.5
2 2
1 1 1 1
− 2 −
=
2 2 2 ( for convenience 0.5= 1 )
1 1 1 1 2
− 2 2
2 2
1 1
−
= 2 2
0 0
One can check the original matrix can be retrieved from this matrix as
{(A)T × y} + m
1 1 1 1
− − 2
x=
2 2 2
0 0
0 0
1 1
2 − 2 3.5
= +
− 1 1 1.5
2 2
4 3
=
1 2
Therefore, one can infer the original is obtained without any loss of information.
11. Apply SVD for the following matrix:
4 3
1 2
Perform matrix decompositions and prove that SVD works.
Solution
14
4 3
A=
1 2
Solution
The first step is to compute
4 3 4 1 25 10
AAT = =
1 2 3 2 10 5
The eigen value and eigen vector of this matrix can be calculated to get U. the eigen values of this
matrix is 0.8578 and 29.1421.
The eigen vectors of this matrix are
02.41421356
u1 = when = 29.1421
1
−0.41421356
u2 = when = 0.8578
1
These vectors are normalized to get the vectors respectively as
0.92387953212
u1 =
0.38268343
−0.38268343
u2 =
0.92387953
The matrix U can be obtained by concatenating the above vector as
0.92387953 −0.38268343
U = [u1 , u2 ] =
0.38268343 0.92387953
The matrix V can be obtained by finding AT A . The eigen values are 29.142135 and 29.14213562.
The eigen vectors can be found as follows:
1.15300969
v1 =
1
−0.8672954
v2 =
1
The above can be normalized as follows:
15
1.15300969
v1 =
1
−0.8672954
v2 =
1
The matrix V can be obtained by concatenating the above vector as
0.75545395 −0.65520173
V = [v1 , v2 ] =
0.65520174 0.75545396
The matrix S can be found as the diagonal matrix as
29.14213562 0 5.39834564
S = =
0.85786348 0
0
Therefore, the matrix decomposition A = U S V T is complete.
12. Find covariance and correlation coefficients for the following two sets of data:
X: 1 2 6 12
Y: 8 12 18 22
Solution
Let x = 1,2,6,12
Let y = 8,12,18,22
The mean can be calculated as follows:
21
X mean = = 5.25
4
60
Y mean = 15
4
The formula for covariance is given as
Therefore compute the table as
xi xi − x yi yi − y ( xi − x ) ( yi − y )
1 -4.25 8 -7 29.75
2 -3.25 12 -3 9.75
16
6 0.75 18 3 2.25
12 6.75 22 7 47.25
x i = 330 y i = 375 ( x − )( y −
i x i y ) = 89
In other words, Covariance =
(1 − 5.25)(8 −`15) + (2 − 5.25)(12 − 15) + (6 − 5.25)(18 − 15) + (12 − 5.25)(22 − 15) 89
= = 22.5
4 4
Correlation coefficient
The mean can be calculated as follows:
21
X mean = = 5.25
4
60
Y mean = 15
4
THE variance and standard deviation of x can be calculated as follows:
xi xi − x ( xi − x )2 yi yi − y ( yi − y ) 2
1 -4.25 18.0625 8 -7 49
2 -3.25 10.5625 12 -3 9
6 0.75 0.5625 18 3 9
12 6.75 45.5625 22 7 49
x i = 330 (x − )i x
2
=74.75 y i = 375 (y − i y ) 2 =116
74.75
x = = 18.6875 = 4.3229
4
116
y = = 29 = 5.3852
4
The covariance is already computed in the previous sum as 22.25
22.25
Therefore, Correlation coefficient = = 0.9558
4.3229 5.3852