Chapter 1
Chapter 1
Summarizing Data
1.1 Graphical summaries of the data
Dot plot and histogram
The time series plot
1.2 Numerical descriptive measures
1.3 Measures of central tendency
The sample mean
The median
Mean versus median
1.4 Measures of dispersion
The sample variance
The sample standard deviation
1.5 The empirical rule
1.6 How to relate two things
1.7 Linearly related variables
Linear functions
Mean and variance of a linear function
Linear combinations
Mean and variance of a linear combination
1.1 Graphical Summaries of the Data
data. 0 1
C1
© Imperial College Business School
•Data is the statistician’s raw material, the numbers that
we use to interpret reality
E = (1+.1)B = (1.1)B
Interpret:
The returns are
centered or
located
at about .01.
The spread or
variation
in the returns is
huge.
8
Dotplot for canada
canada
center or
location of the data
Some data
does not
have the
mound
shape.
Volume
It is skewed
to the left.
We also have data on countries other than Canada.
Let us compare Canada with Japan.
It really helps to get things on the same scale.
How is Japan different from Canada?
Mutual fund data
Dreyfus
growth fund
Putman
income fund
Equally weighted
market
T-bills
The beer data:
nbeerm: the number of beers male MBA students claim
they can drink without getting drunk
nbeerf: same for females
We call a
point
like this an
outlier
Generally the males claim they can drink more, their numbers are
centered or located at larger values.
© Imperial College Business School
The number of bars you use affects how “smooth "the
picture looks.
• The return data has an important feature that the beer data does not
have
• It has an order!
On the
vertical
axis we
have
returns.
On the
horizontal
axis we
have “time”
Now do you
see a pattern?
5
x1 2 n=5
x3 8
6
2
sum x1 + x2 +L+ xn
x= =
n n
We often use the x symbol to denote the mean of the
numbers x
We call it “x bar”
© Imperial College Business School
Here is a more compact way to write the same thing3
Consider x1 + x2 +L+ xn
We use a shorthand for it (it is just notation):
∑x = x
i =1
i 1 + x2 +L+ xn
n
1
x = ∑ xi
n i=1
© Imperial College Business School
Graphical interpretation of the sample mean
Let us go back
to our standard
histogram
To summarize this
we can compute
the average value
for both men and
women
>>mean(nbeerm(1:Tm))
>>ans = 7.862500000000000
>>mean(nbeerf(1:Tf));
>>ans = 4.222222222222222
• Let us compare the means of the Canadian
and Japanese returns
>> mean(canada)
ans = 0.009065420560748
>> mean(japan)
ans = 0.002336448598131
∑ xi
i =1
means that for each value of i, from 1 to n,
we add to the sum the value indicated,
in this case xi
compute y bar.
(here, we do not
sum over all
observations: we
sum only the
second and
third.)
Example
1,2,3,4,5 Median = 3
1,1,2,3,4,5 Median = (2+3)/2 =2.5
Example:
1,2,3,4,5 Mean: 3 Median: 3
1,2,3,4,100 Mean: 22 Median: 3
. . . .
-+---------+---------+---------+---------+---------+-----y
0.030 0.045 0.060 0.075 0.090 0.105
xi − x
. . . .
-+---------+---------+---------+---------+---------+-----x
. . . .
-+---------+---------+---------+---------+---------+-----y
0.030 0.045 0.060 0.075 0.090 0.105
n
2 1 2
s =x ∑
n − 1 i =1
( xi − x)
2
sx = s x
This numerically
captures the
fact that y has
“more variation”
about its mean
than x.
Example 2 (graphical)
The standard
deviations
measure the
fact that there
is more spread
in the
Japanese
returns
x sx
where the data is how spread out,
how variable the data is
( x − s x, x + s x ) = x ± s x
Approximately 95% of the data is in the interval
( x − 2s x , x + 2s x ) = x ± 2s x
The empirical rule will help us understand sx and relate the
summaries back to the histogram
s x =.03833
10
The empirical
rule says that
Density
roughly 95%
5
of the
observations
are between the
dashed lines and 0
roughly 68%
-0.1 0.0 0.1
between canada
the dotted lines.
Looks reasonable. x − sx x + sx
Same thing viewed from the
perspective of the time series plot.
x + 2s x
5% outside
would be
about
5 points. x
There are 4
points
outside,
which is
pretty close.
x − 2s x
A little finance: comparing mutual funds
Let us use the means and standard deviations to compare mutual funds.
For 9 different assets we compute the means and standard deviations.
Then, we plot the means versus the standard deviations.
The assets are:
Variable N Mean StDev
drefus 180 0.00677 0.04724
fidel 180 0.00470 0.05659
keystne 180 0.00654 0.08424
Putnminc 180 0.00552 0.03008
scudinc 180 0.00443 0.03597
windsor 180 0.01002 0.04864
eqmrkt 180 0.01082 0.06856
valmrkt 180 0.00681 0.04800
tbill 180 0.00598 0.00252
0.011 eqmrkt
windsor
0.010
0.009
0.008
Mean
valmrkt drefus
0.007 keystne
tbill
0.006 Putnminc
0.005 fidel
scudinc
0.004
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
StDev
© Imperial College Business School
Let us compare some countries
honkong
0.02 Based
on
usa
singapor france monthly
returns
Mean
belgium germany
australia finland
0.01 canada from ‘88
italy to ‘96
japan
0.00
0.03 0.04 0.05 0.06 0.07 0.08
StDev
20
nbeer weight i
12.0 192 1
12.0 160 2
5.0 155 3
nbeer
10
5.0 120 4
7.0 150 5
13.0 175 6
4.0 100 7
0
12.0 165 8
100 150 200
12.0 165 9
12.0 150 10 weight
. . .
. . . Scatter plot
. . .
Now we think of each pair of numbers as an observation.
Each pair corresponds to a person.
Each person has two numbers associated with him/her,
# beers and weight.
Each pair corresponds to a point on the plot. © Imperial College Business School
Example:
0.2
corresponds
to a month 0.0
-0.1
1 n
s xy = ∑
n − 1 i =1
( xi − x)( yi − y )
s xy
rxy =
s xs y
y1
Correlation of
y1 and x1 = 0.019 -1
-2
-3 -2 -1 0 1 2 3
x1
1
Correlation of
y2
0
y2 and x2 = 0.995
-1
-2
-3
-3 -2 -1 0 1 2 3
x2
© Imperial College Business School
4
y3
Correlation of 0
-1
y3 and x3 = 0.586
-2
-3
-4
-3 -2 -1 0 1 2 3
x3
3
1
Correlation of
y4
0
y4 and x4 = -0.982
-1
-2
-3
-3 -2 -1 0 1 2 3
x4
Correlation of y5 and x5 = 0.210
9
8
7
6
5
y5
4
3
2
1
0
-3 -2 -1 0 1 2 3
x5
0 .0
-0 .1
-0 .1 0 .0 0 .1
usa
japan usa
usa 0.246
singapor 0.407 0.473
x y
0.07 0.11
0.06 0.05
0.04 0.09
0.03 0.03
© Imperial College Business School
First, let us compute the covariance
(which is a necessary ingredient to
compute the correlation):
1 n
∑
n − 1 i =1
( xi − x)( yi − y ) =
1
((.07 −.05)(.11−.07) + (.06 −.05)(.05 −.07) + (.04 −.05)(.09 −.07) + (.03 −.05)(.03 −.07 ))
3
1
= (.02*.04 + .01 * ( −.02) + ( −.1)*.02 + ( −.02) * ( −.04))
3
1 1
= (.0008 −.0002−.0002+.0008) = (.0012) =.0004
3 3
= .0004
x
0.11
0.10
0.09
0.08 (III) (I)
0.07 y
y (II) (IV)
0.06
0.05
0.04
0.03
Points in (I) have both x and y bigger than their means so we get a
positive contribution to the covariance.
Points in (II) have both x and y less than their means so we get a
positive contribution to the covariance.
In (III) and (IV) one of x and y is less than its mean and the other is
greater so we get a negative contribution. The further out the point is,
the bigger the contribution.
© Imperial College Business School
just a few
relatively small Lots of positive contributions
contributions
just a few
relatively small
Lots of positive contributions
contributions
.0004
rxy = =.6
(.0365)(.0183)
y = c0 + c1 x
y = c 0 + c1x
c 0 : the intercept We think of the c’s as constants
c1 : the slope (fixed numbers) while x and y vary.
Let us look
at our
>> cel = [ -10 0 10 15 20 25 30 35 ]';
temperature
example.
>> mul = (9/5)*cel;
Suppose we
first multiply
by (9/5) and >> fahr = 32+mul;
then add 32.
© Imperial College Business School
>> mean([ cel mul fahr])
ans =
15.625000000000000
28.125000000000000
60.125000000000000
>> std([ cel mul fahr])
ans =
15.221577729375776
27.398839912876394
27.398839912876394
. . .. . . . .
+---------+---------+---------+---------+---------+-------cel
. . . . . . . .
+---------+---------+---------+---------+---------+-------mul
. . . . . . . .
+---------+---------+---------+---------+---------+-------fahr
0 30 60 90 120 150
Example:
Suppose our movie star also gets 5 percent of
all sales of the CD released with the movie.
How is the star’s income related to the film’s
gross and CD sales (in millions of dollars)?
y = c 0 + c1x1 + c 2 x 2 + K c k xk
y is a linear combination of the x’s.
ci is the coefficient of xi.
(100)*.5*(1+.1) + (100)*.5*(1+.15)
=100*(1+.5*.1+.5*.15)
Rp = w1x1 + w 2 x 2
This is beautiful (=some people get a kick out of weird stuff!)
Questions:
Let us use our country data and suppose that we had put
.5 into USA and .5 into Hong Kong.
What would our returns have been?
In MatLab:
>> port = .5*hongkong + .5*usa
Kong. 0.016
0.015
0.014
What about the usa
0.013
standard 0.03 0.04 0.05 0.06 0.07
deviation? StDev
honkong
0.020 port
Clearly,
forming
Mean
portfolios 0.015
usa
is an
interesting
thing to do! 0.010 canada
Then, y = c 0 + c1x1 + c 2 x2
y = c 0 + c1x1 + c 2 x2
The mean returns on USA and Hong Kong are .01346 and .02103.
Knowing what the portfolio returns are, we can easily compute the
mean return for the portfolio (i.e., it is the sample mean of the
portfolio returns): .01724.
We can now confirm the validity of our formula:
.01724 = .5*.01346+.5*.02103
© Imperial College Business School
Let us do the same exercise for the variance:
Covariances
>> cov([ honkong usa port])
Covariances
.0033 =
(.25)*(.25)*.00111 + (.75)*(.75)*.0052+(2)*(.25)*(.75)*(.00103)
-0.05
1
-0.07
y = .5x1 + .5 x2 -0.1 -0.01
0.03
0.04
-0.06
0
At each point we -0.01
-0.05
-0.03
x2
plot the value of y -0.05
0.05
-1
0.12
-0.08 0.13
The variances and 0.12
0.11
0.05
covariance are:
-2
0.03
x1 x2
-1 0 1 2
x1 1.334636 x1
x2 -1.208679 1.106238
The dashed lines are drawn
Then, the variance of y is at the mean of x1 and x2
1
0.85
0.70.81
0.78
0.5
0.53
0
plot the value of y
x2
-0.03
-0.17
-0.39
-0.46
-0.79 -0.7
-1
covariance are:
-1.85
-2
x1 x2 -2 -1 0 1 2
x1 1.158167 x1
x2 1.046490 0.9609463
The dashed lines are drawn
Then, the variance of y is at the mean of x1 and x2
Why is the variance of y not so much smaller than those of the x’s ?
2.0
Example: 0.93
1.5
-0.02
0.75
y = .5x1 + .5 x2 -0.27 1.29
1.0
-0.43 1.03
0.17
At each point we
0.5
0.43
x2
plot the value of y
0.0
-0.09 0.39
-1.11 -0.35
-1.2 0.23
-0.5
The variances and -1.07 -0.76
0.13
covariance are:
-1.0
-1.67
-0.69
-2 -1 0 1
x1 x2
x1
x1 1.3870537
x2 0.1976187 0.8247886
The dashed lines are drawn
Then, the variance of y is at the mean of x1 and x2
Suppose,
y = c 0 + c1x1 + c 2 x2 + c 3 x3 +L+ ck xk
Then,
y = c 0 + c1x1 + c 2 x2 + c 3 x3 +L+ ck xk
s 2y = c12 s 2x1 + c 22 s 2x2 + c 23 s 2x3
+ 2 c1c 2 s x1x2 + c1c 3 s x1x3 + c 3 c 2 s x3 x2
y = c 0 + c1x1 + c 2 x2 + c 3 x3
y = c 0 + c1x1 + c 2 x2 + c 3 x3
2 2 2 2 2 2 2
s = c s +c s +c s
y 1 x1 2 x2 3 x3
Covariances