Basic Prob and Stats
Basic Prob and Stats
Contents
1 Data 4
1.1 The method of science and its use of data . . . . . . . . . . . . . . . . . . . 5
1.2 Data and its attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The purpose and content of this course . . . . . . . . . . . . . . . . . . . . . 8
2 Datasets 10
2.1 Data1: The Hungama data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Data2: The Thane census dataset . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Data3: Population of Canada by Age . . . . . . . . . . . . . . . . . . . . . . 11
5 Linear regression 30
8 Probability 43
8.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 The three axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . 44
2
9 Probability Density Functions 46
19 Appendix on Regression 79
19.1 Multivariate Nonlinear Example: Curve Fitting . . . . . . . . . . . . . . . . 79
19.2 Linear regression and method of least squares error . . . . . . . . . . . . . . 80
19.3 Regularised solution to regression . . . . . . . . . . . . . . . . . . . . . . . . 83
3
Data Analysis and Interpretation
Milind Sohoni and Ganesh Ramakrishnan
CSE, IIT Bombay
1 Data
The modern world, of course, is dominated by data. Our own common perceptions are
governed to a large extend by numbers and figures, e.g., IPL rankings, inflation statistics,
state budgets and their comparisons across the years, or some figures and maps, such as
naxalite-affected districts, forest cover and so on. The use of, and the belief in data has
grown as the world as a whole and we in particular, become more and more industrialized or
’developed’. In fact, most of us even frame our objectives in terms of numeric targets. For
example, the Human Development Index (HDI), is a composite of various sets of data and
the Millenium Development Goal is for all countries of the world to achieve certain target
numbers in the various attributes of the HDI.
That said, there is much argument amongst politicians, journalists, intellectuals, cricket
players, students and parents, about whether society is becoming too much or too less data-
driven. This matches calls for more subjectivity (e.g., selecting a suitable boy for your sister)
or objectivity (admitting students into colleges). In fact, these arguments are popular even
among national leaders and bureaucrats, where for example, we now have a new area of
study called Evidence-based Policy Design which aims to put objectives ahead of ideology
and studies methods of executing such policies.
Perhaps the first collectors and users of data were the officers of the kings. Much of
the kingdom’s expenses depended on taxes, in cash and in kind, from artisans and farmers.
This called for maintaining records of, say land productivity, over the years, so that the
correct tax rate for the region could be evolved. Also, in the past, ownership of the land
could be tied to the expertise of the owner in ensuring its productivity. This too needed a
careful understanding of data. Note that for data to be put to use, there must be a certain
technical sophistication in understanding (i) what needs to be measured and (ii) how is it to
be measured, (iii) how is it to be used, and finally (iv) are our conclusions sound. Thus for
example, if you have not measured rainfall, or the number of people in the household, then
you would make wrong conclusions on the productivity of the farmer.
Another early use of data was in astronomy. The measurement of this data required sev-
eral sophisticated actions: (i) the universal acceptance of a certain fixed coordinate system,
and (ii) a measuring device to measure the various parameters associated with the objects.
While agricultural data was much about the past, astronomical data was largely about the
4
future. Using this, astronomers hoped to predict the seasons, eclipses, and so on. Thus,
this involved building models from the given data with certain predictive capabilities. In
fact, even for the simple panchang (the almanac, as known in Maharashtra), there are two
models, viz., the Datey panchang and the more popular Tilak panchang.
• Observe. To observe is different from to see. To observe also assumes a system and a
tool for measurement.
• Model. This is the part which wishes to explain the data, i.e., to create a model which
is the first step towards an explanation. This may be causal, i.e., a relationship of cause
and effect, or concommitant, i.e., of coupled variables. It may be explicit, i.e., attempt
to explain one variable in terms of others, or implicit, i.e., a relationship between the
variables which may not be easily separated.
The simplest model will want to explain the observed variable as a simple function of
a classifying attributes, e.g., rainfall>1000mm ⇒ yield = 1000kg.
• Theorize. This is the final step in the method of science. It aims to integrate the
given model into an existing set of explanations or laws, which aim to describe a set
pf phenomena in terms of certain basic and advanced concepts. Thus, for example,
Mechanics would start with the variables position, velocity, acceleration, coefficient of
friction, etc., and come up with laws relating these variables.
We now see our first piece of data in Fig. 1.1. These are the water levels observed
in an observation bore-well managed by the Groundwater Survey and Development Agency
(GSDA) of the Govt. of Maharashtra. This borewell is located in Ambiste Village of Thane
district. On the X-axis are dates on which the observations were taken, and on the Y -axis,
the depth of the water from the top of the well.
The science here is of course, Groundwater Hydro-geology, the science of explaining the
extent and availability of groundwater and the geology which related to it. Since groundwater
5
Figure 1: The water levels in a borewell (Courtesy GSDA)
is an important source of drinking water for most indians, almost all states of India have
a dedicated agency to supervise the use of groundwater. GSDA does this for Maharashtra.
One of the core data-items for GSDA are observation wells, i.e., dug-wells and bore-wells
which have been set aside purely for observing their levels periodically.
Let us now see how the four steps above apply to this example. Clearly, merely peering
down a well or a bore-well (which is harder), does not constitute an observation. We see here
that there must have been a device to measure the depth of water and a measuring tape. The
next process is documentation. The above graph is one such documentation which wishes
to plot the water level with the dates of observations. There is one severe problem with our
chosen documentation (found it?), and that is that the scale on the X-axis is not uniform
on by time, but equi-spaced by observation count. Thus two observations which are 10 days
apart and two which are two months apart will appear equally apart in the X-axis. This
will need to be rectified. We see here a periodic behaviour, which obviously matches with
the monsoons. Thus, groundwater recharges with the rains and then discharges as people
withdraw it from the ground through handpumps, wells and borewells. The modelling part
could attempt to describe the groundwater levels with time as ideal curves. The science will
attempt to explain these curves as arising out of natural laws.
6
1.2 Data and its attributes
There are two or three important basic concepts that we will associate with data. These are:
• Variables: A variable is an attribute of any system which may change its value while
it is under observation. For example, the number of people in the age group 75 − 79 in
Canada is a variable. There are two basic types of variables, viz., qualitative variables
and quantitative variables.
• Qualitative: Qualitative variables take on values that are words – they do not take on
numeric values. Examples of qualitative variables include marital status, nationality,
color of skin, gender, etc. Frequently, attributes such as Satisfaction with Service in a
Hotel are quantfied, in this case, by giving a scale between 1-5. It is obviously unclear
if a score of 3 from one customer is better than a 2 from another. Many attributes
may be quantitative at first sight but have a hidden quantification rule, e.g., number
of literates in a village. Here, what should be counted as literacy needs to be defined,
and more importantly, the thousands of census workers must be trained to test people
by this definition.
• Integrity: This is related to the trustworthiness of the data. There could be many
reasons to doubt the veracity–improper measuring instruments or of insufficient toler-
ance, e.g., temepratures reported only as integers (in degree celsius), instead of with
one decimal place. Another frequent problem is the interpretion that different mea-
surers have for the same situation. For example, person A may deem person C as
literate while person B may not. Loss of integrity in the data is a severe problem
from which recovery is not easy. Thus it is best that integrity planned right at the
very beginning. One caution–a reading which does not fit the model does not make it
necessarily of less integrity. Most real-life processes are fairly complicated and trying
to correct a reading which doesnt fit may actually convey a more certain world than
it really is. For example, if we had a nice theory relating inflation with stock market
7
rates, with exceptions for a few years, then it would be wise to look into the history of
those specific years, rather than suspect the data item. Such ’outliers’ may prove to
be important.
• Coverage and Relevance: This is whether the data (i) covers the situations that we
wish to explain, and (ii) includes observations on variables which may be relevant but
which we have missed. For example, groundwater levels may depend on the region
and not on the specific location. Thus, the explanation of a groundwater reading may
be correlated with levels in nearby wells, which unfortunately, we have not monitored.
It may also be that groundwater depends intimately on the rainfall in that specific
neighborhood, again, which is not included in the data set.
• Population vs. Sample: This is whether the data that we have is the whole collection
of data items that there are or is a sampling of the items. This is relevant, e.g., when
we wish to understand a village and its socio-economics. Thus, we have visit every
individual and make readings for this individual. This data is then called the popu-
lation data. On the other hand, we may select a representative sample and interview
these selected persons and obtain their data. This is then called the sample data. It
is not always easy to cover the whole population, for it may be very large (a city such
as Mumbai), or it may inaccesible (all tigers in a reserved forst) and even unknown or
irrelevant (e.g., measuring soil quality in an area). In such cases, it is the sample and
the method of selecting the sample which is or prime importance.
There are of course, many other factors that we have missed in our discussion. These
must be surmised for each situation and must be gathered by interveiwing the people who
are engaged in the observations and who are familiar with the terrain or subject matter.
8
back-of-the-envelope analysis of a situation. Handling data and correctly interpreting what
it tells and what it does not, is an important skill.
The course has three main parts.
• Part I: Statistics and Data Handling. This will cover the basic notion of data-sets,
its attributes and relationships. We will introduce the basic terminology is statistics
such as the sample and concepts such as the sample mean and sample variance. We
will use the following datasets at different points in the notes for illustrating (a) Thane
census 2001 data-set (b) population of Canada by age group for the year 2007. We
will also study some elementary methods of representing data such as scatter-plots
and histograms. Next, we will study the use of Scilab to manipulate data and to write
small programs which will help in representing data and in making our first conclusions.
Finally, we will develop the elements of least-square fit and of regressions. This is the
first model-building exercise that one does with data. We will uncover some of the
mathematics of this and also of errors and their measurement.
• Part II: Probability. This is the most mathematical part of the course. It consists
of explaining a standard set of models and their properties. These models such as the
exponential, normal or binomial distributions are idealized worlds but may be good
approximations to your data sets. This is expecially true of the normal distribution.
The above will be introduced as example characterizations of a formal object called the
random variable. We will also study functions of random variable and the important
notion of expectation, which is a single numeric description of a data set. This includes
the mean and variance as special cases.
• Part III: Testing and Estimation. This links statistics and probability. The key
notions here are of parameters, and their estimation and testing. A parameter is an
attribute which we believe, determines the behaviour of the data set. For example, it
could be the rate of decline in the water level of the bore-well. We will uncover methods
of estimating parmeters and assigning it confidence. We will use certain well-known
tests such as the Kolmogoroff-Smirnov tests, the χ2 -test (pronounced chi-squared) and
the Students t-test. We will also outline methods of accepting and rejecting certain
hypotheses made about he data.
9
2 Datasets
2.1 Data1: The Hungama data
: This data set1 is extracted from the corresponding report2 . We will be primarily using
this dataset for assignments. It will be also worth looking at the detailed report survey
methodology3 , the household survey tool4 and the village survey tool5 that form the basis
for data collection using this method.
• TOT: population.
10
HH 256
TOT-P 1287
P-06 302
TOT-W 716
TOT-WORK-MAIN and MARG 374 342
CL 193 171
AL 166 170
HH 0 0
OT 15 1
NON-WORK 571
– MAINWORK: main working population. This is defined as people who work for more
than 6 months in the preceding 1 year.
– MARGWORK: marginal workers, i.e., who have worked less than 6 months in the
preceding year.
• NONWORK: non-workers, i.e., who have not worked at all in the past year. This typically
includes students, elderly and so on.
• AL: agricultural labourer, i.e., who works for cash or kind on other people’s land.
• HH: household industry, i.e., where production may well happen in households. Note
that household retail is not to be counted here.
You can find the data for Pimpalshet of Jawhar taluka, Thane in Figure 2.
11
the variable is divided into. Each class interval includes the left endpoint but not the right
(by convention). The population (second column) is recorded in the thousands. The third
column has a record of the percentage of the population that happens to fall in each age
group.
Table 1: A slightly modified estimate of the population of Canada by age group for the year
2007. The population (second column) is recorded in the thousands.
12
and count the frequencies, i.e., the number of occurences of data-items for each interval.
The histogram will be our first example of a (graphical) descriptive statistic. A histogram
provides a picture of the data for single-variable data. We will thereafter discuss the scatter
plot (or scatter diagram), which serves as a graphical description of data of two variables.
3.1 Histograms
A histogram is a graphical display of tabulated frequencies. A histogram shows what pro-
portion of cases fall into each of several or many specified categories. In a histogram, it is
the area of the bar that denotes the value, not the height. This is a crucial distinction to
note, especially when the categories are not of uniform width.
There are three steps to be followed when plotting a histogram for tabulated frequencies
as in Table 1.
count
percentage =
total number of values
percent
2. Compute height for each class-interval as height = width of range
as shown for the fourth
column of Table 1.
3. Draw axes and label them. Label the class intervals (age groups in this case) along the
x−axis and the heights along the y−axis.
4. Along each class interval on the x−axis, draw a rectange of corresponding height and
width as shown in Figure 3. This is precisely the histogram for the tabulated data in
Table 1.
Figure 3 shows the histogram corresponding to Table 1. Note that the sum total area of
all the bars is 100 (percent).
Histograms for discrete and continuous variables look slightly different. For histograms
of continuous variables, class intervals (such as age ranges) are marked along the x−axis
of the histogram and the width of the bars could be positive real number. On the other
hand, histograms of discrete variables have generally a default width of 1 for every bar and
different values assumed by the discrete variable are marked along the x−axis. Each bar
must be centered on the corresponding value of the discrete variable and the height of every
bar is the percentage value.
As another example, consider the taluka of Vasai and the item (yi ) of the number of
house-holds in village i. This is a data-set of size 100. The mean is 597, the variance 34100
13
1.8
1.6
% population/interval-width 1.4
1.2
0.8
0.6
0.4
0.2
0
0--4 5--9 10--14 15--19 20--24 25--29 30--34 35--39 40--44 45--49 50--54 55--59 60--64 65--69 70--74 75--79 80--84 85--89 90--95
Age
and the standard deviation 583 (c.f. Section 4), and the maximum size of a village is 3152
households. We may construct intervals [0, 99], [100, 199], [200, 299] and count the number of
villages with the number of households in each interval. This aggregated data may be shown
in a table:
This table may be conveniently represented as a histogram as in, Fig 3.1. Locate the mean
597 in the diagram and the points µ ± 3σ, viz., roughly 0 and 2200. We notice that there
are very few points outside this range. In fact, this is a routine occurence and σ actually is
a measure of the dispersion in the data so that most of the data is within µ ± 3σ.
While plotting histograms, there is usually ample room for innovation for selecting the
actual variable and the intervals. Here is an example. Consider for example, the data set
composed of the tuple (si , ci , ni , ai ) of drinking water schemes for villages in Thane district
sanctioned in the years 2005-2011. Here, ni is the village name, ai is the sanctioned amount,
si is the sanction year and and ci is the completion year. There are about 2000 entries in
this data-set. Here would be a table to illustrate a fragment of this data:
14
Figure 4: Number of households in villages in Vasai and Shahpur
15
Completion Year
Sanction 2005 2006 2007 2008 2009 2010 Incomplete Total
Year
2005 0 0 3 15 10 13 15 56
2006 0 6 18 33 63 72 182
2007 1 11 12 15 36 75
2008 0 34 55 160 249
2009 1 13 83 97
Reading across a row tells us the fate of the schemes sanctioned in a given year, which
reading a column gives us an idea of the number of schemes completed in a particular year.
We see that there are considerable variations in the data with 2007 being a lean year and
2008 being an active year in sanctioning and 2009 in completing. In fact, both these years
did mark some event in the national drinking water policy.
The height plotted along the y-axis of a histogram is often referred to as the density scale. It
measures the ‘crowdedness’ of the histogram in units of ‘% per x unit’; taller the histogram
bar, more is the density. In the last example, the unit was census. Using Table 1 and
the corresponding density estimates in Figure 3, one can estimate that the percentage of
population aged between 75 and 77 years of age is around 2.7 5
× 3 = 1.62%. This is assuming
that the density of population in the age group 75 − 79 is evenly distributed throughout the
interval (that is the bar is really flat). But a close look at the bars surrounding that for 75−79
will suggest that the density in the interval 75 − 59 is probably not quite evenly distributed.
While it would accurate and lossless to have population counts corresponding to every age
(instead of intervals), such data may not be as easy to digest as the population estimates
based on intervals. There is a tradeoff between summarization and elaborate accounting or
equivalently between wider bars and lots of bars.
16
Figure 5: A sample scatter plot.
student with its coordinates give by ‘(marks in year 1, marks in year 2)’. Figure 5 shows the
scatter plot for some such hypothetical data. The dotted vertical and horizontal lines mark
the average marks for year 1 and year 2 respectively. It can be seen from the plot that most
students either performed well in both years or performed poorly in both years.
A point corresponding to an observation that is numerically far away from the rest of
the points in a scatter plot is called an outlier. Statistics derived from data sets that include
outliers can be misleading. Figure 6 shows the scatter plot of Figure 5 with an outlier
introduced (in the form of a black point). The outlier results in a relatively drastic change
in mean values of marks for years 1 and 2. While the mean value along the x-axis drops
from 50.72 to 50.68, the mean value along the y-axis increases from 55.69 to 55.74.
The scatter plot is used for a data-set consisting of tuples (xi , yi ) where both are numeric
quantities. For example, we could take Shahpur taluka and let xi be the fraction of literate
people in the i-th village. Thus, xi =P-LIT/TOT-P. Let yi be the fraction of people under 6
years of agei, i.e., yi =P-06/TOT-P. Thus, we for any village i, we have the tuple (xi , yi ) of
numbers in [0, 1]. Now the scatter plot below merely puts a cross at the point (xi , yi ). Note
that we see that as literacy increases, the fraction of people under 6 years of age decreases.
However, one must be very careful to assume causality! In other words, it is not clear that
one caused the other. It could well be that few children induced people to study.
Warning 1 The reader should be aware that each village is our individual data item. For
example, while calculating the mean literacy of the village, we should add up P-LIT for all
17
Figure 6: The sample scatter plot of Figure 5 with an outlier (30, 80) .
villages and divide it with the sum of TOT-P. However, we have chosen not to do this. One
reason is that it tends to drop the identity of the village as site for many correlations which
cannot be understood at the individual level. For example, suppose that P-LIT=450 and
P-ST=300 for a village with TOT-P=600. At the individual level, it would be impossible from
this data to come up with a correlation on ST and literacy. Thus, for correlation purposes,
it is only the aggregate which makes sense. There is another reason and that is the lack of
independence. For example, if the overall literacy in Murbad is 0.7, then for a village of
size 300, if an individual’s literacy is independent of others, then the number of literates in
the village should be very close to 210. But thats simply not true. Many large villages will
show substantial deviation from the mean. The reason of course is that the literacy of an
individual in a village is not independent of other individuals in the village.
Not all scatter-plots actually lead to insights. Here is another example where we plot
the P-06 fraction vs. the size of the village (measured as the number of households). In this
example, we dont quite see anything useful going on.
18
Figure 7: Population under 6 vs. literacy fractions for Shahpur
19
Figure 9: A 3-way plot for Shahpur
N
X
vi
i=1
µ=
N
20
The mean is also the balancing point on a histogram (c.f. Section 3); that is, if you think of
the x−axis of the histogram as the beam of weight balance and weights to be proportional to
the areas of the bars, then the fulcrum placed at the mean point will ensure that the beam
stays horizontal.
Another measure of ‘center’ for a list of numbers is the median. The median is the number
ν such that at least half the numbers in V are less than or equal to ν and at least half the
numbers are greater than or equal to ν. In order to determine the median, you need to sort
the numbers (in either the ascending or descending order) and just pick the center. If more
than one value qualifies as the middle value, their average (arithmatic mean) is taken to be
the median. The median is the point on the x-axis of a histogram such that half of the total
area of the bars lies to its left and half to its right.
As an example, the average of the set V 0 = {1, 3, 4, 5, 7} is 4, while its median is also
4. On the other hand, if the last number in this set is changed from 7 to 12 to yield the
set V 00 = {1, 3, 4, 5, 12}, then the median remains 4, while the mean changes to 5. Thus,
the median cares more for the number of values to its left and right rather than the actual
values to its left and right. For the set V 000 = {1, 1, 3, 4, 5, 12}, the mean is 13
3
, whereas the
median is the average of 3 and 4, which is 3.5. In general, for a symmetric histogram, the
arithmatic mean equals the median. For a longer (shorter) right tail, the arithmatic mean
is greater (smaller) than the median.
In most applications, mean is preferred over median as a measure of center. However,
when the data is very skewed, median is preferred over mean as a measure of center. For
example, while computing summary statistics for incomes, median is often preferred over
mean, since you do not want a few very huge incomes to affect your measure of center.
Similarly, median is preferred over mean as a measure of center of housing prices.
Thus,
1. the first single point estimate of the data set is the mean. This is denoted by y =
Pn
i=1 yi /n. For example, for the above table, it may be that the mean y is 58.6 kgs.
2. Median is that value ymed such that there are as many items above it as there are
below. In other words, if we were to sort the list, then ymed = yn/2 . For the data-set
for Vasai in Figure 4, the median is 403.
3. The mode of a dat-set is the value which occurs the most number of times. For a data-
set which has a lot of distinct possibilities, the mode has no real significance. However,
e.g., if (yi ) were the number of children in a household, the mode would be important.
For the data-set in Figure 4, a reasonable mode could be read from the histogram and
it would be 250, which is of course, the middle value of the interval [200, 300]. A mode
21
could also be a local maxima in the number of occurences of a data-item (or a band of
data items).
4. Existence of two or more modes may point to two or more phenomena resposible for
the data, or some missing information. Consider for example, the weights of students
in a classroom. Upon plotting the histogram, we may notice two peaks, one in the
range 43-45 and another in the range 51-53. Now, it may be that the class is composed
of students from two distinct cultural groups, with students from one group weighing
more, on the average. Or even simpler, the girls may be lighter than the boys. Thus,
the data seems to point that an additional item, e.g., community or sex, should have
been recorded while recording yi .
Example 2 Suppose that we are given data (yi ) as above. Suggest a mechanism of
estimating the two expected mean weights for the two communities/sexes.
22
The SD of the set V 0 = {1, 3, 4, 5, 7} is 2, which is a typical distance of any number in V 0 from
the mean µ = 4. Formulated by Galton in the late 1860s, the standard deviation remains the
most common measure of statistical spread or dispersion. The square of standard deviation
is called variance.
Thus,
Pn
(y −y)2
1. The variance, i=1 n i is denoted by σ 2 . The standard deviation is simply the
square-root of the variance and is denoted by σ. Note that the units of σ are the same
as that of yi , which in this case, is kilos.
The variance is the first measure of randomness or indeterminacy in the data. Note
that the variance is a sum of non-negative terms whence the variance of a data set is
zero iff each entry yi is equal to y. Thus, even if one entry deviates from the mean, the
variance of the data set will be positive.
2. Much of quantitative research goes into the analysis of variance, i.e., the reasons by
which it arises. Fo example, if (yi ) were the weights of 1-year-old babies, then the
reasons for their variation will lead us to malnutrition, economic reasons, genetic pool
and so on. A high variance will point to substantial deviations in the way that these
children are raised, maybe the health of the mothers when they were born, and so on.
A higher variance is frequently a cause for worry and discomfort, but sometimes is also
the basis of many industries, e.g., life insurance. If our mortality was a fixed number
with zero variance then the very basis of insurance will disappear.
Example 4 Let there be two trains every hour from Kalyan to Kasara, one roughly
at xx:10 and the other roughly at xx:50. Suppose that roughly 10 customers arrive at
Kalyan bound for Kasara every minute and suppose that the discomfort in a train is
proportional to the density, what is the average discomfort?
Solution: Well, for the xx:10 train, there will be 200 customers and for the xx:50 train,
there will be 400 customers. Whence the density at xx:10 is 200 and that for xx:50 is
400. Thus the average density is (200 ∗ 200 + 400 ∗ 400)/600 = 2000/6 = 333. Thus, we
see that, on the average there is train every 30 minutes and thus the average density
should be 300, however, since this the variance is high, i.e., the departure times are 20
and 40 minutes apart, the average discomfort rises. It is for this reason that irregular
operations of trains cause greater discomfort even though the average behaviour may
be unchanged. 2
23
− λ)2 .
P
Example 5 For a given data-set (yi ), minimize the function f (λ) = i (yi
Example 6 Consider the census data set for Thane and for each taluka, compute the
mean, variance and standard deviation for the number of house-holds in each village.
3. Sometime you need to be careful with computing the means. Here is an example. Part
II data of the census, lists for each village, whether or not its people have access to tap
water. Thus, let yi = 1 if the i-th village has access to tap-water and yi = 0 otherwise.
If we ask, what fraction of the people of Thane have access to tap-water then we would
P
be tempted to compute y = i yi /n and we would be wrong, for different villages may
have different populations. Whence we need the data as a tuple (wi , yi ), where wi is
the population of the i-th village and thus the correct answer would be:
P
wi yi
µ = y = Pi
i wi
Thus, one needs to examine if there is a weight associated with each observation yi .
Similarly, the variance for this weighted data is similarly calculated as:
wi (yi − y)2
P
2 i
σ = P
i wi
What is the effect of modifying the data V on the summary statistics of the data such as
arithmatic mean, median and standard deviation? The effect of some data modifications
have been studied in the past and are enumerated below.
1. Adding a constant to every number of the data: The effect is that arithmatic mean
and median go up by that constant amount, while the standard deviation remains the
same. This is fairly intuitive to see.
2. Scaling the numbers in data by a positive constant: The effect is that the arithmatic
mean, the median and the standard deviation get scaled by the same positive constant.
3. Multiplying numbers in data by −1: The average and the median get multiplied by −1,
whereas standard deviation remains the same.
24
4.3 The Chebyshev Inqeualities
1. Two-sided: N (Sk ) = number of items such that |xi − x| < ks
N (Sk ) n−1 1
≥1− 2
>1− 2
n nk k
Proof:
(n − 1)s2 = − x)2
P
i (xi
− x)2
P
≥ i:|xi −x|>ks (xi
≥ (n − N (Sk ))k 2 s2
n−1 N (Sk )
⇒ ≥ (1 − )
nk 2 n
2. One-sided: N (k) = number of items such that xi − x ≥ ks
N (k) 1
≤
n 1 + k2
Limits on how ’far’ data points can be from mean. Usually data sets are more bunched
than Chebyshev.
1. Convert the values in X and Y into a set of values in standard unit, viz., Xsu and
Ysu respectively. Computing standard units requires knowledge about the mean and
25
standard deviation
n n therefore beoan expensive step. More precisely, Xsu =
o and could
xi −µx
σx
|xi ∈ X and Ysu = yiσ−µ
y
y
|yi ∈ Y .
3. Let µsu be the arithmatic mean of values in Psu . The the correlation coefficient r = µsu .
Thus, if µx and µy are the means of x and y, and σx and σy are the respective standard
deviations8
X xi − µ x y i − µ y
×
xi ∈X ,yi ∈Y
σx σy
r= (1)
N
The sample scatter plot of Figure 5 is reproduced in Figure 10 with four regions marked,
which are all bordered by the average lines. Points (xi , yi ) in regions (1) and (3) contribute
as positive quantities in the summation expression for r, whereas points (xi , yi ) in regions
(2) and (4) contribute as negative quantities. The correlation coefficient has no units and is
always between −1 and +1; if r is +1 (−1), the points are on a line with positive (negative)
slope. A simple case for which r = 1 is when all values assumed by y are scalar multiples
of the corresponding values of x. If r = 0, the variables are uncorrelated. Two special but
(statistically) uninteresting cases with r = 0, are when either of the variables always takes a
constant value. Other interesting cases with r = 0 are when the scatter plot is symmetrical
with respect to any horizontal or vertical line.
As an example, the correlation coefficient between the marks in years 1 and 2, for the
data in Figure 5 is a positive quantity 0.9923. On the other hand, the correlation coefficient
between the weight and mileage of cars is generally found to be negative. O-rings are one
of the most common gaskets used in machine design. The failure of an O-ring seal was
determined to be the cause of the Space Shuttle Challenger disaster on January 28, 1986.
The material of the failed O-ring was a fluorinated elastomer called FKM, which is not a
good material for cold temperature applications. When an O-ring is cooled below its glass
transition temperature (Tg), it loses its elasticity and becomes brittle. In fact, the correlation
coefficient between the extent of damage to the O-ring and temperature has been found to
be negative.
v
u n
uX
u
t (xi − µx )2
8 i=1
Note that, while for the Chebyshev’s inequality we assumed σx = n−1 , generally, we will
v
u n
uX
u
t (xi − µx )2
i=1
assume that σx = n
26
Figure 10: The sample scatter plot of Figure 5 reproduced with four regions marked based
on positive and negative contributions to the correlation coefficient.
There are some words of caution that one should exercise while interpreting and applying
scatter plots:
1. Extrapolation is generally not a good idea for determining exact values. This is because,
outside the range of values considered, the linear relationship might not hold.
2. Even if you cannot do extrapolations, scatter plots can be informative and could give
us hints about general trends (such as whether the value of one variable will increase
with increase in value of the other variable).
While correlation measures only the strength of the linear association between two vari-
ables, the relationship could also be non-linear. In such cases, the scatter plot could show a
strong pattern that is not linear (as an example, the scatter plot could assume the shape of
a boomerang) and therefore, the quantity r is not as meaningful.
A word of caution before we move on; correlation does not imply causation, while it could
definitely make one curious about a possible causal relationship. Just because the GPAs of
students and their incomes are positively correlated, we cannot infer that high GPAs are
caused by high incomes or vice verca. There could be latent cause of both observations,
resulting in the positive correlation. For example, there is generally a positive correlation
between the number of teachers and the number of failing students at a high school. But
this mainly because generally, larger a school, larger is the number of teachers, greater is the
student population and consequently, more are the number of failing students. Therefore,
27
instead of treating the scatter diagram or the correlation measure as a proof of causation,
these could be used as indicators that might possibly signal causation.
4.5 Covariance
For a paired data (xi , yi ), where µX and µY are the means of the individual components, the
covariance of X, Y , denoted as cov(X, Y ) is defined as the number
Pn
i=1 (xi − µX )(yi − µY )
cov(X, Y ) =
n
It can be shown that the correlation coefficient r, also denoted by corr(X, Y ) is:
cov(X, Y )
corr(X, Y ) = p
cov(X, X)cov(Y, Y )
The first part is a mere computation. The second part is seen by recalling the property
of the inner product on n-dimensional vectors, which says that a · b = kak · kbk · cos(θ), where
θ is the angle between the two vectors.
We see that the correlation of (P-06/TOT-P, P-LIT/TOT-P) is −0.76 while that between
P-06/TOT-P and , no-HH is −0.16. A correlation close to 1 or -1 conveys a close match
between X and Y . The correlation between (p-06/TOT-P) with (P-ST/TOT-P) is 0.57 thus
indicating that the fraction of children is more tightly correlated with literacy than with
being tribal. Scilab allows a 3-way plot and we plot the fraction of children with that of ST
and LIT in Fig. 3.2 below.
Example 8 Show that cor(X, Y ) = 1 (or −1) if and only if Y = aX + b with a > 0 (or
a < 0). This exercise shows that if the coorelation of two variables is ±1 then all points of
the scatter plot lie on a line. Furthermore the sign of the slope is determined by the sign of
the correlation. Thus, the correlation measures the dependence of X on Y (or vice-versa).
The effects of change of scale on correlation are far simpler than they happened to be for
arithmatic mean and standard deviation. We will list some effects on SD of changing a single
variable. The effects of changing values of both variables can be derived by systematically
considering effects produced by changing value of each variable.
28
1. When a constant is added or subtracted from every value of any single variable, the
correlation coefficient stays the same, since such an operation involves translating all
points and average lines by the same constant value along the corresponding axis.
Consequently, the relative positions with respect to the closest line (or the standard
units) remain the same.
2. When every value of any single variable is multiplied by the same constant, the corre-
lation coefficient remains the same, since the standard units of the points remain the
same (since the average and SD get scaled by the same amount as the values).
3. When every value of any single variable is multiplied by −1, the signs of the values
of each variable in standard units change (the value and mean change signs, whereas,
the SD does not). Thus, r gets multiplied by −1. However, if each value of both the
variables is multiplied by −1, the overall correlation coefficient will remain unchanged.
4. When the values of x are switched with the values of y, the correlation coefficient stays
the same, since the terms within the summation expression for r in (1) remain the
same.
29
5 Linear regression
Regression is a technique used for the modeling and analysis of numerical data consisting of
values of a dependent variable y (response variable) and of a vector of independent variables
x (explanatory variables). The dependent variable in the regression equation is modeled
as a function of the independent variables, corresponding parameters (“constants”), and an
error term. However, the relationship between y and need not be causal (as in the case
of correlation). Regression is used in several ways; one of the most often used ways is to
estimate the average y value corresponding to a given x value. As an example, you might
want to guess the inflation next year, based on the inflation during the last three years.
Consider we have a 2-attribute sample (xi , yi ) for i = 1, . . . n, e.g., where xi was the
ST population fraction in village i and yi was the population fraction below 6 years of age.
Having seen the scatter plots, it is natural to determine if the value of x determines or
explains y to a certain extent, and to measure this extent of explanation. The simplest
functional form, of course, is the linear form y = bx + a, where the constants b, a are to be
determined so that a measure of error is minimized. The simplest such measure is
n
X
E(b, a) = (yi − (bxi + a))2
=1
Since E(b, a) is a continuous function of two variables, its minimization must be obtained at
a derivative condition:
∂E ∂E
=0 =0
∂a ∂b
These simplify to:
2 n=1 (yi − (bxi + a)) = 0
P
These are two linear equations in two variables. An important attribute of the matrix is
30
(where µX is the mean):
P P
i 1 x
i i
= n i x2i − ( i xi )2
P P
det P P 2
i xi i xi
= n i (xi − µX )2
P
This shows that the determinant is actually non-zero and positive and in fact, nσ 2 . By
the same token:
P P
i1 i yi P P P
det P P = n i xi yi − ( i xi )( i yi )
i xi i xi y i
which is a close relative of the correlation correl(x,y). It is easy to check (how?) that the
value of b, a as obtained above, actually minimize the error. Thus, our best linear model or
linear regression is y = f (x) is now totally defined. Also observe that f (µX ) = µY , i.e.,
the linear regression is mean-preserving. This is seen by the first defining equation ∂E ∂a
= 0,
P P
which gives us i (yi − (bxi + a)) = 0, and which implies that i yi − f (xi ) = 0, and which
is exactly what we have claimed.
Two examples of the best fit lines are shown below, where we use the Census dataset
for Vasai taluka. We map for each village, the fraction of people 6 years old or under as a
function of (i) the literacy, and (ii) the fraction of tribal population in the village. Note that
the sign of the slope matches that of the correlation.
P
If we denote ei = yi − bxi − a, the error in the i-th place, then (i) i ei = 0 and the total
error squared is obviously i e2i . We will show later that i e2i < i (yi − µY )2 . A measure
P P P
31
Figure 11: Regression: Population under 6 vs. literacy and ST fraction
32
The closer r2 is to 1, the better is the fit. The difference 1 − r2 is the residual or unexplained
error. See for example, the two data-sets for Vasai: (i) ST-fraction vs. Population below 6,
and (ii) male literate fraction vs. female literate fraction.
We now prove the claim that 0 ≤ r2 ≤ 1.
P P P P
i ei (f (xi ) − µY ) = b i ei xi − a i ei − µY i ei
P
= i e i xi
Thus, we see that the n-vectors (ei ) and (f (xi ) − µY ) are perpendicular, and sum to (yi −
f (xi ) + f (xi ) − µY ) = (yi − µY ). Thus we must have i e2i ≤ i (yi − µY )2 . In other words
P P
0 ≤ r2 ≤ 1.
Another point to note is that if the input tuple were reversed, i.e., if x were to be explained
as a linear function of y, say x = b0 y + a0 , then this line would be different from the best-fit
line for y as a function of x. To see this, note that bb0 6= 1 in general. In fact:
hx, yi2
bb0 =
hx, xihy, yi
and thus unless (x, y) are in fact linearly related bb0 < 1 and thus the two lines will be
distinct. See for example below, the two lines for the Vasai female literacy vs. male literacy.
The blue line is the usual line while the red line inverts the role of X and Y . Note that the
point of intersection is (µX , µY ).
33
Figure 12: Vasai female vs. male literacy. Both way regression.
The best possible linear combination is given by find those αj which minimize the error
E(α1 , . . . , αk ). This is done by the equations:
∂E
= 0 for j = 1, . . . , k
∂αj
34
This matrix system is actually invertible (but we will not prove this) and this solves for the
P
optimal values of the constants α1 , . . . , αk . Let f = j αj wj be this linear combination and
let e = y − f be the error.
Remark: To see how our earlier linear case is a specialization, we see that for the tuple
(xi , yi ), our W consists of just two vectors, viz., the vector x = (x1 , x2 , . . . , xn ) and 1 =
(1, 1, . . . , 1). The general linear combination is precisely α1 1 + α2 x, with the i-th entry
(α1 + α2 xi ), which after relabelling is (a + bxi ).
We see that if 1 ∈ W , then the condition he, wi i = 0 for all i says that:
X X
he, 1i = 0 ⇒ µY = ( yi )/n = ( fi )/n = µf
i i
= 0
This implies that y − f and f − µf 1 are perpendicular and thus kek2 ≤ ky − µY 1k2 , and
thus the error in the approximation does not exceed the variance of the observations y and
we may thus define r2 , the goodness of fit, and the residual error similarly.
One useful application of the above formulation is to construct the multi-variable regres-
sion. Suppose that we are given the tuples (xi , yi , zi )ni=1 and we seek a regression of the type
z = ax + by + c. This is computed by considering the set W = {(xi ), (yi ), 1} and solving for
a, b, c as:
hx, xi hx, yi hx, 1i a hx, zi
hy, xi hy, yi hy, 1i b = hy, zi
h1, xi h1, yi h1, 1i c h1, zi
One example of the above is given below- expression of population fraction below 6 as a
function of ST-fraction and literacy fraction for Shahpur gives us the coefficient of literacy
as −0.2, that of ST fraction as −0.004 and the constant term of 0.227. This indicates that
the ST fraction is actually negatively correlated with number of children, once literacy is
accounted for. Another interesting statistic is the r2 values for the fits of P-06 with male
and female literacy separately. This is shown below for all the talukas of Thane.
35
Figure 13: The male and female literacy and P-06
1. In the case of the scatter diagram in Figure 14, one could make use of regression to
estimate the average possible marks of a student in ‘year 2’, given her marks in ‘year 1’.
In this case, the correlation coefficient (0.993) is very close to 1. The mean values for x
and y are 60.26 and 60.33 respectively while the corresponding standard deviations are
10.48 and 10.54 respectively. The graph that shows the average y value corresponding
to each x value is called the graph of averages. It can be seen in Figure 14, that the
average values guessed for y corresponding to different x values are very close to a line
and have very little variation across the line. Such a line is called the regression line.
The regression line is a smoothed version of the graph of averages.
3. In the case of two uncorrelated variables (i.e., r ≈ 0) such a person’s height and his
average scores in mathematics, the average value of y given any particular value of x
will be more or less the same. The variance will also be large.
36
Figure 14: A scatter plot for data similar to that in Figure 5, along with the regression
line. The mean values for x and y are 60.26 and 60.33 respectively while the corresponding
standard deviations are 10.48 and 10.54 respectively. The correlation coefficient is 0.993.
Figure 15: A scatter plot for the marks of students of some class in mathematics (x-axis)
against physics (y-axis), along with the regression line. The mean values for x and y are
60.34 and 60.01 respectively while the corresponding standard deviations are 9.82 and 12.06
respectively. The correlation coefficient is 0.801.
37
It turns out actually that if x is one SD above its average µx , the corresponding predicted
value of y is r SDs above its average µy . The point (µx , µy ) is in fact termed as the point of
averages. In the third example discussed above, with r ≈ 0, the predicted y hard changes
with the value of x. For the first example, the predicted y changes the most with the value
of x.
The linear regression method concerns itself with finding points on the regression line, for
a given table of data values with each row consisting of values of the independent variable x
and the dependent variable y for a particular data point. The following three steps comprise
the linear regression method.
1. Compute the correlation coefficient r between x and y. For the data in Figure 14, the
correlation coefficient was 0.993.
2. Convert any given value of x into its standard units. Multiply this number by r to get
the predicted value of y in its standard units. This step is essentially the regression
step. For example, with respect to Figure 14, consider x = 76. This number in standard
units will be 76−60.26
10.48
= 1.50. The predicted value of y in standard units will be 1.49.
3. Convert the value of y from standard units into normal units. The value of 0.993 for
y in standard units will correspond to 0.993 × 10.54 + 60.33 = 70.80.
The regression line discussed above is for predicting value of y based on x (the usual
notion of line). It is also possible to perform regression to predict value of x based on value
of y, to yield another kind of regression line.
As an exercise, let us predict (using linear regression on the data in Figure 14) the marks
for a student in year 2, given that she obtained 63 marks in year 1. The predicted value for
marks in year 2 will be 60.33+ 0.993 × 63−60.26
10.48
×10.54 = 63.07, which compares favourably
with the actual value of y = 63.53.
As another exercise, with respect to the scatter plot in Figure 14, suppose a student is at
31st percentile in year 1. What is a good guess for this student’s percentile rank in year 2?
We note that the scatter diagram is ‘football shaped’ (or tilted oval) which is an indication
that the two variables roughly follow a normal distribution9 First we will determine the
standard units for x corresponding to the 31st percentile by looking up the value in standard
units corresponding to an area of 100−31100
= 0.69 in standard normal Table ??. The value
happens to be around 0.50. Multiplying this by the correlation coefficient r = 0.993, we get
the standard units of the approximated y value to be 0.497. The corresponding percentile
rank can be again obtained using the normal table to be 100 × (1 − 0.6879) = 31.21, i.e.,
9
Later, we will discuss techniques that will help verify such assumptions.
38
around 31 percentile. Note that we really did not make use of any of estimates for mean or
standard deviation, since all computations dealt only with standard units.
What about the spread of values around the average predicted by linear regression? How
accurate is regression? For each point in the scatter diagram, the actual value of y will
not typically be on the regression line. The prediction error at a point x is defined as the
difference between the actual and predicted values of y at that point. Analogous to the
definition of the standard deviation, the spread for linear regression is measured as the RMS
of prediction errors and is called the RMS error. Linear regression should be suitable for
datasets for which the RMS error is very close to sum of the deviations from average across
all points. While RMS error can be used in the context of any prediction method, for the
case of linear regression, the RMS error can be simplified to the following formula:
√
erms = σy 1 − r2 (2)
39
p
(2) to be 1 − (0.993)2 × 10.09 = 1.19. The question then reduces to ‘what percentage of
values on a normal curve with mean 65.06 and SD 1.19 are above 67?’ The answer to this
can be easily obtained by looking up the standard normal table for the percentage area that
is above 1.63 standard units, which is 100 × (1 − 0.9484) = 5.16%.
Further, a football shaped scatter diagram is homoscedastic, i.e., it has approximately
the same spread around the regression line for every value of x. To cite a practical example,
we would expect different divisions within a class to have the same mean and SD values for
scores in a particular year, irrespective of their sizes. Of course, divisions with smaller sizes
may have a smaller range of scores than larger divisions.
You can refer to the Appendix for a more detailed discussion on Regression including (a)
regularization (b) drawbacks of regression.
6.3 SD Line
The SD line is a line that passes through the point of averages in a scatter plot and has slope
equal to the ratio of standard deviation of y to that of x multiplied by sgn(r). Points on the
SD line are the same in standard units for both x and y. It can be proved that if r = 1, the
SD and regression lines will coincide.
40
Figure 16: The Lorenz plot for the company data
for formal education and then xi would be the number of people having i years of formal
education.
Now, we would like to measure the inequality in the data. Our first step is to assume that
the yi ’s are sorted, i.e., y1 < y2 < y3 . . . < yn . Next, let Xi = ij=1 xi , in other words, Xi
P
is the number of people with values less than or equal to yi . Let X = Xn be the number of
people in the sample. Next, we define Yi = ij=1 xj ∗ yj , i.e., net value for the first i groups
P
of people. Let Y = yn , the total value of the population. The Lorenz curve is the plot which
begins at (0, 0) and plots (Xi /X, yi /Y ).
Example 9 A company has 100 employees at various levels. The number of employees at
each level and their salaries are given below:
No. of Employees 60 25 10 4 1
Pay (in lakh Rs.) 1 1.5 2.5 4 8
We thus see that X = 100, Y = 146.5 and the plots for the Lorenz curve will have the
following data:
It is easy to see (show this as an exercise) that yi /Y < Xi /X, i.e., the Lorenz curve
always sits below the 45-degree straight line joining (0, 0) with (1, 1). Note that in the above
41
Figure 17: The Lorenz plot for Murbad literacy
example, if the salaries were more equal then the Lorenz curve will be closer to the 45-degree.
The Gini coefficient is the ratio of the area A between the Lorenz curve and the 45-degree
line to the area below the line. Since area under the line is 0.5, the Gini coefficient is exactly
2 · A. The Gini ceofficient is easily computed using the trapezium rule, as follows:
n
X xi (xi − Yi ) + (xi−1 − Yi−1 )
2·G=
i=1
X 2
This is available as a function gini.sci which inputs a matrix of two columns, where
the first column are the xi ’s and the second column are the yi ’s. Make sure that the second
column is increasing. It turns out that our company has a Gini coefficient of 0.245.
Let us try another example for aggregate data. For the Murbad taluka census data, we
have for each village i, its population (TOT-P) and the number of literates (P-LIT). The i-th
village literacy fraction yi is then given by P LITi /T OT Pi . Let us denote xi by T OT Pi . Let
us understand what this tuple data and its Gini coefficient (xi , yi ) would mean. Since the
data is aggregated for each village, we will measure the inequality in the literacy levels across
villages. This will smoothen out the education levels within the village, at the individual
level. For Murbad, we see that the coefficient is 0.0878 which is quite small. This is also
evident from the histogram which is bunched around the mean. The plots appear below.
Warning 10 The Gini coefficient must be used with care. For aggregate data, it will tend
to under-compute the inequality. You should try this for say part II data, e.g., total agricul-
tural land. The Gini may be quite low but may hide that within each village, land may be
concentrated in very few households. So unless household data is available, the inequality in
land ownership cannot be measured.
42
8 Probability
The notion of probability comes from a random variable, which is just an abstract data
source. Think for example, of a cannon which may be fired repeatedly. Every firing i will
yield a transit distance di of the cannon ball. Clearly, as there are variations in the sizes
and weights of the cannon ball, variations in the wind conditions, and so on, we will have
that the di ’s will not be all equal. All the same, a repeated observation will indeed give us
an estimate of the range of the cannon.
We can now loosely define a random variable X as (i) an outcome set S, (ii) a collection
E of subsets of S, called the event set, and (iii) a probability function p : E → R, all
with certain properties. For E, we must have that (E1) S ∈ E, (E2) if A, B ∈ E, then so
are A ∩ B and A, i.e., the complement of A. These conditions say that the subsets in E are
closed under boolean operations. More formally:
Definition 12 Event (E) : An event is defined as any subset of the sample space. Total
number of distinct events possible is 2S , where S is the number of elements in the sample
space. For a coin pair toss experiment some examples of events could be
Definition 13 Random variable (X) : A random variable is a mapping (or function) from
set of outcomes to a set of real numbers. Continuous random variable is defined thus
X:S→R
On the other hand a discrete random variable maps outcomes to a countable set (e.g. discrete
real numbers)
X : S → Discrete R
43
Now, we move to the probability function P r. Probability P r is a function P : 2S → <.
It must have the following properties: (P1) P r(A) ≥ 0 for all A ∈ E, (P2) P r(φ) = 0 and
P r(S) = 1, and (P3) if A ∩ B = φ then P r(A ∪ B) = p(A) = p(B). More formally:
Example 14 The biased coin. Here we construct the random variable C(q) corresponding
to the biased coin. Let S = {H, T }, i.e, heads or tails, be the only possible outcomes of a
coin toss. Let E be the set of all possible (i.e., 2S ) subsets of S, and let 0 < q < 1 be a fixed
real number. We define P r by the table below:
This merely says that the probability of the coin falling H is q, of T is (obviously) 1 − q, of
not falling at all is zero, and of falling either H or T is 1.
Example 15 The cannon-ball. Here, let S = [100, 101], ie., the possible outcomes are
all real numbers between 100 and 101. Let E be the collection of all sub-intervals, open or
closed, of [100, 101] and their unions. For an interval [a, b] we define p([a, b]) = b − a. This
random variable CB simulates the falling of a cannon ball. It says that the cannon ball will
always fall between 100m and 101m from the cannon and the probability that a particular
trial falls within the interval [a, b] is in fact b − a. For example, the probability of the ball
falling between [100, 100.2] or [100.5, 100.7] is equal and 0.2. In other words, every outcome
between 100 and 101 is equally likely.
Two random variables X and Y are called independent if the outcome of one do not
affect the outcome of the other. Here are some dependent random variables. Let B be a box
44
containing k red and n − k black balls. Let us first draw one ball and note its (random)
colour as X1 and throw it away. Next, let us draw a second ball and denote its colour by the
variable X2 . Note that as individual random variables, X1 and X2 are identical, viz., the
probability of a red ball is k/n. However, they are certainly not independent. If we know the
outcome of one then we do know something more about the outcome of the other. Another
example is when X is the time that you will wait for your bus and Y is the time elapsed since
the last bus, measured at the instant that you show up at the bus-stop. Another example is
say the life-expectancy of one resident of a village with that of another in the same village.
We will not study independence formally but assume an informal understanding that
one should be careful before assuming that two random variables are independent.
We will denote by E 0 the collection of all open/closed intervals and their disjoint unions.
Verify that it satisfies condition E1 and E2. When S is a finite set, we assume that E is the
collection of all subsets of S. Note that p is then defined by specifying its value on singletons,
i.e., p({s}) (this we abbreviate as p(s)) for all s ∈ S. For if A = {s1 , . . . , sk }), then p(A) is
clearly p(s1 ) + . . . + p(sk ).
Next, let us construct new random variables from old. The simplest is the cross product.
If (S1 , E1 , p1 ) and (S2 , E2 , p2 ) are two random variables, then we can construct the product. We
define S = S1 ×S2 , E as the sets which include E1 ×calE 2 , and define p(A×B) = p1 (A)p2 (B).
Example 16 Lets look at C(q) × C(r). This corresponds to two independent coin throws,
where one coin has bias q and the other r. We see that S = {HH, HT, T H, T T } and
p(HH) = p1 (H)p2 (H) = qr, while p(HT ) = p1 (H)p2 (T ) = q · (1 − r), and so on.
We may construct CB × CB, i.e., the random variable corresponding to two independent
ordered cannon ball firings. Clearly the outcome set is [100, 101] × [100, 101], i.e., the unit
square situated at (100, 100). The probability p([100, 100.2] × [100.3, 100.4]) = 0.2 × 0.1 =
0.02. Thus the probability of the first shot falling in the first named interval and the second
in the second interval is 0.02.
There is another technique of constructing random variables. Let R = (S, E, p) be a
random variable and let S 0 be another set and f : S → S 0 be a onto function. We define the
new variable R0 = (S 0 , E 0 , p0 ), where S 0 is as above. We say that A0 ∈ E 0 iff f −1 (A) ∈ E, and
when this happens, we define p0 (A0 ) = p(f −1 (A)).
Let us now construct our first important example and that is the Binomial random
variable Binom(q, n).
Definition 17 The variable Binom(q, n) has the outcome set [n] = {0, 1, . . . , n} with p({k}) =
n k
k
q (1 − q)n−k . The binomial random variable arises from the n-way repeated trials of C(q),
45
i.e., C(q) × . . . × C(q). Note that sample space of this product is S n which is the collection
of 2n sequences in H and T, corresponding to the fall of the i-th coin. Now consider the
map f : S n → [n] where each sequence goes to the number of H’s in it. For example, for
n = 4, f (HHT H) = 3 while f (T T HH) = 2 and so on. Thus, the function f merely counts
the number of heads. Now, if we consider any k ∈ [n], then then f −1 (k) is precisely the set
of sequences with k heads, and the probability of the occurence of k heads in an n-toss of a
biased coin then is precisely the number above.
Here is an example where Binom(q, n) will find use. Suppose that we must judge the
fraction q of tribals in a large village. One test, if we are unable to survey the whole village,
would be to take a sample of n people (more about sampling later), and count the number
of tribals, say k. Whence, if q were this fraction, then the chance of our meeting exactly
k tribals from a sample of n is exactly nk q k (1 − q)n−k . We will see later that k/n is a
reasonable estimate of q.
Example 18 The uniform random variable. Let S = [0, 1] and let f (x) = 1 for
Rd
x ∈ [0, 1] and zero otherwise. We see that for any sub-interval [c, d], p([c, d]) = c 1.dx =
d − c. If we wished to construct the uniform random interval over another interval [a, b],
1 d−c
then f (x) = b−a for x ∈ [a, b] would do the job, and then, as expected, p([c, d]) = b−a .
Example 19 Here is a more interesting case. Let S = [0, 1] and let f (x) = 2x for x ∈ S and
R
zero otherwise. We see that S 2x.dx = (x2 )10 = 1 − 0 = 1. Also, f (x) ≥ 0 for all x, and thus
f defines the pdf of a random variable. We see that p([0, 0.5]) = 1/4 while p([0.5, 1]) = 3/4
and thus this random variable prefers higher values than lower ones.
46
We now come to the famous Normal or Gaussian random variable. The outcome set for
this is R, the whole real line. Let
1 −(x2 )
f (x) = √ e 2
2π
This is a curious function which arises from classical mathematics and is plotted as the red
curve in the image below (from wikipedia). We see that the curve is smooth and symmetric.
R
The integral R f (x)dx is known to be 1. We see that the normal random variable allows for
all real numbers as outcomes but prefers smaller numebrs (in absolute value) to bigger one.
Rb
The integral values of a f (x)dx are rather hard to calculate analytically and are usually
R2
tabulated. We see for example that p([−2, 2]) = −2 f (x)dx = 0.95, roughly. As can be seen
from the graph below, most of the area under the red curve is indeed between −2 and 2. In
terms of randomness, we see that the chance that the random outcome is in [−2, 2] is about
95%.
The above denisty function is usually denoted by N (0, 1). The general function is N (µ, σ)
and is given by:
1 −((x−µ)2 )
N (x; µ, σ) = √ e 2σ2
σ 2π
R
Assuming that f (x)dx = 1, it is easily shown that N (x; µ, σ) also gives a density
function. This is called the normal density function with mean µ and variance σ 2 .
The figure shows some plots for various µ and σ 2 . We see that µ decides the center value
around which the random variable is symmetric. Increasing σ increases the spread of the
outcomes. For example,
Z 2 Z 1
N (x; 0, 2)dx = N (x; 0, 1)dx = 0.65
−2 −1
47
Figure 18: The Normal density function (from wikipedia)
R
words, the plane R2 . Whence, the proabability that x ∈ I and y ∈ J would be I f1 dx ·
R
f (y)dy. Thus, the density function for the cross-product is merely f (x, y) = f1 (x)f2 (y)
J 2
with the outcome set R × R.
Example 20 Let us pick two random numbers uniformly between 0 and 1, say x1 and x2 .
Let x = max(x1 , x2 ). What is the probability that 0 ≤ x ≤ b? To solve this, let us look at the
random variable z = (x1 , x2 ) where each x is uniform over [0, 1]. Thus, the density function
of z = (x1 , x2 ) is merely f (x1 , x2 ) = f1 (x1 )f2 (x2 ), which is 1 · 1 = 1. Note that the function
R1R1
f is zero outside the unit square and that 0 0 f (x1 , x2 )dx1 dx2 = 1.
Next, we see that for the maximum of (x1 , x2 ) to be less than b, both x1 ≤ b and x2 ≤ b,
and thus, the probability of this event is b2 . See Fig 9 below.
One common operation is a scale and translate of an existing random variable. Thus,
for example, Y = aX + b, where f (x) is the density function for X. In other words,
f (x)dx is the probability that X lies in the interval [x, x + dx]. Now, if Y ∈ [y, y + dy]
then X ∈ [ y−ba
, y−b
a
+ dy
a
]. Thus if fY (y) is the probability density function of Y , then
48
Figure 20: The temperature at Shimla as see by a thermometer
fY (y) = a1 f ( y−b
a
). We see for example, that
1 −((x−µ)2 ) 1 x−µ
N (x; µ, σ) = √ e 2σ2 = N ( ; 0, 1)
σ 2π σ σ
In other words, the Y = N (µ, σ) random variable is related to the the variable X = N (0, 1)
by Y = σX + µ.
Another common operation is restriction. Assume that X is a random variable with
density function f (x) and outcome set S ⊆ R. Now consider the random variable Y , where
Y only reports X if it lies in a sub-range [a, b] of S. For example, Let X represent the
temperature at Shimla on 1st of January over the years. However, our thermometer measures
temperatures in the interval [−3, 15] and reports an error if the temperature lies outside this
interval. Let Y be the reported temperature by this thermometer, whenever an error does
not occur. Thus Y is a restriction of X to the interval [−3, 15]. Now suppose that X was
actually N (2, 4), i.e., normal with mean 3 and standard deviation 4. What would be the
density function of Y ? If fY is the density function of Y , then clearly, it must be zero outside
[a, b]. Next, it must mimic the shape of f within this interval, i.e., must be a multiple of f ,
i.e., fY (x) = αf (x) when x ∈ [a, b], for a constant α. This is determined easily by requiring
Rb Rb Rb
that a fY (x)dx = α a f (x)dx = 1. Thus, we see that α = 1/ a f (x)dx.
For our example, the Shimla temperature variable is shown in blue in Figure 9 below.
The range −3, 15] is marked in red. α turns out to be 1/0.896 which is 1.11. Thus, fY is a
scaled version of f in the interval [−3, 15] and is shown in red.
49
10 Data and probability models
The basic use of probability models is to simulate real data and to predict the effect of
certain interventions with a level of confidence. Here is a concrete example.
Example 21 A literacy program was implemented in 120 revenue villages in the eastern
part of Shahpur, which has a total of 222 revenue villages. The program entailed a teacher
training program, introduction of new kits and so on. The program director wishes to a quick
and economical mid-term appraisal of the program now that 1.5 years have elapsed. Please
come up with a project plan for this task and list the technical outcomes.
It is clear that this calls for understanding the status of the villages which were a part
of the program and compare it with others in the taluka which were not. Next, perhaps, a
sample of the 120 program villages will be taken up for a detailed (and expensive) survey. The
selection of these villages is crucial to make a concrete assertion, with a level of confidence,
on the success of the program. It is in this confidence assertion where known probability
models become very useful, for here these calculations can be done a priori and a testing
regime designed based on these assumptions.
The first task is of course, to check if the data that you have matches with some known
probability density function. We shall briefly examine this question. The first point is to
check that most standard density functions can be programmed on a computer and repeated
trials generated. In other words, for any density function, we may produce a virtual cannon
which will fire according to that density function. For the standard ones, such as Binomial or
normal, Scilab provides ready-made function grand with appropriate arguments and inputs,
see Section 18. Let us use grand to generate 200 random numbers distributed in the Binomial
density function with N = 100 and q = 0.6. After generating this sample of 200, let us plot
it as a histogram for a width of 2, i.e, {k, k + 1}, for even k. Let us also plot the expected
number of occurences, which will be 200 ∗ (pr(k) + pr(k + 1)), where pr(k) is the probability
thaty the bionomial random variable of q = 0.6 and N = 100 will yield k. This combined
plot is shown below in Fig. 10. We see fairly nice things in the plot that the number of
actual outcomes fairly match with the predicted numbers. Moreover, the maximum is close
to 60 = 0.6 ∗ 100.
We try it next with the normal density function with mean 1 and SD 0.6. We plot for
1200 trials and 200 trials as below in Fig. 10. We see the important outcome that as the
number of trials increase, the observed numbers match with the predicted numbers much
better.
We now consider the case of real data and checking if it matches known density functions.
Let us start with the case of number of households per village in Murbad taluka. After
50
Figure 21: The binomial sample and expectation
51
Figure 23: The normal fit to Murbad and Shahpur HH data
several attempts, we see that N (135, 60), i.e., the normal denisty function with mean 135
and standard deviation 60 ( plotted in blue) fits the data fairly well. The actual mean and
SD of the data set are 145 and 85 respectively. We plot that in red. As we see, this is not
as good for many reasons. Firstly, we see that the data naturally has a truncation effect,
i.e., there cannot be any villages with negative number of households. This truncation also
causes a change in the variation which is not very predictable. So, the question remain,
is the observed data from N (135, 60) or not and with what confidence? Such questions are
important and are tackled through specific tests. One of them is the Kolmogorov-Smirnov
test which we will discuss later. We also note that the Shahpur households dont quite fit
the normal density function.
We may try the same with some other attributes. Below, in Fig. 10 we have the female
literacy fraction for various villages of Shahpur. The mean and SD of the data are 0.428 and
0.136 respectively. This is plotted in blue. The best suited (according to my eyes) is with
mean and SD 0.43 and 0.12 respectively. This is plotted in magenta. Of course, not all data
sets are so normalizable. See for example, the ST-fraction for Shahpur. We see that far from
being close to normal, it in fact shows bi-modal behaviour, i.e., with two peaks, at close to
0 and at close to 1. This indicates that Shahpur villages are fairly divided into those which
are largely ST and those which are largely non-ST.
Example 22 Write scilab code to obtain each of the above plots. Also, consider the question
of verifying whether ST communities tend to have better sex-ratios than non-ST communities.
How would you test the above proposition?
52
Figure 24: Shahpur female literacy fraction and ST fraction
53
11 Functions and expectation
In this section, we will delve deeper into the theory of random variables. For the purpose
of this section, we will assume that the outcome set of our standard random variable is R
and is given by density functions f and so on. In other words, for an interval I, we have
R
p(I) = I f (x)dx.
Frequently, we have a function g : S → R. This g may represent a value g(s) that we
attach to each outcome s ∈ S. For example, S = {HH, T H, HT, T T }, and G(HH) = 4
while g(T T ) = g(T H) = g(HT ) = −1. This may model the outcomes of a game of two coin
tosses with two heads fetching Rs. 4 while any other outcome resulting in a loss of Rs. 1.
Example 24 For the example above, for an unbiased coin, we have p(HH) = p(HT ) =
p(T H) = p(T T ) = 0.25, whence E(g) = 0.25. Thus, the games is expected to benefit you Rs.
0.25 every time you play it.
Example 25 Let X be the uniform density function on [0, 1] and let Y = X × X. Thus
fY (x1 , x2 ) = 1 for all x1 , x2 ∈ I. Let g(x1 , x2 ) = max(x1 , x2 ). Let us compute E(g). We see
that the set S may be divided into two halves along the diagonal. The first domain would be
Si where x1 ≥ x2 and the other, where x2 ≥ x1 . Clearly
Z Z Z
E(g) = g(x1 , x2 )f (x1 , x2 )dx1 dx2 = g(x1 , x2 )dx1 dx2 + g(x1 , x2 )dx1 dx2
S S1 S2
By symmetry, both integrals must be equal and we evaluate the first one. We see that
Z Z 1 Z x1 Z 1
g(x1 , x2 )dx1 dx2 = x1 dx1 dx2 = x21 dx1 = 1/3.
S1 x1 =0 x2 =0 x1 =0
Thus E(g) = 2/3. We should recall that the maximum of two uniform random variable
is also a random variable Z with outcome set [0, 1] and density function 2x. In this case,
g(x) = x and the desired number of merely EZ (x) for the random variable Z. We see that
R
[0,1]
2x · xdx = 2/3.
54
• If Y = aX + b, then σY2 = a2 σX
2
. This is an honest calculation:
R
σY2 = fY (y)(y − µY )2 dy
= a1 f ( y−b
R
a
)(y − µY )2 dy
R
= f (x)(ax + b − µY )2 dx (after substituting y = ax + b)
= a2 σ X
2
Let us now compute the means and variances of the standard random variables.
R1 h 2 i1
x
• Uniform. Here f (x) = 1 on the outcome set [0, 1]. We have E(x) = 0
xdx = 2
=
0
1/2. This is expected. We have the variance as
1
1
(x − 12 )3
Z
1 2 1
(x − ) dx = =
0 2 3 0 12
n
k
• Binomial. We have p(k) = k
q (1 − q)n−k and thus
Pn k n
µ = k=0 k · q (1 − q)n−k
k
Pn n−1 k
n−k
= k=1 n · k−1 q (1 − q)
Pn−1 n−1 j
= nq j=0 j
q (1 − q)n−1−j
= nq
This establishes the expected value nq as the mean. The variance is also similarly
calculated and equals nq(1 − q).
• Normal N (µ, σ).By the linear combination result, we only need to prove this for N (0, 1),
−x2
i.e., the standard normal. Now, x· √12π e 2 is an odd function, whence its integral must
be zero. Thus the mean of the standard normal is indeed zero. The mildly harder case
55
is the variance. We see this in the following steps:
R∞ −x2
σ2 = −∞
x2 · √12π e 2 dx
R∞ −x2
d √1
= − −∞ x · dx ( 2π e 2 )dx
h −x2
i∞ R∞ −x 2
= −x · √12π e 2 + −∞ √1 e 2
2π
dx
−∞
= 1
Here is another expectation which is very important in the theory of random variables,
esp. in repeated trials and the structure of the normal distribution.
Definition 27 The moment generating function (mgf ) φX (t) of the random variable X
R
given by the density function f is E(etX ) = f (x)etx dx.
In fact, the mgf of a function f determines (more or less) determines it uniquely. We present
three results on the transform.
σ 2 t2
• If X is normal with mean µ and variance σ 2 then φX (t) = eµt+ 2 . We see this in the
following steps:
R∞ −(x−µ)2
φX (t) = etx · √1 e 2σ 2 dx
−∞ σ 2π
R∞ 2σ 2 tx−(x−µ)2
1
= √
σ 2π −∞
e 2σ 2 dx
R∞ −(x−(µ+σ 2 t))2 t2 σ 4 +2µσ 2 t
= √1 e 2σ 2 e 2σ 2 dx
σ 2π −∞
t2 σ 2
+µt
= e 2
• Suppose that X1 and X2 are independent random variables with density functions f1 (x)
and f2 (x), and mgfs φ1 (s) and φ2 (s). Let Y = X1 + X2 , then the density function fY is
R∞
given by fY (y) = −∞ f1 (x)f2 (y − x)dx. This is called the convolution of f1 and f2 .
This is readily seen by considering the random variable X1 × X2 with density function
f1 (x1 )f2 (x2 ). Let FY (y) denote the probability that x1 + x2 ≤ y. We see that:
R∞ R y−x1
FY (y) = x1 =−∞
f1 (x1 )f2 (x2 )dx1 dx2
x2 =−∞
R∞ R y−x1
= x1 =−∞
f1 (x1 )dx1 x2 =−∞ f2 (x2 )dx2
R∞ hR i
d d y−x1
fY (y) = dy
(FY (y)) = x1 =−∞
f 1 (x 1 )dx 1 dy x2 =−∞
f 2 (x 2 )dx 2
R∞
= x1 =−∞ f1 (x1 )dx1 f2 (y − x1 )
56
The mgf of fY (y) is the product φY (t) = φ1 (t) · φ2 (t). This is seen by:
R∞ R∞
φY (t) = y=−∞
ety · x=−∞ f1 (x)f2 (y − x)dxdy
R∞ R∞
= x=−∞ y=−∞
e−sy f1 (x)f2 (y − x)dydx
R∞ hR i
tx ∞ t(y−x)
= x=−∞
e f 1 (x) y=−∞
e f 2 (y − x)dy dx
R∞
= x=−∞
etx f1 (x)φ2 (t)dx
= φ1 (t)φ2 (t)
• If for i = 1, . . . , n, the variables Xi are normal with mean µi and variance σi2 then so
is the variable Y = X1 + . . . + Xn and it has mean i µi and variance i σi2 . This
P P
t2 σi2 t2 i σi2
P
Y P
+tµi +t i µi
φY = e 2 =e 2
This is clearly the transform of the normal random variable for the said mean and
variance.
2
Lemma 28 The mean µY of Y equals nµX and its variance σY2 = n · σX . For X, we have
2 2
µX = µX and σX = σX /n.
.
The linearity of expectation explains most things. The only calculation is the calculation
of the variance of the sum C of two independent random variables, say A and B, which we
57
do now.
σC2 = E((c − µC )2 )
= E((a + b − µA − µB )2 )
= E((a − µA )2 ) + E((b − µB )2 ) + 2E((a − µA )(b − µB ))
R R
= σA2 + σB2 + A B fA (a)fB (b)(a − µA )(-µB )dadb
2 2
R R ¯
= σA + σB + { A fA (a)(a − µA )da}{ B fB (b)(-µB )db}
¯
2 2
= σA + σB
Thus, we see that the variance of the variable X diminishes with n while its mean remains
invariant. This, in fact, is the basis of much of sampling. Let us try this in an example.
X1 + . . . + Xn − nµX X1 + . . . + Xn − n/2
Zn = √ = p
σX n n/12
58
Figure 25: 500 trials of a 10-uniform-sum
We make 500 trials and plot the observed frequencies for n = 10, i.e., Z10 . The blue line is
the expected frequencies for the normal curve. We see a close match.
Example 30 The basis for assuming normality in social data. Scientists studied
for Thane, the passing percentages of girls and boys in their school years and considered all
factors such as economic conditions, social status, distance from school and so on, and came
out with the following probability estimates for a girl/boy to pass the 10th standard exam:
Xth passing
ST non-ST
Boy 0.13 0.33
Girl 0.21 0.26
Next, consider the village of Dhasai with population structure given below. Let X be the
random variable modelling the number of Xth standard pass adults.
59
It is clear that if X11 X12 , X13 , X14 are random variables expressing if a given boy/girl who
is ST/non-ST is Xth pass, then X is merely the sum of repeated trials 123 copies of X11 ,
312 copies of X12 and so on. Now if Yij are these repeated sums then the theorem says that
each of these is close to being normal. Thus X, the sum of the Yij ’s is also almost normal.
This settles the argument that the number of Xth pass (or its fraction) in Dhasai should be
normal. However, it does not answer why should this quantity for another village Mhasa be
distributed by the same mean and variance as Dhasai. This is argued as follows. Suppose
that the number of adults Nij in Murbad taluka of various categories is known. Suppose next
that a village has some n number of adults. Then we may assume that the composition of
this village by various categories is obtained by n independent trials on the whole Murbad
taluka population. If that is valid, then a further counting of Xth pass may proceed along
earlier lines, giving an argument why the Xth pass fractions across all villages be distrbuted
by a common normal random variable.
This is partly the basis in assuming many of these social variables as normal. There are of
course, serious limitations to this approach. First is the non-independence of many attributes
of individuals with those in his/her village, community etc., as pointed out earlier. Secondly,
as we saw in Shahpur the ST-fraction in villages is not normally distributed. In fact, there
is a divergence towards the extremes of 0 and 1. All the same, the literacy fractions do show
some match with a common normal variable. This may be due to some other mechanisms at
work which are common to both ST and non-ST.
60
Figure 26: Estmating q when k=3 and n=10
p(3) = 10
3
3
q (1 − q)10−3 for various values of q. The plot in Figure 13. We see that the
probability of the event k = 3 is indeed maximized when q = 0.3, although the probability
itself is only about 0.266. Moreover, for q = 0.25, the probability of the event k = 3 is about
0.25 which is not far from 0.266.
Let us first prove the simple fact that q = k/n is indeed where the probability p(k) is
maximum. Let us differentiate nk q k (1 − q)n−k and equate this to zero to obtain q.
d
n k
dq k
q (1 − q)n−k = 0
n
k−1 n−k−1
k
kq − (n − k)(1 − q) = 0
kq k−1 (1 − q)n−k − (n − k)q k (1 − q)n−k−1 = 0
k(1 − q) − (n − k)q = 0
k − nq = 0
Thus q = k/n is where the derivative is zero. It is easy to check that this is a maxima. Thus
our function e : S → Q with e(k) = k/n actually estimates a q such that the probability of
the outcome k is maximized. Such an estimator is called the parameter q ∈ Q is called as
the maximum likelihood estimator of q.
The next matter is of confidence. Suppose that, a priori, we had no guidance on the
possible values of q and that every q ∈ [0, 1] was equally possible. We then plot p(k) for all
values of q ∈ [0, 1]. This is plotted in Fig. 13. We may well assert that q = 0.3, but there
is no reason to doubt that q = 0.28, in fact. Let us quantify our assertion that q = 0.3 by
looking at the area under the curve in the interval [0.25, 0.35]. We see that this is roughly
31% of the total area. Thus, assuming that all values of q were equally likely, we may assert
that we have 31% confidence in our assertion.
61
Figure 27: p(3) for all q and the confidence interval [0.25, 0.35]
Figure 28: p(16) and n=50 for all q and the confidence interval [0.2, 0.3]
How do we strengthen our assertion? The first option is to widen the interval. For
example, we check that for the interval [0.2, 0.4] we have a larger confidence of 56%. The
other, and wiser, option is to increase the number of trials. Suppose now that n = 50 and
k = 15 and thus q = 0.3. Thus the estimated value remains the same. However, the q-plot
changes dramatically, as seen in Fig. 13. Also, now the confidence in the interval [0.2, 0.4]
goes to roughly 91%.
All of this crucially depends on the fact that all q ∈ [0, 1] were equally likely. Suppose, a
priori, we knew that q is in fact in the interval [0.2, 0.8]. In which case, our confidence in our
assertion would increase to area(0.2, 0.3)/area(0.2, 0.8) which is 93%. In general, we have a
general a priori probability density f on [0, 1] for q. In such a situation, the confidence for
the interval [a, b] when we have observed k successes for n trials would be:
Rb n k
f (q) k
q (1 − q)n−k dq
Ra1 n k
0
f (q) k
q (1 − q)n−k dq
62
Such an analysis is called a Bayesian analysis since it bases its estimate of q by conditioning
on the case for each q ∈ [0, 1].
Suppose we have a multivariate bernoulli distribition of the µ’s and let P r(µ = µ1 ) = P r(µ =
µ2 ) = . . . P r(µ = µk ) = k1 . As an example condider the toss of a dice. Suppose at ∞, all
observations are say a particular value Vi then, we will have P r(µ = µ1 ) = 0, . . . P r(µ =
µi ) = 1 . . . P r(µ = µk ) = 0
Using Bayes rule
P r(D|µ̄)P r(µ̄)
P r([µ1 . . . µk ]|D) = P
µ¯0
P r(D|µ̄0 )P r(µ̄0 )
If P r(µ) has a form such that P r(µ|D) has the same form, we say that P r(µ) is the
conjugate prior to the distribution defining P r(D|µ).
Some of the conjugate priors that we will see are
Dirichlet and Multivariate Bernoulli
Beta and Bernoulli
Gaussian and Gaussian
63
• As n1 + n2 ← ∞, spread of the distribution ← 0, a and b becomes immaterial.
n1 +a
( n1 +n 2 +a+b
= n1n+n
1
2
)
E(µ) ← µM ˆL
B(µ|n1 +a,n2 +b)
a+n1 −1
µMˆAP = argmaxP (µ|D) = a+n1 +b+n2 −2
µ
As n1 , n2 ← ∞, µMˆAP = µM
ˆL
Assume X is the event of a coin toss. Let X1=0 (TAILS say), X2=1, X3=0, X4=1, X5=1.
We are interested in predicting the event X6=1 given the above. This can be calculated by
different approaches. The ML, MAP and the Bayes Estimator are called the pseudo Bayes,
and Bayesian estimator is called the pure Bayes.
Maximum likelihood
µM
ˆ L is the probability of X = 1 from the data.
ˆ L )(1−X6)
ˆ L X6 (1 − µM
P (X6|X1..X5) = µM
64
MAP
Bayes Estimator
µbayes
ˆ is the probability of X = 1 from the data.
ˆ )(1−X6)
ˆ X6 (1 − µbayes
P (X6|X1..X5) = µbayes
Bayesian method
Z 1
P (X6|X1..X5) = µX6 (1 − µ)(1−X6) P (µ|X1..X5)dµ
0
Thus Z 1
P (X6|X1..X5) = µX6 (1 − µ)(1−X6) P (µ|X1..X5)dµ
0
Let us now turn the tables and assume that a claim has been made, say that q = q0 . It is
our task to check the validity of the claim. Such a claim is called the null hypothesis and
is denoted by H0 . Our task is to design an experiment with outcome set S and based on the
outcome, either reject or accept the hypothesis. There are clearly two types of error we can
make and this is given in the table below:
65
Our strategy will be as follows. We will design an experiment and specify an event set
E0 ⊆ S. If the outcome of the experiment o ∈ E0 then we assert with some confidence
that H0 is false. This takes care of Type I errors of labelling something as false when it was
actually true. Now consider Type II errors. We construct another hypothesis H1 so that
both H0 and H1 cannot simultaneously hold. For H1 we construct an event E1 ⊆ E0 such
that if the outcome o ∈ E1 then we can claim with confidence that H1 is true. Since H1
is true, H0 is certainly false, and we would have concluded from our experiment that H0 is
false. Thus, the correct task is to design the experiment so that if H0 were false then E1
should be as large as possible.
Thus, the task is to design an experiment and an event E0 and conduct the experiment.
Next, we observe the outcome o. Based on whether the outcome o ∈ E0 or not:
Let us suppose that the null hypothesis is H0 ≡ q0 = 0.4. We are now supposed to built
an event set E0 which will help us refute the hypothesis. Let us suppose that we intend to
conduct 100 trials and observe k, the number of successes. Thus S = {0, 1, . . . , 100}. We see
that if the hypothesis is true then the sum 50
P
i=30 B(100, 0.4)(i) = 0.96, thus we chooose E0
as the event set [0, 29] ∪ [51, 100] ⊆ S, and α = 5%. Clearly if the outcome o ∈ E0 , then we
can reject the claim H0 with confidence 1 − α, for if the hypothesis were true than o ∈ E0
is a very unlikely event. Next we set E2 = [0, 20], β = 1% and H1 as the hypothesis that
q0 < 0.35. We see that if for example, the outcome is 20, then using our earlier theory of
estimatation, we can claim with 99% confidence (check this) that q0 < 0.35. If the outcome
is lower than 20, then the confidence in fact strengthens. Thus we have:
• However, if the outcome is in the set [30, 50], i.e., the complement to E0 ∪ E1 , then we
are forced to remain silent.
What do we do when o ∈ [30, 50]? Well we could conduct a fresh experiment with an
additional 900 trials to get a total of 1000 trials. We see that the set E0 in fact swings closer
to the the number 0.4n and the forbidden set, where we cannot draw any conclusion becomes
smaller. In fact, for n = 1000, the inconclusive set becomes [0.37n, 0.43n].
66
13.2.2 Hypothesis Testing: More formally
Given a random sample (X1 , X2 , . . . Xn ) for a random variable X, the function S : (X1 , X2 , . . . Xn ) →
P
R is called statistic (or sample statistic). For instance, the sample mean nXi and sample
P P
Xi
2
i Xi − n
variance n−1
are statistics.
Let us consider hypotheses H0 and H1 defined by:
H0 : (X1 , X2 , . . . Xn ) ∈ C
H1 : (X1 , X2 , . . . Xn ) ∈
/C
where, C is some tolerance limit also called the confidence region. The C is generally
defined in terms of some statistic S.
The following types of errors are defined as a consequence of the above hypotheses:
Given a significance level α ∈ [0, 1] (some bound on the Type I error10 ), we want to
determine a C such that,
P rH0 ({X1 , . . . Xn } ∈
/ C) ≤ α Type I error
0 0
P rH0 ({X1 , . . . Xn } ∈
/ C ) ≤ P rH0 ({X1 , . . . Xn } ∈
/ C) ∀C ⊇ C
Thus, we are interested in the “smallest” / “tightest” C. This is called the critical region
Cα . Consequently,
P rH0 ({X1 , . . . Xn } ∈
/ Cα ) = α
67
Figure 29: The matix M of Q vs. Outcomes
outcome set S. In our earlier case, Q = [0, 1] and S = {0, 1, . . . , n}. We construct a Q × S
matrix M where M (q, x) = f (q; x). We see that each row of the matrix M , i.e., when
q ∈ Q is fixed, is merely the density function. While, for a fixed outcome x ∈ S, we see
the dependence of the parameter on the oucome x. For our example, we see that for the
outcome k, the column function is a smooth function with variable q, while the row function
is the discrete probability Bin(q, n) with a discrete outcome set S.
For the problem of estimation, since the outcome of the experiment is known, it is the
column function which assumes importance. Thus, for Type II error analysis, the column
function must be understood. The Type I error analysis is about a particular hypothesis on
the parameter and thus it is the row function, i.e., the ordinary density function which must
be understood.
Example 31 The government decides to impose an additional tax of Rs. 400 per tonne of
steel. Consequently, while some of the tax is absorbed by the industry, the remaining part is
passed on to the consumer. Given the price of steel in open market as a time series, estimate
the fraction which was passed on to the consumer.
This is possibly an example where the mean and the variance of the price data is a normal
68
random variable. By observing this before the intervention, this old µ and σ may be accurately
estimated. The economic mechanism suggests that the tax will merely cause a shift in the
mean price from µ to µ + δ without affecting σ.
Example 32 Karjat tribal block is a fairly homogenous sub-taluka of about 200 habitations
with child literacy fraction normally distributed with mean µ = 0.68 and σ = 0.14. Since
distances to school coould be an important factor, an intervention was designed to serve a
region of about 120 habitations by school rickshaws. The mechanism of literacy suggests that
the intervention will move σ without significantly changing σ.
Our task is to estimate µ of an unknown normal random variable X with known variance
2
σ . We define our experiment as an n-way repeated trial with the outcome set X1 × X2 ×
. . . × Xn . The parameter set Q = R is the set of possible µ values, i.e., the set of real
numbers. We define the estimator
e : X1 . . . × Xn → R
x1 + . . . + xn
e(x1 , . . . , xn ) =
n
Note that this is merely the mean of the observations. We see that if each Xi were indeed
independent normal N (µ, σ) then the expectation E(e) would merely be n·µ n
= µ. Thus the
estimator is unbiased, i.e., its expected value is indeed the correct value, if there is one.
We will next show that it is also a maximum likelihood estimator. To see this, the
probability of an n-observation sitting within [x1 , x1 + δ] × . . . × [xn , xn + δ] is proportional
to f (x1 ) · . . . · f (xn ), where f (x) = φ(µ, σ; x), the normal density function. We may write
this as:
P r([x1 , x1 + δ] × . . . × [xn , xn + δ]) = f (x1 ) · . . . · f (xn )δ n
−
P 2
i (xi −µ)
= ( σ√12π )n e 2σ 2 δn
Now let us assume that σ and δ are fixed, and x1 , . . . , xn are given observations, and that we
would like to determine the best possible µ which will maximize the RHS. Next, we see that
the RHS is maximized iff its log is maximized. But the log of the RHS as a function of µ, and
upto constants, is merely i −(xi − µ)2 . Thus the 2
P P
P
RHS is maximized when i (xi − µ)
P
is
x x
minimized. This is easily seen by choosing µ = ni i . This proves that e(x1 , . . . , xn ) = ni i
is indeed the maximum
P
likelihood estimator. P
x X
Let us denote ni i as x, i.e., the observation, while in i by X, the random variable.
We know that X is also normal with mean µ and variance σ 2 = σ 2 /n. The decrease in the
variance of X is the key. We see right away that if µ were the unknown mean and x was the
69
observation, then the abstract matrix M has
1 (x−µ)2
M (µ, x) = √ e 2σ2
σ 2π
Thus, both the row and the column functions have the same behaviour, which makes things
much easier. We see that:
Example 33 Suppose that X is the random variable denoting the child literacy in a village
of Karjat tribal block. Suppose that it is known to be normal with an unknown mean but
a known σ = 0.14. Suppose a team visits 10 villages and finds x = 0.76. (i) What is the
assertion we can make with 99%, 95% and 90% confidence? (iii) Suppose we have prior
belief that the child literacy rate is around 0.5, how do we incorporate this prior information
(ii) Suppose that an expert asserts that µ = 0.68. With what confidence can you refute the
claim?
√
Let us solve (i) first. Firstly, we see that the effective standard deviation is only 0.14/ 10 =
0.044. Next, We see that for a both-sided interval around 0.76, using cdfnor, we see that
the intervals as a multiple kσ, we have k(0.99) = 2.58, k(0.95) = 1.96 and k(0.9) = 1.65.
Thus, we see that these intervals are [0.65, 0.87], [0.67, 0.85] and [0.69, 0.83].
For (ii), we need to make use of Gaussian prior (which happens to be conjugate for mean
of a Gaussian) on µ such that the mean of the gaussian prior is 0.5. Refer to the handwritten
class notes for more details.
For (iii), we see that (0.76 − 0.68)/0.044 = 1.82. Again, using cdfnor, we see that the
event of x = 0.76, assuming that µ = 0.68 is in the (one-sided) 4% and lower. Thus, we
refute the claim with 96% confidence.
70
Figure 30: The χ2n density function for various n
has a parameter n and is denoted by χ2n . This arises most commonly as the square of the
distance of a random point. Let X1 , . . . , Xn be independent normal random variables with
mean 0 and variance 1, i.e., N (0, 1). Let Y = X12 + . . . + Xn2 , then χ2n is the density function
of Y . Clearly E(Y ) = i E(Xi2 ) = n · 1 = n. See the plots below in Fig. 16 (use cdfchi
P
("PQ",x,n*ones(1,m))).
Again, we make n trials X1 , . . . , Xn to obtain samplesPx1 , . . . , xn and the sample mean
n
(x −x)2
x = ( i xi )/n. The estimator of the variance is S 2 = i=1n−1i
P
. The curious term is
of course, the denominator. To understand this, let us look at a related summation as a
function on X1 , . . . , Xn (where µ is the unknown mean).
− µ)2 = − X) + (X − µ))2
P P
i (Xi i ((Xi
2 2
P P P
= i (Xi − X) + i (X − µ) + 2 i (Xi − X)(X − µ)
2 2
P P P
= i (Xi − X) + i (X − µ) + 2(X − µ) i (Xi − X)
2 2
P P
= i (Xi − X) + i (X − µ)
2 2
P
= i (Xi − X) + n · (X − µ)
X σ2
nσ 2 = E( (Xi − X)2 ) + n ·
i
n
Thus, we see that E( i (Xi − X)2 ) = (n − 1)σ 2 , and thus E(S 2 ) = σ 2 . Thus, S 2 is an
P
unbiased estimator.
71
Lets start with the last equality:
X X
(Xi − µ)2 = (Xi − X)2 + n · (X − µ)2
i i
X Xi − µ 2 2
S2
X −µ
= (n − 1) 2 + √
i
σ σ σ n
Since the LHS is a variable of density χ2n and the second term of the RHS χ21 , by a leap
2
of faith, the variable (n − 1) Sσ2 is distributed by the χ2n−1 density function, i.e., a known
density function. Note that this does not need us to assume knowledge of µ at all. Let us
now apply this in an example.
[0.82, 0.73, 0.70, 0.69, 0.67, 0.56, 0.45, 0.44, 0.43, 0.43]
Give 90% and 99% confidence interval estimates for σ 2 . With what confidence will you refute
the claim that the SD is 0.1?
We see that S 2 = 0.0217. The variance is 0.0195 and the sample SD is 0.140. Since
n = 10, we are dealing with χ29 with expected value 9. We will find intervals [a, b] around 9
such that P rχ29 ([a, b]) = 1 − α for α = 0.1 and 0.01. We use cdfchi("PQ",x,9*ones(1,m))
and get these intervals as [3.3, 18.9] and [1.8, 24]. Thus, we see that:
0.0217
P r(3.3 ≤ 9 · σ2
≤ 18.9) = 0.9
σ2
P r(2.727 ≥ 0.0217
≥ 0.476) = 0.9
σ
P r(1.651 ≥ 0.147
≥ 0.69) = 0.9
P r(0.242 ≥ σ ≥ 0.101) = 0.9
Thus, we can claim with 90% confidence that σ lies in the interval [0.101, 0.242]. A similar
(but larger) interval may be found for our 99% confidence assertion.
Next, we move to refuting the H0 ≡ σ = 0.1. We see that 9· 0.0217
0.01
= 1.953. cdfchi("PQ",1.953,9)
2
gives the answer 0.0078, which is outside 1%. Thus, the observed S is outside the 1% chance
and thus we can claim with 99% confidence that σ = 0.1 is false.
72
Figure 31: The t-density function
[0.82, 0.73, 0.70, 0.69, 0.67, 0.56, 0.45, 0.44, 0.43, 0.43]
73
Give 90% and 99% confidence interval estimates for µ. With what confidence will you refute
the claim that µ is 0.55?
We see that the sample mean is 0.592 and S 2 = 0.0217. Thus, we have the variable
0.592−µ
T =√ 2
= 0.592−µ
0.0466
. Next, we use cdft (with n − 1 = 9 degree of freedom) to compute the
S /10
intervals for 90% and 99%. This we get by using the command:
cdft("T",[9 9],[0.9 0.99],[0.1 0.01]) to get 1.383, 2.827. We start with the first prob-
lem, i.e., 90%. We have that:
0.592 − µ
P r(−1.383 ≤ ≤ 1.383) ≥ 0.9
0.0466
This gives us the confidence interval for the 90% as [0.527, 0.656]. Note that this interval is
larger than what would have been for the normal case with σ = 0.0466. The above interval
would correspond to a confidence of 91.66% in the normal case. This is because there is an
inherent uncertainly about the variance and that causes the t-density to be more broad than
the normal case.
Next, we see that 0.592−0.55
0.0466
= 0.901. The p-value for this can be found by
cdft("PQ",-0.901,9) which is 0.196. Thus, we can reject the claim with a mere 1 − 2 ∗
0.196 = 0.609, i.e., 60.9%.
74
18 A few scilab commands and code
Scilab code associated with each and every chapter of the primary textbook (Sheldon M.
Ross) can be found at http://www.cse.iitb.ac.in/~IC102/code/scilab_code.zip. A
local copy of the scilab code description can be found here http://www.cse.iitb.ac.in/
~IC102/code/scilab_code_description.pdf. The code has been reformated and reorga-
nized chapterwise for compatibility with Aakash tablets in both tar.gz and zip formats at
http://www.cse.iitb.ac.in/~IC102/code/aakash_code.
In what follows, we only briefly describe
• some functionalities in scilab and some additional functionalities that we have shared
for reading and processing data from xls files and
For any detailed information on performing operations based on probability and statistics
on datasets, you need to refer to the scilab code repository (and its documentation) pointed
to above.
Reading a .xls file: In the beginning there is an .xls file. To input it into your scilab
session, you need to use the readxls command, such as:
murbad=readxls("thane murbad census I.xls")
This creates a copy of the .xls file in your session and the file is called murbad. These will
have as many sheets as your original file had and these are refered as murbad(1), murbad(2)
and so on. So lets do the following.
We see that columns 56, and 10 onwards are numeric, while the others are text. Now,
let us select all the rows which correspond to VILLAGE (column 7) and all the numeric
columns. This is done as follows:
75
size(murbadnumeric) // should give you 205. 55.
save murbadnumeric // now a load will get this back for us
Now, we load all the index names. This is done by exec "index.sci". What this will do
is to define variables such as TOT P and NON WORK M and put the correct column index
for them, which are 11 and 63 respectively. Remember that while creating murbadvillage we
have deleted the first 9 columns and hence murbadvillage(:,TOT P-9) will be the column
vector of the total populations of all villages in Murbad. Just for fun, we extract the
population fraction under 6 as follows:
• mean(X) returns the mean of the entries of the matrix X. Example mean([1 2; 3 4])
returns 2.5.
76
• histplot(M,X): plots a histogram of the entries in X. M is either an integer or a
row-vector of values M = [m1 , m2 , . . . , mk ]. If M is an integer, the produced figure
has M divisions. If M is a vector, then the plots are for frequencies in [mi−1 , mi ].
he Y -axis is normally fraction of entries. Use histplot(M,X,normalization=%f) for
frequencies.
• plot2d(x,y): x and y should be vectors of the same size. This will plot a poly-line
connecting (xi , yi ) to xi+1 , yi+1 ) for each i. plot2d(x,y,’r+’) will not draw the line,
but only the points. These will be marked red and with a ”+” sign.
• title("my title") will add a title to your graph. legend("my legend"), xlabel("mylabel"),
ylabel("mylabel") will add the labels and legends to your plot.
where φ is the gaussian function. Thus cdfnor implements the cumulative density
function.
77
Figure 32: The households in Shahpur
mu=mean(HH) // 201
variance(HH,1) // 34484
max(HH) // 1635
sig=sqrt(varr) // 186
xx=linspace(0,1700,86) // this creates an array of 86 equally spaced point from
//0 to 1700, i.e., 20 apart
histplot(xx,HH,normalization=%f) // creates the histogram below
size(HH) // is 222
cdf=cdfnor("PQ",xx,mu*ones(1,86),sig*ones(1,86));
// this produces the vector in cdf for all the stopping points
// of the histogram
pdf=differ(cdf)*222 // this is what we want
// differ is our function to compute input(i+1)-input(i)
//
$\chi^2_n $ density function for various $n$}
plot(xx,pdf) // does the job by plotting the normal on the histogram
78
19 Appendix on Regression
Suppose there are two sets of variables x ∈ <n and y ∈ <k such that x is independent
and y is dependant. The regression problem is concerned with determining y in terms of x.
Let us assume that we are given m data points D = hx1 , y1 i, hx2 , y2 i, .., hxm , ym i. Then the
problem is to determine a function f ∗ such that f ∗ (x) is the best predictor for y, with respect
to D. Suppose ε(f, D) is an error function, designed to reflect the discrepancy between the
predicted value f (x0 ) of y0 and the actual value y0 for any hx0 , y0 i ∈ D, then
where, F denotes the class of functions over which the optimization is performed.
f (x) = w0 + w1 x + w2 x2 + ... + wt xt
K
1X
arg min [(w0 + w1 x + w2 x2 + ... + wt xt ) − y1 (i)]2
w=[w1 ,w2 ,...wt ] 2 i=1
If there are m data points, then a polynomial of degree m − 1 can exactly fit the data,
since the polynomial has m degrees of freedom (where degrees of freedom=no. of coefficients)
As the degree of the polynomial increases beyond m, the curve becomes more and more
wobbly, while still passing through the points. Contrast the degree 10 fit in Figure 34 against
the degree 5 fit in Figure 33. This is due to the problem of overfitting (overspecification)
Now E is a convex function. To optimize it, we need to set ∇w E = 0. The ∇ operator
is also called gradient.
Solution is given by,
79
Figure 33: Fit for degree 5 polynomial.
X = (φt φ)−1 φt Y
If m << t then
basis functions (for example, we can consider φi (x) = xi , i.e., polynomial basis functions)
.
80
Figure 34: Fit for degree 10 polynomial. Note how wobbly this fit is.
Any function in F is characterized by its parameters, the wi ’s. Thus, in (3) we have to
find f (w∗ ) where
w∗ = arg minε(w, D)
w
The error function ε plays a major role in the accuracy and tractability of the optimization
problem. The error function is also called the loss function. The squared loss is a commonly
used loss function. It is the sum of squares of the differences between the actual value and
the predicted value.
X
ε(f, D) = (f (xi ) − yi )2
hxi ,yi i∈D
m p
∗
X X 2
w = arg min (wi φi (xj ) − yj
w
j=1 i=1
The minimum value of the squared loss is zero. Is it possible to achieve this value ? In
p
X
other words is ∀j, wi φi (xj ) = yj possible ?
i=1
81
C(φ)
ŷ
Figure 35: Least square solution ŷ is the orthogonal projection of y onto column space of φ
φ1 (x1 ) · · · φp (x1 ) y
1
.. .. .. ..
φ= and y = .
. . .
φ1 (xm ) · · · φp (xm ) ym
It has a solution if y is in the column space (the subspace of Rn formed by the column
vectors) of φ. It is possible that there exists no w which satisfies the conditions? In such
situations we can solve the least square problem.
Let ŷ be a solution in the column space of φ. The least squares solution is such that the
distance between ŷ and y is minimized. From the diagram it is clear that for the distance
to be minimized, the line joining ŷ to y should be orthogonal to the column space. This can
be summarized as
1. φw = ŷ
ŷT φ = yT φ
ie, (φw)T φ = yT φ
ie, wT φT φ = yT φ
ie, φT φw = φT y
∴ w = (φT φ)−1 y
In the last step, please note that, φT φ is invertible only if φ has full column rank.
82
Theorem: If φ has full column rank, φT φ is invertible. A matrix is said to have full column
rank if all its column vectors are linearly independent. A set of vectors vi is said to be
P
linearly independent if i αi vi = 0 ⇒ αi = 0.
Proof: Given that φ has full column rank and hence columns are linearly independent, we
have that φx = 0 ⇒ x = 0.
Assume on the contrary that φT φ is non invertible. Then ∃x 6= 0 3 φT φx = 0.
⇒ xT φT φx = 0
⇒ (φx)T φx = ||φx||2 = 0
⇒ φx = 0. This is a contradiction. Hence the theorem is proved.
Problem formulation
In order to discourage coefficients from becoming too large in magnitude, we may modify
the problem and pose a constrained optimization problem. Intuitively, for achieving this
criterion, we may impose constraints on the magnitude of the coefficients. Any norm for this
purpose might provide a good working solution. However, for mathematical convenience,
we start with the euclidean (L2 ) norm. The overall problem with objective function and
constraint goes as follows:
83
As observed in last lecture, the objective function, namely f (w) = (Φw −Y )T (Φw −Y ) is
strictly convex. Further to this, the constraint function, g(w) =k w k22 −ξ, is also a convex
function. For convex g(w), the set S = {w|g(w) ≤ 0}, can be proved to be a convex set by
taking two elements w1 ∈ S and w2 ∈ S such that g(w1 ) ≤ 0 and g(w2 ) ≤ 0. Since g(w) is
a convex function, we have the following inequality:
As discussed earlier, we need to minimize the error function subject to constraint kwk≤ ξ.
Applying the first order necessary conditions of minimality to this problem, if w∗ is a global
optimum then from the first first order necessary conditions for minimality, we get,
i.e.
We will also obtain the following conditions from the Karush Kuhn Tucker necessary opti-
mality conditions;
kw∗ k2 ≤ ξ (8)
and
λ≥0 (9)
84
and
λkw∗ k2 = λξ (10)
Thus values of w∗ and λ which satisfy all these equations would yield an optimal solution.
Consider equation (7),
w∗ = (ΦT Φ + λI)−1 ΦT y
(ΦT Φ + λI)w∗ = ΦT y
By triangle inequality,
(α + λ)kw∗ k ≥ kΦT yk
i.e.
kΦT yk
λ≥ −α (12)
kw∗ k
Note that when kw∗ k → 0, λ → ∞. This is obvious as higher values of λ would focus more
on reducing values of kw∗ k than on minimizing the error function.
kw∗ k2 ≤ ξ
kΦT yk
∴λ≥ √ −α (13)
ξ
85
This is not the exact solution of λ but the bound (15) proves the existance of λ for some ξ
and Φ.
Figure 36: RMS error vs. degree of polynomial for test and train data
Recall the polynomial curve fitting problem we considered in earlier lectures. Figure 36
shows RMS error variation as the degree of polynomial (assumed to fit the points) is in-
creased. We observe that as the degree of polynomial is increased till 5 both train and test
errors decrease. For degree > 7, test error shoots up. This is attributed to the overfitting
problem (The datasize for train set is 8 points.)
In Figure 37, the variation in the RMS error with variations in the Lagrange multiplier λ
has been explored (keeping the polynomial degree constant at 6). Given this analysis, what
is the optimum value of λ that must be chosen? We have to choose that value for which the
test error is minimum (Identified as the point of optimum in the figure.).
86
Figure 37: RMS error vs. 10λ for test and train data (at Polynomial degree = 6)
min(k Φw − y k2 +λ k w k2 ) (15)
For the same λ, these two solutions are the same. This form or regression is known as Ridge
regression. If we employ the L1 norm it is called ’Lasso’. Note that the w∗ form that we
derived is valid only for the L2 norm.
Question: Consider the two graphs above. Say you know probability function p(x). When
is knowing value of X more useful (that is, carries more information)?
Ans: It is more useful in the case(2), because more information is conveyed in Figure 38
than in Figure 39.
87
Figure 38: Figure showing curve where Information is not distributed all along.
I(x)<I(y)
There is only one function which satisfies the above two properties.
88
• The Entropy in the case of discrete random variable can be defined as:
X
EP [I(p(x))] = −c log[p(x)] (19)
x
Observations:
• For a discrete random variable (with countable domain), the information is maximum
for the uniform distribution.
• For Continuous random variable ( with finite mean and finite variance), the information
is maximum for the Gaussian Distribution.
R
xp(x)dx = µ
and
R
(x − µ)2 p(x)dx = σ 2
−(x−µ)2
e 2σ 2
p(x) = √
σ 2π
• If X ∼ N (µ, σ 2 )
−(x−µ)2
p(x) = √1 e 2σ 2 where − ∞ < x < ∞
σ 2π
89
(σt)2
Φ(N (µ, σ 2 )) = EN (µ,σ2 ) [etx ] = eµt+ 2
Recall
E(X) = dφ(p)
dt
d2 φ(p)
var(x) = dt2
(σt)2
EN (µ,σ2 ) [et(w1 x+w0 ) ] = (w1 µt + w0 t + 2
× w12 ) ∼ N (w1 µ + w0 , w12 σ 2 )
X1 + X2 + ...... + Xn ∼ N (nµ, nσ 2 )
Pn
In genaral if Xi ∼ N (µi , σi2 ) =⇒ Xi ∼ N ( µi , σi2 )
P P
i=1
X−µ
z= σ
∼ N (0, 1) (Useful in setting interval estimate)
(take w1 = 1
σ
and w0 = σµ )
Note:- If X1 , X2 , ....Xm ∼ N (0, 1)
1. y = i Xi2 ∼ χ2m . That is, y follows the chi-square distribution with m-degrees
P
of freedom.
2. y = Xz 2 ∼ tn . (where z ∼ N (0, 1))). That is, y follows the students-t
Xi
distribution.
90
Figure 20.1 : Figure showing the nature of the (chi − square) distribution with 5
degrees of freedom
Qm −(Xi −µ)2
µ̂M LE = argmaxµ √1
i=1 [ σ 2π e ]
2σ 2
(Xi −µ)2
P
−
= argmaxµ σ√12π e 2σ 2
Pm
Xi
µ̂M LE = i=1
m
= sample mean
• With out relaying on central limit theorem Properties (2) and (1)
2
µ̂M LE = N (µ, σm )
Similarly
Pm 2
2 i=1 (Xi −µ̂M LE )
σ̂M LE = m
is χ2 distrbution
∼ χ2m
91
• Coming up with conjugate prior of N (µ, σ 2 )
⇒ µ ∼ N (µ0 , σ02 )
⇒ σ2 ∼ Γ
92