NUS GEA1000 Quantitative Guide
NUS GEA1000 Quantitative Guide
Correlation Coefficient, r Probability Random Variables This means: we are 95% confident that the population
Correlation coefficient between two numerical values, r, is A random variable is a numerical variable with proportion (parameter in this case) of food transactions
Probability as a mathematical means to reason about that are from Terrace (a certain category), lies within the
a measure of linear association between them. Always uncertainty. probabilities assigned to each of the possible numerical
ranges between -1 and 1. values taken by the numerical variable. Conceived as confidence interval.
• Sign and Magnitude of r: Tells us about the direction • Sample Space: Collection of all possible outcomes of a Idea of confidence level: 95 of 100 SRS of same size
mathematical way to model data distribution.
of the linear association. If r > 0, association is probability experiment. will contain population parameter. (Exact value not
• May be Discrete or Continuous Random Variables.
positive, when one increases the other tends to increase • Event: Subcollection of the sample space is an event. known) (** Not 95% chance, chances are in sampling
Visualisation: (respectively)
as well. r < 0, association is negative, increase in one procedure, not parameter.)
• Rules of Probability: Probability of an event E, P (E),
variable leads to decrease of the other. If r = 1 or is between 0 and 1 inclusive. Probability of entire • Properties of CI: The larger the sample size, the smaller
r = −1, there is perfect positive/negative association. sample space P (S) is 1. the random error, narrower CI. The higher the
When r = 0, there is no linear association. Magnitude confidence level, the wider the CI. CI is way to quantify
of r tells us the strength of the linear association. • If E and F are mutually exclusive events, then the
random error.
Approx: (0 - 0.3 weak, 0.3 - 0.7 moderate, 0.7 - 1 strong) probability of E union F is equal to the sum of the
probabilities of E and F. That is, • For discrete rv, sum of probabilities assigned to each Hypothesis Testing
• Calculation of r: P P P P (E ∪ F ) = P (E) + P (F ). outcome must equals 1. For continuous rv, area under
1. Null and Alternative Hypothesis.
r = √ Pn(2 xy)−( x)( y)
P 2 P 2 P 2 • Uniform Probability and Rates: Way of assigning
density curve is always equal to 1.
• Null hypothesis usually asserts stand of no effect /
[n x −( x) ][n y −( y) ]
probabilities to outcomes such that equal probability is difference. Alternative is what we wish to confirm and
assigned to every outcome in the finite sample space.
Normal Distributions
• Properties of r: r is not affected by adding a number to pit against null hypothesis. (Mutually exclusive) e.g.
all values of a variable, or by multiplying a positive Relevant in random sampling. A class of continuous random variables. N (x, y). (bell Null Hypothesis H◦ : P (H) = 0.5
number to all values of a variable. curve god) Alt. Hypothesis: H1 : P (H) > 0.5
Conditional Probability and Independence • Normal Distributions only differ by means and 2. Collect data and determine test statistic.
• Limitations of r: Association is not causation.
variances. (mean x, variance y). • Testing usually involves some random variable, and its
r does not give indication of non-linear association. Conditional Probability is written using the notatoin
Outliers can affect the correlation coefficient r • Common Properties: Bell-shaped curve, Peak of curve probability distribution. (e.g. coin, vaccine safety)
P (E|F ) and read as ”probability of E given F”.
significantly. occurs at the mean, Curve is symmetrical about the 3. Set level of significance and compute p-value.
P (E∩F ) mean. (Mean = Mode = Median).
P (E|F ) = P (F ) • Significance level: How convincing evidence must be to
Linear Regression reject H◦
If we believe that two variables are linearly associated, we • Mutually Exclusive Events: No overlap between E and • The lower the S.L., the greater the evidence needed.
may model relationship by fitting a straight line to the F, meaning not simultaneously possible. Then, Commonly used is 0.05 level, or 5% level of Sig, or 0.1
observed data, known as linear regression. P (E ∩ F ) = 0. If an event F itself cannot occur, then (10%), or 0.01 (1%).
• The slope of the line is the amount of change in Y when by convention P (E ∩ F ) is also equal 0.
the value of X increases by 1. • p-value: Probability of obtaining test result at least as
• Law of Total Probability: extreme as result observed, assuming null hypothesis is
• Finding Regression Line: Method of least squares:
true.
Fit the line to minimize the square of error terms.
Also the probability of observing test result that
Hence, two regression lines are different and not Confidence Intervals favours alternative hypothesis at least as much as
interchangeable.
Using a sample statistic to estimate the population observed in current sample, assuming null hypo is true.
parameter is subjected to inaccuracies (bias / random
error).
• Analogy between Probability and Sampling: • A Confidence Interval is a range of values that is likely
to contain a population parameter based on a certain
degree of confidence. This degree of confidence is
known as the confidence level and is usually expressed 4. Compare p-value and level of significance.
as a percentage (%). • Hence, we reject null hypothesis in favor of alternate if
• Conditional Probabilities: equivalent to conditional
p-value < significance
rate: • To construct confidence intervals for population
(logically it is very unlikely)
P (A|B) = rate(A|B) proportion:
• Slope vs. Correlation Coefficient Slope of regression
q
p∗ (1−p∗ ) • However, if
• Independent Events: For independent events A and B, p∗ ± z ∗ × n p-value > significance
line and correlation coefficient related by:
the probability of A is the same as the probability of A
m = ssxy r given B.
where:
p∗ = sample proportion
We do not reject the null hypothesis
(cannot accept, does not mean H◦ is true) (we don’t
where sy is the standard deviation for y and sx is the P (A) = P (A|B) z ∗ = ”z-value” from standard normal distribution (table) know if observation is due to chance, inconclusive)
standard deviation for x. If we express conditional probability P (A|B) as: n = sample size
P (A∩B) • We only carry out hypothesis test with sample data.
• Important to remember that correlation coefficient is not P (B) • To construct confidence intervals for population mean When given population data, all can be determined.
necessarily equal to gradient of the regression line. µ:
then A and B being dependent means that x̄ ± t∗ × √sn
• Extrapolation: Prediction beyond the observed range Common Hypothesis Tests: One-sample t-test and
P (A) ∗ P (B) = P (A ∩ B)
is dangerous (Not advisable) where: Chi-squared test:
which is an equivalent definition for two independent
• Linear Regression on Non-Linear Models: Model µ = sample mean
events.
relationship indirectly (e.g. property of log) to form a t∗ = ”t-value” from t-distribution (table)
• Independence as non-association: A and B are s = sample standard deviation
linear relation.
independent event whenever A and B are not associated n = sample size
4. Statistical Inference with each other.
• Interpreting Confidence Interval:
Statistical Inference is the use of samples to draw • Independent Probability Experiments: E.g. Coin toss, Two parts: Confidence Level (e.g. 95%) and Interval
inferences or conclusions about population in question. Downloaded
where one instance is independent by Christine (christiinelhz@gmail.com)
of the other. (e.g. 0.254 ± 0.0191 [margin of error])