[go: up one dir, main page]

0% found this document useful (0 votes)
38 views3 pages

NUS GEA1000 Quantitative Guide

biz law

Uploaded by

christiinelhz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views3 pages

NUS GEA1000 Quantitative Guide

biz law

Uploaded by

christiinelhz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

lOMoARcPSD|45105042

GEA1000-cheatsheet - Summary made.

Quantitative reasoning with data (National University of Singapore)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Christine (christiinelhz@gmail.com)
lOMoARcPSD|45105042

GEA1000 Summary Categorical Variables: Association: Positive / Negative Association: If there is


no association, we write that
changed, unlike bar graph. No gaps between bars in a
histogram.
AY22/23 Sem 2 Either category or label values (mutually exclusive,
variable cannot be placed in two different categories) rate(A|B) = rate(A|N B)
github.com/gerteck
Ordinal Variables: Natural ordering, numbers represent Four comparisons are mathematically equivalent:
order (e.g. Happiness)
1. Data Collection Nominal Variables: No intrinsic (e.g. Eye colour)
Biasness Numerical Variables:
• Selection Bias: Associated with Researcher’s Biased
Discrete Variables: Possible values form a set of numbers
selection of units. Imperfect sampling frame (units
with ”gaps” e.g. Number of siblings Symmetry Rule on Rates:
excluded). Caused by non-probability sampling.
Continuous Variables: Can take on all possible values in
• Non-Reponse Bias: Associated with Participants’ non an interval e.g. Time. Boxplots
participation, or non-disclosure of (sensitive)
Summary Statistics for Numerical Variables • Five Number Summary: Minimum, Q1 (25th), Median
information.
(Q2) , Q3 (75th), Maximum.
Central Tendency Measures: Mean, Median, Mode
Probability Sampling Mean: Adding constant value to all changes mean by that • Outliers: Greater than Q3 + 1.5 ∗ IQR or smaller than
Four types. Every unit has a known non-zero probability value. Multiplying all changes mean similarly. Dispersion Q1 − 1.5 ∗ IQR.
of being selected (need not be same). Element of chance Measures: Standard deviation, Inter-quartile Range
to eliminate bias. Randomized mechanism. Standard deviation: distance between each point and the
• SRS: Simple Random: All units selected randomly mean. Measure of data distribution/spread. Basic Rule on Rates: The overall rate(A) will always lie
without replacement, with equal chance. Subject to between rate(A|B) and rate(A|N B)
non-response. • Simpson’s Paradox: is a phenomenon in which a trend
appears in more than half of the groups of data but
• Systematic Sampling: Apply some selection interval k
disappears or reverses when the groups are combined.
and random starting point from the first interval. List Coefficent of Variation: Here, “disappears” means the two variables in question • Understanding boxplots: Shape, and Spread.
should be random.
(say A and B) are no longer associated. Rate of A given Shape: left-skewed vs right-skewed (variability of data
• Stratified Sampling: (some units of all groups) Divided B is now equal to rate of A given not B. on lower and upper half respectively).
into strata based off similar nature, size may vary. SRS Centre: Described by Median. Cross represents mean.
Median: Middle value of (ascending/descending ordered) • Confounder: A confounder is a third variable that is
to each strata. We can compare the relative positions of the median and
data set. Overall median will always be between lowest associated with both the independent and dependent
• Cluster Sampling: (whole cluster of only certain and highest median amongst all subgroups. variables whose relationship is being investigated. (Can mean from the boxplot.
clusters): Divide into clusters. Fixed number of clusters Quartile 1: 25th percentile value, be positive or negative association.) Spread: IQR gives us idea of the spread for the middle
chosen using SRS, which all units are used. Quartile 3: 75th Percentile value. IQR: Q3-Q1. They can be addressed by the splicing of data according 50% of the data set, used to measure across different
Mode: Value that appears the most often. to the confounding variable or by randomized distributions.
assignment (general solution across all confounders). Boxplots vs. Histograms:
• Observation of the Simpson’s paradox implies that Histogram: Better sense of shape of distribution of a
there is definitely a (third) confounding variable present. variable. Boxplot: Better identifies and indicates outliers.
Experimental Study However, existence of confounder does not necessarily Bottom line: Used together to complement each other.
lead to Simpson’s paradox, nor does lack of observation
Controlled experiment, manipulate independent variable
imply lack of confounder.
to observe effect on dependent variable. Goal is to provide Bivariate EDA
evidence for cause-effect relationship. Make sure
Non Probability Sampling
independent variable is the only factor, through random
3. Numerical Data Focus on relationship between two variables in a
Selection not done by randomisation but by human assignment. (Uses probability to allocate subjects into Univariate EDA population.
discretion. Broad Types include: (Non mutually exclusive) treatment and control groups) By law of probability,
Quota, Convenience, Judgement, Volunteer Samplings. Exploratory Data Analysis of Univariate (one variable) • Deterministic Relationship: Value of one variable can
subjects will tend to be similar in all aspects.
• Convenience Sampling: Subjects most easily available numerical data: Consider Distribution, Histograms, be determined exactly from the other. (e.g. Conversion
Placebo: Inactive substance, likely caused by the
to participate, e.g. Mall surveys Boxplots. of units of measurement, m ⇔ f t, ◦ C ⇔◦ F .)
psychology of believing.
Describing Distributions (Overall Pattern + Deviations): • Association (Non-Deterministic) Statistical relation,
• Volunteer Sampling: Self-selected sample, biased and Double Blinding: Patients and researchers both unaware
Focus on shape, centre and spread of distribution, and given one variable value, we can describe average value
non representative. of grouping.
outliers. Can be in the form of (mode) multimodal of the other variable.
Approach + Generalizability Criteria Observational Study distribution (local maxima), unimodal, (Standard
Variation, range of distribution) low variability vs. high • Consider scatterplots (idea of pattern), correlation
• Choose Sampling frame. (Larger than or equal to target Used when there are ethical issues. Observes individuals coefficients (check for linear relation) and regression
and measures variable of interest, without direct variability, and outliers.
population, members of target pop must not be left out. analysis (fitting line or curve to data).
manipulation of variables. Does not provide convincing Median and Mode are robust statistics - Outliers have
• Sample from Sampling frame (Decide if Probability little to no effect on these values. (e.g. median salary)
evidence of cause-effect relationship, and only
Sampling in sample frame is feasible.) Scatter Plots
Association. Histograms
• Remove unwanted Units.
• Generalizability Criteria: Good sampling frame that 2. Categorical Data • Graphical representation that organises data points into Direction, Form, Strength and Outliers.
covers target population, probability based sampling Joint Rate: Chance of an event occurring out of all the ranges/bins. Useful for large data sets.
• Direction: Positive / Negative relationship or neither
(Need to be used to minimise selection bias), large possible outcomes: • Histogram vs. Bar Graph: A histogram shows the (curved).
sample size (Helps to reduce variability of data, reduce Conditional Rate: Based on a given condition (X) , in distribution of a numerical variable across a number line,
• Form: General shape, classify as linear or non-linear.
error amount in sample estimate, Minimal non-response which rate of success/failure is found. but a bar graph makes comparisons across categories of
rate. Downloaded by Christine (christiinelhz@gmail.com)
Rate(Success|X) a variable. Orderings of bar in histogram cannot be • Strength: How closely data follows form.
lOMoARcPSD|45105042

Correlation Coefficient, r Probability Random Variables This means: we are 95% confident that the population
Correlation coefficient between two numerical values, r, is A random variable is a numerical variable with proportion (parameter in this case) of food transactions
Probability as a mathematical means to reason about that are from Terrace (a certain category), lies within the
a measure of linear association between them. Always uncertainty. probabilities assigned to each of the possible numerical
ranges between -1 and 1. values taken by the numerical variable. Conceived as confidence interval.
• Sign and Magnitude of r: Tells us about the direction • Sample Space: Collection of all possible outcomes of a Idea of confidence level: 95 of 100 SRS of same size
mathematical way to model data distribution.
of the linear association. If r > 0, association is probability experiment. will contain population parameter. (Exact value not
• May be Discrete or Continuous Random Variables.
positive, when one increases the other tends to increase • Event: Subcollection of the sample space is an event. known) (** Not 95% chance, chances are in sampling
Visualisation: (respectively)
as well. r < 0, association is negative, increase in one procedure, not parameter.)
• Rules of Probability: Probability of an event E, P (E),
variable leads to decrease of the other. If r = 1 or is between 0 and 1 inclusive. Probability of entire • Properties of CI: The larger the sample size, the smaller
r = −1, there is perfect positive/negative association. sample space P (S) is 1. the random error, narrower CI. The higher the
When r = 0, there is no linear association. Magnitude confidence level, the wider the CI. CI is way to quantify
of r tells us the strength of the linear association. • If E and F are mutually exclusive events, then the
random error.
Approx: (0 - 0.3 weak, 0.3 - 0.7 moderate, 0.7 - 1 strong) probability of E union F is equal to the sum of the
probabilities of E and F. That is, • For discrete rv, sum of probabilities assigned to each Hypothesis Testing
• Calculation of r: P P P P (E ∪ F ) = P (E) + P (F ). outcome must equals 1. For continuous rv, area under
1. Null and Alternative Hypothesis.
r = √ Pn(2 xy)−( x)( y)
P 2 P 2 P 2 • Uniform Probability and Rates: Way of assigning
density curve is always equal to 1.
• Null hypothesis usually asserts stand of no effect /
[n x −( x) ][n y −( y) ]
probabilities to outcomes such that equal probability is difference. Alternative is what we wish to confirm and
assigned to every outcome in the finite sample space.
Normal Distributions
• Properties of r: r is not affected by adding a number to pit against null hypothesis. (Mutually exclusive) e.g.
all values of a variable, or by multiplying a positive Relevant in random sampling. A class of continuous random variables. N (x, y). (bell Null Hypothesis H◦ : P (H) = 0.5
number to all values of a variable. curve god) Alt. Hypothesis: H1 : P (H) > 0.5
Conditional Probability and Independence • Normal Distributions only differ by means and 2. Collect data and determine test statistic.
• Limitations of r: Association is not causation.
variances. (mean x, variance y). • Testing usually involves some random variable, and its
r does not give indication of non-linear association. Conditional Probability is written using the notatoin
Outliers can affect the correlation coefficient r • Common Properties: Bell-shaped curve, Peak of curve probability distribution. (e.g. coin, vaccine safety)
P (E|F ) and read as ”probability of E given F”.
significantly. occurs at the mean, Curve is symmetrical about the 3. Set level of significance and compute p-value.
P (E∩F ) mean. (Mean = Mode = Median).
P (E|F ) = P (F ) • Significance level: How convincing evidence must be to
Linear Regression reject H◦
If we believe that two variables are linearly associated, we • Mutually Exclusive Events: No overlap between E and • The lower the S.L., the greater the evidence needed.
may model relationship by fitting a straight line to the F, meaning not simultaneously possible. Then, Commonly used is 0.05 level, or 5% level of Sig, or 0.1
observed data, known as linear regression. P (E ∩ F ) = 0. If an event F itself cannot occur, then (10%), or 0.01 (1%).
• The slope of the line is the amount of change in Y when by convention P (E ∩ F ) is also equal 0.
the value of X increases by 1. • p-value: Probability of obtaining test result at least as
• Law of Total Probability: extreme as result observed, assuming null hypothesis is
• Finding Regression Line: Method of least squares:
true.
Fit the line to minimize the square of error terms.
Also the probability of observing test result that
Hence, two regression lines are different and not Confidence Intervals favours alternative hypothesis at least as much as
interchangeable.
Using a sample statistic to estimate the population observed in current sample, assuming null hypo is true.
parameter is subjected to inaccuracies (bias / random
error).
• Analogy between Probability and Sampling: • A Confidence Interval is a range of values that is likely
to contain a population parameter based on a certain
degree of confidence. This degree of confidence is
known as the confidence level and is usually expressed 4. Compare p-value and level of significance.
as a percentage (%). • Hence, we reject null hypothesis in favor of alternate if
• Conditional Probabilities: equivalent to conditional
p-value < significance
rate: • To construct confidence intervals for population
(logically it is very unlikely)
P (A|B) = rate(A|B) proportion:
• Slope vs. Correlation Coefficient Slope of regression
q
p∗ (1−p∗ ) • However, if
• Independent Events: For independent events A and B, p∗ ± z ∗ × n p-value > significance
line and correlation coefficient related by:
the probability of A is the same as the probability of A
m = ssxy r given B.
where:
p∗ = sample proportion
We do not reject the null hypothesis
(cannot accept, does not mean H◦ is true) (we don’t
where sy is the standard deviation for y and sx is the P (A) = P (A|B) z ∗ = ”z-value” from standard normal distribution (table) know if observation is due to chance, inconclusive)
standard deviation for x. If we express conditional probability P (A|B) as: n = sample size
P (A∩B) • We only carry out hypothesis test with sample data.
• Important to remember that correlation coefficient is not P (B) • To construct confidence intervals for population mean When given population data, all can be determined.
necessarily equal to gradient of the regression line. µ:
then A and B being dependent means that x̄ ± t∗ × √sn
• Extrapolation: Prediction beyond the observed range Common Hypothesis Tests: One-sample t-test and
P (A) ∗ P (B) = P (A ∩ B)
is dangerous (Not advisable) where: Chi-squared test:
which is an equivalent definition for two independent
• Linear Regression on Non-Linear Models: Model µ = sample mean
events.
relationship indirectly (e.g. property of log) to form a t∗ = ”t-value” from t-distribution (table)
• Independence as non-association: A and B are s = sample standard deviation
linear relation.
independent event whenever A and B are not associated n = sample size
4. Statistical Inference with each other.
• Interpreting Confidence Interval:
Statistical Inference is the use of samples to draw • Independent Probability Experiments: E.g. Coin toss, Two parts: Confidence Level (e.g. 95%) and Interval
inferences or conclusions about population in question. Downloaded
where one instance is independent by Christine (christiinelhz@gmail.com)
of the other. (e.g. 0.254 ± 0.0191 [margin of error])

You might also like