ITCS 6100: Big Data for Competitive Advantage
Data Driven Decision Making: A/B Testing
Dr. Gabriel Terejanu
https://towardsdatascience.com/a-b-testing-a-
complete-guide-to-statistical-testing-e3f1db140499
A/B Testing
• A/B testing is a method of comparing two versions of a web
page or app to see which one performs better.
• This is typically done by randomly dividing website visitors or
app users into two groups, and showing each group a different
version of the page or app.
• The version that performs better is then chosen for further use.
Example A/B testing use cases
• Navigation links
• mobile apps • Calls to action (CTAs)
• Design/layout
• website pages
• Content offer
• components on web pages
• Headline
• emails
• Email subject line
• newsletters
• Friendly email “from” address
• advertisements • Images
• text messages • Social media buttons (or other
buttons)
• Logos and taglines/slogans
https://www.oracle.com/cx/marketing/what-is-ab-testing/
A/B testing process
• Randomly subset the
users into the two
groups.
• Control group –
interacts with the current
state of your product.
• Treatment group –
interacts with the
variant(s) of the product
that you want to test.
• Monitor which product
(control or treatment)
perform best.
Metrics – quantify performance
• Metrics are performance indicators that we want to minimize or
maximize.
• They indicate how engaged the users are with your product.
• conversion rates
• signups
• subscriptions
• time spent on the site
• click-through rates
Why control and treat at the same time?
• An alternative (bad idea) is to do it sequentially
• Measure the metrics on the current version
• Make the change to your product
• Then measure the metrics on the new version
• Flaw – misses factors such as external events, temporal trends
and seasonality.
• Differences in control and treatment performance will be
impacted by these factors that have nothing to do with the
proposed improvements of the product.
• Very challenging to estimate just the contribution of the
proposed improvements.
Why control and treat at the same time?
Oprah calls
Kindle "her
new favorite
thing"
BAD idea
Oprah calls
Kindle "her
new favorite
thing"
GOOD idea
The importance or randomization
• Random assignments isolate the impact of product changes.
• (Bad idea) Assignments criteria that are not random may
introduce confounders which makes it challenging to estimate
the contribution of the proposed changes. Example:
assignments based on time, region, demographics etc.
• Confounders are factors that affect the relationship between
the treatment and the performance metric. They introduce bias
into the A/B testing. Example type of device used:
• Version A optimized for desktop users
• Version B optimized for mobile users
• Even though version B may have a higher conversion rate, it
could be because more mobile users are visiting the site,
rather than because version B of the page is better.
Multivariate test
• Multivariate test examine the effect of multiple variables on the
outcome of interest.
• Unlike traditional A/B testing, multivariate testing allows you to
test multiple elements of a page or product simultaneously, by
creating multiple versions of the page that vary by more than
one element.
• For example, you can test different headlines, images, and call-
to-action buttons all at once.
• Full factorial design tests all possible combinations of the
multiple variables.
• This requires higher traffic as it depends on the number of
variables that you are testing. Most websites and email
campaigns would struggle to find the traffic to support that.
Use cases for A/B testing
• Use A/B testing when users are impacted individually
• Test changes that can directly impact their behavior
• DO NOT use A/B testing when the problem exhibits network
effect among users
• Challenging to untangle the impact of the test
Class Experiment
What is your price?
Analyzing the results - Example
Time spent on a webpage [seconds]:
A: [ 8.50 0.93 2.18 6.83 9.14 9.12 1.23 6.47 9.60 9.26 ]
B: [ 6.93 2.16 3.07 4.06 1.76 7.78 2.09 6.75 10.24 8.79 ]
Analyzing the results – Sufficient?
Time spent on a webpage [seconds]:
mean(A) = 6.33 > mean(B) = 5.36
A: [ 8.50 0.93 2.18 6.83 9.14 9.12 1.23 6.47 9.60 9.26 ]
B: [ 6.93 2.16 3.07 4.06 1.76 7.78 2.09 6.75 10.24 8.79 ]
Repeat with different users multiple times
Central Limit Theorem (CLT): for a large enough sample size,
the distribution of the sample mean will be approximately
normal, regardless of the underlying distribution of the
population from which the sample is drawn.
What is the error in our mean estimates?
Standard error (SE) is a measure of the
variability of a statistic, such as the mean,
estimated from a sample.
It is calculated as the standard deviation of
the sampling distribution of a statistic.
It is used to indicate the precision of the
estimate of a population parameter.
What is the error in our mean estimates?
Are we still confident to say that
mean(A) > mean(B) ?
(these are sample means and not
the true population means)
Sample mean vs True mean
• Denote sample_mean(A) = mean(A)
• true_mean(A) is not known but rather a random variable
sample_mean(A) ~ Normal( true_mean(A), SE(A)^2 )
sample_mean(A)−true_mean(A)
~ Student-t(n-1)
SE(A)
Interested in Average Treatment Effect (ATE)
• ATE is a measure of the difference in outcomes between a
group of individuals who receive a treatment and a group of
individuals who do not receive the treatment
true_ATE = true_mean(B) - true_mean(A)
• We have access to an unbiased estimate given by our
observations:
sample_ATE = sample_mean(B) - sample_mean(A)
sample_ATE ~ Normal(true_ATE, SE(A)^2+SE(B)^2)
sample_ATE−true_ATE
~ Student-t(2n-2)
SE(A)^2+SE(B)^2
We want to perform Hypothesis testing
• In statistics, a hypothesis is a statement or claim about a
property of a population or a relationship between two or more
populations.
• There are two types of hypotheses:
• Null hypothesis (H0): This is the default assumption that there
is no significant difference or relationship between the groups or
variables being studied. It is usually a statement of equality (e.g.
the means of two groups are equal)
• Alternative hypothesis (Ha): This is the hypothesis that is put
forward as an alternative to the null hypothesis. It is usually a
statement of inequality (e.g. the means of two groups are not
equal) or a statement of difference (e.g. the mean of a sample is
different from a specified value).
Hypothesis testing
Null hypothesis H0: true_ATE = 0
Alternative hypothesis Ha: true_ATE > 0
• The hypothesis we want to test is if Ha is “likely” true.
• Two outcomes are possible
• Reject H0 and accept Ha because of sufficient evidence in
favor of Ha
• Do not reject H0 because of insufficient evidence
• This does not mean that the null hypothesis is true!
Null Hypothesis
H0: true_ATE = 0
true_ATE = 0
Hypothesis testing
Significance level – usually 0.05
https://web.northeastern.edu/dummit/teaching_su20_3081/pr
p-value obstat_4_hypothesis_testing_v1.01.pdf
t-critic t-stat
sample_ATE
t-statistic = ~ Student-t(2n-2)
SE(A)^2+SE(B)^2
p-value is a probability value that is used in statistical hypothesis testing to indicate the
level of evidence against a null hypothesis. The p-value represents the probability of
obtaining a test statistic as extreme or more extreme than the one observed, assuming
that the null hypothesis is true. Usually reject null hypothesis if p-value < 0.05.
Python Example A/B testing
Test anchoring bias – Mini-assignment
• Anchoring bias, also known A
as anchoring effect, is a
cognitive bias that occurs
when an individual relies too
heavily on an initial piece of
information, known as the
"anchor," when making
subsequent judgments.
B
• For example, if an individual
is asked to estimate the
number of doctors in a city
and is given an anchor of
1000, they will likely provide
a higher estimate than if
they were given an anchor
of 100.
H0: mean(B) = mean(A)
Ha: mean(B) > mean(A)