[MILESTONE] Run an A/B test to evaluate Edit Check (references) impact
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Jul 27 2023, 10:27 PM

Description

Reference Check is designed to increase the likelihood that newcomers and Junior Contributors who are editing from within Sub-Saharan Africa:

Publish edits that they are proud of and experienced volunteers consider useful
Return to edit again in the future

This task involves the work with running an A/B test (or perhaps a multivariate test [i]) to evaluate the extent to which this initial Edit Check has been effective at impacting newcomers and Junior Contributors in the ways described above.

Decision(s) To Be Made

1. Decide whether the impact Edit Check is having on users' behavior are positive enough to be made available by default, at all Wikipedias.

Hypotheses

ID	Hypothesis	Metric(s) for evaluation
KPI	The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will include a reference or an explicit acknowledgement as to why these edits lack references.	1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added, 2) Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model (T317700, T343938))
Curiosity #1	Newcomers and Junior Contributors will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.	Increase in the proportion of newcomers and Junior Contributors that publish at least one new content edit that includes a reference.
Curiosity #2	Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that includes a reference because Edit Check will have caused them to realize references are required when contributing new content to Wikipedia.	1) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and successfully and return to make an unreverted edit to a main namespace during the identified retention period., 2) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and return to make a new content edit with a reference to a main namespace during the identified retention period.
~~Curiosity #3~~[v]	~~Newcomers and Junior Contributors will be more likely to change an unreliable reference if presented with that information while attempting to add a reference when contributing new content.~~	~~Proportion of newcomers and Junior Contributors that elect to "try new source" after being presented with the reference reliability check.~~
~~Curiosity #4~~	The quality of new content edits published by newcomers and Junior Contributors will increase because these contributors will made aware that the reference they are adding is deemed to be unreliable and unfit for publishing on-wiki	~~Proportion of new content edits by newcomers and Junior Contributors presented with the reference reliability check that are reverted within 48 hours (or have a high revision risk score).~~

Leading indicators

See T352130.

Guardrails

This section describes the metrics we will use to make sure other important parts/dimensions of the "editing ecosystem" are not being negatively impacted by Edit Check. The scenarios named in the chart below emerged through T325851.

ID	Name	Metric(s) for Evaluation
1)	Edit quality decrease (T317700)	Proportion of published edits that add new content and are still reverted within 48hours (or have a low revision risk score if we use the revision risk model (T317700)). Will include a breakdown of revert rate of published edits with and without a reference added.
2)	Edit completion rate drastically decreases	Proportion of edits that are started (event.action = init) and are successfully published (`event.action = saveSuccess`)
3)	Edit abandonment rate drastically increases	Proportion of contributors that are presented Edit Check feedback and abandon their edits (indicated by `event.action = abort` and `event.abort_type = abandon`).
4)	People shown Edit Check are blocked at higher rates	Proportion of contributors blocked after publishing an edit where Edit Check was shown
5)	High false positive or false negative rates	A) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative) & B) Proportion of contributors that dismiss adding a citation and select "I didn't add new information" or other indicator that their edit doesn't require a citation

A/B Test: Decision Matrix

ID	Scenario	Indicator(s)	Plan of Action
1)	Edit Check is disrupting, discouraging, or otherwise getting in the way of volunteers who are attempting to make edits in good faith. Read: people are less likely to publish the edits they start.	Significant drop in edit completion and spike in edit abandonment in ~~good faith edit sessions~~ [iii][iv] where Edit Check is activated. Will include breakdown to review edits where reference reliability check was included.	Pause scaling plans; investigate changes to UX
2)	Edit Check is increasing the likelihood that people will publish destructive edits	Increase in proportion of contributors blocked after publishing an edit where edit check is activated, Increase in proportion of published edits where edit check was activated and are reverted within 48 hours relative to new content edits edit check was NOT activated within.	Pause scaling plans, review edits to try to identify pattern in abuse and propose changes to UX to mitigate them
3)	Edit Check is causing people to publish edits that align with project policies	Increase in the proportion of edits edit check was activated within that include a reference and are not reverted within 48 hours relative to new content edits without a reference edit check was NOT activated within	Move forward with scaling plans
4)	Edit Check is effective at causing people to accompany new content edits that include a reference, but those references are unreliable	Increase in the proportion of published edits edit check was activated within that include a reference and increase or no change in the proportion of these edits that are reverted within 48 hours	Block scaling plans on reference reliability work (T276857)
5)	Edit Check is not effective at causing people to accompany new content edits that include a reference but is not disrupting to volunteers.	No change or decrease in the proportion of published edits edit check was activated within that include reference and A) no significant drop in edit completion or abandonment rate or B) no significant spike in block or revert rate	Move forward with scaling plans

i. Where a "multivariate test" in this context could look like tests wherein we compare: A) multiple variations of Reference Check user experiences or B) people who are shown the source editor by default, to people who are shown VE by default, and people who are shown VE by default with Edit Check activated, as @MNeisler and @DLynch raised offline
ii. See T331582#9132480
iii. Being able to distinguish edits made in good faith from those made in bad faith depends on T343938
iv. Per the reasons @MNeisler discovered and named in T343938#9368298, it is not feasible to use the revert risk model to assess whether an edit was made in good or bad faith: "This would require us to determine if it is a good-faith edit session while the user is attempting an edit, which is not feasible yet per engineering constraints @Pablo mentioned in T343938#9082581. The revision risk model requires a revision ID, which is only stored with published edits."
v. Per discussions with the Editing team, we have decided not to include the reference reliability check in this AB test. We we will review the impact of this feature in a separate deployment.

Details

Subject	Repo	Branch	Lines +/-
Enrollment for the edit check a/b test	mediawiki/extensions/VisualEditor	wmf/1.42.0-wmf.18	+22 -0
EditAttemptStep: log buckets for the edit check test	mediawiki/extensions/WikimediaEvents	wmf/1.42.0-wmf.18	+11 -11
Launch the Visual Editor edit check a/b test	operations/mediawiki-config	master	+28 -0
Enrollment for the edit check a/b test	mediawiki/extensions/VisualEditor	master	+22 -0
EditAttemptStep: log buckets for the edit check test	mediawiki/extensions/WikimediaEvents	master	+11 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T265163 Create a system to encode best practices into editing experiences
Open	Trizek-WMF	T331946 [RELEASE TICKET] Make Edit Check (references) available to all newcomers at all Wikipedias
Resolved	MNeisler	T342930 [MILESTONE] Run an A/B test to evaluate Edit Check (references) impact
Resolved	Trizek-WMF	T345298 Identify participating wikis for Edit Check (references) A/B test
Resolved	MNeisler	T352129 Verify proposing A/B test wikis will provide sufficient edit volume to draw conclusions from
Resolved	ppelberg	T345300 [SPIKE] Decide whether adjustments are needed to Edit Check (references) heuristic
Resolved	ppelberg	T344454 QA edit check interaction data instrumentation
Resolved	DLynch	T346837 Implement bucketing for Edit Check (references) A/B test
Open	ppelberg	T343938 [SPIKE] How might the Editing Team leverage the "revert risk" model to identify high value checks?
Resolved	MNeisler	T352122 [Config] Start Edit Check (references) A/B test
Resolved	MNeisler	T352130 Report on Edit Check A/B test leading indicators
Open	Trizek-WMF	T352131 Publish findings from the Edit Check (references) A/B test
Resolved	MNeisler	T365741 Present Reference Check A/B test findings at June staff meeting
Resolved	Trizek-WMF	T366262 Publish findings from the Edit Check (references) A/B test on Diff
Resolved	Esanders	T361727 [Config] Stop the Edit Check (references) A/B test

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ppelberg mentioned this in T342404: [Analysis] Reference check concept validation (quantitative).Sep 6 2023, 8:13 PM

ppelberg added a subtask: T344454: QA edit check interaction data instrumentation.Sep 6 2023, 8:20 PM

Update: I've added the leading indicators @MNeisler and I discussed offline on 6 Sep 2023.

MNeisler mentioned this in T346837: Implement bucketing for Edit Check (references) A/B test.Sep 19 2023, 7:54 PM

ppelberg moved this task from Untriaged to Upcoming on the Editing-team board.Oct 3 2023, 6:41 PM

ppelberg added a subtask: T346837: Implement bucketing for Edit Check (references) A/B test.

MNeisler claimed this task.Oct 13 2023, 7:34 PM

MNeisler triaged this task as Medium priority.

MNeisler added a project: Product-Analytics.

MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.

ppelberg mentioned this in T343938: [SPIKE] How might the Editing Team leverage the "revert risk" model to identify high value checks?.Oct 18 2023, 8:20 PM

ppelberg updated the task description. (Show Details)Oct 18 2023, 9:57 PM

ppelberg added a subtask: T343938: [SPIKE] How might the Editing Team leverage the "revert risk" model to identify high value checks?.Oct 18 2023, 9:59 PM

ppelberg updated the task description. (Show Details)Oct 18 2023, 10:04 PM

ppelberg updated the task description. (Show Details)

I've updated the task description with the proposals @MNeisler and I converged on during today's (18 Oct 2023) offline meeting.

Next steps

T333709: [CONSULT] Collaborate with volunteers to define Edit Check key results
Discuss the proposed Decision matrix, Guardrails, Hypotheses, and Leading indicators with members of the Editing Team

ppelberg closed subtask T344454: QA edit check interaction data instrumentation as Resolved.Oct 24 2023, 4:48 PM

ppelberg renamed this task from [Analysis] Run an A/B test to evaluate Edit Check (references) impact to [MILESTONE] Run an A/B test to evaluate Edit Check (references) impact.Nov 22 2023, 6:20 PM

ppelberg mentioned this in T352092: Instrument the multi-check experience.Nov 27 2023, 8:27 PM

ppelberg mentioned this in T352129: Verify proposing A/B test wikis will provide sufficient edit volume to draw conclusions from.Nov 28 2023, 2:00 AM

ppelberg mentioned this in T352130: Report on Edit Check A/B test leading indicators.Nov 28 2023, 2:03 AM

ppelberg updated the task description. (Show Details)

ppelberg mentioned this in T352131: Publish findings from the Edit Check (references) A/B test.Nov 28 2023, 2:07 AM

ppelberg updated the task description. (Show Details)Nov 29 2023, 9:22 PM

ppelberg added a subscriber: Pablo.

MNeisler moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.Dec 1 2023, 5:23 PM

MNeisler updated the task description. (Show Details)Dec 13 2023, 8:16 PM

MNeisler updated the task description. (Show Details)Dec 13 2023, 8:18 PM

MNeisler updated the task description. (Show Details)Dec 13 2023, 8:31 PM

I've updated the task description to include curiosities (Curiosity #3 and #4) and guardrails (Guardrails #2 and 3) )we'd like to review as part of the planned incorporation of the reference reliability check (user experience being defined in T347531).

The measurement plan has also been updated to incorporate these changes.

MNeisler moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.Jan 3 2024, 7:39 PM

@MNeisler @ppelberg: just want to verify that

Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model

is what would be reported in WE1: Program Indicator Plan Template_WE1_Contributor Experiences once initial data is available (and then updated as Edit Check is deployed to more wikis)

MNeisler updated the task description. (Show Details)Jan 11 2024, 8:50 PM

MNeisler mentioned this in T352133: Instrument the reference reliability user experience.Jan 16 2024, 3:48 PM

MNeisler updated the task description. (Show Details)Feb 5 2024, 6:32 PM

From T345298, we can proceed with the following wikis:

arwiki
afwiki
frwiki
itwiki
jawiki
ptwiki
swwiki
yowiki
viwiki
zhwiki

We might have eswiki as a bonus in a few days.

Change 1003052 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1003052

Change 1003053 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1003053

Change 1003052 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1003052

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.19; 2024-02-20).Feb 16 2024, 5:01 PM

Change 1004351 had a related patch set uploaded (by DLynch; author: DLynch):

[operations/mediawiki-config@master] Launch the Visual Editor edit check a/b test

https://gerrit.wikimedia.org/r/1004351

Trizek-WMF closed subtask T345298: Identify participating wikis for Edit Check (references) A/B test as Resolved.Feb 19 2024, 1:57 PM

Change 1003053 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1003053

I've been asked to include eswiki as well, so the config patch has been updated to add them.

Change 1004708 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.18] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1004708

Change 1004709 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@wmf/1.42.0-wmf.18] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1004709

Change 1004351 merged by jenkins-bot:

[operations/mediawiki-config@master] Launch the Visual Editor edit check a/b test

https://gerrit.wikimedia.org/r/1004351

Change 1004708 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.18] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1004708

Change 1004709 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@wmf/1.42.0-wmf.18] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1004709

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:25:05Z] <zabe@deploy2002> Started scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:26:26Z] <zabe@deploy2002> kemayo and zabe: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Maintenance_bot removed a project: Patch-For-Review.Feb 19 2024, 9:30 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:42:31Z] <zabe@deploy2002> Finished scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] (duration: 17m 25s)

MNeisler mentioned this in T352122: [Config] Start Edit Check (references) A/B test.Feb 21 2024, 4:13 PM

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.20; 2024-02-27); removed MW-1.42-notes (1.42.0-wmf.19; 2024-02-20).Feb 22 2024, 1:00 AM

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.18; 2024-02-13); removed MW-1.42-notes (1.42.0-wmf.20; 2024-02-27).Feb 22 2024, 10:02 AM

MNeisler closed subtask T352122: [Config] Start Edit Check (references) A/B test as Resolved.Feb 23 2024, 7:35 PM

MNeisler edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Mar 1 2024, 7:42 PM

MNeisler moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.Mar 4 2024, 8:33 PM

DLynch closed subtask T346837: Implement bucketing for Edit Check (references) A/B test as Resolved.Mar 7 2024, 9:36 PM

I've completed an initial analysis of the Edit Check AB test results including review of the KPI, Curiosity #1, and Guardrails 1-3. Results are summarized in this slide deck which was presented to the team on 12 March 2024.

Below are the current proposed next steps:

Complete review of guardrails #3-4.
Complete review of retention as part of Curiosity #2.
Add findings for editors from Sub Sahrah Africa (SSA) once data is available. There currently have only been about 70 published edits from SSA in the test. We will extend the test duration so we can get statistically significant findings related to editors volunteering from editors from within SSA.
Edit Completion Rate (Guardrail #2)
- Look at edit completion rate of constructive edits (unreverted)
- Review options to limit edit completion rate of control group to only eligible edits.
- Shift edit completion start to "saveIntent" as that is the point that edit check is or would be shown.
Revert Rate (KPI and Guardrail #1)
- Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.
- As the observed change between the test and control groups was only slight, it might be beneficial to apply a statistical model to confirm the impact of the Edit Check on edit revert rate.
- Review impact if revision risk score is used instead of revert rate as a measure of quality.
Summarize results into a final report that can be shared on the project page

In T342930#9631364, @MNeisler wrote:

I've completed an initial analysis of the Edit Check AB test results including review of the KPI, Curiosity #1, and Guardrails 1-3. Results are summarized in this slide deck which was presented to the team on 12 March 2024.

Below are the current proposed next steps

The proposed next steps you identified align with the next steps the presentation you made on 12 March 2024 brought to mind for me...thank you for documenting these and all of the work that inspired them, @MNeisler

Completed review of guardrails #4-5. Findings are summarized on slides 35-45 of the slide deck.

Note: We decided to end the AB test on 5 April 2024. At the end of the test, I will review the complete data to confirm any changes to already calculated metrics and calculate metrics that require more data (Retention rate [Curiosity #2] and findings for editors from SSA)

I've completed the retention rate analysis (Curioisity #2) now that we have two full months of data. Findings are summarized on slides 20-28 of the slide deck.

Notes re methodology:

I used second-month retention as the retention period. This is defined “Out of the users who made an edit in the month before the previous, the proportion who also made an edit during their second 30 day" and aligns with the new editor retention contributor metric used by movement insights.
Compared first edits where reference check was activated in the test group to eligible edits (edits where reference check would have been activated) in the control group.
Analysis is currently limited to registered newcomers and Junior Contributors.

eunnpixie subscribed.Apr 13 2024, 12:17 PM

All results and methodology are now summarized in the final report. Please let me know if you have any questions or suggested additions/changes.

Note. I'm drafting a high-level summary with a link to this report that can be published on mediawiki. This is being completed as part of T352131
cc @ppelberg

MNeisler reassigned this task from MNeisler to ppelberg.Apr 29 2024, 4:13 PM

Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.

Note: I created a new tab to the "References people cite when adding new content" dashboard ("February-March 2024 Reference Check Available", which includes an updated snapshot of the references added by new content edits at the wikis where the AB test was run. You can use the filter "was reference check shown" to review the types urls and domains published by edits shown this prompt.

This data can be used to help investigate the types of domains and urls being included in published edits still reverted after being shown reference check

In T342930#9770352, @MNeisler wrote:

Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.

Note: I created a new tab to the "References people cite when adding new content" dashboard ("February-March 2024 Reference Check Available", which includes an updated snapshot of the references added by new content edits at the wikis where the AB test was run. You can use the filter "was reference check shown" to review the types urls and domains published by edits shown this prompt.

This data can be used to help investigate the types of domains and urls being included in published edits still reverted after being shown reference check

Oh, this is spectacular, @MNeisler! I can see it being immediately useful for T348060: Expand the Reference Reliability check to include a wider range of sources.

Specifically, I can see us sharing this data with volunteers and empowering them to decide what – if any – domains they would like to provide people attempting to cite them feedback about...

Screenshot 2024-05-14 at 2.04.22 PM.png (541×862 px, 79 KB)

In the meantime, a clarifying question about the meaning of each column within the Domain summary stats view:

Would it be accurate for me to understand the columns below as meaning...?

num_domain_occurrences: the number of references that cite a given domain across all pages in the main namespace at a given project
Number of distinct pages domain added to with new content edit: the number of distinct pages in the main namespace of a given project where a domain was cited during the period of time the data shown in this chart was gathered
Number of new content edits that included domain: the number of edits tagged with editcheck-newcontent that include a given domain at a given project
Revert rate: the percentage of Number of new content edits that included domain that are reverted within 48 hours

ppelberg mentioned this in T348060: Expand the Reference Reliability check to include a wider range of sources.May 14 2024, 9:14 PM

ppelberg mentioned this in T347530: Design user experience for presenting people with multiple checks.May 18 2024, 12:15 AM

ppelberg reassigned this task from ppelberg to MNeisler.May 18 2024, 12:21 AM

@ppelberg

Would it be accurate for me to understand the columns below as meaning...?

num_domain_occurrences: the number of references that cite a given domain across all pages in the main namespace at a given project

The num_domain_occurrences comes from the external links table and indicates the number of times that domain appears across all pages on a given project.

It's not necessarily restricted to just references added with new content edits but any external links on a project.

Number of distinct pages domain added to with new content edit: the number of distinct pages in the main namespace of a given project where a domain was cited during the period of time the data shown in this chart was gathered

Correct.

Number of new content edits that included domain: the number of edits tagged with editcheck-newcontent that include a given domain at a given project

Also limited to edits tagged with editcheck-newreference

Revert rate: the percentage of Number of new content edits that included domain that are reverted within 48 hours

Correct

ppelberg closed subtask T352130: Report on Edit Check A/B test leading indicators as Resolved.Jun 5 2024, 8:31 PM

ppelberg closed subtask T361727: [Config] Stop the Edit Check (references) A/B test as Resolved.Jun 12 2024, 5:46 PM

ppelberg mentioned this in T352095: Publish multi-check measurement plan.Aug 14 2024, 8:58 PM

ppelberg closed subtask T345300: [SPIKE] Decide whether adjustments are needed to Edit Check (references) heuristic as Resolved.Aug 20 2024, 6:14 PM

ppelberg closed this task as Resolved.Aug 23 2024, 7:05 PM

	F53227641: Screenshot 2024-05-14 at 2.04.22 PM.png
	May 14 2024, 9:14 PM

[MILESTONE] Run an A/B test to evaluate Edit Check (references) impactClosed, ResolvedPublicActions