8000 Refactor com.google.cloud.firestore.it.ITSystemTest.queryWatch to make it more reliable. by BenWhitehead · Pull Request #5902 · googleapis/google-cloud-java · GitHub 8000

Refactor com.google.cloud.firestore.it.ITSystemTest.queryWatch to make it more reliable. #5902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

BenWhitehead merged 2 commits into googleapis:master from BenWhitehead:firestore-it-hang-fix

Jul 30, 2019

Contributor

BenWhitehead commented

•

Changes

Separate document update logic from event listener code
Move assertions out of listener code
Add timeouts to all await conditions with associated assertions
to clearly communicate the test fails due to an await, and which one.
Switch from using a semaphore to using multiple CountDownLatches

Motivation

The queryWatch test has been flaky for some time, resulting in failure
most of the time. Due to these persistent failures the trust of the
firestore integration test suite has been low, and the results often
ignored.

After being able to run the test in a profileable environment, I ran the
old implementation repeatedly for 25 minutes resulting in a total of 136
runs, 100 of which failed -- each with the same error. The new
implementation was ran for 25 minutes resulting in a total of 2500 runs,
none of which failed.

BenWhitehead added the api: firestore label

BenWhitehead requested review from chingor13, kolea2 and schmidt-sebastian

July 26, 2019 21:22

BenWhitehead self-assigned this

googlebot added the cla: yes label

BenWhitehead added the needs work label

schmidt-sebastian reviewed

View reviewed changes

Contributor

schmidt-sebastian left a comment

I'm somewhat on board with this, but I think that we should reduce some of the logic here that was added in order to plumb through QuerySnapshot exceptions.

I also suggested a way to deflake the original test.

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java Outdated Show resolved Hide resolved

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java Outdated Show resolved Hide resolved

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java

    
                              @Override

                              public void onEvent(

                                  @Nullable QuerySnapshot value, @Nullable FirestoreException error) {

                                receivedEvents.add(new ListenerEvent(value, error));

Contributor

schmidt-sebastian

We should never see an error here. To simplify this test, I would add an assert here that the error is null (and maybe that value is not) and replace ListenerEvent with QuerySnapshot.

Contributor Author

BenWhitehead

Asserts inside the snapshotListener unfortunately do not automatically cascade outside the listener and as such can't fail the test, that I why I pulled it up to the top level and added the assertion after all the operations.

Contributor

schmidt-sebastian

What happens if we don't see the right number of snapshots (we get an error but we ignore it and don't add a receivedEvent)? My only reservation with this PR is that the generic nature of it adds a lot of code that makes the event flow somewhat hard to parse. I'm totally biased here, but the original test seems significantly easier to follow.

Contributor Author

BenWhitehead

When I started investigating why this tests would fail ~65% of the time in CI I narrowed it down to the fact that the semaphore wasn't ever being satisfied. In #5746 I added a timeout as well as some diagnostic information to the test and found that the failure always occurred in the case 0 brach and resulted in the other branches never being executed. To ensure that performing new document updates wouldn't ever block the processing of received I separated the update and the lister as presented in this PR. I have ran this new implementation continuously for 2,500 iterations and all 2,500 runs passed.

I agree that it is more code -- a fair amount of it would be reduced if we were able to use java 1.8 constructs (which will hopefully happen sooner rather than later) -- and not super straightforward from a cursory glance, but I don't think the original test was either. One advantage with this new implementation is that it's more explicit about what the interrelationships are between the listener and document updates. In addition, there are explicit timeouts defined for each type of operation so we don't ever hang the entire build, and having the assertions outside the listeners means if there is a failure it will actually fail the test.

If it isn't important for doc1 and doc2 to be updated independent from one another I can merge those workers. If we want to avoid the anonymous inner classes from FluentIterable I can rewrite that with for loops.

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java Outdated Show resolved Hide resolved

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java

    
                        });

                    // Wait for the document update operations to be performed

                    final boolean doc1CDLAwait = doc1CDL.await(10, TimeUnit.SECONDS);

Contributor

schmidt-sebastian

If I am not mistaken, doc1CDLAwait and doc2CDLAwait could be a single count down latch initialized with new CountDownLatch(2).

Contributor Author

BenWhitehead

It is true we could do this, but then we would lose the info about which of the document operations timed out for the assertion. I'd like to keep the extra detail in case the test ever fails due to this.

...clients/google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITSystemTest.java

    
                                          actualEvents.add("event 0");

                                          assertTrue(value.isEmpty());

                                          ref1 = randomColl.add(map("foo", "foo")).get();

                                          ref2 = randomColl.add(map("foo", "bar")).get();

Contributor

schmidt-sebastian

I think the problem with this test is that these two adds are not guaranteed to generate two different events. One way of fixing this would be either split these up into two tasks or use a WriteBatch for both writes, which is guaranteed to only raise one event of size 2.

Contributor Author

BenWhitehead

This scenario feels a bit like a different test case to me. I can look at adding another test if that sounds good to you?

Contributor

schmidt-sebastian

It's different for sure - but this test is just meant to show that basic Watch functionality works. The actual steps that we verify are of minor importance.

BenWhitehead added 2 commits

July 29, 2019 14:21


          Refactor com.google.cloud.firestore.it.ITSystemTest.queryWatch to mak…

17e327d

…e it more reliable.

1. Separate document update logic from event listener code
2. Move assertions out of listener code
3. Add timeouts to all await conditions with associated assertions
   to clearly communicate the test fails due to an await, and which one.
4. Switch from using a semaphore to using multiple CountDownLatches

The queryWatch test has been flaky for some time, resulting in failure
most of the time. Due to these persistent failures the trust of the
firestore integration test suite has been low, and the results often
ignored.

After being able to run the test in a profileable environment, I ran the
old implementation repeatedly for 25 minutes resulting in a total of 136
runs, 100 of which failed -- each with the same error. The new
implementation was ran for 25 minutes resulting in a total of 2500 runs,
none of which failed.


          Code review typo fixes

43b8af6

BenWhitehead removed the needs work label

codecov bot commented

Codecov Report

Merging #5902 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master    #5902      +/-   ##
============================================
- Coverage     46.79%   46.79%   -0.01%     
  Complexity    25648    25648              
============================================
  Files          2456     2456              
  Lines        267593   267593              
  Branches      30564    30564              
============================================
- Hits         125229   125228       -1     
- Misses       133105   133106       +1     
  Partials       9259     9259

Impacted Files	Coverage Δ	Complexity Δ
...ain/java/com/google/cloud/pubsub/v1/Publisher.java	`89.54% <0%> (-0.35%)`	`40% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b212217...43b8af6. Read the comment docs.

schmidt-sebastian approved these changes

View reviewed changes

BenWhitehead merged commit 46d7a3c into googleapis:master

BenWhitehead deleted the firestore-it-hang-fix branch

July 30, 2019 18:07

kolea2 mentioned this pull request

Release google-cloud-java v0.103.0 #5950

Merged

BenWhitehead mentioned this pull request

firestore: listen_multiple test is flakey GoogleCloudPlatform/python-docs-samples#2529

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: firestore cla: yes

0