8000 Refactor com.google.cloud.firestore.it.ITSystemTest.queryWatch to make it more reliable. by BenWhitehead · Pull Request #5902 · googleapis/google-cloud-java · GitHub 8000
[go: up one dir, main page]

Skip to content

Conversation

@BenWhitehead
Copy link
Contributor
@BenWhitehead BenWhitehead commented Jul 26, 2019

Changes

  1. Separate document update logic from event listener code
  2. Move assertions out of listener code
  3. Add timeouts to all await conditions with associated assertions
    to clearly communicate the test fails due to an await, and which one.
  4. Switch from using a semaphore to using multiple CountDownLatches

Motivation

The queryWatch test has been flaky for some time, resulting in failure
most of the time. Due to these persistent failures the trust of the
firestore integration test suite has been low, and the results often
ignored.

After being able to run the test in a profileable environment, I ran the
old implementation repeatedly for 25 minutes resulting in a total of 136
runs, 100 of which failed -- each with the same error. The new
implementation was ran for 25 minutes resulting in a total of 2500 runs,
none of which failed.

@BenWhitehead BenWhitehead added the api: firestore Issues related to the Firestore API. label Jul 26, 2019
@BenWhitehead BenWhitehead self-assigned this Jul 26, 2019
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 26, 2019
@BenWhitehead BenWhitehead added the needs work This is a pull request that needs a little love. label Jul 26, 2019
Copy link
Contributor
@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat on board with this, but I think that we should reduce some of the logic here that was added in order to plumb through QuerySnapshot exceptions.

I also suggested a way to deflake the original test.

@Override
public void onEvent(
@Nullable QuerySnapshot value, @Nullable FirestoreException error) {
receivedEvents.add(new ListenerEvent(value, error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never see an error here. To simplify this test, I would add an assert here that the error is null (and maybe that value is not) and replace ListenerEvent with QuerySnapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts inside the snapshotListener unfortunately do not automatically cascade outside the listener and as such can't fail the test, that I why I pulled it up to the top level and added the assertion after all the operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we don't see the right number of snapshots (we get an error but we ignore it and don't add a receivedEvent)? My only reservation with this PR is that the generic nature of it adds a lot of code that makes the event flow somewhat hard to parse. I'm totally biased here, but the original test seems significantly easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I started investigating why this tests would fail ~65% of the time in CI I narrowed it down to the fact that the semaphore wasn't ever being satisfied. In #5746 I added a timeout as well as some diagnostic information to the test and found that the failure always occurred in the case 0 brach and resulted in the other branches never being executed. To ensure that performing new document updates wouldn't ever block the processing of received I separated the update and the lister as presented in this PR. I have ran this new implementation continuously for 2,500 iterations and all 2,500 runs passed.

I agree that it is more code -- a fair amount of it would be reduced if we were able to use java 1.8 constructs (which will hopefully happen sooner rather than later) -- and not super straightforward from a cursory glance, but I don't think the original test was either. One advantage with this new implementation is that it's more explicit about what the interrelationships are between the listener and document updates. In addition, there are explicit timeouts defined for each type of operation so we don't ever hang the entire build, and having the assertions outside the listeners means if there is a failure it will actually fail the test.

If it isn't important for doc1 and doc2 to be updated independent from one another I can merge those workers. If we want to avoid the anonymous inner classes from FluentIterable I can rewrite that with for loops.

});

// Wait for the document update operations to be performed
final boolean doc1CDLAwait = doc1CDL.await(10, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, doc1CDLAwait and doc2CDLAwait could be a single count down latch initialized with new CountDownLatch(2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true we could do this, but then we would lose the info about which of the document operations timed out for the assertion. I'd like to keep the extra detail in case the test ever fails due to this.

actualEvents.add("event 0");
assertTrue(value.isEmpty());
ref1 = randomColl.add(map("foo", "foo")).get();
ref2 = randomColl.add(map("foo", "bar")).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with this test is that these two adds are not guaranteed to generate two different events. One way of fixing this would be either split these up into two tasks or use a WriteBatch for both writes, which is guaranteed to only raise one event of size 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario feels a bit like a different test case to me. I can look at adding another test if that sounds good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different for sure - but this test is just meant to show that basic Watch functionality works. The actual steps that we verify are of minor importance.

…e it more reliable.

1. Separate document update logic from event listener code
2. Move assertions out of listener code
3. Add timeouts to all await conditions with associated assertions
   to clearly communicate the test fails due to an await, and which one.
4. Switch from using a semaphore to using multiple CountDownLatches

The queryWatch test has been flaky for some time, resulting in failure
most of the time. Due to these persistent failures the trust of the
firestore integration test suite has been low, and the results often
ignored.

After being able to run the test in a profileable environment, I ran the
old implementation repeatedly for 25 minutes resulting in a total of 136
runs, 100 of which failed -- each with the same error. The new
implementation was ran for 25 minutes resulting in a total of 2500 runs,
none of which failed.
@BenWhitehead BenWhitehead removed the needs work This is a pull request that needs a little love. label Jul 29, 2019
@codecov
Copy link
codecov bot commented Jul 29, 2019

Codecov Report

Merging #5902 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5902      +/-   ##
============================================
- Coverage     46.79%   46.79%   -0.01%     
  Complexity    25648    25648              
============================================
  Files          2456     2456              
  Lines        267593   267593              
  Branches      30564    30564              
============================================
- Hits         125229   125228       -1     
- Misses       133105   133106       +1     
  Partials       9259     9259
Impacted Files Coverage Δ Complexity Δ
...ain/java/com/google/cloud/pubsub/v1/Publisher.java 89.54% <0%> (-0.35%) 40% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b212217...43b8af6. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: firestore Issues related to the Firestore API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0