[FLINK-39699][tests] Stabilize flaky tests in KafkaSinkITCase and KafkaWriterFaultToleranceITCase#254
Conversation
spuru9
left a comment
There was a problem hiding this comment.
Thanks for the PR. Added a few comments.
There was a problem hiding this comment.
Here too is a similar function, worth replacing as well. WDYT?
There was a problem hiding this comment.
Good catch, replaced 👍
| void setUp() { | ||
| topic = UUID.randomUUID().toString(); | ||
| createTestTopic(topic, 1, TOPIC_REPLICATION_FACTOR); | ||
| Properties adminProperties = new Properties(); |
There was a problem hiding this comment.
getKafkaClientConfiguration can be reused in the setup, the properties seems generic.
There was a problem hiding this comment.
getKafkaClientConfiguration() returns consumer properties (+zookeeper properties, which will be tackled in https://issues.apache.org/jira/browse/FLINK-39705).
Meanwhile here, we need only admin/generic properties.
WDYT?
…ilable tests testFlush/testWrite/testCloseExceptionWhenKafkaUnavailable all rely on the producer still having undelivered work when KAFKA_CONTAINER.stop() takes effect. Under CI load the sender thread can ship and ack the buffered record before stop() returns, so the operation under test (flush/write/close) has nothing to fail on and the .rootCause() assertion fires with "Expecting actual not to be null" instead of seeing the expected NetworkException / TimeoutException. Drain a warm-up record before stopping the broker, then issue the real write while the broker is down. The producer's metadata is already cached so write() returns immediately; the sender fails to deliver (retries=0); the operation under test reliably surfaces the underlying exception.
|
@Savonitar Can you move the PR to ready for review? Also is the other PR too now part of this as mentioned in the PR title. |
I will when it is ready. |
|
Thanks for the reviews/comments/merge. |
Stabilizes three related CI flakes across two test classes. All three failures shared the same family of symptom, a one-shot read or unsynchronized exception assertion that worked on local machines but raced under CI load.
E.g. In PR #252 we are facing flaky test issue