CSI: Flaky Tests | How to Investigate Intermittent Failures Without Losing Your Mind?

Curiously, all my recent projects have been about “fixing failing suites” where nobody knows why they fail.

The same test sometimes passes and sometimes fails, or passes locally but fails in CI. The more workers (instances) running, the higher the failure rate. This is exactly the definition of a flaky test: when, given the same code and the same environment, it sometimes passes and sometimes fails.

The purpose of tests is to provide fast, effective feedback on what we’re delivering. Flaky tests take us away from that purpose in these ways:

Trust issues: Developers start ignoring real failures (“oh, that same test is failing again, let’s just re-run it”). This erodes developers’ confidence in the tests we write, and let’s be honest—that’s what tests are for: to have confidence in what we’re delivering to customers and users in each release. There may come a time when team starts ignoring test results to ship a new version of our product “faster,” and we could be letting bugs slip through.

Slower development cycles: If we don’t start ignoring test results in favor of faster releases, we find ourselves at the other extreme: re-running tests over and over and over again until they pass, making work in general (not just releases) slower. From blocked pull requests (that we can’t merge until tests pass and PR checks are OK) to delays in our deadlines. Re-running CI/CD consumes time and money.

The tradeoff of forcing tests to pass: At the other extreme, I’ve also seen (and wish I hadn’t) tests forced to pass that have no relevant verifications/assertions, and some, much worse: that verify nothing at all, with poor design (oriented toward running automatically and quickly but not toward detecting errors). These tests don’t reflect the true state of the system.

How Can We Be Sure a Test or Group of Tests Is Flaky?

The main indicator is that the test fails intermittently:

Passes locally but not in CI, or vice versa.
Passes when you run it alone (isolated) but fails when the entire regression suite (or a group of tests) runs, or vice versa.
Fails more as we increase the number of workers or instances (in parallel).
Fails only the first time and passes when you re-run.
Fails more at certain times of day.
Fails only in headless mode or vice versa (headed mode).

Where Do We Start Diagnosing?

Some ideas:

Reproduce the failure consistently (run in a loop), especially if we’re running the complete suite.
Isolate the test to evaluate if it passes when run alone (isolated).
Add strategic logging—it’s super useful to “print” states, IDs (that we can later check in the database), API responses, what data we’re sending to the API. Sometimes it may seem like too much data, but for evaluating why our test is flaky, it gives us a lot of information we can use to fix these failures. You don’t have to do this with all tests, but definitely with those under investigation.
Review application logs (and API), not just test logs. Sometimes only in the application logs can we find the true cause of the error (for example: you’re trying to create a duplicate record, there’s a failure in a state change, something can’t be deleted, the user doesn’t have permission to execute a certain operation, you don’t have access to a certain resource, a transaction is locked).
Review screenshots/videos of the failure—if it’s a UI test, we can see the exact state (in the interface) when the test failed, what data it was using, which user was logged in, and the differences between one execution and another.
Analyze timing with traces, which allows us to detect at which specific point the test might be taking longer than expected (waiting for a response or state, or for an element or data to exist).
Analyze the data and states used by the test alone and the complete suite. I know it sounds like a lot of work, but sometimes it’s necessary to understand what’s happening.

Common Errors We CAN Fix in the Test/Suite

In the previous analysis, you may find that multiple tests are using the same data: creating, modifying, and/or deleting the same record in the database. Multiple tests modifying the state of the same entity/record. Perhaps they’re trying to create more than one record with the same name? Even if it’s read-only mode, do we have 50 tests using the same database record?

DOM (UI elements) that doesn’t load fast enough (as the test expects).

One test depends on another to run—for example, a test to edit user permissions for a user that doesn’t exist beforehand but depends on another test creating that user.

How do we fix them?

Error	Cause	Fix
Multiple tests using the same data	The test depends on data left by a previous test or changes the same record’s state or different values, altering the expected test result.	Each test should have its own data, whether it creates it before the test and deletes it after, or it’s pre-loaded data in the database, or we use mock data. The method we choose doesn’t matter as much—what’s important is that each test is independent of others for parallelization purposes (multiple tests running at the same time).
DOM that doesn’t load fast enough	Slow server or service. Too many tests running at the same time pushing the server/service to its limit.	Adding dynamic timeouts (for the element to exist, to be visible, to be interactable, among others) that wait for UI elements up to the maximum time we specify. For example, if it’s 15 seconds, it’s up to 15 seconds maximum—if the element appears sooner, the test continues to the next step sooner, instead of waiting exactly 15 seconds. This ensures we wait long enough for an element without slowing execution more than strictly necessary.
One test depends on another to run	The test depends on data left by a previous test. The test sometimes passes because it finds data from another test.	Again, each test should have its own data. In some cases, if you have a test that creates data, another that edits it, and another that deletes it, you can configure them to run sequentially, but with this, you sacrifice parallelization. You can evaluate your most immediate need: execution speed or test reliability (both are important).

Common Errors We CANNOT Fix (By Ourselves) in the Test/Suite

If tests run slower as we add more workers/instances and fail more, we’re facing request rate limits or server overload.

Request rate limit failures: a number of requests to the API and/or server/service greater than allowed (security configurations) per user, per time period, per IP. This requires developers/DevOps to help us by increasing these limits on the servers. Many parallelization errors (multiple tests running in parallel on the same server/service) are due to this.

Server overload: sometimes our tests (if there are many) push the server to its limit. Test environments are rarely the same as production (resources like RAM, disk space) due to budget constraints. Other times, tests exert a greater load than would occur in production (real usage by users and customers).

External service failures: an API or external service that’s outside our team’s control.

Intermittent 500 responses: this needs to be evaluated by developers.

Network Errors: these are the most difficult to diagnose and fix because, by nature, they’re intermittent and depend on factors external to both the test code and the application itself. Errors like ECONNRESET, ETIMEDOUT, ECONNREFUSED as responses to API and service calls are a clear indication that we’re facing a network error. It depends on cloud service configurations and region.

Race Conditions and Asynchronous Processes

A Race Condition occurs when the result of a test depends on the sequence or timing of uncontrollable events.

An automated test runs much faster than it would if we were executing it manually, and therefore, there are errors/issues in automated execution that can’t be reproduced manually (and this is where it becomes even more important to review error videos/screenshots or the logging of each request/step/verification of the test). This causes the test to fail due to timeout: elements that don’t appear in the expected time, data that doesn’t load in the UI in the expected time, or requests that don’t respond in the expected time (due to the amount of data they bring or some API problem).

This happens because our tests execute steps (lines of code) in nanoseconds while applications have to render HTML, download CSS, execute JavaScript, and wait for server responses. This takes milliseconds or seconds, and it’s what we call an “asynchronous” process. Since the framework and the application don’t always go at the same pace, this is where race conditions are born.

Some solutions we can implement:

Dynamic wait times (described in table 1) to wait for UI elements or an API response.

Beyond waiting for elements, if we’re waiting for something like a state change and we know it can take up to 5 minutes, for example, instead of waiting 5 minutes and then getting the state to compare it with what we expect (verification or assertion), we can create a poll function that checks at regular intervals (every minute, for example) if the state is what we expect, with a maximum time of 7 minutes. If after 7 minutes it fails (the expected state doesn’t occur), then the test fails.

Data control: If your tests add and delete data themselves, find the most efficient/fastest way to complete this process—sometimes the test fails before starting execution while waiting for data creation. There are other options like pre-loading data in the database (which also implies there must be an “initial state for that data” and that with each suite execution, the data must be in its initial state) or using mocks and stubs (which is already technically covered by various test automation frameworks, and you don’t have to build modules and libraries to use mock data).

If tests create data, make sure they don’t have the same names and codes (it seems silly, but it happens a lot). And at the same time, don’t create more data than strictly necessary.

Conclusion

Investigating and fixing flaky tests requires patience, analysis, and method.

A flaky test can be a symptom of very diverse problems: from errors in test design (shared data, dependencies between tests, inadequate waits) to situations beyond our direct control (rate limits, server overload, external services).

The most important things to remember:

Don’t ignore the failure: Each manual re-execution is lost time and money, and each ignored failure is a potential bug we let slip through.
Don’t blame the test or the system under test without evidence: Sometimes the test is just the messenger of a real problem in the system. Before “fixing it” to make it pass, investigate whether it’s revealing something that also affects users.
Isolate, reproduce, document: Your best (and almost always the only) tool is evidence: logs, screenshots, traces, and above all, the ability to reproduce the failure and analyze all the information together.
Collaborate: In cases we can’t fix by ourselves (race conditions, request limits, infrastructure and network problems), we need the development and DevOps teams.

Flaky tests don’t disappear by magic, but with the right techniques, we can transform a chaotic suite into one that truly fulfills its purpose: giving us confidence in what we deliver.