From 17:17 PDT to 18:00 PDT on May 9, 2020, the API that powers coinbase.com and our mobile applications experienced an elevated error rate. The error rate peaked at 17:24 PDT and gradually decayed until the issue was fully resolved at 18:00 PDT. This issue affected customers' ability to access Coinbase and Coinbase Pro UIs, and did not impact trading via our exchange APIs or the health of the underlying markets.
At 17:18 PDT, alongside an increase in traffic due to market volatility, our monitoring detected elevated latency and error rates in our API and alerted our engineering team to the issue. In response to these alerts, our engineering team observed elevated latency across all outgoing HTTP requests on the application instances which serve our API traffic. This manifested in our monitoring as dramatically increased % of time spent per API request in outgoing HTTP requests.
As a result of this increase in latency, we saw an elevated error rate due to timeouts while attempting these outgoing HTTP requests. The elevated error rate was amplified by our load balancer killing otherwise-healthy application instances that failed health checks. The failed health checks were a result of their request queue being saturated due to this shift in request shape.
Upon further investigation, we identified that the increase in latency was due to instance-level rate limiting of the DNS queries used to serve these HTTP requests. As traffic reduced due to the error rate, we dipped below the rate limit, leading to the gradual decay of the error rate. In parallel, we rolled out a previously in-flight change to add per-instance DNS caching, bringing us back into the non-rate-limited range for global DNS queries and ensuring the failure mode would not appear again.
Beyond addressing the specific root cause for this incident, we are making a number of improvements to increase availability in the event of future similar failures. First, we’re adjusting our health check logic to ensure that saturated, but otherwise healthy, application instances are not automatically removed from the load balancer. Second, though this incident impacted all HTTP requests, we’re rolling out improved tooling to ensure we can quickly identify and shutdown errant external services that increase latency. Finally, we’re rolling out safeguards that will allow us to contain the impact of future HTTP failures to as small a subset of requests as possible.