From 10:28 PDT to 10:40 PDT on April 29, 2020, the API that powers coinbase.com and our mobile applications became unavailable for our customers globally. This was followed by 30 minutes of stability, and then a period of instability from 11:12 PDT to 12:11 PDT, during which we sustained 20 minutes of full unavailability and 40 minutes of degraded performance with elevated error rates. At 12:11 PDT, full service was restored and all systems began operating normally.
This issue affected customers’ ability to access Coinbase and Coinbase Pro UIs, and did not impact trading via our exchange APIs or the health of the underlying markets. It was caused and perpetuated by two separate, but related, root causes.
The initial incident was triggered at 17:28 PDT by an increase in the rate of connections to one of our primary databases. This connection rate increase was the result of a deploy creating new connections while our systems were scaled to respond to elevated traffic at the time. When this spike in connections occurred, the host operating system for the database began rejecting new TCP connections to the host, which triggered degraded operations and restarts in the routing layer for the database. When this occurred, our monitoring began reporting an elevated error rate across all API requests that touched the impacted database.
In response to the failures in the routing layer and the corresponding operational failures, our systems attempted to reconnect in order to retry these operations. Unfortunately, due to improper closed connection handling and lack of support for timing jitter in new connection creation, our systems “connection stormed” the database. This connection storm triggered the same failure we saw on other members of the routing layer, preventing new connections to be established. While the initial database was able to recover at 10:40 PDT, this same failure mode occurred with three other, separate database instances, creating the second period of unavailability from 11:12 PDT to 12:11 PDT.
In response to this failure, we’re rolling out a number of changes. First, we’re changing our database deployment topology to reduce our overall connection count, limit connection spikes, and separate the routing and daemon processes of the database to limit competition for host resources. Second, we’re resolving the issue with the driver closed connection logic and implementing better jitter to prevent connection storming when this failure mode occurs. Finally, we’re rolling out safeguards that will allow us to contain the impact of future database failures to as small a subset of requests as possible.