Around 16:05 PDT, the price of BTC reached USD $10,000. In connection with the rising price, we experienced a 5x traffic spike over 4 minutes. Our autoscaling was unable to keep pace with this dramatic increase in traffic.
This traffic spike affected a number of our internal services, increasing latency between services. This led to process saturation of the web servers responsible for our API, where the number of incoming requests was greater than the number of listening processes, causing the requests to either be queued and timeout, or fail immediately. Our request error rate spiked to 50%, causing customers to experience errors when interacting with coinbase.com and our mobile apps.
The health check is also served by these saturated processes, which caused some instances to be marked as unhealthy and taken out of the load balancer, further exacerbating this issue.
In an effort to mitigate the saturation, we redeployed the API at 16:20 PDT to increase the machines serving the traffic. Once this deploy completed, the previous deploy’s instances were taken out of rotation, leading to another 2 minute outage due to instances saturating and being marked unhealthy. This was handled automatically by our autoscaling.
In response to these events, we’re working on a number of improvements. We have since fixed the health endpoint to ensure that saturated instances don’t get taken out of rotation. We’re working on reducing the impact of price-related traffic spikes though pre-scaling and caching. Longer term we’re planning to improve our deployment process to mitigate some of the autoscaling issues we experienced.