Database outages often happen because workloads with peaks waste a lot of resources during non-peaks and most RDBMS deployments don’ t scale in/down easily.
Everyone’s seen the “wall of shame” of tweets about major e-commerce sites suffering site slowdowns and outages during Black Friday and Cyber Monday. Here’s a brief recap just from the last decade:
And there are similar “flash sales” (short duration, limited items, deep discount) all over the world, including China’s Singles Day and Flipkart’s Big Billion Day.
This is the very reason scale is needed… to avoid these kinds of high-impact outages. But hidden here is a big reason why this keeps happening.
Each individual additional node in master/master won’ t give linear write scale; instead, they give additional HA. So removing nodes doesn’ t give the same amount of scale-in as actually shrinking each node, i.e., swapping each node for a smaller instance. And that kind of swapping requires bringing up separate nodes from backup, using replication to catch up, and then cutting over–a lot of effort.
Scaling in a sharded array is similarly complex. Partitions have to be consolidated between shards, application queries often have to be modified, and shard: data LUT routing has to be updated. Nearly everyone I’ ve talked to who has deployed and/or supported sharded installations has confirmed: “We never try to scale back in.”
So when the spike to 40% more comes, the site craters.