Tickets, queues, disk util %, CPU, and connections are just a couple of the things that you should consider when it comes to setting up your alerts for MongoDB.
What’s happening inside your database can have a huge impact on your application and your customers’ happiness. In this post, we’ ll talk through some of the things you need to look out for to keep your MongoDB deployment on track. There’s a ton of new stuff in MongoDB that will change how you monitor, including a new storage engine, and MongoDB Atlas, our database as a service offering.
Since all MongoDB Atlas deployments use the WiredTiger storage engine instead of MMAPv1, the old Lock % metric doesn’ t make sense anymore. Document-level locking means better performance for just about all workloads.
“Tickets Available” reveals the number of tickets available to the storage engine, which represents the number of concurrent read or write operations that can occur. When all tickets are claimed, those operations must wait, meaning it enters the queue, which is where the “Queues” metric comes in. These related metrics will help you detect queries that took a little longer than expected due to load. Increasing your instance size (or sometimes disk speed) will help these metrics. A good value for alerts based on these metrics might be Tickets Available under 30 for a few minutes, or Queues over 100 for a minute. You want to avoid false positives triggered by relatively harmless short-term spikes, but catch issues when these metrics stay elevated for a while, so “send if condition persists” helps a lot here.
Since MongoDB Atlas is a complete database as a service solution, we provision instances on your behalf for your data to live on. Those instances have their own metrics, which can be useful to alert on. The first of these I’ d be keen to know about would be the Disk Utilization.
By default, we set an alert for you at 90% of Disk Util %, but you might want to tweak this value even lower, just in case.
Similarly, it may be useful to know when your app has stopped writing to your database, so you might want to have an alert for disk util under approximately 10% that would fire after 5 minutes of continued low disk utilization. This would be great for production clusters, but probably too noisy for ones in development. Additionally, if your app has very bursty usage – e.g., no one uses it at night — you might get some false positives with this alert. As with all of these alerts, experiment with values to find what’s right for your application, and don’ t forget to periodically reevaluate them to make sure you’ re getting the best information.
In AWS, our M10 and M20 instances are running on t2-family machines. These machines have shared CPU cores with their neighbors. To prevent noisy neighbor issues, the hypervisor intervenes when instances use too much of their allotted core fraction for too long, throttling the client instance. This is reflected as “CPU Steal” on your metrics tab. A good indication that it’s time to scale up is when your CPU steal goes over approximately 15% and stays there. At this point, you’ ll experience slow queries and difficulty connecting, so it’s pretty important to avoid CPU Steal.
The good news is that you can set an alert for CPU Steal, so I’ d strongly encourage you to create such an alert for your deployment. Set it at 10% and get ahead of the need to scale.
If I had to come up with another title for this section, it’ d be “ Indexes, [unprintable] , do you speak it?!“ If your database has its indexes configured correctly, you may almost never need to fetch entire documents from disk, serving queries at lightning speed from the index in the filesystem cache. Remember, however, that indexes are not purely good – too many indexes may slow down your writes due to write-multiplication.
That said, if your “Query Targeting Scanned/Returned” or “Query Targeting Scanned Objects/Returned” goes over 50 or 100, it’s a big indicator that your queries are inefficient. Check your primary’s logs to find them. In particular, you will be looking for log lines with COLLSCAN in them, indicating an index could not be found for your query.
Collection scans are inefficient in both CPU and disk, so eliminating them can let you get better performance out of smaller instances and slower disks.
Finally, each MongoDB Atlas instance size has a connection limit. This is because each connection takes up a certain amount of resources from the operating system, and we want to make sure the operating system has enough resources to give to MongoDB to properly handle your data. Since your cluster’s connection limit is displayed on your clusters tab, use this information to your advantage.
I’ d suggest setting an alert to somewhere between 80% and 90% of your limit; this way, you can get ahead of your scaling needs, or avoid an app problem where connections get gobbled up for no reason.