Home United States USA — software Avoiding Alerts Overload from Microservices: Sarah Wells at QCon London

Avoiding Alerts Overload from Microservices: Sarah Wells at QCon London

253
0
SHARE

At QCon London, Sarah Wells presented “Avoiding Alerts Overload from Microservices”, and cautioned that developers and operators must fundamentally change the way they think about monitoring when building a microservice system. Key takeaways included: build a system that can be supported; focus on ‘stuff that…
At QCon London , Sarah Wells presented ” Avoiding Alerts Overload from Microservices “, and cautioned that developers and operators must fundamentally change the way they think about monitoring when building a distributed microservice-based system. Key takeaways included: build a system that can be supported; focus on monitoring ‘stuff that matters’, such as core user journeys and business functionality, when creating monitoring and alerts; and continually and proactively cultivate and improve alerts.
Wells , a Principal Engineer at Financial Times, began the talk by stating that knowing when there is a problem is not enough, an alert must only be triggered when an action by a human is required. A microservices architecture may allow the development team to move fast, but there is an operational cost, and the number (and complexity) of alerts generated by a microservice-based system can be overwhelming.
The FT.com website is powered by a microservice backend, primarily utilising the Java and Go programming languages, packaged and deployed with Docker and CoreOS onto the Amazon Web Services (AWS) platform. Data stores included mongoDB, elastic, neo4j and Apache Kafka. There are 99 functional services (with 350 running instances at any given time), and 52 non functional services (with 218 running instances). Wells stated that if each of the 568 service instances were checked every minute, this would result in 817,920 checks per day. Running containers on shared Virtual Machines (VMs) requires 92,160 system-level checks, for a total of 910,080 checks per day. In addition, any microservice-based application is a distributed systems, and accordingly services do not run independently. If something fails, it can often lead to cascade failures, which further complicates monitoring and alerting.

Continue reading...