Home United States USA — software Systems Observability

Systems Observability

December 18, 2021

155

This article helps explain what observability is in the world of large information systems and how to apply it properly in large distributed organizations.
Join the DZone community and get the full member experience. According to Wikipedia: “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In control theory, the observability and controllability of a linear system are mathematical duals.” In simple words, it is how the system describes its internal state through its external outputs. There are 3 main pillars of observability: Time series sensors data provide low-latency quick feedback regarding system performance. Tracing data helps to find where the error happened. Text data describing in details events that happened in a low application level. Everything starts with data gathering. No matter how complex or simple your system is, you will need to have this data as a basis for further analysis and actions. In the world of distributed computing, clouds, and microservices, how to make a system observable might look like a very hard question. It becomes much easier when analyzing some systems from the perspective of users who interact with them. What the user should know in terms of observability is the system’s operational state: is it good or bad, working or not, operating successfully or out of operation? We have plenty of examples from the real world of how we do it day-to-day. For example: We are not thinking of it consciously because we are doing it automatically, but to answer these questions we need metrics. To know if we are ok or not, we need to measure temperature, pressure, and blood analysis results. To say if the car is ready to go, we need to look at the control panel if there is an error or not. Assuming we have a lot of components in the system, then the overall state will be the result of the binary multiplication of its components. If we need to know what the overall system state is, we need to collect metrics from each component. We also want to know what the state was in previous time and state change time as well. This means we need to constantly collect this data from the components. Once we have metrics data, we can build a dashboard with a nice view. Nice dashboard to show your boss you are cool, but what of these indicators do you actually need to say what is the state of the system? From the big variety of metrics, we need to choose the most important which directly affect user experience and business operations. From the huge number of possible metrics, KPIs, and measuring data, the next three are most important, as they directly affect the user experience and business operations.