Home United States USA — software Metrics Collection From Large Scale IoT Deployments at Vivint

Metrics Collection From Large Scale IoT Deployments at Vivint

332
0
SHARE

Vivint’s engineering team built their own metrics collection platform to collect and analyze metrics from their deployed devices. The key motivation behind writing their own system was to be able to store only aggregated data and focus on its analysis, which they achieve by their Rothko project.
Vivint’s engineering team built their own metrics collection platform to collect and analyze metrics from their devices. The key motivation behind writing their own system was to be able to store only aggregated data and focus on its analysis, which they achieve by their Rothko project.
Vivint is a provider of smart home devices. Rothko’s fundamental design decision that differentiates it from other systems like Graphite and OpenTSDB is to store aggregated data instead of data points for every service. This was motivated by a conscious trade-off between not storing every data point and still having the ability to pinpoint issues. At the same time, the data had to be available for statistical analysis without losing any key features needed for such analysis.
Rothko allows looking at overall distributions of metrics and analyzing them. Since individual metrics are not stored, does the team ever run into situations where issues with individual devices need to be diagnosed? InfoQ got in touch with Jeff Wendling, Software Engineer at Vivint, to find out more about this and about Rothko’s architecture:
Indeed, we don’t store the individual data points. This is solved in two ways. One, we can easily and cheaply store the minimum and maximum as well as who they came from, so we do. That helps when they’re the most deviant outlier. Two, since every device is sending data approximately every 30 minutes, we have a “firehose” that let’s us tap into the data and filter out specific metrics or devices, etc. Assuming that it’s still sending, we can usually figure out who it is. Of course, both of these methods don’t guarantee you’ll determine the problem, but it’s a cheap and easy 80% solution for 20% of the effort, which fits in with the Rothko principles.
Time series data typically has metadata like tags that store additional properties of the data like the application name or datacenter location for logical grouping during analysis. Is this true for Vivint’s data also? Wendling replied:
While we don’t send up anything other than a random ‘instance id’, it’s currently just an unstructured slice of bytes. Theoretically you could send up whatever you wanted in there. Since the set of devices we’re monitoring are mostly cheap devices in customers’ homes, we don’t have any GPS equipment in them or anything, but you can get reasonably close with geolocation on the IP.
Rothko’s architecture is composed of a database implementation that uses a configurable number of flat files for each metric that it writes and reads using mmap, an implementation of accepting metrics based on the Graphite wire protocol, an implementation of an approximate quantile sketch to aggregate the data, some API endpoints to retrieve data and render graphs, and a frontend UI for easy human consumption. Data can be sent securely from devices to the Rothko endpoint.
“The design was kept pluggable”, says Wendling, since “there are many competing standards and different workloads. For example, internally, we have our own plugin for reading metrics from our custom wire protocol. It’s designed to be easy to write plugins and configure them with a toml file. Even logging and internal metrics collection of the process can be easily swapped out to match whatever you want.”
Rothko was designed to handle a small number of metrics across a large set of instances. It currently handles close to 50,000 metrics and completes disk flushes for them in about 50 seconds with 500MB of RAM. The flushes happen every 10 minutes, so “it should be easy to do 500k metrics”, according to Wendling. It’s deployed on a single instance, and there has been no need yet to implement scaling policies like horizontal sharding.
On being asked if Vivint’s team also uses any alerting mechanisms, Wendling responded that they don’t do so, and the focus is more on keeping an eye on dashboards. Rothko is written in Go, is open source and hosted on Github.

Continue reading...