Janna Brummel and Robin van Zijll, from ING Netherlands, talked at the Velocity conference in London about how poor availability from their internet banking systems prompted the bank to implement an SRE culture. A centralized SRE team was set up in the Netherlands to provide tooling, consulting and education on reliability to product teams (known as BizDevOps squads internally).
Janna Brummel and Robin van Zijll, from ING Netherlands, talked at the Velocity conference in London about how poor availability from their internet banking systems prompted the bank to implement an SRE culture. A centralized SRE team was set up in the Netherlands to provide tooling, consulting and education on reliability to product teams (known as BizDevOps squads internally).
By mid-2017 ING’s metrics highlighted that their internet banking retail systems’ availability was down to 96.84%, in contrast with other systems (ideal retail and mobile banking retail) closer to the ideal 99.99% mark. Some of the factors leading to this outcome included: lack of monitoring ownership by product teams; a centralized alerting system triggered at very high level (system down) causing long time to diagnose and delegate to engineers (69 minutes on average for a major incident); infrequent post incident reviews and sharing of lessons learned; and lack of availability insights at component level (aggregated results at service level only contributed to product teams not feeling directly responsible).
The centralized SRE team has a consulting role only (they do not run and are not on call for the services) but also acts as a platform team, providing tooling and internal services to help the product teams run and improve their systems’ reliability. Planning and prioritization of the team’s backlog is guided by the service reliability hierarchy as defined in Google’s SRE book:
So far, the SRE team has focused mostly on the bottom 3 layers in the pyramid. In terms of monitoring and incident response, they are building shared tooling, based on Prometheus, Grafana and Mattermost ( ChatOps). They facilitate postmortems by the product teams, and provide consulting on how to identify and fix reliability issues. Brummel and van Zijll mentioned how it took time and concerted effort to remove the existing blame culture around major incidents. They advise to invest time creating awareness and setting the scene before actually increasing the frequency of the incident reviews, otherwise they can backfire.
All these changes were rolled out on-demand, not as a “big bang” initiative, allowing product teams to decide whether to switch to the tooling and practices proposed by the SRE team. The latter is also in the process of scaling from one team with a few engineers to a larger community of practice (with multiple SRE teams across different countries – currently 3 teams in the Netherlands, one in Spain and one in Australia). Demos and internal discussions on SRE topics help build the community.
Brummel and van Zijll’s takeaways so far in their SRE journey include: value SRE mindset over specific skills when hiring; SRE team needs a product owner to protect the team from conflicting priorities; be ready to spend a lot of time explaining and promoting SRE to product teams; tooling provided needs to be of commercial quality in terms of usability and it needs to alleviate actual pain points of your users; consider scalability and ownership in your tooling strategy.