Домой United States USA — software Q&A With Stuart Davidson on Scaling CD at Skyscanner

Q&A With Stuart Davidson on Scaling CD at Skyscanner

По

March 31, 2018

233

Stuart Davidson spoke at QConLondon 2018 about Skyscanner’s mission to get from a reactive operations model to providing teams with an empowering developer experience. Davidson told the story of how, with support and a lofty-goal from their CTO, they began on a technical and cultural journey to enable their squads to deliver 10 thousand times a day. InfoQ speaks with Davidson to learn more.
Stuart Davidson, an engineering manager at Skyscanner, spoke at QConLondon 2018 on the journey which his organisation has been on to get from a reactive operations model to providing teams with an empowering developer experience. Davidson told the story of how, with support and a lofty-goal from CTO Bryan Dove, Skyscanner began on a journey to provisioning the capacity to deliver 10 thousand times a day, across an organisation which already had 600 technologists worldwide and was continuing to grow. This took the form of an iterative path with small increments and cultural change, ultimately resulting in a containerised platform owned by empowered teams.
Davidson, who heads up Skyscanner’s Development Mechanics and Deployment & Orchestration teams, shared that he was originally part of a small group fully occupied with reactively delivering minor increments, such as specific CI enhancements in response to requests from development teams. Recognised the strategic benefit that a robust deployment pipeline would bring, the company’s leadership set a goal to create a platform capable of deploying 10,000 changes a day. Davidson described this as an awakening, as they realised that the platform they were designing would never have scaled to this capacity.
Davidson explained that his team discovered they were «a strategic enabler» for the business by recognising they were, in fact, a «strategic roadblock» gating provisioning of the CI pipeline «which all products had to go through.» Davidson calls this the «Jenkins Paradox,» which he described as the contradiction between repeatable builds, infrastructure and robustness, at the same time as encouraging exploration and innovation with new and unfamiliar tools.
Their solution to this was to move their CI infrastructure to Drone, a container native tool, which managed layering of build and CI environments provided by development squads. This gave the squads greater ownership of their build, test and runtime infrastructure. He explains that making teams responsible for their own repeatable pipelines proved popular with the squads and resulted in upskilling: The adoption was insane because the engineers saw the benefits as well. These semi-autonomous squads could build the pipelines they wanted. They loved it…and we’d inadvertently trained every squad in containers. There was at least one person in every squad who knew how to manage their Dockerfile, as they needed it to control their build environment. So we thought, let’s take it to production.
Skyscanner started their production journey by using AWS’s ECS service as it was the «cheapest, easiest, most accessible container scheduler.» While they are now migrating to managed Kubernetes, Davidson doubted that choosing Kubernetes at the outset would have been as successful, given that they were already an AWS shop. He points out that technical problems can be deferred and that teams should make «solutions as simple as possible if taking an iterative step.» Speaking on how they iteratively incremented, he said: Don’t invest too much. It’s a hypothesis. Don’t get into a position where you are investing six months worth of effort. Try and find something that’s quick and easy. Learn from that and then try and find what’s important to you in the next step.
Davidson shared that experimenting with tools should be balanced against the cultural shifts required to achieve CD, saying: If you want to try some of these tools, look for one that is robust and will run, and you won’t have to worry about the operations of it. Because you will have way, way more problems. You’ll have the cultural shift.
Skyscanner’s deployment solution evolved rapidly to one where teams were able to do blue/green deploys with integrated monitoring and observability. Davidson told the audience that they were able to «take an idea and put it into production, and have it monitored and alerted in 30 minutes.» He pointed out that the steps they took to get there were «just small iterations on the idea that we already had.»
Davidson also spoke about the safety given to the teams through performing small deploys where risk was reduced even though they had concerns about their testing: Every change going into gitlab was being deployed continuously in a blue/green fashion. This was scary as our testing was good, but maybe not that good. With continuous deployment we did find, if you look at the risk equation, the change which was being deployed was very, very small… So if there was a problem it was easy to find where the problem stemmed from.
Davidson spoke of how they added a canary-like pause to their blue/green deploys, using StackOverflow’s Bosun to execute queries against a time series database with both system and business metrics to assess health of a release. Davidson explained that squads could define programmatic pauses of the rollout at specific percentages to analyse for acceptance patterns and if this failed, the deployments were automatically rolled back. Davidson said: We can also query (openTSDB) to see if our sales of flights have gone down. In fact, we would prefer to rollback if that suddenly takes a plummet. It could be that something good came on the television, people stopped looking at Skyscanner and we rolled-back accidentally, but that’s OK. We want to be safe with this sort of thing.
At the time of presenting, Davidson showed that in the previous month Skyscanner had deployed 456 distinct services a total of 3,733 times across multiple regions. He described his teams’ goal as being one of empowering squads to be able to focus on delivering value to the traveller: We aim to be a force multiplier. For every engineer who works in Skyscanner we try to enable them to do more in a day. We try and do as much of the heavy lifting as we can, so an engineer can get their source into production as quickly and reliably as possible. They can then focus on the product and features we give to the traveller.
InfoQ spoke with Davidson to learn more about this journey to scaled CD.
InfoQ: What form did management support take after the initial challenge to be able to scale to 10K deploys a day?
Stuart Davidson: Bryan lived up to his word and mentioned Slingshot as often as he could — we even got to present it to the board.
My manager (Ryan Crawford) and my Engineering Lead (Paul Gillespie) did a ton of work going round the engineering leadership and getting feedback, challenging perceptions and gathering a group of influencers that would help sell it to the rest of the company.
But it wasn’t just top-down, we had tremendous support from an ambitious Tribe Engineering Lead called Dave Garcia who could see the benefit of what we were doing and was adamant his Direct Booking team became early adopters of the product.