Start United States USA — software Evolution of Deployment Architecture at Buzzfeed

Evolution of Deployment Architecture at Buzzfeed

135
0
TEILEN

The Buzzfeed engineering team shared the story of how their deployment evolved from monolithic-app based deployments that took days to 150 deployments per day. They built an in-house tool called rig to leverage tools like Docker, AWS ECS and Jenkins to move to a service oriented…
The Buzzfeed engineering team shared the story of how their deployment evolved from monolithic-app based deployments that took days to 150 deployments per day. They built an in-house framework called rig to leverage tools like Docker, AWS ECS and Jenkins to move to a service oriented architecture and a more collaborative engineering team.
Many engineering teams have shared their stories of how they evolved in terms of architecture, deployment, and DevOps culture. Buzzfeed, a media and technology site, is one of the latest to do so. Buzzfeed started out with a small monolithic application which gradually grew in features and users, leading to a parallel growth in the size and scope of releases. The deployment process became cumbersome as the suite of products expanded, each of which had varied requirements. Deployments started to take days to push and validate.
A small group in the infrastructure team started an internal PaaS project involving containers to make things better. InfoQ reached out to Matt Reiferson, VP of Engineering at Buzzfeed, to find out more about this turning point:
There was quite a bit of discussion around consolidating and improving our config management systems and associated tooling. We ultimately felt like that approach would only yield minor gains over the existing workflow. Our hypothesis was that instead of choosing between Puppet, Chef, or Ansible, etc. we should obviate the need for the user to interact with them at all. Essentially, we were missing a higher-level abstraction that would allow teams to focus on solving their actual problems and iterating quickly. Containers are a natural answer — they drastically simplify „config management“ and provide a substrate for a uniform „service“ abstraction. We realized that all of the container orchestration solutions required glue to provide the consistent development and configuration experience we wanted to provide.
Along with this toolset, the team decided to adopt a service oriented architecture (SOA) model for their applications. SOA comes with its own set of challenges, including cultural and technical. Teams have to be empowered and organizational maturity is required for that to happen. Reiferson reflects on Buzzfeed’s experience on this aspect:
We had already adopted an organizational structure loosely based on Spotify’s model, where we divided into mission-driven „groups“ composed of „squads“, each staffed with the appropriate skillsets to accomplish their goals. This shift highlighted the need to invest in infrastructure. Rig then crystallized and encouraged behaviors we were looking for those teams and individuals to have – around development workflow, operational ownership, observability, and consistency. We crafted a series of internal documents called „The BuzzFeed Guide To Computers“, which talk about our technology choices, workflow, and FE/BE/Mobile architecture. Most importantly, they dig into the tradeoffs we considered — the why in addition to what — which provide context for making good choices when building new systems. We also formed an architecture review council along with templates for project submission which teams could take advantage of.
Rig took some inspiration from the open source paasta project. The infrastructure requirements for each service can be declared in a YAML file and a Dockerfile for creating a container image. The designers adopted several runtime conventions, including some from the Twelve Factor principles, and others like having no local state and health check endpoints for all HTTP-based services. The infrastructure layer was VM based, with deployments initiated from a web interface and Terraform for cloud infrastructure provisioning. The team leveraged Amazon’s Elastic Container Service (ECS) for container orchestration with other AWS services for DNS and load balancing.
Among the things to improve in the old system, the primary were making the development and deployment pipeline easier to work with, app interface standardization and higher post-deployment engagement for all teams. The goal was to have better collaboration and site reliability. The right toolsets are necessary for dev, qa and ops collaboration, especially ones that improve visibility. Observability was one of the key principles for the Buzzfeed team in building their toolset, which would have „out-of-the-box support for system and application distributed logging, instrumentation, and monitoring.“
Buzzfeed uses Datadog for metrics collection and a Nagios-based monitoring tool. Nagios is integrated with PagerDuty, with critical and actionable alerts being in Nagios. „These alerts are also delivered to team-specific Slack channels, declared in the respective service config“, says Reiferson. The integration between Nagios and Datadog would need to be better defined to have an effective escalation policy, and this is something that is still being explored at Buzzfeed.
A typical deployment pipeline starting from code commit would go through a Jenkins-based builder service, which also builds a container image and runs tests on it. On a successful run of the tests, the image is pushed to a container registry.
As the team moved from a POC to a production grade tool, they faced challenges like not having much operational competency with Docker, and the newly launched (at that time) AWS ECS. However, ECS rested on top of proven infrastructure elements in AWS which saved them the trouble of worrying about the container scheduling and focus on the machines and software stack running in them.
The migration was done in stages, with low-risk, small workload systems being migrated first. The impact has been manifold, with an average of ~150 deploys per day since rig launched. Culturally too, the team has seen changes in terms of consistency across services, low-cost experimentation and greater ownership.

Continue reading...