How Acast hacked its data
Operations in Teams - Part 1
Operations in Teams - Part 1
Background
In early 2018 we reorganised from small, functional teams, into purpose-driven, autonomous product teams. There were several driving factors.
Autonomy and purpose are two core principles of intrinsic motivation for individuals, and have proven to be effective for teams as well, so we wanted to make that part of how we organise ourselves. We also identified that this way of organizing fits well with the Acast culture and way of working, as well as enabling the teams to output value at high velocity, while continuing to grow rapidly.
During this reorganization, we also had an idea for a new way to manage operations and on-call responsibilities for all of our systems. Rather than having architects or managers dictating how teams should design solutions and what tools to use, we wanted to give our DevOps engineers — who were on-call for all of Acast — the mandate to decide which systems should be accepted into the on-call rotation.
That change meant our teams were autonomous in everything they do, right up until the point a system required people to be available 24/7 — at which point it had to adhere to a set of requirements.
We wanted each team to have engineers that are part of the on-call rotation, so there would be competence in each team to build it the “right way” from the beginning.
A few months into 2019, we still had the same people being on call for all of Acast’s increasing number of systems — and we had failed to hire and onboard people in all teams into the on-call rotation.
We had ended up in a situation where it was unsustainable for the individuals that were on-call to manage operations for the entire platform, while at the same time contributing to their home product team. It would also not be sustainable to grow the organization any bigger without making changes.
During 2019, we grew from about 30 people working in our product and engineering organisation to more than 50 — and anticipated continued growth during 2020.
At that time, we believed it was the right time for us to take another step in having even more autonomous teams, and to have them take ownership of the complete lifecycle of their products and systems, including the operational part.
There’s a commonly cited quote from Werner Vogels that captures the core of it really well:
“Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”
Rolling it out
We wanted all teams to take operational responsibility for the services owned by each team, potentially including taking part in an on-call rotation if service levels require.
We expected all software engineers joining Acast to learn and operate their teams’ systems, and it became part of their job description. We hoped that engineers would appreciate the higher degree of autonomy this brings, and would take pride in being part of the complete product lifecycle.
When rolling this out, we didn’t push it on to any team until they felt ready. That meant some investments in tech-debt, deployments, automation, hiring, and so on.
For many teams, it also meant moving their services away from being deployed via our old Rancher setup (described here) to our current setup, running them on fully managed container services in AWS. This transition was ongoing for six months, as we moved to a AWS account structure that would support our way of working.
In order to achieve a high degree of autonomy while delivering secure, reliable and resilient services, we isolate workloads between all our product teams by giving them their own AWS accounts. All resources should be owned by a specific workload and not shared among several.
We’re implementing three levels of isolation, achieving a high level of segregation of duties while still giving full autonomy to our teams.
Level 1 is marked in red in the figure above, and has restricted access to a few sets of individuals.
Level 2 is marked in blue, and we’re aiming to have that only accessible by each team’s build server. This means that by design no human interacts with the production environment — and we ensure that all changes, both in infrastructure and the applications, go through our change management process.
Level 3 is marked in green, and this is accessible by all team members — enabling them to do both experimentation and development work completely isolated from the production environments.
Current state
We’re now in a position where all the teams have taken operational ownership of their own systems, and have 24/7 on-call rotations. Some argue that this is the actual meaning of the much-hyped term ‘DevOps’ — in other words, the same individuals both develop and operate their services.
We’ve chosen to compensate the on-call time outside of office hours with a fixed hourly rate, regardless of whether there are live site issues or not. The goal with this approach is to incentivise all teams to build reliable and resilient services by design, so the on-call compensation ends up being easy money with no alarms ever triggering outside of office hours.
We use PagerDuty as the tool for alerts and notifications, and also base our on-call payroll on teams’ PagerDuty schedules.
When the teams moved their services away from our Rancher setup, they implemented the infrastructure to be managed as code using the AWS CDK. There’s a threshold to get started with using AWS CDK, but once you get past that it provides our software engineers with the ability to write infrastructure as code in a familiar language, in their favourite IDE — and, in the process, produce more secure, consistent infrastructure.
We’re already seeing that engineers appreciate the higher degree of autonomy this brings them.
We’ve also noticed that lots of old issues, both known and unknown, are surfacing — and are being fixed to a higher degree, now there are more people with more bandwidth to spend time on them. We’re still working on figuring out how teams should most effectively prioritise incident remediation work versus delivering new features and products.
When we put the expectation on people to own all aspects of what they’re developing, it’s important to ensure people feel they have the autonomy and control to allocate time needed to create healthy systems, while still maintaining a healthy and sustainable work environment.
The ultimate goal with being on call at Acast is that it should be easy money with zero work out of office hours required, because the systems are so resilient.
In the second part of this blog series we’ll share testimonies and experiences from some of our engineers. Watch this space.