
Platform health is the key to high performing teams
So many large businesses almost imploded because they didn't address their technical debt. Here's how to measure platform health — and what to do about it.
All stakeholders and team members want to be part of an efficient and effective process. But how does a business identify what it needs to do to create a culture of high performance?
After a complete outage in 1999, eBay went nearly 2 years without releasing a major feature so they could re-platform. In 2002, following the TWC memo from Bill Gates, Microsoft went on a feature freeze for 1 year. The same happened with Amazon in 2004, Twitter in 2008, LinkedIn in 2009, and many more.
These businesses all nearly imploded because their teams were performing poorly. They needed to stop trying to build new features and address the health of their platform in order to survive and ultimately thrive.
According to Gene Kim — founder of Tripwire and author of The Phoenix Project — high performance in teams begins with DevOps. His definition of DevOps is a very useful one, because rather than say what it is, it describes the outcomes we should aspire to:
"DevOps is the architecture, technical practices, and cultural norms that increase our ability to deliver applications and services, enable rapid experimentation and innovation, and deliver value to customers quickly."
The State of DevOps Report 2019 used cluster analysis to identify trends in the deployment practices of teams who rate their performance as Elite, High, Medium and Low. It looked at things like deployment lead time and frequency — finding that these particular metrics are 100 and 200 times better in Elite versus Low performing teams. So what can be done to make sure your teams can achieve the better deployment practices so closely linked with team and business performance?
Technical debt = risk
Part of this can be attributed to the health of the architecture and code — something engineers often call "technical debt". Technical debt leads to increased fear of deployments, increased complexity in testing, and increased pain of development. The trouble with the term "tech debt" is that it doesn't transfer its meaning well from engineers to executives. Debt to an exec can be something quite acceptable — a strategic norm to be paid off later without consequence along the way — but this is not the case when it comes to platform health. Maybe instead we can use a shared concept that engineers and execs both understand: risk.
In a project full of technical debt, risk escalates the longer the technical and architectural problems are ignored. Fixes are deferred, which leads to more painful releases, which slows a team's ability to bring business value, leads to fear and dissatisfaction amongst engineers, and eventually project meltdown. In order to avoid this risk from escalating we need to understand when to step in to fix the problems — and to do that we need a metric to measure against.
Risk can be measured in pain
Arty Starr, founder of DreamScale and author of Idea Flow, talks about one such metric: the pain felt by the engineers developing the software. Working in a complex piece of software often requires debugging and troubleshooting — essential when implementing a new feature or fixing a defect. Your engineers are happily coding away when they find something unexpected. This leads to confusion which in turn impedes progress. This confusion is a normal part of the development process — but when it extends beyond a certain time, that causes pain.
Arty suggests that debugging something for more than 50 minutes is excessive and should always lead to a conversation to find out how to address the underlying issues. She suggests engineers keep a record of their progress and note when that confusion moment begins and ends. If there's no investigation and subsequent mitigation when an engineer feels pain, these problems will lead to ever-increasing project risk.
Don't be afraid to pull the andon cord
This reflection every time an engineer's confusion extends beyond a certain tolerance is not dissimilar to the idea of the andon cord in a Toyota plant. The principle is that a single cord runs through the plant and any time someone on the production line finds something wrong they can pull the cord and stop the whole line. It could be that a component arrives at a worker's station in an unsatisfactory condition — they pull the cord — the team finds out how that happened — and puts in place something that prevents it from happening again.
The idea behind stopping everything any time something is wrong is simple: that 20-second delay is compounded over time and has knock-on effects further down the chain for everyone — so to de-risk the whole business, immediate intervention is always preferred. There should be no shame in calling this out. Often. In a typical day at a Toyota manufacturing plant in Kentucky, the andon cord is pulled and the production line is completely stopped 5,000 times per day. This is how Toyota maintains its position as a high-performing business — not by producing more features, but by refining their process to be more predictable.
Invest in platform health in order to de-risk
Looking at this data, it's clear businesses need to invest in platform health in order to de-risk. If engineers are suffering daily pain; if deployments are taking days instead of minutes; if engineers are scared of the codebase they are working in — it's time to act.
So if you're a DevOps engineer feeling the burden of deployments; a developer feeling the pain of feature development or debugging; or an exec wondering why your business isn't innovating fast enough — maybe it's time to start talking with one another about the risk that you have inadvertently accepted by neglecting the health of your platform.