It's three o-clock in the morning and you’re awakened by the unwelcome beep of a text message. You don't even have to look at your phone to know what it’s about. Good news always sleeps ‘till noon, as the saying goes, and so you surmise (quite rightly) that somewhere - somehow - something’s gone wrong with the servers and the site is down. Your business is based entirely on global eCommerce. Every minute the site is down customers are annoyed and sales are lost.
You stumble out of bed and make your way to your office downstairs. Your wife sleepily yet pointedly reminds you on your way out that this is the fourth time this month. This is not news to you. And to be honest, her count may even be on the conservative side. For months now, the site has been springing leaks on a weekly basis. The team is disgruntled. Some have left, while the ones who’ve remained are feeling powerless. The issues are too numerous. They simply can't keep up.
You get into your office, boot up your laptop and find two team members chatting about the problem. Dave has already isolated it, and is in the midst of fixing it. That indicates it's a problem with the fulfillment system. Dave knows his way through that code better than anyone. If it were a problem with search, Ryan would be all over it. If it were with transaction processing, Mei would take care of it. As for the other parts of the site, well, they don't break.
The six other team members no longer bother responding to pre-dawn texts. They assume (quite rightly) that whatever problem’s presented itself relates to one of the three typically problematic systems, and that one of the three aforementioned “heroes” will take care of it.
As the team lead, you just have to wake up and be present – but the heroes will get the job done. Soon it's just you and Dave in IM, though you're not even talking. Then Dave types, “Try it now.”
You run through a quick perhaps even haphazard battery of tests, and everything looks fine. You know Dave’s fixes work.
You close your laptop, and go back to bed.
The next morning at work, everyone’s busy fixing bugs. No one even mentions the problems that arose in the middle of the night. This is the status quo.
While it may seem depressing, many on-line businesses will find this scenario familiar. It is not an edge or extreme case. When teams create complex software and a lot of hastily released code, they end up with something called “technical debt.” In essence, technical debt is releasing something (and it can be anything - from software, to a car, to food in a restaurant) that can work most of the time, but is likely to be returned.
When teams release software with a lot of technical debt, a few things happen. First, the team becomes unhappy, knowing that they've released a questionable product and that bugs are going to come back. Second, if the bugs are numerous and severe (causing outages like those in the story) heroes show up to save the day.
Heroes tend to be armed with razor sharp, double-edged swords.
The Edge that Solves
On one edge, they quickly cut through to the heart of the problem: they fix the code, and get the site up and running. While that's wonderful and everyone can be is relieved for the moment, it’s just that - for the moment.
The Edge that Cuts
The second edge of that sword cuts into the team. While the hero gets the site up and running, no one knows what he did or why. His code was never vetted, tested, or“refactored” (a fancy word coders use for “edited”). So in the end, the hero has created undocumented code that no one else can go back and work with. Every time the hero touches the system, they fix the symptom (the outage) and exacerbate the problem (technical debt).
That's not optimal, and everyone knows it. But frequently organizations suffer from so much technical debt that focus is placed on putting out fires at the expense of creating value. More fires equate to more undocumented code. More undocumented code makes it more difficult to fix the root causes of the problem. If the root cause remains, we continue spending time and money responding to it.
In Agile, we've declared hero culture to be a detriment because it creates and perpetuates technical debt and introduces inefficiencies. The Agile solution to team heroes is to cross-train – to make sure that many people on the team have the capability to fix the code. That solution is backed up with coding standards, unit tests, and other practices designed to avoid sloppy code and remove the need for heroes in the first place.
So, we have our solution. Simple! But wait - what about all the technical debt that remains? We still have heroes. We still have not found our coding nirvana.
Because coding is inherently messy, and it always will be. If it weren't messy, it would be automatically generated by software writing software. If it weren't messy, coders would make $35 an hour. If it weren't messy, we wouldn't need Agile to begin with.
So, technical debt is a fact of life. Heroes are a fact of life.
Should they be?
So let's think about this a minute.
We have a group who knows their product. We have individuals on the team who have hyper-specialized in response to the reality of the system. We have a system that is breaking often and requiring emergency repairs.
Many Agile coaches would put systems into place that removed the heroes from their stations and cross-trained others in the group. Is that the right solution? Maybe.
The other people on the team are overworked just keeping up with all the technical debt hitting them in the face. Catastrophic failures currently require immediate intervention by someone with specific knowledge. Every minute is worth millions of dollars. This is NOTnot the time for cross-training.
It's hero time.
So, in a traditional system, the hero's tasks would be lost in the shuffle. The code enters the system, patches the problem, and people go on about their work. That's not good either. Hero time cannot be blind-faith time.
But what if we're using a kanban on this team. For the non-heroes, we have a simple value stream of:
And that works fine, because those coders are doing things by the book. Their code has unit tests, it meets coding standards, they are pairing, have code reviews, and so on. The legitimately generated new code has ample safeguards for quality.
The heroes, however, are responding to a threat-in-progress. Their task is to code it and forget it; get the site up and running at all costs. To make matters worse, these heroes are brilliant problem solvers. This means that their code is not only undocumented, it's completely confusing to read. It works, but no one can tell how it works. They saved the day, but left this nugget of ingenious-yet-impenetrable code gristle for people to chew on later.
At the time though, and this is important, they successfully got the system up and running immediately. They routinely save the company from ruin. And that's a good thing.
So let's rationally solve this problem. Let them be heroes, but give them a different value stream that gets the fix in and then responsibly refactors and integrates that fix with the rest of the system.
The new hero-embracing value stream might look like this:
So we extend our hero metaphor a bit with some familiar characteristics.
- Above The Law: The hero is uniquely gifted with the ability to release production code without any testing. The goal for the hero, when invoked, is to fix the problem and get the system running asap, no questions asked.
- Jimmy Olsen is Needed: Immediately upon release, the hero teams with a normal human being to refactor their genius code into something that is maintainable, readable, and avoids future technical debt.
- Bat Caves & Arctic Fortresses: Heroes need to be introspective. “With great power comes great responsibility,” says the Batman. After a situation where a hero action is invoked, the team or even part of the team should meet and discuss the issue, why it happened, and how to strike at its root cause and ensure it does not reoccur. This gives the crime fighter the ability to stop the next crime before it starts - really taking advantage of their gifts.
Number 1 in this list ensures that mission-critical problems are dealt with immediately and decisively. Number 2 ensures that the hack is not the final solution. Number 3 ensures that the team learns and quickly sets out to improve the system to avoid future calls to the hero.
Ultimately, we'd like to build systems with no technical debt. Our processes are, of course, aimed at that lofty goal. But like it or not, we do have emergencies that necessitate rapid fixes. We have but two choices – boldly stand up and face reality, or continue to solve technical debt with more technical debt until the software implodes.
Which do you choose?
("Superhero Down-Time" Photo by Jason Stanley)