Sources of Delay: Failure Demand

Fernando Cuenca
Jul 31, 2023
9 min read

Giovanna knew stressful days where coming soon. The night before, her Product Team had made a production deployment for a new release of their flagship web application; if the past was any indication, that meant that at least during the next two weeks Giovanna's team was going to receive a big wave of user support requests.

She wasn't too worried about functionality defects; in fact, the Product Team did a good job of testing the application's functionality before release. It represented some work for them (especially following up with users once the fix was done), but it was relatively straightforward. What worried her was data fixes and manual configuration requests and adjustments.

It was common for new releases to interact badly with their very complex production database, very often resulting on data corruption that had to be manually fixed. This "data fixes" constituted the biggest portion of the wave of work observed during the first week after a new release, and they were usually high priority, and difficult to implement.

But in addition to data fixes, she knew they should expect an initial high volume of manual configuration requests. The company had decided, long ago, to implement only very limited configuration functionality in the application's user interface. At the time, the decision made sense, since the user base was relatively small and the configuration parameters relatively simple; it was thought that users could do it themselves by adjusting configuration files, but over time the configurability of the application grew, moved into the database and became more complex, so eventually users were instructed to submit a configuration request when they needed changes. With the user base also expanding, that meant lots of configuration change right after a new release.

As it had happened with every other production release, all this support work would arrive as the team was already working on features promised for the next release. This would create conflicts in priorities, and the need to put other work "on hold" to attend to these (usually disproportionately high) wave of urgent and disruptive user requests.

Work We'd Rather Do Less Of

We have a name for this kind of demand Giovanna is anticipating: failure demand.

This name needs to be understood in opposition to "value demand" and refers to those cases where some customer demand (like in Giovanna's story, a user request to fix some data corruption or to change a configuration parameter) is driven by our "failure" to provide full value with our originally delivered work. In essence, when we deliver value, some of it is "left on the table" and customers come back to us to request that part that is missing.

The names "value" and "failure" demand are, perhaps, historical relics and, if we take the words too literally, they can lead to unproductive conversations (like never ending debates about what constitutes "value" and whose "failure" that was). That said, I think there's merit in having some label to apply to that work that we're requested to do whose origin can be traced back to some "deficiency" or "shortcoming" in a work item we delivered earlier (whatever the reason for that deficiency/shortcoming)

So, if "value demand" is what motivates us to do the work that is part of our "mission", and arguably we'd like to do more of it, "failure demand" constitutes requests for work that 1) originates in the delivery of the first kind of demand, 2) it's remedial in nature, and 3) we'd like to do less of.

It's important to point out that failure/value demand are not usually used as Work Item Types, but as a way to classify those. Each team/service would benefit to agree which of their work item types falls into which category, and manage accordingly.

Different Varieties of "Failure Demand"

Defect fixes: this is probably the most commonly acknowledge form of failure demand. We deliver some work item and later we (or worse, the Customer) notices a "problem" with it which requires some work to fix. Whether this "problem" originates in requirements missing or misunderstood, or an implementation deficiency will be a matter of (usually "passionate") debate, but regardless of how the argument is settled, the result still is that this demand that is derivative from some earlier (value) demand, and it can be very disruptive.
Production issues: data fixes in Giovanna's story are an example of this variety. Work is delivered and might be in itself fine, but it interacts badly with the production environment creating some other problem, often of high criticality. This is usually the most disruptive form of failure demand, because it often requires a "drop everything and focus on this" approach (expedite class of service).
Manual work: sometimes we deliver work that, to be fully useful to the customer, requires us to do some additional work for then in a case-by-case basis, often performed manually. In Giovanna's story, the application included reporting capabilities, but there was no UI to create new reports, so users requiring new reports (or slight changes to existing ones) had to submit a support request to the team, and someone would manually change configuration parameters to make the report available.
Workarounds/"Training issues": this are similar to Manual Work, in the sense that they require a support request that ends up being resolved manually on behalf of the customer, but in this case rather than involving doing new work, the resolution is about exploiting some other characteristic of work delivered earlier, to accomplish what the customer needs. As an example, I once worked with a system that offered English/French user interface depending on the settings of the current user. However, because of the way the translation mechanism had been designed, updates to some translations were propagated only as part of a nightly batch process intended for data consolidation, and they wouldn't immediately appear on the screens where users were expecting them. Many users were unaware of this technicality, resulting in support requests each time someone needed to make a translation change. The workaround was, of course, to trigger the batch process manually, but most users didn't know about this, resulting on a support call. Because very often these workarounds can be executed directly by customers (as it was the case in my example here), work of this kind is commonly regarded as a "user training issue", so once enough users are "sufficiently trained", this kind of failure demand can largely diminish, but until then it can still be very disruptive, and the "training effort" be substantial.
Deliberate Technical Debt: Martin Fowler proposes a model for classifying technical debt in several categories. One of them is the technical debt that is acquired "on purpose", normally as a result of making some form of a trade-off: for example, we know some code we've written is messy and requires refactoring to clean up the design and eliminate duplication, but doing so will take time and we have to release something now to take advantage of some narrow window of opportunity. So, we decide to "take in some debt" in exchange for some reward. This decision can be prudent or imprudent (two additional dimensions in Folwer's model) but regardless of that, it means we've just created future work for ourselves, which fits my earlier definition of "failure demand". In more general terms, I'd bundle here any work that originates internally in the service and it's the result of taking some "shortcut" in the delivery of other work; the demand in this case doesn't originate from the customer, but internally.

Failure Demand and its Discontents

Failure Demand can be very disruptive, especially if arrives in big waves or if it has a very high cost of delay. As we shift focus on remediation of past shortcomings, we have to put aside other work that arguably provides more value to our customers; that works gets then delayed, or its quality affected by the increase in WIP and the need to context-switch (which then creates more future failure demand).

Perhaps we manage not to affect delivery timelines for current work by allocating some capacity to failure demand, and minimizing task switching. That might stabilize lead time for work items in process, but it also means that other work will have to wait in a queue to enter the workflow later, accumulating delay there.

Either way, failure demand will add to the service's management overhead. In addition to making decisions about how to treat the work, allocate and manage capacity, etc. this is the kind of demand that usually causes additional "political noise" because it essentially erodes trust in customers and other stakeholders: customer will be upset about it, higher management will want explanations and remediation plans, all those things will have to be discussed in meetings, special plans will have to be made, task forces will be assembled, fingers will be pointed at various people, etc. And because very often individual occurrences are taken as "one of" cases rather than systemic reoccurrence, all this situation will keep happening cycle after cycle.

What to do About it

Gaining situational awareness is, as usual, the first step: understand the nature of your demand, and use that to figure out how to proceed.

My first question would be: what's the proportion of failure demand respect to other work you do? In order to answer that question, a pre-requisite is to have an understanding of the various work item types a service delivers, and which of those are to be considered failure demand.

Let's say that Giovanna builds the following table as an answer to that question, showing the number of items (throughput) for each work item type delivered in each of the last 3 releases, and classifies each one as either failure or value demand:

We can then build these charts to get an idea of the proportions:

A common objection here is that we're dealing with items of perhaps very different size: how do, for example, 15 features and 25 data fixes delivered in Release 1 compare in terms of the effort they required? Perhaps the features took months to complete, while the data fixes were more quickly resolved in the couple of weeks that followed the release. Those are fair observations, and it is true that these charts don't tell the whole story (or even the most important part of the story).

However, the charts do show that at least in terms of raw throughput, the balance in this services seems to be tilted towards failure demand, rather than value demand, and that could in itself be a source of dissatisfaction because of the perception it can create of a low quality product. At the very least, they tell us that that more investigation could be useful.

Giovanna could then remember that it's always advisable to visualize data as it evolves over time, to understand trends. So, let's say she comes up with another graph:

Here we can see more clearly that, irrespective of the absolute values for each kind of demand, value demand (the orange line) seems to have some oscillation around a middle value, whereas failure demand (blue line) seems to show an upward trend. Yes, it's a very small sample, so we shouldn't perhaps read too much into it, but assuming the trend continues (or at the very least matches our intuition) this could be enough evidence to justify other actions, or point to more research.

It could be important, for example, to understand things like severity, complexity of resolution, delivery lead time, concurrent WIP, etc. and with all that get an idea of how much of a problem our current level of failure demand represents, so that we can decide an appropriate response.

An initial countermeasure could be to decide to explicitly allocate some capacity to failure demand, based on the demand analysis performed above. One of the most common traps seen in delivery teams is to assume they have all their capacity available for value demand when making plans for the work ahead. The result of that usually means making delivery promises that are later difficult to fulfill because failure demand reduces the available capacity and delays value work. Understanding the pattern of your failure demand may help you make more credible plans.

In Giovanna's story, she knew that the most common situation right after a release was a "wave" of data fix requests. She could decide, for example, to dedicate some team members to those fixes, so they could be immediately available to focus on them, while at the same time containing their impact and isolating other team members from it. Perhaps she could even get additional, temporary team members to help with data fixes for the few weeks immediately following a release. Or maybe she could establish a total WIP limit for the service, and explicitly allocate a percentage of that for data fixes.

An important ingredient for making decisions like those is understanding the cost of delay profile for your failure demand, and then using that to decide the appropriate class of service to give to each request.

But in the log run, of course, the real solution to getting failure demand under control is to find and remove the root causes that generated it in the first place. This may require things like improving design and testing activities, more automation, more early feedback from users and customers, more flexibility and liquidity of skills, implementation of self-serve options for your users, etc.