Incidents, post-mortems and teamwork

My intention with this article initially was to talk about how to manage incidents when they happen, how to identify the source of problems, and how to make sure they don't happen again.

As I have done on other occasions, I had thought about creating a story that reflected all this in a narrative way to understand this type of situation first-hand. However, thinking about it, I realized that the best way to express it was to tell you a true story of an incident, how the team reacted, and how they took action to fix it.

So this time I bring you not a story of mine but of a team with whom I have worked closely at Mercadona Tech. It's a story about incidents, actions taken, and lessons learned.

The day of the incident

That day the team was visiting one of the hives, the warehouses where we prepare orders at Mercadona Tech. They had been testing some new functionalities related to the reception of the trucks and by noon they finished all the work. Everything had worked as expected, so they went down to the parking lot to get the cars and go back to the office.

That's when the on-call mobile phone rang out.

One, two, three stores were calling support because they were getting an error when they scanned a container when they unload it off the truck... Without validating it, they couldn't go ahead and start replenishing the shelves.

The team left the cars and ran upstairs to see what was going on and fix it. Had they touched something they shouldn't? Was the failure due to an application change or an infrastructure failure?

They went to the kitchen and took out their laptops. The backends started looking at the latest changes they had uploaded to the project and the android developers started investigating the logs to see what could be happening.

It didn't take long for them to find the problem, an endpoint that was supposed to be no longer in use and that they had deleted that very morning. Why something that no one was using was giving problems?

Androids quickly discovered it, it turns out that this endpoint was still used in old versions that some stores still used. It had been removed a few versions ago, but some stores were using a version from a few weeks ago and hadn't updated yet.

Hadn't they already talked about this? Wasn't it supposed to have been forced a long time ago? It didn't matter, they would analyze it later, the important thing now was to unlock the stores.

What solution did they have in hand? Force the app to the latest version. Consequences? A forced update to stores that did not have the problem.

They assessed the risk, the negative part of forcing the version was that all stores that did not have the last version would have to download it to be able to continue with their processes, but it was a quick way to solve the problem and that did not block the rest of the stores. In addition, many of the stores were already in the latest version so the final impact was not too great.

Another option would be to roll back the backend version to recover that endpoint so they could continue, but they weren't entirely clear on what impact a rollback could have at that point.

Finally, and after evaluating the risks, they decided to force the version of the app, unlocking the stores was the most important thing at that time.

That fixed the problem, and after updating, the stores were able to continue.

The port-mortem

The next day the team got together to analyze what had happened. They started the post-mortem as usual: they described what had happened in order and how they had solved it. They also commented on the impact: a forced upgrade to some stores that had not suffered from the problem.

Then they analyzed the source of the problem: they had deleted an endpoint that was not being used, or so they thought.

Had it been a communication problem? Not really, backend and androids had communicated as they should, talked about the change and checked that the endpoint no longer existed in the implementation.

So what had been the mistake? A simple human error, they hadn't taken into account that older versions might still be using it and hadn't looked at the logs to check it. Something that could happen to anyone.

Did anyone blame someone? Obviously not, they worked as a team, the important thing was not to point out those who had made the mistake, the important thing was to find a way so that it did not happen to them again.

After a while, they defined the actions to be taken to avoid this type of problem in the future. They picked up on a conversation that had come up a couple of times in the past: implement a contract testing system that would check the compatibility of contracts between the android app and web services so as not to upload anything to production that was incompatible with the available versions.

In a short time they had a working prototype and checking that the changes they made to the contracts were verified against the production versions.

All this happened about 8 months ago, since then they have not had any similar incidents.

Work as a team so that it doesn't happen again

As we have seen, one of the most important things to solve a problem is to work as a team. Relying on your colleagues and working together to solve it is the best way to move forward. In addition, in the face of a crisis or resolving an incident, the important things are:

Unlock, then analyze: As we saw in the example, the team focused first on unlocking the situation and then analyzing why they had reached that point. When faced with a crisis, the important thing is to solve it first. Investigating what caused it will come later when you're not working against the clock.
Balance the risk: when faced with actions, you must assess the risk they may have. The team assessed the impact of forcing the release and decided to move forward because the potential impact was acceptable. Sometimes the risk may be too great and you may have to consider other options.
Blameless culture: this is the most important one. No one is to be blamed. We work as a team to solve the problem and deliver value to our users. We are a team, we don't compete with each other, we have to reward continuous improvement and learning over individual faults.
Take actions that prevent it from happening again: every post-mortem has one goal: to define the actions to be taken so that the problem does not happen again. These actions also have to be preventive instead of reactive, that is, instead of a patch that solves the current problem, actions that anticipate potential problems must be thought of. Implementing contract testing, for example, was one of these actions, introducing a validation of contracts before deploying changes to production not only solved that problem but any that they introduced in the future.

As we have seen, teamwork and decision-making to mitigate and avoid potential problems is essential when faced with incidents. The team worked in an exemplary way to solve it and the problem did not get worse. In addition, they took the necessary actions so that it did not happen again and since then nothing like this has happened.

I want to thank Alejandro Capdevila, and Luis Cencillo for telling me the details of that day and the entire V6 team for the good work they did and continue to do.

If you want to read more about how they implemented contract testing to solve the problem, here is an article by Edgar Miró about how they did it: Contract Tests: A New Hope

Teamwork, the most important thing to solve a problem.

TL;DR

Rapid identification of the problem and decision-making to unlock the situation demonstrate the importance of immediate action in critical situations. The post-mortem carried out the day after the incident reflects a blameless culture, where the emphasis is on learning and continuous improvement rather than pointing finger-pointing at individual blame.

Implementing preventative measures, such as contract testing, is an effective strategy to prevent similar problems from happening again. The success of these actions is shown in the absence of similar incidents in the last 8 months, underlining the effectiveness of the decisions made by the team.

Not only is effective resolution of an incident important, but it is also essential to cultivate an environment of collaboration, informed decision-making, and preventative measures. The history of this team at Mercadona Tech serves as a reminder that, in the face of challenges, team strength and the ability to learn and evolve together are critical to long-term success.

Incidents, post-mortems and teamwork

The day of the incident

The port-mortem

Work as a team so that it doesn't happen again

TL;DR

Mantente al día

Comparte este artículo