Handling Production / Critical Issues in your team / project

5 min readAug 1, 2022

Recently, I came across a few critical issues in the service for which our team is responsible for, which potentially impacted all of the teams working on the project. Thankfully, the issue was in a non Production environment.

But this instance led me to write this blog to illustrate some of my learnings handling such issues which could come in handy while dealing with such issues in your application ( whether Prod / Non Prod).

To Illustrate , let’s take an example

GroceryApp is an android based app used by end users to buy all types of groceries. One of its basic functions allow users to add items to cart before pricing can be finalised, payment details and final order generation. Adding to cart is mandatory before any other process like pricing, payment etc. could be done.

Consider the impact if Add to Cart is broken , that would basically mean no new order would be created.

This is different from supposedly, if a particular product cannot be added to cart ( Not to say, that is not important). So urgency to tackle Add to Cart being broken should be higher compared to a particular product not being added to cart.

Responding to the issue

While any issue could be important, I am focusing on the ones which would stop an end user from being able to complete a basic flow in the application. However information contained here is applicable in other scenarios too.

Below, I have documented on how an issue response could look like.

1. Acknowledgement of the issue

Issue acknowledgement is paramount to starting well on the incident / issue response.
You may have self diagnosed the issue, received alerts or got aware of it through any other stakeholders, the first important step is to highlight / respond that you have become aware of the issue and are going to start checking it.

It is perfectly ok if you would be handing over to someone else after acknowledgement.

For self diagnosed issues if only a few details are currently available, that is ok. First and foremost, it’s required to update people on the issue so that they are aware of the situation and can respond / be prepared to tackle the same on their end too, if need be.
Choose the appropriate forum to either highlight or respond to the issue.
Timely acknowledgment brings in confidence with relevant stakeholders that the issue is being addressed.

In case of Production issues, there may be SLA with the client. So it’s important to acknowledge the issue in those time frames. You may check with tech lead / project leadership to understand that if you are not aware of such a SLA.

2. Share Initial analysis and potential impact

Now since the issue has been acknowledged, continue your analysis. As soon as you understand the issue, share that information across to the channel you used in Step 1. It could be helpful if that’s done in around 10–15 mins of acknowledgement.

You shouldn’t make your stakeholders wait for an initial brief. It’s ok, if you could also share very little details to start off.

Working on a critical issue may be strenuous , so preferably work in a pair as that would help you manage analysis and communication going in parallel.
When debugging an issue, always be mindful of below things

(i)First and foremost our need is to get system back to its current state before the issue transpired

(a)Look for any new code changes, for the cause of the issue
(b)Any Configuration / toggle which would toggle off the problem area in the application
(c)Workaround may also be considered. But do validate that with your internal and external stakeholders depending on the workaround you are suggesting.

(ii) It is very important to remind yourselves at this point of time that no one is looking for the perfect fix. What’s needed is a fix that would allow business to run as usual.

(iii) When proposing a solution, be confident about the same as it would reflect back on decision makers(if they have to make a decision on a solution being proposed).

3. Keep Communication going

As I have pointed out above , keep communication going with your stakeholders.
In case the incident time span increases, the pressure to fix the problem could increase, so there may be more desperation from people awaiting you to provide that “fix”. So keep the channel aware of your current situation on the issue.
It would also be good to jump on a common meeting to discuss, align and work on the issue.
For any additional support like access , 3rd party personnel, feel free to call out. You don’t need to worry about that as invariably, you would have people available to sort those logistics out.
If the issue is resolved, make sure to clearly state that to relevant stakeholders. In case there are any limitations attached to the solution provided, that needs to be called out also.
In some situations, the issue cannot be fixed on a given day. Make sure to end the conversation for the day with clear status and next steps.

On Critical Prod issues, that would not happen as the issue would need to be fixed invariably.

4. Inform when you handed over or no longer available

While working on the issue, possibly you are no longer going to be available as your shift is ending or someone is taking over. That should be clearly communicated so that people know whom to reach out.

5. Root Cause Analysis

Last and also an important bit is to do the root cause analysis once the situation has resolved to (just name a few things)

Understand why the issue happened?
Any further action required for the issue / incident?
How could it be prevented in future?
How can we be better prepared for the same in future?
Was the incident response appropriate ?
Any learnings that need to be shared to a wider audience.

For some clients, there may already be predefined steps on responding to an incident. But above pointers provide you a general approach you can start off with

Image Reference(s):

https://agilepainrelief.com/blog/scrum-production-support.html