#46 - Managing Incidents in Code and Beyond

Key concepts from production incidents applied to unstructured problems

Oct 20, 2024

Introduction

As a software engineer, your work doesn't end when your code reaches production. Actually, one could say it's just the beginning. Once your code is in production, it must be maintained, addressed, improved, and always available. An "incident" is a term used in software engineering to refer to a case where code in production is unavailable, doesn't behave as it should, or adversely impacts customers.

You're probably familiar with the term "bug" in software engineering. An incident is like a bug on steroids, which needs to be addressed now. Incident management refers to the entire lifecycle from the moment you identify an incident until it's resolved and ideally won't happen again. Incident management matters because a product that constantly has incidents will impact customers and, eventually, revenue. It's unrealistic to never have incidents. It is reasonable to resolve them fast (in time to resolve metric) and do your best, so the same incident does not happen again.

This post will explore the main concepts around incident management and how we can apply them and learn from them in real-world problems.

Identifying an Incident

The first challenge is to know when an incident happens. Let's take, for example, a website that starts sending errors and doesn't load for customers. Do you know it when the first customer encounters it? Do you know when the 100th customer encounters it? Or do you find out when a customer calls the company's CEO and says their website doesn't work? That's the first challenge. You must ensure your services and products are production-ready to identify an incident correctly. Production-ready means having logs, metrics, and observability tools in place, which help you understand the system's status at any time.

The second step would be to identify what is expected and abnormal. For example, if you monitor a database's CPU usage over time, you see a percentage considered normal, 70%. And then you can say anything above 70% is potentially an incident. The third step is to define benchmarks and rules for what is actually an incident. For example, if the CPU goes above 90% on Saturday and the checkout smoke test fails, it is an incident.

It's important to know that incidents are not black and white. There is no one-size-fits-all. Incidents could have different priorities and affect other areas, so the approach to identifying and managing them should be flexible.

Triage and War Room

Once an incident is triggered, the owning engineer must triage it: What’s the priority? Who needs to be informed or involved? What’s the impact? Should the status page be updated, or should a meeting be opened? The triaging process is where these critical decisions are made.

The engineer typically starts a war room, physically or virtually, bringing in the necessary people. The company is aware of the incident, and insights are shared. Customer success may report impacted clients, support can highlight relevant tickets, and DevOps shares system performance insights. Relevant teams then join the effort.

The engineer is responsible for communicating the impact, current actions, next steps, and update timing. They also track completed and pending tasks. In short, triaging and the war room move the incident from acknowledgment to active resolution.

Managing the Incident

During the incident, it’s important to first focus on mitigation. For example, if the site doesn’t load, you might be able to get it to load in specific regions or ensure that some parts of the site are functional. After that, the goal is to fully resolve the incident and ensure no loose ends remain—all affected customers should no longer be impacted.

To achieve this, the incident owner must be empowered to access all the necessary resources. That engineer should be available if they need a specific engineer’s help. They must be granted during the incident if they require permissions they don’t usually have. It’s about ensuring they are fully empowered with the right tools to mitigate and resolve the issue.

Closing and Analyzing

An incident isn’t genuinely closed until a Root Cause Analysis (RCA) is in place. In other words, you can only consider the incident over and closed once you understand what caused it and have identified the steps needed to prevent it from happening again. As the old saying goes, "Fool me once, shame on you; fool me twice, shame on me." Preventing the following incident is not about chance or destiny. We must act on our findings and take the necessary steps to ensure it doesn’t happen again.

When closing an incident, it’s important to ensure everyone is in the loop and aware of the impact and the final resolution. If there’s an external status page, it should be updated accordingly. Finally, the RCA should include concrete action items that the team can take to prevent similar incidents in the future.

Applying It to Real-World Problems

I believe we can learn from the structured approach to incident management and apply some of its principles to the non-structured problems we might face daily. I will review three issues that each organization has likely encountered and explain how I apply certain practices from incident management to deal with them.

Deal with Employee Retention

The negotiation to retain an employee often starts when they’ve already decided to leave. Similar to how services are monitored, you should consider how to sense the 'heartbeat' of each employee to get early signals of churn risk. Examples include skip-level 1-on-1s, internal NPS or feedback surveys, and recurring sessions with an HRBP. Building processes and tools to monitor and be 'alerted' when something goes off track is crucial. For example, if a UX designer’s 1-on-1 with the VP felt disengaged, didn’t answer the NPS survey, and missed the last company all-hands meeting, it might require check-in and potentially 'trigger an incident.'

Implement a New Tool or Process

Let’s assume the team has been working with Jira, and the company has decided to switch to Monday. This type of change resembles managing an incident—a clear outcome (complete the transition), a defined timeframe (as soon as possible), and implementation is spread across the organization. It’s important to have a clear owner, such as a Program Manager, to manage this. Next, they must plan and communicate milestones, as engineers do when communicating a mitigation plan. If additional resources or permissions are needed during the process—like admin access to Jira or dedicated sessions with teams—the owner must be empowered to obtain them. Ownership and access to resources are crucial to success in these cases.

Quarterly Planning

Before planning the next quarter, we usually reflect on the previous one. However, this reflection often remains shallow. Inspired by the root cause analysis and the 'five whys,' we should dig deeper to understand why the past quarter went well (or didn’t). With clear understanding and action items, we can plan the next quarter and ensure we don’t repeat the same mistakes—or, hopefully, we repeat our successes. As with incidents, the quarter isn’t over until a deep analysis is conducted and insights are extracted to improve the next one.

Summary

I'm obsessed with productivity and efficiency; a big part is decision-making. With incident management, each decision counts, and the faster, the better. While incidents clearly impact our day-to-day, problems and dilemmas often don’t. I believe we can take inspiration from incident management—topics like on-call schedules, alerting & monitoring, war rooms, and root cause analysis—to handle non-structured problems in our day-to-day lives more efficiently and faster.

Harmonizing Complexity

Discussion about this post