Intro
In my previous post, I explored how to break high-level metrics into actionable working agreements, using the DORA metrics framework as a reference. Now, I’d like to extend this concept to our day-to-day practices.
Imagine our team is working on a new feature to increase product adoption, and we decided to measure success by tracking the average weekly sessions per user. What should we monitor once the feature is deployed and the rollout begins? Should we immediately focus on weekly sessions per user? Of course not.
Risk Metrics
In the example above, the primary focus is on risk metrics during most deployments. These typically include error logs, CPU utilization, overall system health, user complaints, and bugs. Within the first few hours, you’re more likely to detect if the new feature introduces any regressions than to see a noticeable impact on the defined KPI. This aligns with the broader concept of monitoring: ensuring your product remains functional and efficient and maintains its normal KPIs.
Tips for defining risk metrics
They should be easily measurable.
Regressions should be identifiable within minutes.
They should impact critical product flows.
For instance, the number of login attempts is a crucial risk metric. It allows you to spot regressions quickly, and the impact on the product is significant. However, consider factors like seasonality (e.g., weekends), which can lead to false positives.
Leading and Lagging Metrics
Metrics can signal that something is about to happen (leading) or reflect something that has already occurred (lagging). It’s crucial to select the right metrics based on your goals.
At one point, I worked with a product team that spent 70% of its allocation on maintenance—addressing bugs, technical debt, and production stabilization. This was a classic lagging metric, representing an already escalated situation. To address this, I began tracking leading indicators like the throughput rate of maintenance tickets and the number of production alerts.
For example, there were weeks when we had three daily alerts, ten new tickets were opened, and we only closed four. By tracking these metrics regularly, patterns began to emerge, and addressing them eventually helped us reduce our overall maintenance allocation from 70% to 40%.
Tips for defining leading metrics
Use methods like the KPI tree to drill down to actionable, lower-level metrics.
Apply a root cause mindset by asking why. For example, “Why do we spend so much time on maintenance? Because we have too many tickets. Why do we have so many tickets? Because new ones keep opening before we can close the old ones.”
Define actions based on results. For example, if the throughput rate hits a 50% threshold, we will halt new work and escalate to the group manager. If no actionable step can be tied to a metric, it’s probably not a good leading indicator.
Conclusion
Defining metrics is only the first step. The real challenge lies in tracking, monitoring, and continually improving them. It is crucial to invest more time in working with risk metrics and identifying the correct leading indicators. Focusing on these can transform minor improvements into meaningful, long-term changes impacting overall performance.