Generative AI for incident intelligence
A global financial-data company was drowning in operational noise: roughly 100,000 incident and alert signals every month, and no fast way to separate cause from symptom. We built a generative-AI workflow that cut mean time to detect and resolve in-scope incidents by about 30%, worth an estimated $20 million a year in platform stability.
For a company whose product is data that has to be right and always on, reliability is not a support function. It is the product. When something in the platform degrades, every minute spent noticing it and tracing it back to a cause is a minute of risk to customers who are relying on that data to make decisions.
The Firm generated an enormous amount of operational signal: roughly 100,000 incident and alert data points every month, spread across a large estate of systems. Engineers triaged much of it by hand. For in-scope incidents, mean time to detect and resolve averaged 45 to 60 minutes, and the harder problem wasn’t any single alert. It was the relationships between them. Which failures were related, what the real root cause was, and how far the blast radius reached were all questions that took experienced people time to answer under pressure.
We integrated a generative-AI workflow, built on an enterprise large-language-model engine, directly into the operations pipeline. It correlated those 100,000 monthly signals to surface patterns, point to a probable root cause, and estimate client impact and blast radius, so engineers started from a hypothesis instead of a blank page.
The model was the easy part. Making it useful and safe was the work.
We set automation standards and responsible-model governance for AI-driven operations, taught engineers prompt engineering so they could get reliable results out of the system, and deployed it where it changed an outcome rather than switching it on everywhere at once. The goal was never to take engineers out of the loop. It was to put them further along in it.
What changed, in numbers
What changed, in numbers
| Measure | Before | After |
|---|---|---|
| Mean time to detect and resolve (in-scope incidents) | 45–60 minutes | 32–42 minutes, roughly 30% lower |
| Monthly incident and alert signals correlated | Triaged largely by hand | ~100,000, correlated automatically |
| Estimated annual value | N/A | ~$20 million in platform stability |
This worked because it was pointed at a specific, measured problem and supported by the same four principles behind every Natoma engagement. We educated the engineers who used it, governed how and where it was deployed, chose the technology to fit the work, and measured the result in the one number that mattered: time to resolution. AI earned its place here by making good engineers faster, not by standing in for their judgment. That is the bar we hold it to, and it is the difference between AI that pays for itself and AI that just adds another dashboard.