Automated anomaly detection has become invaluable in the observability tools I rely on for production systems. More advanced vendors like Datadog have introduced “AI” functionality, which is really just conventional machine-learning models for proactively identifying suspicious behavior.

One of the systems I am responsible for is showing a dramatic spike in read latency. I am running a large-ish data ingestion process, so some difference in load was not unexpected. Looking at the graphs however it appears that something serious is happening. There’s a 4x spike in latency, that can’t be good!

The y-axis is important but the tools sometimes omit a unit, or don’t include necessary context.

Before unfurling my Jump To Conclusions™ mat I noticed that hte 4x increase is from 500 nanoseconds to 2 milliseconds.

For a system which has a latency budget more than an order of magnitude more than that, this is not actually a concern.

In isolation the “Read Latency” spike looks frightening, in the context of the rest of the metrics, the addition of another millisecond or so doesn’t even register.