The problem
How many times in a year do we get complaints from our end users that the network is slowand our monitoring tool has not yet alerted us to anything critical? Sometimes Zoom lags, and other times Jira takes a lifetime to load — and you don’t know whether it’s a circuit issue or a Wi-Fi issue or a DNS resolution problem, each of which is maintained by different teams and different tools. You end up desperately toggling between different screens and your CLI.
These slowness issues and complaints have sometimes come up in my organization and executive directors are never happy when these escalations come to them.
For us, once the ticket opens, the usual troubleshooting that every network engineer must know by heart starts — at least for the first three steps: Ping the destination, perform a traceroute and check physical links for any drops. Let’s agree that it’s a very time-consuming and drab process in 2025.
And for most use cases, ping times are normal, traceroute shows regular latency, and of course, your interface has zero drops, which is when we go into the CLI device by device and that takes another hour and is a real strain on the eyes.
We are living in an era where AI is supposed to do the repetitive tasks as well as heavy lifting and help us troubleshoot such issues at hand, or even better, alert us in advance.
But the questions arise, where do we even start? Thankfully, our network devices have also evolved with time, and they can support open telemetry in some cases or almost every time, logging based on all events occurring, but in our use case, most of our vendors do not support open telemetry, so we went ahead with logs.
The solution and challenges
For AI to work for us and alert us in advance, it should have good quality, reliable data over time, and this data can be retrieved from our classic logs when any event is triggered. Ping and SNMP would only provide data in polling time intervals of two or three minutes, and it seems like a blurred reality; they won’t tell us the current state or projected states on trends.
So the research began: What level of information logs should we be collecting? Information level. We were collecting logs from around 2,500 global devices, and so we need to scale for capacity servers, which is not a problem in a large organization.
We were now collecting every informational level log from our SD-WAN routers, which included SLA violations, CPU spikes on hardware, bandwidth threshold increases, logging configuration changes every second and even collecting netflowbecause let’s just agree brownouts usually hide between “user” and “app,” not inside a single device.
SD-WAN routers have SLA monitors configured for DNS, HTTPS and SaaS application monitors, which worked as our synthetic emulators and created a log whenever SLA breached for a layer 7 service or when any website is “slow,” which would help us monitor layer 7 protocols from a router.
From our radius/TACACS servers, we were receiving logs on security violations on layer two ports and MAC flooding(occasionally). Not just that, we even collected granular data like signal strength, SSID, channel bandwidth, and number of clients on the access point on our wireless infrastructure, all thanks to a vendor API that made quick work of this. Similarly, for our switches, we were collecting data from layer two VLAN changes to OSPF convergence, from radius server health to interface statistics.
After all the heavy lifting, we were able to get all this data into a data lake, but it turned out to be more like a
Leave a Reply