Datadog Bits AI
Now it’s Songkran festival in Thailand. This year I am not going anywhere, and I spend a lot of time in learning Datadog, AI and Muay Thai training.
Thanks to those Datadog learning that I did on Datadog Learning Center , I got a chance to try out a very cool feature, Bits AI, which isn’t enabled in the organisation of the client that I am working with.
I have always been interested in trying out Datadog’s AI feature, because I believe there is so much that we can do with that abundant volume of data. After trying it out in a Datadog ephemeral environment, I am fully sold.
The course I did was about monitor. Datadog created a fake e-commerce site called storedog to demonstrate their capability. Storedog is a distributed system involving many moving parts, including various microservices, nginx, some frontend app, postgres and redis. The whole system is well instrumented by Datadog, and it is the demo app that is reused in all Datadog training materials. In the monitor course, they simulated a fault in storedog-frontend service. That triggered a p95 high latency alert. The course is all about setting up and optimising the monitor to make sure the monitor aligns with the best practice of SRE.
After finishing the course, I started looking around Datadog, and I figured that the Bits AI feature is available in this learning environment. In the course, it didn’t tell us what caused the issue. As a learner, I also don’t have access to the underlying code, nor am I familiar with it. This made me feel this would be a typical situation in reality as well. As an infrastructure engineer, it’s likely that we will deal with a lot of errors that we have no clue at all. This is where I am interested: how AI can help us in this situation?
I tried it, and I want to say the result was amazing!
So, let’s look into it. The alert is simply like this:
avg(last_5m):p95:trace.browser.request{env:notification-message,resource_name:/ads,service:store-frontend} by {service} > 3
When the evaluated value is over 3, the alert is triggered.
As mentioned earlier, I don’t know what exactly caused this. However, there is a button on the top called “Investigate with Bits AI SRE”. Why not give it a try?
So I clicked on that, it automatically created a Bits AI SRE investigation for me. This is what it looks like:
It starts with some crafted prompts, and automatically investigates step by step. In the same time, you can add more details in the chat pane on the left. I didn’t do anything, and agent just went through the investigation all by itself.
Here is a step by step summary of its investigation:
| Step | Name | Finding | Time Consumed |
|---|---|---|---|
| - | Start | Start with the Datadog prompt | Doesn’t show |
| 1 | Identified latency spike on /ads resource | store-frontednd has high latency on /ads endpoint |
5.4s |
| 2 | Identified dual deployments in store-ads | store-frontend calls store-ads. store-ads has two deployments at the same time: 1.0.0 and 1.0.10 |
5.0 s |
| 3 | Ruled out backend latency | Related APM traces shows that store-ads took only 47ms, postgres 22ms, but the whole browser.request span took 7.262ms |
6.7s |
| 4 | Spotted network layer latency gap | The gap between server completion time and browser span suggests something in network layer is wrong. Suspecting a proxy/reverse proxy is in the middle | 4.3s |
| 5 | Linked browser latency to systemic frontend issue | Confirmed it’s networking issue of the host because other endpoints such as /discount, /_next/data/... were all slow |
24.4s |
| 6 | Spotted high CPU pressure on host | Checked the host metrics and found the 1-minute CPU load spiking to 17.19 | 2.2s |
| 7 | Investigation Step | Queried per-container CPU usage | Doesn’t show |
| 8 | Identified Puppeteer CPU contention | Identified lab-puppeteer-1 had the highest CPU contention |
14.4s |
| 9 | Confirmed CPU contention pattern | Confirmed in the past 2 hours, lab-puppeteer-1 container regularly hits 80-124% CPU utilization, and the system load correlates directly with these spikes |
11.3s |
| 10 | Confirmed CPU oversubscription bottleneck | Concluded that the total container CPU demand massively exceeds the 200% (2 cores) capacity | Doesn’t show |
| - | Conclusion | Write-up of Root Cause Analysis, Evidence, Analysis, Impact, and Significant Tool Calls (of the AI agent) | 4.0s |
| Total time | ~77.3s |
While this is happening, all of its analysis is evidence driven. You can click on any of its analysis to review the actual data on Datadog. For example, related monitor, APM trace, container metrics etc.
And most importantly, it only took in minutes. This is blazingly fast. Imagine you are a SRE engineer with very minimal application context, how huge impact this feature will make.
The AI agent is also very clever. I don’t know if Datadog includes some description in their container or not. Bit AI agent apparently also understands that the problematic puppeteer is used to generate traffic. This is apparently the context of Datadog Learning Center because such traffic generator won’t exist in a real production environment.
This is just a quick demo of the Bits AI investigation. It has a bit more about auto-investigating monitor alert and generating reports by team. In general, this feature is so cool! I absolutely like it.