Observability with AIOps

Proposing a new way of IT operation.

18 min readJan 31, 2025

Why does root cause analysis always fail?

Various OpenTelemetry and observability books have already been published. Perhaps the ultimate goal of reading such books is to accurately analyze the root cause and quickly solve the problem. So, was the purpose successfully achieved?

I don’t know if there is a book that deals with the topic of root cause analysis in depth. One book is suitable for explaining a single technology in detail. However, root cause analysis is so complex and covers so many topics that it is not easy to explain the story in one book. As far as I know, there was no book that explained how to analyze the root cause. Since there was no suitable book on this subject, I decided to write one that describes my experiences and cases.

While observability, AIOps, and OpenTelemetry are becoming standard technologies to improve IT operations, it is clear that these technologies still have clear limitations. There are already many books that explain the advantages, so I think it is more important to explain what the problem is, why it fails, and the limitations and constraints. Don’t you think that people often get screwed over later because they only talk about the good things and never mention the bad things? For example, OpenTelemetry is useful, but it can be problematic if you expect too much from it. Problems that did not occur when using only existing logs, but now frequently occur when using OpenTelemetry. The problem is not simple and it is a lower-level system error, so it is not easy to solve. If you think that the platform will change, you need to be cautious, rather than adopting simple technology. We must understand the limitations and use them appropriately.

The book includes various practical examples and cases. The first example is the observability of a bank, where a transaction of an fund transfer involves 500 spans and multiple traces. This level of approximate observability is not sufficient to understand delays and problems. In the demo of this book, more than 5,000 spans are created in a single transaction without break. The trace I have configured will never be broken, and it will span the delays of interrupt, context switche, and packet retransmission that occur in the Linux kernel without modifying the code.

The 5,000 spans does not miss a nanosecond delay and can output the internal behavior of system resources, will align to business processes. The number of 5000 is relative and may be considered complex. But it is clear that it is a number that no one has yet experienced. At first, no one believed my results. So this book will be used as evidence. Include as many demonstrations as possible to ensure that the evidence is proven. I wanted to prove through this book that anything is possible.

Can you identify a specific method in a million-line open source project that is causing a problem? Or can you find a bug in a Linux kernel consisting of 10,000 methods that causes a nanosecond delay?

When a complex and difficult problem arises, it must be solved. However, the reality is that we avoid most of them without solving them. The same problem will recur later, and they will only try to avoid the root cause rather than solve it. This is clear evidence that we are becoming stagnant water(고인물). SRE requires a passion for learning and an interest in internal operations. To solve problems that occur in system resources, highly specialized knowledge and experience are required. When it comes to time and cost, there is nothing that can be solved with technology alone. The approach to the problem is also important. If you work in a large organization, rather than working alone, collaboration is also important.

Technically speaking, trace and logging are not very helpful for root cause analysis. If you only use trace and logs, most root cause analyses will fail. Detailed examples of why this is the case are explained in this book. As explained earlier, even if 5,000 spans are identified and a small number of them are found to be the problem, additional investigation is needed to determine whether the identified spans are really the problem.

The research methods I describe in the book are profile and debugging. Profile and debugging are important signals for root cause analysis. This book explains in detail the problems that occur in the kernel, network, CPU, and IO. A reader who understands the book will be able to accurately identify the method that causes the delay among the 1,000 events and 10,000 methods in the kernel within 30 minutes.

In root cause analysis, the approach is more important than the technology. There is no root cause that can be solved with a single technology, and it is common to give up after using a few technologies without understanding the correct approach. Root cause analysis is not a tactic, but a strategy. eBPF is a good example. There are many good books that explain eBPF. However, eBPF cannot be used alone. eBPF is the last tool used in root cause analysis, and before eBPF, it is necessary to narrow down the scope of the problem and identify the event and method that caused the delay.

Most books tend to focus on the use of eBPF tools, excluding the complex analysis process required before eBPF. Tools are just tools, and the overall scope and procedures are more important. I often see people misunderstand eBPF and fail to use the tool effectively, which ultimately leads to a failure of root cause analysis. This book proposes the right approach and explains how to use the tools in the right place at the right time. In this book, I have tried to avoid the mistake of starting with the tools without an accurate approach.

Why are AIOps, anomaly detection, Alarm, and dashboard always below expectations?

The results of most anomaly detection and AIOps projects are always below expectations. In fact, I think it was a failure. Not only I never heard of any specific success stories, they never share their failures with others. It is clear that it is not something to be proud of, as it was a huge failure that cost a lot of money. The same goes for the dashboard and Alarm. Alarm are inaccurate and noisy. Dashboard are often managed at the level of listing metrics. I didn’t like any of the anomaly detection, AIOps, Alarm, and Dashboard I’ve seen so far. The reasons for the failure and solutions are explained in this book.

In the Dashboard and Alarm, I will introduce a completely different approach without changing the code. If the dashboard user cannot understand the dashboard content, the dashboard has failed. Even a dashboard that contains very complex and difficult content should be able to convey a story and content that users can understand. The old approach is recommended to be forgotten.

SREs often fail to solve problems because they don’t understand the business, not the technology. To develop a good dashboard, you need to understand the business. The problem is not the technology, but the business. To solve this, we use events and trace. Event capture the business context, and distributed trace captures the technical context, which ultimately joins the event and trace to create dashboard and alarm. Using event and trace is one of the various approaches described in this book. We will develop dashboard and Alarm in a completely different way from the existing approach. It’s not magic. But it’s close to magic. This book explains how to combine various signals to create AIOps, anomaly detection, Alarm, and dashboard.

I did not consider the situation of dozens of microservices. The book explains the complex production environment, which includes thousands of microservices and dozens of legacy systems. In fact, if there are dozens of microservices without legacy, there is no need to worry about observability. If it is not large in scale, just do whatever you want.

To the readers of this book

This book describes the detailed internal operation of the system. Analyze the lower levels of system resources and measure nanosecond delays. Most developers have never experienced such technology.
It can solve problems that other developers have not been able to solve. When other developers give up on solving a problem, you can become a developer who solves the problem quickly within 30 minutes.

This book explains the kernel and virtual machines. The content is difficult, complicated, and takes a lot of time to understand. Through this book, the reader can become a guru.

As close to the actual practice as possible.

Theory and practice are clearly different. Practice is much more complicated and requires consideration of various variables. OpenTelemetry, which is covered in recent books, does not explain the various constraints of practice. I by no means deny the theoretical achievements of other books. This is because the author also learns a lot of knowledge. However, there was a big gap between theory and practice, and there was no book to fill this gap.

I believe that this book can fill the gap between theory and practice. For example, telecommunications and banking use a lot of legacy. Legacy is a technical debt, but it is also the most important asset. Most books focus only on new technologies and do not explain black box and legacy. Observability and AIOps cannot succeed without a solution for legacy and black box. If ignored, it will become observability to a very limited extent, with many limitations, and root cause analysis will be impossible.

This book explains in detail most of the legacies used in banking and telecommunications. It contains a lot of content that has not been covered in any other book. Indeed, what book would explain SAP ERP, SIEBEL CRM, CICS DB2, and TUXEDO? This book explains about 50 legacy systems used in practice. This is sufficient to implement the complex observability required in practice. In fact, there is more legacy, including IBM SNA LU 0, but I did not include it because I was concerned that it would make the book too complicated.

All legacy systems in the book were virtualized and configured internally and were used for the book’s demonstration. However, it is always cautious to disclose this to the outside. I’m thinking of ways to share it with readers. I welcome personal questions at any time.

Most services are developed as microservices these days, and multi-clouds are configured using AWS, GCP, and Azure. Cloud is connected to on-premises, where legacy is configured. The architecture of banks and telecommunications is to deploy more than 10,000 microservices across multiple Kubernetes clusters and to interface them with various legacy systems. Can you understand why past monitoring is no longer suitable? Rather than using only specific commercial software as in the past, we now use all of the open source, Cloud, Kubernetes, legacy together, so it is very difficult to manage dynamic and complex applications using the old-style approach. While developers only need to understand and develop a single development language for a specific task, SRE are involved in all failures that are likely to occur, so the difficulty level is much higher.

I will not explain further about the 24x7 global service for users around the world. Complex service operations are possible, and they are not as static and simple as they used to be. Failures can occur anytime, anywhere, and it is time to always think about and prepare for the reliability of services. This book has tried to reflect as much as possible the concerns experienced in practice. You may lack the theory, but there are many other books that explain the theory in detail, so it would be helpful to study them together.

What was the motivation for writing the book?

The biggest motivation for writing this book was a distaste for the existing methods. Existing culture, methods, and theories were inefficient and not suitable in practice. If the results had been good, there would have been no distaste, but if the results had been bad, I would not have wanted to repeat the failure.

In terms of solutions, there are too many problems that cannot be solved by using either open-source or commercial observability. Despite spending so much money, large-scale observability projects, the results are not good. The accuracy of inference is too low, automation fails, and the root cause cannot be analyzed. There are so many problems that I can’t list them all.

On the technical side, SREs can feel happy when they see that a single trace includes hundreds of spans and outputs the delay clearly. However, this book is not satisfied with that level. Not hundreds, but thousands of spans must be printed clearly. You must understand the internal operation by outputting all the small events that occur inside the CPU without missing any. All internal actions of system resources must include business context. Only then can understand the complexity and manage it. We should not just boast about our technology, but also strive to manage overhead, reduce costs, and improve service reliability.

How will it help any reader?

System developers who want to understand the internal operation of the system and its root causes.

Failures occur, but SRE don’t understand the cause. Due to time and manpower constraints, applications are restarted and moved on without understanding the root cause. Willingness or not, root cause analysis is not possible if developers and engineers do not have knowledge of how internal systems are processed, or if signals within observability are not accurately collected.

The purpose of this book is to help engineers understand the internal operation of a system, collect signals, and perform root cause analysis. That’s why I go into detail about kernel, VM, instrumentation, agent, IO, trace. I am a developer and an engineer myself. I know it’s hard and complicated, but I hope reader can understand it without giving up and become a better engineer and developer.

Data engineers who want to automate observability and understand AIOps.

From a data perspective, observability is big data, and it’s a great place to apply AI. Now It’s a transitional time, and both open source and commercial offerings have limited capabilities. At the same time, many company are investing heavily in emerging technologies like AIOps. My experience with commercial AIOps deployments is that they are not effective at root cause analysis and reducing noise in production. The cost/benefit ratio of AI is still far from perfect, but given the evolution of the technology, it’s worth staying interested and learning.

Analyzing and aggregating observability data is something that needs to be explored and understood. The analysis and aggregation of observability data is still a new field. Not only is there a lack of understanding of the various observability signals, but there is also a lack of open source for loading and analyzing them. SRE need to work together with data engineer and complement each other’s strengths.

SRE who want to quickly identify the root cause and understand the resolution

An important part of operations is automating systems and resolving problems quickly. The ability to identify and resolve problems is especially important for SRE. With so many different development languages and types of problems propagating, it’s not easy to recognize and resolve failures quickly.

Recently published SRE books seem to focus more on the macro level than on the practical level to help solve problems, so it would be helpful for many SREs to have a book like this one that explains approaches and solutions to various errors that occur in production. To understand and solve the problem, trace is explained in detail. Trace consists of two types: distributed trace and system tracing. I believe that traces are the most basic signal for observability and the starting point, so my approach is to start with traces and gradually expand to other signals. Despite the complexity and difficulty of trace, it is necessary to configure distributed trace at the E2E level and system tracing at the system resource level to solve the problem.

Infrastructure engineer in banking, telecom, and other industries with a lot of legacy and want to understand new observability.

While SRE cultures are often characterized as tech-centric, only a small percentage of organizations are tech companies. It is more important to successfully configure observability for banking, telecommunications, and manufacturing in common industry. Applying observability to legacy is more important than applying observability to new technology.

The main problems are black box, legacy, which makes it difficult and slow to configure observability. This is because legacy breaks the propagation of traces, and agents are often difficult to configure and not technically supported. In this book, I have described various middleware. Since legacy is often connected through EAI servers and Message servers, I describe how to configure E2E observability by instrumenting this middleware. I describe various legacy applications, such as SAP ERP and TUXEDO, and explain how to configure observability in legacy.

Architect who want to technically advance observability and understand its business value

In this book, I categorize observability into three levels. First, Organize application and infrastructure observability and leverage it for root cause analysis. Second, automate the analysis and aggregation of observability data, root cause analysis and forecast failures with AIOps. Third, provide reliable service to customers to minimize losses from failures, and use observability to combine business and IT

Looking at observability as a tool for business improvement. In the past, observability has been considered only in the realm of systems — application, kernel, and virtual machine. In the future, observability can be utilized as a tool to further support and align the business and increase revenue. I’ll go beyond root cause analysis and operational automation to explain how observability can lead the business and deliver value to product owners and executive. I hope this book helps you see the possibility.

JAVA developer curious about observability best practices for tech companies and large enterprises

Developers love Google’s SRE culture and practices and have a certain amount of admiration for SRE at tech companies. At my current job, my manager, team leader, and teammates are from Google SRE, so I have a great opportunity to understand and learn about SRE as thought and experienced by people from tech companies.

What I’ve learned from working with them is that there are pros and cons. Legacy is the most important core system, so there are many constraints to implementing observability. The difference is that we have complex domains and business processes that tech companies don’t have.

DevOps engineers use both open source and commercial observability

While I personally prefer open source, the company I work for is paying tens of billions of dollars to build commercial observability. Service failure can be very damaging to a banking business, so they pay high prices to minimize failures and solve problems quickly to commercial observability. Observability is not just a tool for SRE, it is a tool used across the organization with developers, and we are looking for ways to reduce costs and improve developer productivity.

The chapters of this book are as follows.

Chapter 1. Root cause analysis

Analyze the limitations and problems of existing observability and root cause analysis methods. Explains a successful root cause analysis approach.

Explains the limits of observability and the solutions to them.
Describe the important signals that make up observability, including distributed trace, system trace, event, and profiles. In addition to the details, we explain the limitations of each signal, bugs, workarounds, and future directions.
How to combine multiple traces within an event and create business-centric observability
How to use eBPF profiles in Infrastructure Observability for root cause analysis

Chapter 2. Observability approach

Most observability dashboard and alarm are inaccurate and fail. Explain the causes of failure and solutions.

Observability is complex and involves many learning curves. Correlations, which are easy for developers to understand and use, narrow the scope of the problem, and provide multiple perspectives for root cause analysis.
visualize correlations between various signals. Describe how to develop dashboards and charts and design a root cause analysis process.
Visualization of the 14 correlations between the 8 signals.

Business and IT alignment
Correlation between applications and infrastructure

Dashboard and Alarm are often inaccurate. Explain the cause and solution. I propose a different approach to this.
Once constructed correlation, I’ll show you how to develop dashboard, chart, and root cause analysis processes based on them.

Chapter 3. Trace-oriented observability

When building observability, problems arise in server framework, message server. where trace is broken or logs are confilcted. Analyze the causes of various problems and explain examples.

Many companies are introducing E2E trace. Explain what problems arise when SRE configured E2E trace and how to solve them.
The starting point for application observability is distributed trace. This section describes the internal operation of trace and agent. Traces instrument a wide variety of software, including

Cloud-based managed services
Application server for non-blocking
Event-driven Message server
EAI server to integrate legacy
Extensions to complement the OpenTelemetry Agent

Observability is more difficult and complex than Microservices development because it requires understanding the internals of the system. This chapter helps SRE understand and solve the internal problems using observability.

Chapter 4. Observability cases by industry

Explains trace cases in banking, telecommunications, and gaming. Through complex real-life examples, we can understand the advantages and directionality of observability.

In practice, observability is very complex. However, the complexity can vary considerably depending on the size of the company. While a small startup operates hundreds of Microservices, a bank operates tens of thousands of Microservices. There are many differences in complexity. This chapter explains how the actual observability is configured and provides examples for each industry.
Describe how observability is organized in practice by industry. Each industry has different goal, architecture, and business processes to target, so the observables that support them must be organized differently.

Banking: Explains how front-end Microservices and back-end EAI servers are connected by traces and how to configure E2E traces with hundreds of spans.
Telecommunication: Describes the process of distributed transactions in the ordering process that handles new activations of combined wireline and wireless products.
Online Game: Unlike bank and telecom, which have a lot of legacy, game companies develop and operate ultra-low latency applications. The procedures for processing threads, memory, and locks are very different from those of the Spring Boot server framework. It provides detailed guidelines for developing and operating ultra-low latency applications.

I’ll show you how to solve them using open source and commercial observability.

The application used in the bank demo are as follows.

Tuxedo: is a TP monitor that runs on UNIX.
CICS: is a TP monitor that runs on IBM Mainframes. In the demo, it manages cash and savings accounts
SAP ERP: is an enterprise resource planning application, but in banking it is used as core banking. In the demo, it processes payment services
SWIFT CASmf: is Swift middleware that supports FTP, MQ protocols and MT message formats for communicating with the SWIFT network. The CASmf Swift Server interfaces with the SWIFT network and processes business-to-business payments and foreign exchange transactions. The Global PAYplus Payment gateway uses Tuxedo internally and processes transactions with CASmf via MQSeries
IBM MQ: is a queue-based Message server, industry standard message server.
Customer MDM
ORACLE ERP
Manage File Transfer
Lotus Notes
DataStage ETL
FileNet BPM

The applications used in the Telecom demo are as follows.

TIBCO Order Management: support order fulfillment and network provisioning
Portal Infranet
Siebel CRM
PeopleSoft
MetaSolv
ORACEL ASAP

Chapter 5. OpenTelemetry Demo

Customize the OpenTelemetry demo in various ways and improve its functionality and explain practical examples.
Using the OpenTelemetry reference demo, resolved the root cause by applying the application observability.
Configure profile, RUM, event, trace, and more, and explain how to improve them. Analyze observability data using Promscale.

Chapter 6. Observability of Infrastructure

Existing infrastructure observability cannot analyze root causes. Understand the problem and configure system trace and profile for new root cause analysis.

It is my personal favorite chapter and provides answers to root cause analysis. It explains why observability fails to root cause analysis and how to use system trace and system profile to solve this.
Use system trace and eBPF to identify delay and error in system resource and troubleshoot problem.
It measures nanosecond-level unnecessary delay caused by context switch and scheduler, interrupt, timer. Align the kernel method and system call to business context, and measure 99 percentile latency, number of system call through BPF profile. By correlating distributed trace with system trace, Visualization of all section where delay and problem occur.

It also explains how to apply Cilium and chaos engineering to infrastructure observability.

Chapter 7. Infrastructure anomaly detection

Existing anomaly detection is mostly unsuccessful because it is not accurate and has a lot of noise. Understand the problem and organize successful anomaly detection.

Anomaly detection is a useful technology that can complement infrastructure observability. However, most anomaly detection is built incorrectly or fails. Explain what the reasons for failure are and how to solve them.
Configure anomaly detection using OpenSearch anomaly detection. Configure infrastructure observability to identify and proactively respond to anomalies in system resources.
Introduce various anomaly detection cases and explain how to configure reliable anomaly detection.

Chapter 8. Analyze Observability Data

Commercial observability allows you to analyze observability data in detail. Explains how to easily analyze observability data using open source.

The biggest reason for using commercial observability instead of open source is data analysis. Open source is difficult to analyze observability data. I will present a solution to this. Observability data is primarily used to solve problems through root cause analysis, but it can also be used for a variety of other purposes, including dashboard, Alarm, AIOps.

In the beginning, observability is a simple requirement, but as it becomes more sophisticated, complex requirements need to be implemented and a separate observability data analysis tool is needed. Describes how to store collected trace and metric in Promscale, and how to view and analyze observability data for complex requirement.

Chapter 9. Aggregate Observability Data

Commercial observability cannot analyze the root causes of complex clusters. Successfully analyze the root cause through Java observability.

Select Presto and Druid for building a data-lake for aggregation of observability signals and describe the pros and cons.

Describe the process of using observability to debug and understand a cluster Druid consisting of complex microservices. To understand complex open source, use VisualVM to explain the sampling and profile of CPU, memory and lock.
This chapter will explain in detail the Java observability to solve this problem. Java is the most popular development language. Virtual threads and non-blocking are difficult to monitor. Explain how to apply observability in non-blocking and Coroutine. Observability does not support the latest Java technology.

Chapter 10. AIOps

Most commercial observability AIOps fail to perform root cause analysis. We propose a plan for successful AIOps implementation.

Explained how to search, analyze, and aggregate data, and now I will explain how to implement AIOps with LangChain and RAG using the collected observability data.
It analyzes why traditional AIOps fails and explains the various data required for AIOps configuration, such as CMDB. To build AIOps, you can use OpenSearch’s agents and tools to quickly build RAG. AIOps built using OpenSearch RAG includes a knowledge base, which can automate root cause analysis and inference failures.