AIOps 기반의 관측가능성

36 min readJan 30, 2025

IT 운영에 대한 새로운 방법을 제안하다.

왜 근본원인분석은 항상 실패하는가?

이미 다양한 오픈텔레메트리, 관측가능성 도서들이 출판되었다. 아마도 독자들은 이러한 도서를 읽어서 정확하게 근본 원인을 분석하고, 신속하게 문제를 해결하는 것이 최종 목적일 것이다. 그렇다면 목적을 성공적으로 달성하였는가?

근본원인분석이라는 주제를 깊이있게 다룬 도서가 있는지 모르겠다. 책 한권 분량은 단일 기술을 자세히 설명하기에 적합하다. 하지만 근본원인분석은 너무나 복잡하고 다양한 주제를 설명하므로, 책 한권으로 스토리를 풀어나가는 것이 쉽지 않다. 내가 아는 한 근본원인을 어떻게 분석하는지 설명한 도서가 없었다. 이제까지 적합한 도서가 없었으므로, 이에 대한 경험과 사례를 설명한 도서를 집필하게 되었다.

IT 운영을 개선하고자 관측가능성, AIOps, 오픈텔레메트리가 표준 기술로 자리잡고 있지만, 아직 한계가 분명한 기술임에 분명하다. 이미 장점에 대해서 설명한 도서가 많으므로, 저는 무엇이 문제이고, 왜 실패하고, 한계와 제약을 설명하는 것이 더 중요하다고 생각한다. 항상 장점만 이야기하고, 단점은 이야기를 안해서 나중에 뒤통수 맞는 경우가 다들 많지 않을까? 예를 들어 오픈텔레메트리는 유용하지만, 너무 많은 기대를 하면 곤란할 수 있다. 기존 로그만 사용할 때는 발생하지 않던 문제점들이 오픈텔레메트리를 사용하면, 자주 발생한다. 문제가 간단한 게 아니고, 시스템 하위 수준의 에러라서 해결이 쉽지 않다. 간단한 기술을 도입하는 것이 아니라, 플랫폼이 바뀐다고 생각한다면 신중할 필요가 있다. 한계를 정확히 이해하고, 적절하게 사용해야 한다.

책에서는 다양한 실무적인 예제와 사례를 포함하고 있다. 첫 번째 예제는 은행의 관측가능성인데, 계좌 이체의 트랜잭션은 500 스팬과 다수의 추적을 포함한다. 이 정도 수준의 대략적인 관측가능성으로는 지연과 문제점을 이해하는 데 충분하지 않다. 이 도서의 데모에서는, 중단없이 단일 트랜잭션에서 5000개 이상의 스팬을 생성한다. 제가 구성한 추적은 절대로 중단되지 않으며, 코드의 수정없이 리눅스 커널에서 발생하는 인터럽트, 컨텍스트 전환, 패킷 재전송의 지연을 스팬으로 생성한다.

5000개의 스팬은 나노초 지연을 놓치지 않으며, 비즈니스 프로세스를 포함한 시스템 리소스의 내부 동작을 자세하게 출력할 수 있다. 5000개의 숫자는 보기 나름이고, 복잡하다고 생각할 수 있다. 하지만 아직까지 누구도 경험해 보지 못하는 숫자인 것은 분명하다. 처음에는 저의 결과를 아무도 믿지 않았다. 그래서 본 도서는 증거 자료로 사용될 것이다. 가능한 증거가 증명될 수 있도록 많은 데모를 포함한다. 저는 본 도서를 통해서, 모든 것이 가능하다는 것을 증명하고 싶었다.

100 만 라인 오픈 소스에서 문제가 발생하는 특정 메서드를 식별할 수 있는가? 또는 10000 개 메서드로 구성된 리눅스 커널에서 나노초 지연을 발생시키는 버그를 발견할 수 있는가?

복잡하고 어려운 문제가 발생하면, 해결해야 한다. 하지만 현실은 해결하지 못하고, 대부분 회피한다. 나중에 같은 문제는 재발하고, 근본원인을 해결하는 것이 아니라 회피하려고만 노력한다. 이것은 고인물이 되어가고 있다는 명확한 증거이다. SRE는 내부 동작에 관심을 가지고, 배우고자 하는 열정이 필요하다. 시스템 리소스에서 발생하는 문제를 해결하려면, 고도의 전문화된 지식과 경험이 필요하다. 시간과 비용을 고려한다면, 기술만 가지고 해결할 수 있는 것은 없다. 문제에 대한 접근 방법도 중요하다. 혼자 일하는 것이 아니고, 큰 조직에서 일한다면 협업도 중요하다.

기술적으로만 고려하면, 추적과 로그는 근본원인분석에 큰 도움이 되지 않는다. 만약 추적과 로그만 사용한다면 대부분의 근본원인분석은 실패할 것이다. 왜 실패하는지 자세한 예제는 본 도서에 자세히 설명되어 있다. 앞에서 설명한 것처럼 5000개 스팬이 식별하고, 그 중에는 지연이 발생한 문제라고 판단되는 소수의 스팬을 발견했더라도, 식별된 몇 개의 스팬이 정말 진짜로 문제인지는 추가적인 조사가 필요하다.

필자가 도서에서 설명하는 조사 방법은 프로파일과 디버깅이다. 프로파일과 디버깅은 근본원인분석을 위한 중요한 신호이다. 본 도서에서 커널, 네트워크, CPU, IO에서 발생하는 문제에 대해서 자세히 설명한다. 도서를 이해한 독자는 커널의 1000개 이벤트와 10000개 메서드 중에서 지연이 발생하는 메서드를 30분 이내에 정확하게 식별할 수 있을 것이다.

근본원인분석에 있어 기술보다 접근법이 더 중요하다. 단일 기술로 해결이 되는 근본원인은 없을 뿐더러, 정확한 접근법에 대한 이해없이 기술만 몇개 사용하다가 포기하기 일쑤이다. 근본원인분석은 전술이 아니라, 전략이 필요하다. 예를 들어 BPF는 좋은 예가 될것이다. BPF를 설명하는 좋은 도서들이 여럿 있다. 하지만 BPF는 단독으로 사용할 수 없다. BPF 는 근본원인분석 시 마지막에 사용하는 도구이고, BPF 이전에 문제의 범위를 좁히면서 지연이 발생한 이벤트와 메서드를 식별하는 과정이 필요하다.

대부분의 도서들이 BPF 이전에 필요한 복잡한 분석 과정은 제외하고, BPF 도구의 사용에 집중하는 경향이 있다. 도구는 도구일 뿐이고, 전체적인 범위와 절차는 더 중요하다. BPF를 잘못 이해한 나머지, 효과적으로 도구를 사용하지 못하는 것을 자주 보게 되고, 결국 근본원인분석은 실패한다. 본 도서에서는 올바른 접근법을 제안하고, 적재적소에 도구를 잘 사용하는 방법을 설명한다. 본 도서에서는 정확한 접근법 없이, 도구부터 설명하는 실수를 피하기 위해서 노력하였다.

AIOps, 이상탐지, 알람, 대시보드는 왜 항상 기대이하인가?

대다수의 이상탐지와 AIOps 프로젝트의 결과는 항상 기대이하이다. 사실상 실패라고 생각한다. 구체적인 성공 사례를 들어본 적도 없을 뿐더러, 실패 사례는 남들에게 절대 공유하지 않는다. 많은 비용을 지출하고 폭망했으므로, 자랑거리가 아닌 것은 분명하다. 대시보드와 알람도 마찬가지다. 알람은 부정확하고, 노이즈가 많다. 대시보드는 메트릭을 나열하는 수준으로 관리되곤 한다. 이제까지 봐온 이상탐지, AIOps, 알람, 대시보드 모두가 마음에 들지 않았다. 실패한 이유와 해결책이 본 도서에 설명되어 있다.

대시보드와 알람에서는, 나는 코드의 변경 없이 기존 접근법과는 완전히 다른 방법을 소개할 것이다. 대시보드 사용자가 대시보드 내용을 이해할 수 없다면, 해당 대시보드는 실패한 것이다. 아주 복잡하고 어려운 내용을 포함한 대시보드라도 사용자가 이해할 수 있는 스토리와 내용을 전달할 수 있어야 한다. 기존 방식은 잊어버리는 것을 권장한다.

SRE는 기술이 아니라 비즈니스를 이해하지 못해서, 문제를 해결하지 못하고 어려움을 경험하게 된다. 좋은 대시보드를 개발하기 위해서는 비즈니스에 대한 이해가 필요하다. 기술이 문제가 아니라, 비즈니스가 더 큰 문제이다. 이를 해결하기 위해서 이벤트와 추적을 사용한다. 이벤트는 비즈니스 컨텍스트를 캡처하고, 분산 추적은 기술적인 컨텍스트를 챕처하는데, 결국에는 이벤트와 추적을 조인해서 대시보드와 알람을 생성한다. 이벤트와 추적을 사용하는 것은 본 도서에서 설명하는 다양한 접근법 중의 하나이다. 기존 접근법과는 완전히 다른 방식으로 대시보드, 알람을 개발할 것이다. 마법은 아니다. 하지만 마법에 가깝다. 다양한 신호를 조합해서 AIOps, 이상탐지, 알람, 대시보드를 구성하는 방법을 설명한다.

필자는 마이크로서비스 수십개의 상황을 고려하지 않았다. 수 천개의 마이크로서비스와 수 십개의 레거시를 포함한 복잡한 운영 환경을 고려하고, 도서에서 설명한다. 사실 레거시 없이 마이크로서비스 수 십개라면, 관찰가능성에 대해서 딱히 고민할 것도 없다. 규모가 크지 않다면, 그냥 알아서 하고 싶은대로 하세요.

본 도서를 읽는 독자들에게

본 도서는 시스템의 자세한 내부 동작에 대해서 설명한다. 시스템 리소스의 하위 수준을 분석하고, 나노초 지연을 측정한다. 대부분의 개발자는 이러한 기술을 경험해 보지 못했을 것이다.

다른 개발자가 해결하지 못한 문제를 해결할 수 있다. 다른 개발자가 문제를 해결하는 것을 포기했을 때, 30분 이내에 신속하게 문제를 해결하는 개발자가 될 수 있다.

본 도서는 커널, 가상머신을 설명한다. 내용은 어렵고, 복잡하고, 이해하는 데 많은 시간이 걸린다. 본 도서를 통해서 독자는 구루가 될 수 있다.

가급적 실무에 가깝게

이론과 실무는 분명 다르다. 실무는 휠씬 복잡하고, 다양한 변수를 고려해야 한다. 근래 도서에서 다루는 오픈텔레메트리는 실무의 다양한 제약에 대해서 설명하지 않는다. 다른 도서의 이론적인 성과를 결코 부정하지 않는다. 필자도 많은 지식을 배우기 때문이다. 하지만 이론과 실무 간에는 큰 갭이 있는데, 이를 보완하는 도서가 없었다.

바로 본 도서가 이론과 실무의 갭을 보완할 수 있다고 생각한다. 예를 들어 통신과 은행은 많은 레거시를 사용한다. 레거시는 기술 부채이지만, 가장 중요한 자산이다. 대부분의 도서가 새로운 기술에만 집중한 나머지, 블랙박스와 레거시에 대해서는 설명하지 않는다. 레거시와 블랙박스에 대한 해결책 없이 관측가능성과 AIOps는 성공할 수 없다. 무시한다면, 아주 제한적인 수준에서 한계가 많으며, 근본원인분석이 불가능한 관측가능성이 될것이다.

본 도서는 은행과 통신에서 사용되는 대부분의 레거시에 대해서 자세히 설명하였다. 다른 어떠한 도서에서도 다루지 못한 내용들이 다수 포함되어 있다. 과연 어떠한 도서가 SAP ERP, SIEBEL CRM, CICS DB2, TUXEDO에 대해서 설명하겠는가? 본 도서는 실무에서 사용하는 대략 50개 레거시에 대해서 설명하고 있다. 이는 실무에서 필요로 하는 복잡한 관측가능성을 구현하기에 충분하다. 사실 IBM SNA LU 0를 포함한 더 많은 레거시가 있지만, 책이 너무 복잡해지는 것을 우려해서 포함하지 않았다.

도서의 모든 레거시 시스템이 가상화되어 내부적으로 구성되었으며, 도서의 데모에 사용되었다. 하지만 이를 외부로 공개하는 것은 항상 조심스럽다. 독자들과 공유할 수 있는 방법을 고민하고 있다. 개인적으로 질문하는 것은 언제든 환영한다.

근래 대부분의 서비스는 마이크로서비스로 개발되며, AWS, GCP, Azure를 사용해서 멀티클라우드를 구성한다. 클라우드는 레거시가 구성된 온프레미스와 연결된다. 만 개가 넘는 마이크로서비스가 다수의 쿠버네티스 클러스터에 배포되고, 다양한 레거시와 연계하는 것이 은행과 통신의 아키텍처이다. 과거의 모니터링이 더 이상 적합하지 않은 이유를 이해할 수 있는가? 과거처럼 특정 상용 소프트웨어만 사용하는 것이 아니라, 현재는 오픈소스 클라우드 쿠버네티스 레거시를 모두 함께 사용하므로, 과거의 운영 방식으로는 동적이고 복잡한 어플리케이션을 관리하는 것이 아주 어렵다. 개발자는 특정 업무를 단일 개발 언어만 이해하고 개발하면 되지만, SRE는 발생 가능성이 높은 모든 장애에 관여하므로 난이도가 휠씬 높다.

전 세계 사용자를 대상으로 무중단 글로벌 서비스까지는 더 이상 설명하지 않겠다. 복잡한 서비스 운영은 얼마든지 가능하고, 과거처럼 정적이고 단순하지 않다. 언제 어디에서든 장애가 발생할 수 있으며, 서비스의 신뢰성에 대해서 항상 고민하고 준비해야 되는 시점이 되었다. 본 도서는 가급적 실무에서 경험한 고민을 최대한 반영하려고 노력하였다. 이론이 부족할 수도 있지만, 이론에 대해서 자세히 설명한 다른 도서도 많기 때문에 함께 참고하면서 학습하면 도움이 될 것이다.

도서를 집필하게 된 동기는?

본 도서를 집필하게 된 가장 큰 동기는 기존 방식에 대한 반감이다. 기존 문화, 방식, 이론이라는 것들이 너무 비효율적이고, 현실적으로 적합하지 않았다. 만약 결과가 좋았다면, 반감이 없었겠지만 결과가 안 좋았다. 더 이상 실패를 반복하고 싶지 않다.

솔루션 측면에서는 오픈소스 관측가능성과 상용 관측가능성, 무엇을 사용하더라도 해결하지 못하는 문제점이 너무나 많다. 너무나 많은 비용을 지출함에도 불구하고, 대규모 관측가능성 프로젝트를 하지만, 결과가 좋지 않다. 추론의 정확도는 너무 낮고 자동화는 실패하고, 근본원인은 분석이 안 된다. 모든 문제를 나열하지 못할 정도로, 문제점이 많다.

조직 측면에서는 구글, 메타를 포함한 테크 기업의 문화와 기술에 대해서 배우고 싶었다. 함께 일해 보면서 느낀 점은 아주 큰 실망감이다. 앞으로는 결코 테크 기업을 동경하지 않기로 결심했다. 차라리 내가 책을 쓰고 현실에 대해서 정확하게 전달하는 것이 올바르지 않을까라는 생각이 컸다.

기술 측면에서는 하나의 추적 내에 수 백개 스팬이 포함되어, 지연을 일목요연하게 출력하는 것을 보면서 SRE는 행복을 느낄 수 있다. 하지만 본 도서는 그 정도 수준에 만족하지 않는다. 수 백개가 아니라 수 천개를 일목요연하게 출력해야 한다. CPU 내부에서 발생하는 작은 이벤트라도 놓치지 않고, 모두 순서대로 출력해서 내부 동작을 이해해야 한다. 시스템 리소스의 모든 내부 동작은 비즈니스 컨텍스트를 포함해야 한다. 그래야만 복잡도를 이해할 수 있고, 관리가 가능하다. 기술력만 자랑하지 말고, 오버헤드를 관리하고 비용을 절감하고 서비스 신뢰성을 향상하기 위해서 노력해야 한다.

어떠한 독자에게 도움이 될까?

System developers who want to understand the internal operation of the system and its root causes.

Failures occur, but SRE don’t understand the cause. Due to time and manpower constraints, applications are restarted and moved on without understanding the root cause. Willingness or not, root cause analysis is not possible if developers and engineers do not have knowledge of how internal systems are processed, or if signals within observability are not accurately collected.

The purpose of this book is to help engineers understand the internal operation of a system, collect signals, and perform root cause analysis. That’s why I go into detail about kernel, VM, instrumentation, agent, IO, trace. I am a developer and an engineer myself. I know it’s hard and complicated, but I hope reader can understand it without giving up and become a better engineer and developer.

Data engineers who want to automate observability and understand AIOps.

From a data perspective, observability is big data, and it’s a great place to apply AI. Now It’s a transitional time, and both open source and commercial offerings have limited capabilities. At the same time, many company are investing heavily in emerging technologies like AIOps. My experience with commercial AIOps deployments is that they are not effective at root cause analysis and reducing noise in production. The cost/benefit ratio of AI is still far from perfect, but given the evolution of the technology, it’s worth staying interested and learning.

Analyzing and aggregating observability data is something that needs to be explored and understood. The analysis and aggregation of observability data is still a new field. Not only is there a lack of understanding of the various observability signals, but there is also a lack of open source for loading and analyzing them. SRE need to work together with data engineer and complement each other’s strengths.

SRE who want to quickly identify the root cause and understand the resolution

An important part of operations is automating systems and resolving problems quickly. The ability to identify and resolve problems is especially important for SRE. With so many different development languages and types of problems propagating, it’s not easy to recognize and resolve failures quickly.

Recently published SRE books seem to focus more on the macro level than on the practical level to help solve problems, so it would be helpful for many SREs to have a book like this one that explains approaches and solutions to various errors that occur in production. To understand and solve the problem, trace is explained in detail. Trace consists of two types: distributed trace and system tracing. I believe that traces are the most basic signal for observability and the starting point, so my approach is to start with traces and gradually expand to other signals. Despite the complexity and difficulty of trace, it is necessary to configure distributed trace at the E2E level and system tracing at the system resource level to solve the problem.

Infrastructure engineer in banking, telecom, and other industries with a lot of legacy and want to understand new observability.

While SRE cultures are often characterized as tech-centric, only a small percentage of organizations are tech companies. It is more important to successfully configure observability for banking, telecommunications, and manufacturing in common industry. Applying observability to legacy is more important than applying observability to new technology.

The main problems are black box, legacy, which makes it difficult and slow to configure observability. This is because legacy breaks the propagation of traces, and agents are often difficult to configure and not technically supported. In this book, I have described various middleware. Since legacy is often connected through EAI servers and Message servers, I describe how to configure E2E observability by instrumenting this middleware. I describe various legacy applications, such as SAP ERP and TUXEDO, and explain how to configure observability in legacy.

Architect who want to technically advance observability and understand its business value

In this book, I categorize observability into three levels. First, Organize application and infrastructure observability and leverage it for root cause analysis. Second, automate the analysis and aggregation of observability data, root cause analysis and forecast failures with AIOps. Third, provide reliable service to customers to minimize losses from failures, and use observability to combine business and IT

Looking at observability as a tool for business improvement. In the past, observability has been considered only in the realm of systems — application, kernel, and virtual machine. In the future, observability can be utilized as a tool to further support and align the business and increase revenue. I’ll go beyond root cause analysis and operational automation to explain how observability can lead the business and deliver value to product owners and executive. I hope this book helps you see the possibility.

JAVA developer curious about observability best practices for tech companies and large enterprises

Developers love Google’s SRE culture and practices and have a certain amount of admiration for SRE at tech companies. At my current job, my manager, team leader, and teammates are from Google SRE, so I have a great opportunity to understand and learn about SRE as thought and experienced by people from tech companies.

What I’ve learned from working with them is that there are pros and cons. Legacy is the most important core system, so there are many constraints to implementing observability. The difference is that we have complex domains and business processes that tech companies don’t have.

DevOps engineers use both open source and commercial observability

While I personally prefer open source, the company I work for is paying tens of billions of dollars to build commercial observability. Service failure can be very damaging to a banking business, so they pay high prices to minimize failures and solve problems quickly to commercial observability. Observability is not just a tool for SRE, it is a tool used across the organization with developers, and we are looking for ways to reduce costs and improve developer productivity.

본 도서의 챕터는 아래와 같다.

챕터 1. 근본 원인 분석

기존 관측가능성과 근본원인분석 방법의 한계와 문제점을 분석한다. 성공적인 근본원인분석 방법을 설명한다

Explains the limits of observability and the solutions to them.
Describe the important signals that make up observability, including distributed trace, system trace, event, and profiles. In addition to the details, we explain the limitations of each signal, bugs, workarounds, and future directions.
How to combine multiple traces within an event and create business-centric observability
How to use eBPF profiles in Infrastructure Observability for root cause analysis

챕터 2. 관측가능성 접근법

대부분 관측가능성 대시보드와 알람이 부정확하고 실패한다. 실패하는 원인과 해결 방안을 설명한다.

Observability is complex and involves many learning curves. Correlations, which are easy for developers to understand and use, narrow the scope of the problem, and provide multiple perspectives for root cause analysis.
visualize correlations between various signals. Describe how to develop dashboards and charts and design a root cause analysis process.
Visualization of the 14 correlations between the 8 signals.

Business and IT alignment
Correlation between applications and infrastructure

Dashboard and Alarm are often inaccurate. Explain the cause and solution. I propose a different approach to this.
Once constructed correlation, I’ll show you how to develop dashboard, chart, and root cause analysis processes based on them.

챕터 3. 추적 중심의 관측가능성

관측가능성 구축 시에 서버 프레임워크, 메시지 서버 등에서 추적이 중단되거나, 로그가 충돌하는 문제점이 발생한다. 다양한 문제점의 원인을 분석하고, 사례를 설명한다.

Many companies are introducing E2E trace. Explain what problems arise when SRE configured E2E trace and how to solve them.
The starting point for application observability is distributed trace. This section describes the internal operation of trace and agent. Traces instrument a wide variety of software, including

Cloud-based managed services
Application server for non-blocking
Event-driven Message server
EAI server to integrate legacy
Extensions to complement the OpenTelemetry Agent

Observability is more difficult and complex than Microservices development because it requires understanding the internals of the system. This chapter helps SRE understand and solve the internal problems using observability.

챕터 4. 산업 별 관측가능성 사례

은행, 통신, 게임의 추적 사례를 설명한다. 복잡한 실제 사례를 통해서, 관측가능성의 장점과 방향성을 이해할 수 있다.

In practice, observability is very complex. However, the complexity can vary considerably depending on the size of the company. While a small startup operates hundreds of Microservices, a bank operates tens of thousands of Microservices. There are many differences in complexity. This chapter explains how the actual observability is configured and provides examples for each industry.
Describe how observability is organized in practice by industry. Each industry has different goal, architecture, and business processes to target, so the observables that support them must be organized differently.

Banking: Explains how front-end Microservices and back-end EAI servers are connected by traces and how to configure E2E traces with hundreds of spans.
Telecommunication: Describes the process of distributed transactions in the ordering process that handles new activations of combined wireline and wireless products.
Online Game: Describe the processes and applications of the Online Game and organize the observability for the game operation.

I’ll show you how to solve them using open source and commercial observability.

The application used in the bank demo are as follows.

Tuxedo: is a TP monitor that runs on UNIX.
CICS: is a TP monitor that runs on IBM Mainframes. In the demo, it manages cash and savings accounts
SAP ERP: is an enterprise resource planning application, but in banking it is used as core banking. In the demo, it processes payment services
Customer MDM
ORACLE ERP
SWIFT CASmf: is Swift middleware that supports FTP, MQ protocols and MT message formats for communicating with the SWIFT network. The CASmf Swift Server interfaces with the SWIFT network and processes business-to-business payments and foreign exchange transactions. The Global PAYplus Payment gateway uses Tuxedo internally and processes transactions with CASmf via MQSeries
Manage File Transfer
Lotus Notes
IBM MQ: is a queue-based Message server, industry standard message server.
DataStage ETL
FileNet BPM

The applications used in the Telecom demo are as follows.

TIBCO Order Management: support order fulfillment and network provisioning
Portal Infranet
Siebel CRM
PeopleSoft
MetaSolv
ORACEL ASAP

챕터 5. 오픈텔레메트리 데모

Customize the OpenTelemetry demo in various ways and improve its functionality and explain practical examples.
Using the OpenTelemetry reference demo, resolved the root cause by applying the application observability.
Configure profile, RUM, event, trace, and more, and explain how to improve them. Analyze observability data using Promscale.

챕터 6. 인프라 관측가능성

기존 인프라 관측가능성은 근본원인을 분석할 수 없다. 문제점을 이해하고, 새로운 근본원인분석을 위한 시스템 추적과 프로파일을 구성한다.

It is my personal favorite chapter and provides answers to root cause analysis. It explains why observability fails to root cause analysis and how to use system trace and system profile to solve this.
Use system trace and eBPF to identify delay and error in system resource and troubleshoot problem.
It measures nanosecond-level unnecessary delay caused by context switch and scheduler, interrupt, timer. Align the kernel method and system call to business context, and measure 99 percentile latency, number of system call through BPF profile. By correlating distributed trace with system trace, Visualization of all section where delay and problem occur.

It also explains how to apply Cilium and chaos engineering to infrastructure observability.

챕터 7. 인프라 이상탐지

기존 이상탐지는 정확도가 낮고 노이즈가 많아서, 대부분 실패한다. 문제점을 이해하고, 성공적인 이상탐지를 구성한다.

Anomaly detection is a useful technology that can complement infrastructure observability. However, most anomaly detection is built incorrectly or fails. Explain what the reasons for failure are and how to solve them.
Configure anomaly detection using OpenSearch anomaly detection. Configure infrastructure observability to identify and proactively respond to anomalies in system resources.
Introduce various anomaly detection cases and explain how to configure reliable anomaly detection.

챕터 8. 관측가능성 데이터 분석

상용 관측가능성을 사용하면 관측가능성 데이터를 자세히 분석할 수 있다. 오픈소스를 사용해서 쉽게 관측가능성 데이터를 분석하는 방법을 설명한다.

The biggest reason for using commercial observability instead of open source is data analysis. Open source is difficult to analyze observability data. I will present a solution to this. Observability data is primarily used to solve problems through root cause analysis, but it can also be used for a variety of other purposes, including dashboard, Alarm, AIOps.

In the beginning, observability is a simple requirement, but as it becomes more sophisticated, complex requirements need to be implemented and a separate observability data analysis tool is needed. Describes how to store collected trace and metric in Promscale, and how to view and analyze observability data for complex requirement.

챕터 9. 관측가능성 데이터 집계

상용 관측가능성는 복잡한 클러스터의 근본원인을 분석할 수 없다. 자바 관측가능성을 통해서, 성공적으로 근본원인을 분석한다.

Select Presto and Druid for building a data-lake for aggregation of observability signals and describe the pros and cons.

Describe the process of using observability to debug and understand a cluster Druid consisting of complex microservices. To understand complex open source, use VisualVM to explain the sampling and profile of CPU, memory and lock.
This chapter will explain in detail the Java observability to solve this problem. Java is the most popular development language. Virtual threads and non-blocking are difficult to monitor. Explain how to apply observability in non-blocking and Coroutine. Observability does not support the latest Java technology.

챕터 10. AIOps

상용 관측가능성의 AIOps가 근본원인분석에 실패하는 원인을 분석한다. 성공적인 AIOps 구축을 방안을 제안한다.

Explained how to search, analyze, and aggregate data, and now I will explain how to implement AIOps with LangChain and RAG using the collected observability data.
It analyzes why traditional AIOps fails and explains the various data required for AIOps configuration, such as CMDB. To build AIOps, you can use OpenSearch’s agents and tools to quickly build RAG. AIOps built using OpenSearch RAG includes a knowledge base, which can automate root cause analysis and inference failures.