
Observability in the AI-Native Era: AIOps: Building, observing, and operating resilient systems in the artificial intelligence age
Author(s): Andreas Grabner (Author), Hilliary Lipsig (Author), Robert Rati (Author)
- Publisher finelybook 出版社: Packt Publishing – ebooks Account
- Publication Date 出版日期: April 9, 2026
- Language 语言: English
- Print length 页数: 318 pages
- ISBN-10: 1806389592
- ISBN-13: 9781806389599
Book Description
Discover how AIOps is transforming the observability landscape for cloud-native and traditional systems. Learn how to build, monitor, and operate resilient services using AI-drive dynamic insights for smarter and more scalable operations
Key Features
- Practical Integration of AI and Observability in Modern Engineering Workflows
- Real-World Use Cases Grounded in Industry Experience
- Tailored for Modern Engineering Roles and Organizations
Book Description
With OpenTelemetry, observability has become central to building and operating cloud-native distributed systems. At the same time, advances in AI are transforming how we extract value from the growing volume of observability data. This book shows you how to implement scalable observability, improve engineering efficiency with AI, and extend observability practices from production into development through modern internal developer platforms.
You’ll begin with the fundamentals of observability, logs, metrics, and traces, then learn how AIOps enhances signal correlation, anomaly detection, and root-cause analysis. Through real-world examples and architectural guidance, the book demonstrates how to integrate AIOps into existing systems and build pipelines that proactively detect and resolve issues before users are affected.
You’ll also explore best practices for expanding observability across the software development lifecycle, enabling AI-powered observability as a self-service capability for engineers. Using tools such as OpenTelemetry, Prometheus, Elasticsearch, and Grafana alongside machine learning models, you’ll learn how to automate diagnostics and remediation.
By the end of this book, you’ll be able to design and implement AIOps-enabled observability solutions that make cloud-native systems more resilient and efficient.
What you will learn
- Build observability pipelines with logging, metrics, and tracing
- Apply AI/ML for anomaly detection and root cause analysis
- Correlate signals from multiple sources for better incident triage
- Automate responses with self-healing and remediation scripts
- Integrate tools like OpenTelemetry, Prometheus, and Elasticsearch
- Design scalable architectures for intelligent monitoring
Who this book is for
This book is for Software engineers and engineering leaders working on teams with operational responsibilities, such as platform engineering, site reliability engineering (SRE), DevOps, or application development, who want to integrate AIOps capabilities into their workflows will benefit from this book. If your team is responsible for building and running high-performing, resilient software systems, this book is for you.
Table of Contents
- Observability: The art of turning data into information
- The Elephant in the Room: Artificial Intelligence
- From Observability to AIOps and the Use Cases it solves today!
- Financial One ACME: Implementing AIOps!
- Democratizing Observability: A Primer to Self-Service Platforms
- Observability Agents in Action
- Financial One ACME: How to move from AIOps to Agentic Platforms
- Evolving Operations: Proactive -> Preventive -> Self-Driven Architecture
- Navigating AI Pitfalls: Governance, Cost & Ethical Guardrails
- Transforming Financial One ACME with AI-Driven Observability
Editorial Reviews
Editorial Reviews
About the Author
Andreas Grabner is a technical advocate for making distributed systems observable and making automated data-driven decisions across the software development lifecycle. In his capacity as a CNCF ambassador and a DevRel at Dynatrace, he connects and educates global software engineering communities on building and continuously validating digital services for resiliency, high availability, and security.
Since his early days, he has been passionate about software quality and performance engineering as it results in building excellent digital products. Andi uses his advocacy platforms to share best practices on topics such as observability, progressive delivery, DevOps, site reliability engineering, platform engineering, and digital business operations!
Hilliary Lipsig is an autodidact and start-up veteran who has frequently learned and applied technologies to get a job done. She’s had her hand in every part of the application delivery process, honing her skills originally as a quality engineer. Hilliary is an IT polyglot, able to talk the lingo of both the Operations and Development teams. She’s currently a Principal Site Reliability Engineer at Red Hat Inc., working on Kubernetes-based platforms. She’s passionate about GitOps, continuous integration, scalable processes, consistency in tooling, and good developer documentation. Her open source activities include contributions to the CNCF Glossary and she’s a member of the Code of Conduct Committee for Kubernetes.
Robert Rati is a platform engineer veteran of small, medium, and large corporations in regulated industries ranging from wireless communications to the financial sector. He is passionate about reducing noise and enabling teams to focus on creating business value. He emphasises maintainability, consistency, user friendliness, and productivity when planning projects.
Robert is currently a Senior Software Engineer with Second Front.
finelybook
