
Data Pipelines with Apache Airflow, Second Edition 版本: Orchestration for data and AI
Author(s): Julian de Ruiter (Author), Ismael Cabral (Author), Kris Geusebroek (Author), Daniel van der Ende (Author), Bas Harenslak (Author)
- Publisher finelybook 出版社: Manning Publications
- Publication Date 出版日期: January 27, 2026
- Edition 版本: 2nd
- Language 语言: English
- Print length 页数: 512 pages
- ISBN-10: 1633436373
- ISBN-13: 9781633436374
Book Description
Using real-world scenarios and examples, this book teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack. Part reference and part tutorial, each technique is illustrated with engaging hands-on examples, from training machine learning models for generative AI to optimizing delivery routes.
In
Data Pipelines with Apache Airflow, Second Edition you’ll learn how to: • Master the core concepts of Airflow architecture and workflow design
• Schedule data pipelines using the Dataset API and time tables, including complex irregular schedules
• Develop custom Airflow components for your specific needs
• Implement comprehensive testing strategies for your pipelines
• Apply industry best practices for building and maintaining Airflow workflows
• Deploy and operate Airflow in production environments
• Orchestrate workflows in container-native environments
• Build and deploy Machine Learning and Generative AI models using Airflow
About the Technology
Apache Airflow provides a unified platform for collecting, consolidating, cleaning, and analyzing data. With its easy-to-use UI, powerful scheduling and monitoring features, plug-and-play options, and flexible Python scripting, Airflow makes it easy to implement secure, consistent pipelines for any data or AI task.
About the book
Data Pipelines with Apache Airflow, Second Edition teaches you how to build, monitor, and maintain effective data workflows. This new edition adds comprehensive coverage of Airflow 3 features, such as event-driven scheduling, dynamic task mapping, DAG versioning, and Airflow’s entirely new UI. The numerous examples address common use cases like data ingestion and transformation and connecting to multiple data sources, along with AI-aware techniques such as building RAG systems.
What’s inside
• Deploying data pipelines as Airflow DAGs
• Time and event-based scheduling strategies
• Integrating with databases, LLMs, and AI models
• Deploying Airflow using Kubernetes
About the reader
For data engineers, machine learning engineers, DevOps, and sysadmins with intermediate Python skills.
About the author
Julian de Ruiter, Ismael Cabral, Kris Geusebroek, Daniel van der Ende, and Bas Harenslak are seasoned data engineers and Airflow experts.
Table of Contents
Part 1
1 Meet Apache Airflow
2 Anatomy of an Airflow DAG
3 Time-based scheduling
4 Asset-aware scheduling
5 Templating tasks using the Airflow context
6 Defining dependencies between tasks
Part 2
7 Triggering workflows with external input
8 Communicating with external systems
9 Extending Airflow with custom operators and sensors
10 Testing
11 Running tasks in containers
Part 3
12 Best practices
13 Project: Finding the fastest way to get around NYC
14 Project: Keeping family traditions alive with Airflow and generative AI
Part 4
15 Operating Airflow in production
16 Securing Airflow
17 Airflow deployment options
A Running code samples
B Prometheus metric mapping
finelybook
