Sticking with Airflow for a second, this is a stellar post where the Lyft data eng team talks about their production Airflow deployment (500+ DAGs!). They discuss:
- overall architecture
- monitoring & SLAs
- customizations they’ve made
- production performance and reliability
In this post, even in the process of outlining a very sophisticated Airflow environment, it’s hard to miss the areas of the product where duct tape needed to be applied. The monitoring system, in particular, felt somewhat rudimentary relative to its criticality—there is clearly a lot of scope for a managed service
to add value here.