

It allows you to dynamically scale up and down based on the task requirements.

The Kubernetes executor creates a new pod for every task instance. This is the executor that we’re using at to be able to run up to 256 concurrent data engineering tasks. Airflow then distributes tasks to Celery workers that can run in one or multiple machines. The Celery executor requires to set up Redis or RabbitMQ to distribute messages to workers. While you can run the local executor in production, it’s common to migrate to the Celery executor to improve availability and scalability. The local executor can run tasks in parallel and requires a database that supports parallelism like PostgreSQL. It runs tasks sequentially in one machine and uses SQLite to store the task’s metadata. The default executor makes it easy to test Airflow locally. The difference between executors comes down to the resources they’ve available.Įxample Airflow configuration Sequential Executor The executor communicates with the scheduler to allocate resources for each task as they’re queued. One of the first choices when using Airflow is the type of executor.
#Airflow etl machine learning how to
How to get automatic version control for each machine learning task.Īpache Airflow has a multi-node architecture based on a scheduler, worker nodes, a metadata database, a web server and a queue service.How to easily execute Airflow tasks on the cloud.How machine learning tasks differ from traditional ETL pipelines.The different strategies for scaling the worker nodes in Airflow.We wanted to keep on using Airflow to orchestrate machine learning pipelines but we soon realized that we needed a solution to execute machine learning tasks remotely. While working at, we first had a few hundred DAGs to execute all our data engineering tasks. If you’re using Apache Airflow, your architecture has probably evolved based on the number of tasks and their requirements. It has more than 15k stars on Github and it’s used by data engineers at companies like Twitter, Airbnb and Spotify. Apache Airflow is a popular platform to create, schedule and monitor workflows in Python.
