Customized Airflow Pipeline

Airflow was released two years after Luigi but gained more popularity due to its scalability, visualization capabilities, and flexibility in building workflows.

Airflow Setup (Local Mac)

Setting up the Airflow ecosystem requires more effort compared to Luigi. Let’s go over Lady H.'s notes. She experimented with both Docker and local setups on Windows and Mac but found that the easiest approach was a local setup on Mac or Linux. Here’s how she got Airflow running on Mac:

To install Airflow, follow the steps here
- It’s a good idea to note the Python version used by your Airflow installation, as this will help you locate its site packages later.
After the installation succeeded, each time follow steps below:
- Type airflow standalone through your terminal to start Airflow.
- Then you can get access to http://localhost:8080/home, which is the interface of all the DAGs (workflows).
- To create your own DAG
  - Make sure you have defined AIRFLOW_HOME during the installation stage, like this export AIRFLOW_HOME=~/airflow.
  - Find file airflow.cfg in your AIRFLOW_HOME folder, in this file, make sure to specify DAGs folder like this dags_folder = ~/airflow/dags, this is where you will add new DAGs as .py files.
  - When creating your DAG in the .py file, make sure to define the dag_id, this will be the file name shown in the DAG list.
  - To check whether your DAG has been added to the DAG list, type airflow dags list through a terminal and you will see all the DAGs. If you want to find DAGs through keywords, you can type airflow dags list | grep [key_word]. For example, Lady H. was trying to search for DAGs with "super" in their names:

If your DAG isn't shown in the list
- Check whether all the imported packages were installed in the site packages of the Python used by your Airflow.
- Check whether there is any error shown at the top of http://localhost:8080/home, if so, fix it.

Once your DAG appeared in the list, you can run it through the localhost interface:

If you know how to set up Airflow in other ways, such as using Docker or on Windows system, welcome to share ideas here! 💝

Simple Airflow Pipeline

The learning curve of Airflow is steeper than Luigi, to help your learning, Lady H. decided to exhibit a simple Airflow pipeline that covers the key learning points.

This simple pipeline only has 2 tasks, data splitting task followed by model training task. This whole workflow is a DAG and can be defined within a .py file.

🌻 Check Simple Airflow DAG >>

At the start of this DAG, you'll import various Python packages. In addition to built-in Python and Airflow packages, external libraries like lightgbm and sklearn must be installed in the site-packages directory of the Python environment used by Airflow. Otherwise, the DAG won't show up in the DAG list. Next, you'll need to define the parameters for the DAG. Make sure dag_id and tags are unique for each DAG.

Now comes the core part, where each task needs to be defined. The logic for data splitting and model training is implemented within functions, and each task is represented by a PythonOperator. The task_id parameter specifies the unique ID for the task, and the python_callable parameter links to the corresponding function for that task. Additionally, op_kwargs stores the function's parameters as key-value pairs.

For example, the logic for split_data_task is defined in the split_data() function, so its python_callable is set to "split_data". The split_data() function also has a configurable parameter called label. In this case, the value of label is set to "species", so 'label': 'species' is included in the task's op_kwargs.

You must also have noticed the xcom_push() and xcom_pull() used in the code. In this DAG, they are used to transfer the data between tasks:

In split_data() function, xcom.push() is used to push the absolute path of the data output. Remember, Airflow only accepts the absolute path.
In order to load the split data, train_model() function uses xcom_pull() to locate the pushed data through task_ids and the key specified in xcom_push(). In this example, task_ids points to the task_id of split_data_task.

After seeing your DAG appeared in the DAG list, you can run it, and may get errors in a certain task. For example, as shown below, data spliting task succeeded so it's marked as green and model training task failed so it's marked in red:

To check the error log, you can click the failed task and find the log, it will indicate which lines of code caused the errors.

After trials and errors, finally you will get green lights on every task of the DAG 🥳, and from the Tree view, you can see an overview of all the historical attempts.

PreviousCustomized Luigi Pipeline NextCustomized ZenML Pipeline

Last updated 11 months ago