Customized Airflow Pipeline
Airflow was released two years after Luigi but gained more popularity due to its scalability, visualization capabilities, and flexibility in building workflows.

Airflow Setup (Local Mac)
Setting up the Airflow ecosystem requires more effort compared to Luigi. Letโs go over Lady H.'s notes. She experimented with both Docker and local setups on Windows and Mac but found that the easiest approach was a local setup on Mac or Linux. Hereโs how she got Airflow running on Mac:
To install Airflow, follow the steps here
Itโs a good idea to note the Python version used by your Airflow installation, as this will help you locate its site packages later.
After the installation succeeded, each time follow steps below:
Type
airflow standalonethrough your terminal to start Airflow.Then you can get access to http://localhost:8080/home, which is the interface of all the DAGs (workflows).
To create your own DAG
Make sure you have defined AIRFLOW_HOME during the installation stage, like this
export AIRFLOW_HOME=~/airflow.Find file
airflow.cfgin your AIRFLOW_HOME folder, in this file, make sure to specify DAGs folder like thisdags_folder = ~/airflow/dags, this is where you will add new DAGs as .py files.When creating your DAG in the .py file, make sure to define the
dag_id, this will be the file name shown in the DAG list.To check whether your DAG has been added to the DAG list, type
airflow dags listthrough a terminal and you will see all the DAGs. If you want to find DAGs through keywords, you can typeairflow dags list | grep [key_word]. For example, Lady H. was trying to search for DAGs with "super" in their names:
If your DAG isn't shown in the list
Check whether all the imported packages were installed in the site packages of the Python used by your Airflow.
Check whether there is any error shown at the top of http://localhost:8080/home, if so, fix it.
Once your DAG appeared in the list, you can run it through the localhost interface:

If you know how to set up Airflow in other ways, such as using Docker or on Windows system, welcome to share ideas here! ๐
Simple Airflow Pipeline
The learning curve of Airflow is steeper than Luigi, to help your learning, Lady H. decided to exhibit a simple Airflow pipeline that covers the key learning points.
This simple pipeline only has 2 tasks, data splitting task followed by model training task. This whole workflow is a DAG and can be defined within a .py file.
๐ป Check Simple Airflow DAG >>

At the start of this DAG, you'll import various Python packages. In addition to built-in Python and Airflow packages, external libraries like lightgbm and sklearn must be installed in the site-packages directory of the Python environment used by Airflow. Otherwise, the DAG won't show up in the DAG list. Next, you'll need to define the parameters for the DAG. Make sure dag_id and tags are unique for each DAG.
Now comes the core part, where each task needs to be defined. The logic for data splitting and model training is implemented within functions, and each task is represented by a PythonOperator. The task_id parameter specifies the unique ID for the task, and the python_callable parameter links to the corresponding function for that task. Additionally, op_kwargs stores the function's parameters as key-value pairs.
For example, the logic for split_data_task is defined in the split_data() function, so its python_callable is set to "split_data". The split_data() function also has a configurable parameter called label. In this case, the value of label is set to "species", so 'label': 'species' is included in the task's op_kwargs.
You must also have noticed the xcom_push() and xcom_pull() used in the code. In this DAG, they are used to transfer the data between tasks:
In
split_data()function,xcom.push()is used to push the absolute path of the data output. Remember, Airflow only accepts the absolute path.In order to load the split data,
train_model()function usesxcom_pull()to locate the pushed data throughtask_idsand thekeyspecified inxcom_push(). In this example,task_idspoints to thetask_idofsplit_data_task.


After seeing your DAG appeared in the DAG list, you can run it, and may get errors in a certain task. For example, as shown below, data spliting task succeeded so it's marked as green and model training task failed so it's marked in red:

To check the error log, you can click the failed task and find the log, it will indicate which lines of code caused the errors.

After trials and errors, finally you will get green lights on every task of the DAG ๐ฅณ, and from the Tree view, you can see an overview of all the historical attempts.
Last updated