In this blog, I will summarize some Airflow tips, which actually are aha moments after I went through some documentation carefully related to the parts I didn’t pay attention to in work.
-
There are two components of Airflow. One is metadata database maintaining information of Dags and task status. The other part is scheduler which compiles and processes Dag files and utilize information stored in metadata database to decide when task should be executed. It is critical to keep the Dag file very light like a configuration file so that it takes less time and resources for Airflow scheduler to process at each heartbeat. A good takeaway is that we don’t do any actual data processing in Dag files themselves. All data processing should happened in the methods or functions passed in
python_callable
params under an Operator object. -
Changing the Dag ID of an existing Dag is equivalent to creating a brand new Dag. Don’t rename Dag ID only if it is a real necessity. Renaming may cause extra trouble like losing all the Dag run history and Airflow will attempt to backfill all the pass Dag runs again if catchup is turned on.
-
Deleting a Dag file doesn’t erase its Dag run history and other metadata. Either use the delete button in Airflow UI or
airflow delete_dag
because these two ways will trigger deleting the metadata records from metadata database of Airflow -
The most difficult part of Airflow for me is the execution time. Airflow scheduler triggers a Dag run at the end of the time period rather than the beginning of it, meaning a Dag will be triggered only when the time dependency is met in other words. The execution time in airflow is not the actual run time. For example, we set a schedule happening on 7 am UTC daily. At 7 am UTC, 2020-09-17, a dag is executed and the execution time of this dag is 7 am UTC, 2020-09-16 because a couple sentences ago, I mentioned the dag is triggered at the end of the time period. The time period of 2020-09-16 dag is from 7 am UTC, 2020-09-16 to At 7 am UTC, 2020-09-17.
-
Catchup is a feature in Airflow to allow us finish backfill tasks. There are two ways to configure. One is under cluster level, we can set
catchup_by_default=True
in airflow.cfg. The other way is under dag level like usingdag = DAG("dag-name", default_args=default_args, schedule_interval="0 * * * *", catchup=True)
and pass dag variable into Operator object. It’s good practice to make sure Dags are idempotent and Dag run is independent of each other on on the actual run date, meaning the result of running the Dag multiple times should be the same as the result of running it once.
After I understand above speciality of Airflow, I do feel confident in manipulating with loads of Airflow tasks with these aha moments in my mind.