Welcome to Apache Airflow Overview. After watching this video, you will be able to: Recognize Apache Airflow as a platform to programmatically author, schedule, and monitor workflows. List the main features and principles of Apache Airflow. And, list common use cases for Apache Airflow. Apache Airflow is a great open source workflow orchestration tool that is supported by an active community. It is a platform that lets you build and run workflows, such as batch data pipelines. With Apache Airflow, a workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called tasks, arranged with dependencies. And, note that unlike Big Data tools such as Apache Kafka, Apache Storm, Apache Spark, or Flink, Apache Airflow is not a data streaming solution. It is primarily a workflow manager. Let’s take a look at a simplified overview of Apache Airflow’s basic components. Airflow comes with a built-in Scheduler, which handles the triggering of all scheduled workflows. The Scheduler is responsible for submitting individual tasks from each scheduled workflow to the Executor. The Executor handles the running of these tasks by assigning them to Workers, which then run the tasks. The Web Server , serves Airflow’s powerful interactive User Interface. From this UI, you can inspect, trigger and debug any of your DAGs and their individual tasks. The DAG Directory contains all of your DAG files, ready to be accessed by the Scheduler, the Executor, and each of its employed Workers. Finally, Airflow hosts a Metadata Database, which is used by the Scheduler, Executor, and the Web Server to store the state of each DAG and its tasks. A DAG specifies the dependencies between tasks, and the order in which to execute them. The tasks themselves describe what to do. In this example DAG, the tasks include data ingestion, data analysis, saving the data, generating reports, and triggering other systems, such as reporting any errors by email. Let’s have a look at the life cycle of a task’s state. In this diagram, you can see how Apache Airflow might assign states to a task during its life cycle. No status: This means that the task has not yet been queued for execution. Scheduled: The scheduler has determined that the task’s dependencies are met and has scheduled it to run. Removed: For some reason, the task has vanished from the DAG since the run started. Upstream failed: An upstream task has failed. Queued: The task has been assigned to the Executor and is waiting for a worker to become available. Running: The task is being run by a worker. Success: The task finished running without errors. Failed: The task had an error during execution and failed to run, and Up for retry: The task failed but has retry attempts left and will be rescheduled. Ideally, a task should flow through the scheduler from ‘no status’, to ‘scheduled’, to ‘queued’, to ‘running’, and finally to ‘success.’ Now, let’s have a look at the five main features and benefits of Apache Airflow. Pure Python Create your workflows using standard Python. This allows you to maintain full flexibility when building your data pipelines. Useful UI Monitor, schedule, and manage your workflows via a sophisticated web app, offering you full insight into the status of your tasks. Integration Apache Airflow provides many plug-and-play integrations, such as IBM Cloudant, that are ready to execute your tasks. Easy to Use Anyone with Python knowledge can deploy a workflow. Airflow does not limit the scope of your pipelines. And finally, the open source feature. Whenever you want to share your improvement, you can do this by opening a pull request. Airflow has many active users who are sharing their experiences in the Apache Airflow community. Apache Airflow pipelines are built on four main principles. They are: Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. It is ready to scale to infinity. Dynamic: Airflow pipelines are defined in Python, and allow dynamic pipeline generation. Thus, your pipelines can contain multiple simultaneous tasks. Extensible: You can easily define your own operators and extend libraries to suit your environment. And, Lean: Airflow pipelines are lean and explicit. Parameterization is built into its core using the powerful Jinja templating engine. Apache Airflow has supported many companies in reaching their goals. For example, Sift used Airflow for defining and organizing Machine Learning pipeline dependencies, SeniorLink increased the visibility of their batch processes and decoupled them, Experity deployed Airflow as an enterprise scheduling tool, and Onefootball used Airflow to orchestrate SQL transformations in their data warehouses, and to send daily analytics emails. In this video, you learned that: Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. The five main features of Airflow are its use of Python, its intuitive and useful user interface, extensive plug-and-play integrations, ease of use, and the fact that it is open source. You also learned that Apache Airflow is scalable, dynamic, extensible, and lean. And finally, defining and organizing machine learning pipeline dependencies with Apache Airflow is one of the common use cases.