Adapting new tool for our automated jobs: Apache Airflow

Performance marketing combines advertising and innovation to assist merchants and affiliates expand their companies in any aspect. Each retailer's campaign is carefully targeted, ensuring that everyone has a chance to succeed and win. When all the operations on each side are done correctly, performance marketing offers win-win situations for both merchants and affiliates.

Okan Karaduman
Okan Karaduman - 01 November, 2021
Adapting new tool for our automated jobs: Apache Airflow

As Tech Team , we gave a decision about writing new blogs about the digital marketing projects we build with the software engineering skill sets. We mainly produce new solutions for different brands and our main goal is to increase their performance with data analysis and automation projects. On that sense, we decided to publish new blogs related to our projects.

What do we produce as a team?

As a tech team ,we adopted the modern performance marketing ideology and as a result of this , we produce an automation project using Airflow to schedule product reports for our customers. On our product reports , we are gathering information about the unique codes , availability , discounted prices , discount percentages and many more related features about the products by visiting their urls. Moreover , we have a solid background on the Google Sheets to add more power on our automation projects for more monitoring&reporting purposes. Furthermore we also used docker container technology to adapt the airflow environment into a virtual environment for the problems that we can face during deployment phase.


To be more clear , Airflow is a an automation tool to create data pipelines for multiple purposes. The main reason we work with airflow is , rather than cron jobs , airflow provides us a UI services to monitor all the processes in a almost real-time. In addition to that , analyzing the logs on the platform has a significant impact on catching and regulating errors during the process. Moreover , when an error occurs on the system during the processes , with the configuration file , we can get emails for any kind of errors and it provides us a service for instant interference.

First of all ,whole process depends on the scraping and on the performance side of the whole process , we mainly use parallel threads working asynchronously and with that way we can scrap all the data in minutes for large scale of urls to be checked. We designed a multiple virtual machine templates on the Google Cloud which can operate all the required tasks in a given order. The most significant part of the order is to deliver data through the processes in a single template for the upcoming reports on the airflow platform.

What is Airflow?

Let’s get more deep inside into the main structure of the airflow to understand how it works and how we can adapt this data pipeline environment to different projects for future purposes. “A DAG specifies the dependencies between Tasks, and the order in which to execute them and run retries; the Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more”[1]. Basically we can think Airflow as a way more complex version of cron jobs.In Airflow we are using workers as threads to operate all the Tasks.

First of all , we need to know that all airflow tasks work on a pre-structured object named ‘DAG’ , which is a terminology to annotate each task we scheduled on the Airflow system. For example; 


with DAG( "Company_X_Product_Report", schedule_interval='@daily', catchup=False, default_args=default_arguments ) as dag:

``

1- We can define DAGs using Context Manager, DAG structure has so many features such as time scheduling ,naming and retry options in case of any errors, helping us to regulate and set each of them on the data pipeline environment.In addition to that we also have default-arguments to add more features on the DAG structure. For example ;

default_arguments = {

   'owner':'AnalyticaHouse',

   'start_date':days_ago(1),

   "sla": timedelta(hours=1),

   'email': ['analyticahouse@analyticahouse.com'],

   'email_on_failure': True,

   }

2- We used multiple functions to operate each task in a DAG. To be more clear , whole process can be expressed as

url_task >> scrape_task >> write_to_sheet_task >> find_path_task >> parse_message_task >> write_message_task

3- We also separate each task with corresponding functions with using PythonOperators. Mainly these operators are responsible from the python functions handling the required duties. One of the most used ones are Bash operators and Python Operators for executing bash and python scripts. In our project , we mainly used Python Operators for executing each function one by one. In this example we showed that , each scheduled python function needs to get implemented in a such fashion that airflow will be interpreting all the implemented format as a job. We used more than 5 main functions to operate the whole scheduled job and we can think like we are adding the scheduled jobs to a stack structure. In this manner, all the PythonOperators will be added into a line and get executed one by one in a row. If one of the tasks failed(if exception raises) all operations fails and airflow informs us.

url_task = PythonOperator(

task_id='get_url_data',

python_callable=getUrlData,

)

Combining the Airflow architecture with Docker

Docker is a free and open platform for building, delivering, and operating apps. Docker allows you to decouple your apps from your infrastructure, allowing you to swiftly release software. In that sense , we also used docker in our development stages and at the end of the day , we created a docker image based on apache’s docker image on the DockerHub[3]. We come up with small changes such as network bridging and port directions for other purposes. Basically , while we are developing the airflow environment , we are faced with different minor and major problems including packet incompatibilities and version problems. Moreover , before we started working with airflow , we did research about how we can adapt docker technology to the airflow project and then we made several meetings to have a common sense about the advantages that docker can bring into action. Shortly after , during the development process we got stuck in so many points and every time we crash into the wall , rather than removing airflow from the server , we deleted the container that we created and re-deploy it with the same docker image .In addition to that , composing or creating a docker container helped us to save so much time during the development part of the project.

Advantages:

  • Time saving in case of any unexpected errors during the development phase
  • Easy to configure networking and volume features for the development of the project
  • Isolated environment which is good for managing small bugs and errors
  • Containerized structure brings easy configurations for each build time 

Challenges during the development 

Before we start implementing the project , we made so much brainstorming to figure out the critical parts of it because every time we come up with an architectural design , we were always adding more features on top of it to make it it’s best in the industry. We are a team sharing the ideology of creating a product in a impeccable way. As a result of this ideology , starting point of the project differs from the end product at the end. First we build the project on Google Cloud VM and then with the other projects we developed , it started to get harder for us to manage and deploy all of them. Then ,as a team we decided to work with the docker to make everything easier. We are a young team having an eager to learn build new projects with the latest and reliable technologies. In that manner, we also implemented error catching modules because while we were developing the project , we faced with server and airflow crashes for no reason. Catching errors has an essential impact on us because customers that we are working with needs to receive product reports daily and during the flow of the whole automation project , in case of any errors it results with undelivered reports.


References

[1],[2]-https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html
[3] https://hub.docker.com/r/apache/airflow