Background: Athemaster is a technology company offering solutions and expertise in implementing Enterprise Data Hub and automating Data integration with Open Source technologies. My offices are based in Taipei, Taiwan. There I have to manage more than 50 data pipelines, and they have dependencies between each other. In the beginning, some tables from the relational database have to be ingested to the Hadoop cluster. After that, some tables have to be joined. Then I use some statistical models to check the data to find out if there is any fraudulent behavior. If the model detects fraud, the fraud data is sent to our security department for verification. This whole pipeline is very long, and the relationship is very complex. I set out to find a tool to help me manage them so that I can easily observe the status of those jobs and rerun or revise them.
Goals: To identify and start using automation ETL pipelines and MLOps pipelines, making clearer dependencies between data processing stages, and creating more elasticity in my pipeline configuration for easier management.
Solution & Results: I choose Jenkins to manage data pipelines for 3 reasons.
Jenkins makes complex data pipeline management become simple.
Speaking of capabilities:
Actually, it is not the first time I use Jenkins to manage data pipelines. And the results are always excellent. For this use case, we saw: