Don't Use Apache Airflow

Поділитися
Вставка
  • Опубліковано 26 лис 2024

КОМЕНТАРІ • 208

  • @wexwexexort
    @wexwexexort 2 роки тому +93

    I wasn't using it but after this video I just changed my mind. I'm gonna schedule some jobs using Airflow next sprint.

    • @BryanCafferky
      @BryanCafferky  2 місяці тому +2

      How is your use of Airflow going? What use cases did you use it for?

  • @Seatek_Ark
    @Seatek_Ark Рік тому +48

    I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently.
    1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do-
    a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start).
    b. Extract your data from one location.
    c. Transform your data. This is where the bulk of your coding comes into play
    d. put your data into it's appropriate database or storage
    e. make sure a-d goes off without an issue.
    The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.

    • @vasdecabeza2
      @vasdecabeza2 Рік тому +2

      I agree. Furthermore, Airflow is a workflow/orchestration [management] tool/platform, that's why it includes Job/Task scheduling, monitoring, retries, and other features. On the other hand, there are things I don't like from Airflow like lack of a declarative way (via JSON or YAML) to define DAG and tasks.

    • @viewpointzero1420
      @viewpointzero1420 4 місяці тому +1

      I've recently started to use Airflow, and while I agree with many points, the thing with Airflow at least in the current 2.9 version is that by now it has all the operators so that the "Python" code you write *is* the description.
      For many tasks you do nothing but call operators like SqlExecuteQuery('file.sql')
      If you had to write the task descriptions in yaml you would be just copy/pasting the same python declarations with nothing much taken away as by now they're terse, and unless you do need some Python code they're just declarations strung together.

    • @gherbihicham8506
      @gherbihicham8506 2 місяці тому

      ​@@viewpointzero1420 But that's kind of the point isn't it? You wanna spend your time re-inventing the wheel and patting yourself in the back? Or you wanna deliver the product you are working on? That's the whole point of these tools, it makes parts of the jobs that's not actually "your problem" easier.
      If you just want to write python code, which is a really uncool language if ask me, then you shouldn't be using Airflow.

  • @-MaCkRage-
    @-MaCkRage- Рік тому +3

    I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.

  • @janHodle
    @janHodle Рік тому +5

    In most point I agree... Airflow is not an ELT Tool. It's an orchestrator, in my opinion the best in the world. In the company where I work I built up a BI for online activities. I tried a lot of tools, don't want to mention them all. But they all had a lot of draw backs and where expensive. I ended up using Airflow and I'm pretty happy with it. Sure, it's all code! That's what you have to keep in mind. Other tools like DBT, Airbyte and so on integrate perfectly into Airflow. So scheduling and monitoring the entire pipeline is absolutely great. On the other hand I had to struggle with a lot of data sources where out of the box tools had problems understanding the data. In the end I had to program a middle ware in python to make the data compatible with these tools. Now in Airflow it works inside the Airflow environment. Due to the fact Airflow delivers a lot of good operators the code got even smaller. Furthermore the Docker (Compose) images are great and the Helm Charts are good... So yes: it's not a native ELT tool. You have to use code only... But with code only comes a lot of flexibility. Don't want to go back the kettle, talend or SAP Data Services. What looks interesting is NiFi...

    • @BryanCafferky
      @BryanCafferky  Рік тому

      Thanks for the feedback. I agree if you need a high degree of control and have a lot of dependency/complexity it can be a good option. It does not fit most of the use cases I have done over several decades of data engineering work though.

  • @bnmeier
    @bnmeier 2 роки тому +10

    Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish.
    If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup.
    Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement.
    Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.

  • @ariocecchettini1159
    @ariocecchettini1159 2 роки тому +18

    Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Yeah. It is good for orchestration and it can work with Databricks.

  • @gudata1
    @gudata1 2 роки тому +26

    Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.

  • @yevgenym9204
    @yevgenym9204 2 роки тому +10

    As someone coming from SSIS and literally hate it for being all too much graphical interface,
    I have to say you did a good job about describing the problems with AirFlow.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Thanks

    • @AP-nq4pe
      @AP-nq4pe 2 роки тому +4

      Only thing I hate in SSIS is the variables. If you follow ELT pattern and do minimal/no data transformation in the package, it is nice, scalable and most importantly easy to administer / manage, without tons of code .

    • @BryanCafferky
      @BryanCafferky  2 роки тому +2

      @@AP-nq4pe Best to do most work in SQL Server T-SQL but it SSIS does orchestrate well. Package parameters are also a nice feature.

    • @michaeldowd5545
      @michaeldowd5545 Рік тому

      Python Code is often Keep It Stupid Simple compared to SSIS or other tools for that matter.

  • @DodaGarcia
    @DodaGarcia 10 місяців тому +2

    I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem.
    I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself.
    I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.

    • @BryanCafferky
      @BryanCafferky  10 місяців тому +2

      Yeah. After the video, I came to the conclusion that there are job schedulers and orchestrators and often you just need a good job scheduler. When the complexity requires an orchestrator, I recommend you look at Dagster. It is much more extensible, testable, and adds a ton of features over Airflow. I've been studying up on it for months to be sure I liked it. dagster.io/

  • @tomhas4442
    @tomhas4442 Рік тому +9

    Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer

  • @igoryurchenko559
    @igoryurchenko559 Рік тому +2

    A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.

  • @Jeffsdata_0
    @Jeffsdata_0 2 роки тому +22

    Love the video. Definitely made me think and gave me some good tools to look into.
    A few notes here (I'm an Airflow noob, but I've at least used it...)
    1. It doesn't really work on Windows like it says in the screenshot at the beginning - unless you're using Docker or WSL. It only works on Linux.
    2. It does not only support Python. As you mention, there's a BashOperator, which means it can run anything using a bash script (python, JavaScript, php script, Java app, C# console app, etc).
    3. I think it's a bit disingenuous to say your DAG code could be more than your actual code running - the DAG definitions are insanely simple... your examples are probably about as complex as 70% of jobs (outside of the actual logic).
    4. All the alternate solutions you present also have overhead to learn and their own proprietary outputs (that can't be reused anywhere else - except maybe Data Factory, which might be able to port into SSIS on-prem or whatever). A Python script (or whatever script - Powershell, C# app, etc) can run just about anywhere.
    5. Instead of putting your Python logic inside the script, you can just use a BashOperator to run the Python script (ie: "python3 -m path/to/thescript.py") - which means you can decouple and use the script part anywhere and only the DAG definition is the only thing specific to Airflow (which is... trivial most of the the time). This might not work if you have complex dependencies between your scripts - mine were always fairly linear jobs like: move data to cloud, train ML model, run batch model outputs, do something with the outputs, update some API.
    I'll just say... if you're currently running C# console and Python script jobs on Windows Server Scheduler (which is where I'm coming from, lol!), Airflow is an awesome tool that's super easy to get started with. We didn't end up using it because it was Linux-only and our infra team is scared of Linux (and Docker... and WSL2...).

    • @BryanCafferky
      @BryanCafferky  2 роки тому +2

      Thanks. Lots of good comments. My point is about parsimony. Do only as much as needed and keep maintenance in mind. To create the DAGs, I believe Python must be used but from there you can call other languages. Not sure how tightly integrated other operator are, i.e. seem to just shell out but Ok. I've used SQL Server Agent for ETL scheduling and it worked great and no coding required. But in the cloud, I need to use other options like Azure Data Factory, etc. Azure also has Azure Automation but I wish Azure had a good job scheduler.

    • @HamzaHafeez7292
      @HamzaHafeez7292 Рік тому +1

      @@BryanCafferky Having worked extensively on Airflow in recent months, on multiple proof of concepts, I will admit it has a fair bit of complexity to it. However, it does provides a lot of operators out of the box e.g. DockerOperator, K8sPodOperator. Working with those in a managed environment like AWS MWAA (Managed Airflow) has made things very straightforward for us. We have been using our pre-cooked Spark Docker Images to carry out all the tasks on runtime. It does require fair bit of training to understand how to best use it. And COST yes. The COST is expensive. But we were able to get started with Airflow on AWS in couple hours and were testing out Spark modules on the very day.

  • @lahvoopatel2661
    @lahvoopatel2661 2 роки тому +1

    This is amazing. Rarely anyone is so fair in evaluating popular tool like airflow

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thank you. There are some who disagree but I was trying to be fair.

  • @shutaozhang9827
    @shutaozhang9827 2 роки тому

    I am studying Apache NiFi now, it looks like a good tool for ETL purpose, thanks for your comments.

  • @abhinee
    @abhinee 2 роки тому +1

    writing 800 lines of code to schedule a job in airflow..i totally agree with you..its a Pain in the wrong place

  • @ben.morris
    @ben.morris 2 роки тому +6

    Thank you for POV. Take a look at Dbt too from Fishtown Analytics. I think version control needs to be a core requirement for any tool that is responsible for moving data. This might be a problem if the solution isn’t code-based.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Thanks for your feedback. Do you use dbt or work for Fishtown? Source code control can take many forms. SSIS stores its programs as XML which can be placed under SSC. The level of and need for SSC depends on the project requirements. For example, in a small shop where one person maintains the code, ease of use and a GUI may outweigh the need for SSC assuming the ETL object snapshots can be stored.

    • @AP-nq4pe
      @AP-nq4pe 2 роки тому

      @@BryanCafferky Latest version of SSIS I checked, does version control and CI/CD like a pro!

  • @Theoboeguy
    @Theoboeguy 2 роки тому +1

    whether I end up using airflow or not, this is a great video that clearly explains how to use the tool and your perspective. thank you!

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks for your kind words. Glad it is helpful!

  • @sanjaybhatikar
    @sanjaybhatikar 2 роки тому +4

    Beautifully explained! I love how you dive into the code without getting lost in the weeds. Very helpful, thank you :)

  • @goutham4678
    @goutham4678 Рік тому +1

    KubernetesPodOperator can be used to run any docker images using Airflow.

  • @halildurmaz7827
    @halildurmaz7827 2 роки тому +2

    As I know, Airflow is used for "scheduling" the ETLs; not "creating" the ETLs. So, can you perform both "creating" and "scheduling" operations via AWS Glue?

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      I've not used Glue but the docs say you can. "AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs." For time based scheduling see docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html

    • @halildurmaz7827
      @halildurmaz7827 2 роки тому +1

      @@BryanCafferky Thank you so much for your attention. Then, if you are working for a company that uses a Cloud Platform, actually you do not even need Airflow.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      @@halildurmaz7827 YW. If you just need to do ETL work, you don't need Airflow. If you need complex task orchestration, i.e. workflows, Airflow might be a good option.

  • @rodrigoloza4263
    @rodrigoloza4263 2 роки тому +5

    Airflow is great. Coupled with Kubernetes you don't have to stick to Python anymore. The only drawback I saw was that DAGs don't scale when they have huge amounts of tasks. Though it's easy to solve by splitting the DAG.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks for the comment. You do have to define the DAGs in Python. What do you use Airflow for?

    •  2 роки тому +1

      There are different ways to use Airflow, you can rely on Kubernetes Pods to run Docker instances.
      In recent versions, you can scale schedulers to solve task issues.
      Nowadays, anyone can give an opinion just by reading Wikipedia and some basic examples.
      It's not an ETL solution.
      It's just an orchestrator with batteries.

  • @thybui1368
    @thybui1368 10 днів тому

    It definitely has a steep learning curve. I was not able to deploy with airflow so I switched to dagster and it was way simpler, was able to spin up a task schedule within a day

  • @brendoaraujo9110
    @brendoaraujo9110 2 роки тому +1

    Hello, I have an airflow running on my machine with Postgresql on the scheduler's backend and LocalExecutor, but when I put my dags to run it consumes a lot of server CPU, how could I solve this high consumption problem?

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      If you are running it all on your machine, then it sounds like your machine may not have enough power to support it. You could deploy Airflow to cloud VMs or Kubernetes cluster to get more resources. This Stack Overflow talks about limiting Airflow memory consumption. stackoverflow.com/questions/52140942/airflow-how-to-specify-quantitative-usage-of-a-resource-pool This blog discusses how to configure Airflow with setting for max_threads, worker_concurrency, etc. medium.com/@sdubey.slb/high-performance-airflow-dags-7ad163a9f764

  • @davidgao4333
    @davidgao4333 2 роки тому +1

    I use Airflow, too, but I totally agree with Bryan's point of view.
    Airflow is a powerful tool, but the other side of the coin is its steep learning curve especially for new Airflow users.
    Most of the time I just need to do simple stuff and I find using Airflow leads to over-engineering.
    Lots of people uses Kubernetes Operator, the biggest problem I see with it is a lot of times I have one common base Docker image, but I need to bundle different code into that Docker image just for the sake of using Kubernetes Operator.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      If you are doing Databricks, the new workflows are pretty easy to use to create task orchestration. Thanks for your comment. .

    • @davidgao4333
      @davidgao4333 Рік тому

      @@BryanCafferky Thank you for posting this video. It's THE best video that I've encountered that explains what Airflow is. A lot of people in my company uses Airflow for use cases that are not fit for Airflow..

    • @BryanCafferky
      @BryanCafferky  Рік тому

      @@davidgao4333 Glad you liked it.

  • @yahyaayyoub9959
    @yahyaayyoub9959 2 роки тому +1

    Can not compare Apache Airflow With Apache Nifi
    These two tools aren’t mutually exclusive, and they both offer some exciting capabilities and can help with resolving data silos. It’s a bit like comparing oranges to apples - they are both (tasty!) fruits but can serve very different purposes.

  • @awadelrahman
    @awadelrahman 2 роки тому +2

    Does AWS StepFunctions Service fit anywhere within those alternative options?

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      I have not used them but from the docs, yes, it looks like Step Functions would be a good option.

  • @konzy2
    @konzy2 3 місяці тому

    In Airflow the correct pattern is not to write top level code. In the Postgres example, in production you could have a DBT file or .sql file that contains the queries.
    Airflow specifically says not to do processing on Airflow itself, it's used for kicking off jobs elsewhere, monitoring them and dealing with the results.
    Some examples would be running a Glue job, Apache Spark, start an EMR cluster, making an RESTful call to an API, train a model on a Ray cluster.
    Most of your code to do the data processing should be on those platforms and can be in Scala, Java, Golang, C.
    Apache NiFi is also good for ETL, but parts of it require that data moves along its processors. Such as converting from one file format to another or regexing columns. So, some parts of it need more compute to process the data needing to scale the NiFi cluster. NiFi 1.x is Java only and only recently with 2.x Python is supported.

  • @ricardorodriguez4180
    @ricardorodriguez4180 2 роки тому +7

    This is a cool vid. I personally love Airflow, I use it mostly as an interface to k8s and run applications on pods.
    I think in terms of "if I can write a container for the task, then I can orchestrate it in Airflow".
    You're not wrong though, it did take time to learn the intricacies of Airflow (both in code and UI). Our company-practice is to make reusable functions that generate DAGs, reduces the code for creating workflows per use-case down to just a function call.
    Thanks for putting this video together. I learned about some good alternatives.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      You're welcome and thanks for your feedback.

  • @patrickbateman7665
    @patrickbateman7665 2 роки тому +2

    Recently an idiot on Reddit argued with me by saying Airflow is better than Data Factory. This video says it all. Thanks Alot Bryan 🙏

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Well, Airflow may be better at some things but not data movement/transformations in most use cases. ADF is a solid choice if you are on Azure.

  • @abc8879
    @abc8879 2 роки тому +3

    "you are only limited to python" -- I don't think this is a bad thing. Python is a stable and versatile language with libraries for everything.
    "it's complex" --- If the developer already knows Python, imo, airflow isn't difficult to learn.
    "Requires 100% coding" -- I see this as an advantage. I'm using both Airflow and Pentaho. With Pentaho, code review is just painful because the raw code is in XML which makes it difficult to read and to keep track of. Also, there's not a huge user base like python or airflow. So there isn't much help out there on stackoverflow.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks for the feedback.

    • @guyvleugels8507
      @guyvleugels8507 2 роки тому

      I'm not against visual programming or low code tools, but if your team are experienced developers, they'll be more productive and happier using all code.
      Nevertheless. ETL tools have their place. Let your teams use the tools that are most suited for the job and use Airflow as a centralized orchestrator. You can orchestrate ADF pipelines for instance.

  • @paleface_brother
    @paleface_brother 2 роки тому +5

    Thank you, Bryan, for your videos. They are really useful. It will be very kind of you to make lessons about Apache NiFi, especially how to choose processors for needed actions.

  • @najbighouse
    @najbighouse 9 місяців тому +1

    Which tool is recommended for a project where you have to be calling this jobs every 20secs? I suppose this is better for tasks that run once or twice a day and not in a constant loop. right? only 10% of my tasks are daily or weekly. any recommendations?

    • @BryanCafferky
      @BryanCafferky  9 місяців тому

      If the job is constantly running, then an orchestration service seems to be unnecessary. Perhaps you should consider using a streaming source.

  • @JimRohn-u8c
    @JimRohn-u8c 2 роки тому +2

    Just to make sure I’m understanding correctly:
    1. If a company uses Azure or AWS they can just use ADF or AWS Glue instead of Airflow? If so it seems that Airflow is more for companies who do end-to-end python ETL/ELTs and don’t wanna pay for ADF or AWS Glue?
    2. I’m a bit confused between what you said because a couple articles on the web and answers on Quora say that Apache NiFi is NOT a replacement for Apache Airflow. So are there things that Apache NiFi can’t do that Airflow can?
    3. I really don’t wanna learn Airflow because of the learning curve but some jobs do require it :/ so if Apache NiFi can replace it I’d rather just use that. Do you know of any good resources to learn Apache NiFi or do you plan on making videos on it?

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Thanks for your thoughts. Airflow can be a good solution but my point was that it is not an ETL tool. It is a job scheduler or orchestrator. It is promoted often as an ETL solution which I think is misleading. But yes, I too see jobs that ask for it. For complex workflows, it may make sense, especially streaming or something with complex dependencies. Bear in mind a given workflow cannot run concurrently with itself, i.e. each run must go from start to finish before it can start again.
      I would google NiFi or check Amazon for books. The documentation online looks pretty good. NiFI videos might be something I'll do in the future. It looks pretty cool.

  • @DaveAlbert54
    @DaveAlbert54 2 роки тому +2

    I think you are mistakenly comparing Airflow to AWS Glue where AWS Step Functions (maybe also with AWS Glue) are a better representation of what it seems you get from Airflow. I'm not an expert in Airflow, but based on what is shown here in this video, that is the impression I get.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Thanks for the feedback. My intent was that ETL focused tools include SSIS, Informatica, Databricks, Azure Data Factory, NifI, Pentaho, etc. Airflow is a workflow orchestrator. I saw many places where it is promoted as an ETL service. It is not an ETL tool although it can be used to orchestrate ETL work. However, unless there are many task dependencies, it is probably overkill.

  • @TheUnderdogr
    @TheUnderdogr 2 роки тому +2

    I think there's a misunderstanding. Airflow is NOT an ETL tool, and I don't think it was ever meant to be, or marketed as such.
    It's rather an unfortunate confusion in the minds of many, between the workflow management / orchestration (which Airflow DOES), and the ETL tasks that actually implement the data transformations which make up the ETL pipeline (which should Not and usually are not airflow tasks).
    With Airflow on AWS service we run nightly data ingestion of rdbms data in AWS S3; all the tables in a given schema are processed in parallel airflow tasks, but each of these tasks is just calling an informatica script which actually does the job of ingesting a given DB table.
    So, again, if people don't understand the meaning of orchestration, don't blame the tool 😁

  • @ForestFWhite
    @ForestFWhite Рік тому +1

    Good comparisons. Python has the best/easiest frameworks (pandas, pyspark, et al.) for data transformation so that isn't a limitation.

  • @bres6486
    @bres6486 Рік тому

    I don't think social media is a good example of a DAG since in general if a is connected to b, b is connected to a, which are bidirectional (non-directed) edges. I suppose if you impose who connected to who first then you could keep it directed but that seems artificial.

  • @razyuval
    @razyuval 2 роки тому +1

    First, Thanks for this video and thanks for your insights.
    I understand why you said some things but I don't agree with most of it. You're right Airflow is a great job scheduler, not an ETL/ELT tool.
    But from my experience, neither is Nifi, not if you want to do some long complex batch jobs, each block is autonomous and they don't wait for the previous one to complete (The others I haves very little experience with so apart from being pretty expensive...).
    I think the strength of Airflow, the reason I choose to use it, is the level of control you get, and the diversity of job/tools you can use.
    It can start with a bash calling a Talend Job that loads your DB and then a DBT job that processes it.
    You can further split your DBT into tests and loads and when there's a failure rerun from the point of failure.
    These are features I saw in expensive enterprise tools such as Control-M.
    It does have a steep learning curve but looking at the trends in the market today and the way teams are being structured, Engineers for infrastructure and Analysts for the BI part, I think its a good choise.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks. NiFi is documented as only an ETL tool and seems to fit that from what I read, though I have not used it. As I discuss late in the video, Airflow can be a good choice as a scheduler if you need the sophistication, i.e. DAGs, it offers. I purposely titled the video to alert people that think Airflow is an ETL, that it is not. That’s what I wanted to use it for and after reading a book on it, realized, its not an ETL tool. It is a Workflow engine. There’s a similar one in Windows that works with C#. Its fine if that’s what you need. Airflow seems great for complex ML pipelines. On SQL Server, I have used SQL Server Agent which worked well for that environment. It had sufficient dependency management and control for most jobs. The best ETL service to use depends on your environment and requirements: Databricks Notebooks for Sparks, Azure Data Factory for Azure Cloud, Pentaho, Informatica, etc.
      I
      appreciate the feedback. Good thoughts.

  • @chasedoe2594
    @chasedoe2594 2 роки тому +2

    Coming from legacy ETL.
    I am kind of confused, as you said a lot marketed them as ETL tools, and when I look closely, I totally agree with you that it is CRON on steroids. I guess it is marketed as ETL as python had pandas is relatively easy to ingest data compared to other frameworks but doing real heavy ETL on pandas is not a perfect way to do.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Depending on your needs and environment, you can use different tools, Azure Data Factory for Azure, AWS Glue, Databricks Notebooks for Databricks which runs on Spark. Pentaho, Infomatica, etc. Lots of choices.

  • @jenya7united
    @jenya7united Рік тому

    Hi I am working with Pentacho, can you make a video on it ?

  • @enesteymir
    @enesteymir 2 роки тому +2

    Thanks clear explanations , I haven't use Airflow yet but it is nearly in the all job posts :) Companies like to use it actually

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      YW. Yeah. I wonder if they all use it or just like list it in job ads. But it could be. The best tool is often not the one selected for the job. Thanks for watching.

  • @snehotoshbanerjee1938
    @snehotoshbanerjee1938 2 роки тому +3

    Very nice video Bryan! What is your take on Prefect? They highlighted few short comings in Airflow and hence Prefect. But, Airflow in its recent version came up with lesser boilerplate. But happy to hear back from you on Prefect. Thanks!

  • @yoyartube
    @yoyartube Рік тому +1

    With the BashOperator and custom operators I find it hard to understand how it only supports python.

    • @BryanCafferky
      @BryanCafferky  11 місяців тому

      True with the bash operator it can do an OS call out to run a script but that's not tight integration. Your DAGS are defined in Python and Airflow is a Python framework. Thanks for your feedback.

  • @mirmir1918
    @mirmir1918 2 роки тому +1

    Very good explanation ! It s good that other participants(products from aws or ms etc ) mentioned

  • @vilivilhunen3383
    @vilivilhunen3383 2 роки тому +2

    Thanks! I struggled getting Airflow up and running - it seems like a really complex system. I'll take a look at Apache NiFi instead :)

  • @dragon3602010
    @dragon3602010 2 роки тому +1

    hey , can we compare it to n8n or absolutely not ?

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      Yes. I had to look up n8n but it seems to be better focused on ETL work and has many connectors. However, it does not appear to run on Spark so you would need to config a Docker/K8 environment or use their Cloud service which is in the Azure Marketplace.

  • @steinofenb3645
    @steinofenb3645 2 роки тому +1

    What does ETL stands for in ETL Service? (at min 4:02)

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      It stand for Extract, Transform, and Load.

  • @aeggeska1
    @aeggeska1 2 місяці тому

    I have been reading on their website, and I just can't understand what airflow even is or does.

  • @Davidgon100
    @Davidgon100 4 місяці тому +1

    We are actually considering this to replace some old batch processes we inherited. These processes are created in a no code solution and we cannot stand it.

    • @BryanCafferky
      @BryanCafferky  4 місяці тому

      Have you looked at Dagster? It addresses most of the issues I mentioned and has an excellent data object centric model. It's all Python based too. See dagster.io/ I'd code centric but provides a lot of value for the code you write.

    • @Davidgon100
      @Davidgon100 4 місяці тому

      @@BryanCafferky I'll check it out. Thanks!

  • @H1d3AndSeek1
    @H1d3AndSeek1 2 роки тому +1

    Very interesting video. What would be a suitable orchestrator to use if e.g. our stack for ELT is Fivetran and dbt.
    While yes, we might be able to hook up these individual tools directly, I feel an overarching orchestrator ("dag job scheduler") is needed.
    So, I am not interested in using Airflow as ETL/ELT but I always thought it would only be an orchestrator tool.
    Cheers

  • @v4ldevrr4m47
    @v4ldevrr4m47 2 роки тому +1

    Thanks is totally true that can use airflow as great ETL you need an effort focus in python. When you are a developer that use python and can prepare sql querys result perfect. Any way I will consider NiFi because I don´t Know it. Let me read about it

  • @evgeny_web
    @evgeny_web 2 роки тому

    Hi, thank you very much for this video. The project where I work plans to replace Apache oozie with Airflow, so I think it is pretty useful to watch video like this one. I don't have any prior knowledge of Airflow, it is very easy to understand the main ideas behind this framework.

  • @IIIxwaveIII
    @IIIxwaveIII 2 місяці тому

    i liked this vid. I understand your view but not sure I would call airflow a schedculer...
    anyway, its late 2024, which open source, on prem tool (bare metal and private cloud), would you use for ETL processes? (the more options the marrier)
    10x!

    • @BryanCafferky
      @BryanCafferky  2 місяці тому

      Good question. When I did on prem ETL, we used SQL Server Integration Services (SSIS) which is proprietary but an excellent ETL tool. For open-source, I really like Dagster (dagster.io/) which is a Python based data orchestration framework. While it has some of the same issues as Airflow, the data centric nature of Dagster including integrated data validation, data lineage tracking, and composability make it far superior to Airflow in my opinion. I have not used it in production so I would recommend a POC and pilot before committing to Dagster. The Dagster university free online training is good to get started. Dagster is still an orchestrator, not an ETL service. I've seen and looked at some open source ETL tools but have not dug in enough to recommend any. Does that help?

  • @sau002
    @sau002 2 роки тому +1

    Nicely done. Airflow is being explored by one of our team members. I have a question for you - Is it possible to debug the code on my local workstation before running it on Airlfow?

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Well, you can run Airflow locally, see airflow.apache.org/docs/apache-airflow/stable/start/local.html
      To test without Airflow, remember that Airflow just runs Python code in the specified sequence so you should be able to test that code. Just run it in the order it will run when it is in Airflow.

  • @paulellicapadilla3421
    @paulellicapadilla3421 Рік тому

    I’ve worked with the alternates you mentioned and you’re missing one other product that surpasses all of them. That is: Dagster

    • @BryanCafferky
      @BryanCafferky  Рік тому

      Yeah. Looks interesting . Do you work for Dagster?

    • @paulellicapadilla3421
      @paulellicapadilla3421 Рік тому

      @@BryanCafferky Nope. Don’t work for Dagster. Just a mild manner data engineer trying to weed out all the noise in the tech world, finding the right gems so I can focus on exploiting those gems to be productive and be ahead in the game. Unfortunately, most of my time is spent on weeding out noise. I thank you for your service for doing the same. I think the road to take in discovery of new tools is to ask the question “why a tool is bad”, rather than, “why is this tool good”.

  • @andalupu6145
    @andalupu6145 2 роки тому +1

    hi, please allow me to add that I used Airflow in order to run complex queries in Impala using .sql files (that contain Impala query) and run inside the DAG tasks in the needed order. This might be usefull, for me it was . I agree that Nifi is best and my favorite. Thanks

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks. Yes. Sounds like you had a good use case for Airflow.

  • @DanielRodriguez-el1gb
    @DanielRodriguez-el1gb Рік тому

    Im having a problem handling around to 50 scripts that generates reports that are send to the users, i would like to schedule them but also activate them from some microservices, is there any suggest for that? :(

    • @BryanCafferky
      @BryanCafferky  Рік тому

      50 scripts generating and sending reports is probably not the idea solution. A reporting tool would make more sense. However, you could use Azure Automation which supports Python and PowerShell to do scheduling and run the scripts. Azure functions could also be used.

  • @teenspirit1
    @teenspirit1 Рік тому

    DAG is a bad name for a task schedule.
    1. Directed graphs are obviously necessary if you want to define an execution flow, so duh.
    2. The fact that you have a schedule interval means that your DAG isn't really acyclic, because it loops onto its own start node at the end node.
    3. Acyclic graphs are good for reducing complexity and dependency between tasks, and that's a great thing. But that's an actual restriction, a lack of functionality, so it isn't really a feature.

  • @himanshutech8320
    @himanshutech8320 5 місяців тому

    Thanks. Excellent video. I recently moved to a data engineering project that uses airflow with DBT ( and cosmos ). Finding it difficult to understand why use airflow esp. with ELT tool like DBT. For any task, there is dependency on available operators if you want to use airflow. python code is tightly coupled with airflow. and as in the video it says - you have to code everything. its not that you can't get work done with airflow and DBT but with something like pentaho you would have done it with half the effort.

    • @BryanCafferky
      @BryanCafferky  5 місяців тому +1

      Thanks for your comment. I would suggest also looking at Dagster They address many of my concerns with Airflow. dagster.io/ Not sure how well is works with Databricks clusters though.

    • @himanshutech8320
      @himanshutech8320 5 місяців тому +1

      @@BryanCafferky Thanks. Will check !!

  • @wesselbindt7589
    @wesselbindt7589 Місяць тому

    A social network is a perfect example of a graph that is neither directed (facebook connections are not unilateral relations), nor acyclic (Tom knows Sally knows Bob knows Tom there's your cycle). Airflow not having ETL functionality is a perfect example of the Unix philosophy. Do one thing, and do it well. It not being low code is a strength rather than a limitation. Whenever I'm constrained by the whims of a low code platform, I always end up having to struggle immensely to find some way to get the job done whenever my use case doesn't match exactly with what the platform expects. Lot of misses at the start of the video already. Not sure if the rest of the video is worth anyone's time.

  • @vpn740
    @vpn740 Рік тому +1

    The entire functionality of Airflow is already available in tools like Control-M, Zeke, AutoSys etc, which have been present in the market for more than 2 decades. What is it that Airflow is doing differently? It seems, the programming cult has taken over the data processing and data management world and re-writing all the tools in the way that it was in 1980s. We intentionally came out of code heavy data processing/management model because of its heavy and expensive maintenance costs. Almost 18 yrs ago, in the early days of my career I worked on an "ELT" tool called Sunopsis (later acquired by Oracle). Today we are lauding a similar technology called "dbt" which is doing exactly what Sunopsis did 20 years ago. what's going on folks?

    • @BryanCafferky
      @BryanCafferky  Рік тому

      Good feedback. Not sure about dbt. It seems to offer quite a bit for ETL, less so for scheduling/orchestration.

  • @macbeth1910
    @macbeth1910 2 роки тому +4

    Sorry but there are many misleading statements here. Firstly, you are not coerced to use python in your tasks, you can perfectly orchestrate almost anything if you put your code in an image (so yeah, you can use NodeJS, Java, etc). The learning curve is nothing more complicated as the one to learn any other framework, like Django (obviously we are in the "data processing" domain here). Most of all, it is a powerful tool to organize your tasks when using a bunch of cron-jobs in microservices is not an option

    • @BryanCafferky
      @BryanCafferky  2 роки тому +2

      Thanks for your comment. Your code to orchestrate must be Python, which is a limitation. Parsimony is key. For a given project, the question 'Do I need to take on the overhead of creating and maintaining code just to orchestrate work'. Code which can break. Absorb the learning curve time and future skill set needed for employees. It is powerful but with great power comes great responsibility I don't think most data movement/transformations cases need Airflow.

    • @OgnyanDimitrov
      @OgnyanDimitrov 2 роки тому

      @@BryanCafferky The validity of the reasoning is best observed if you compare Airflow with Alteryx and contrast them. Then we really see the difference of the learning curve. Alteryx and Kettle allows for non-devs to make ETL pipelines and the learning curve for non-devs is shorter. Am I correct on my assumptions? Thanks for the video. It was a time saver really.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      @@OgnyanDimitrov Yes. You got it! Thanks

    • @guyvleugels8507
      @guyvleugels8507 2 роки тому +1

      I don't really understand Python being a limitation here. It's just the technology and ecosystem Airflow is using.
      SSIS, ADF, Pentaho,... They all have their limitations in the ecosystems they are sitting in.
      As for maintaing code.... Same applies to SSIS, ADF,... Only you build logic using a visual tool instead of all code. Airflow has lots of pre-built provider packages for database actions, ADF, Databricks, non-data related stuff,... which you can use so you don't need to build tasks from scratch.
      Thanks for the vid btw. Your other points were valid. Airflow is indeed an orchestrator, not an ETL tool. 😊

  • @MichaelCizmar
    @MichaelCizmar 2 роки тому

    Thanks for this. It is easy to understand things sometimes in the context of when you should not use it rather than what it's for.

  • @samsal073
    @samsal073 2 роки тому

    I agree apache airflow is pain in the butt to learn , install and figure out the code ...one big limitation is that it doesn't support windows system unless you run it inside docker container ...I would rather using apache nifi since it can run on windows ....support multiple scripting languages and its UI oriented vs code which make it more productive and much easier to use

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Nifi may be a good option. Databricks Workflows are also a good one. See my video on it. ua-cam.com/video/tMH3K8Rncmk/v-deo.html

  • @programminginterviewsprepa7710
    @programminginterviewsprepa7710 2 роки тому +1

    Many times all code much better than no code - much better version management code reviews and existing code readability

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      But what if you can do no code faster, cheaper, and with less bugs?

  • @rick-kv1gl
    @rick-kv1gl 2 роки тому +1

    ur channel is underrated.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      I didn't know it was rated but hope you find it useful.

    • @rick-kv1gl
      @rick-kv1gl 2 роки тому

      @@BryanCafferky def. its a hidden gem. thanks for content!

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      @@rick-kv1gl Thanks. Please let others know about my channel.

  • @ridwantahirhassen197
    @ridwantahirhassen197 2 роки тому +1

    we have extensively used airflow, it is AMZING. I think the whole video revolves around "worflow orchestration is not that complicated and is of secondary importance", which is not usually the case. For "complex" workflows, using configurations is not any simpler or more neat than writing python scripts. It is also important you test your workflow. Airflow has that functionality. The ui feature is very handy. Restarting jobs, clear visibility into what happened, etc. It also scalers really well! This videos is a little misleading!

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks for your comments. Did you watch the video? That's not what I said.

  • @rdean150
    @rdean150 2 роки тому

    Surprised you didn't mention Argo as an alternative.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      There are many alternatives. Too many to cover them all. Thanks for the suggestion.

  • @janekschleicher9661
    @janekschleicher9661 Рік тому +1

    I think a huge characteristics of Airflow is that is a static tool (what I personal really don't like, but let's try to keep it neutral).
    If you want to change something, you'll need to change the underlying code deployed to the server where Airflow is running.
    This means to first take all the procedure to change the code, review it, run it in CI/CD development and then ship the code to the server (or probably just redeploy the server). That's a long process, you can take some shortcuts, but you'll never have the experimental mode or fast prototyping possible. Even when working with a test instance, still it's a slow process.
    For some use cases, that's great, because there is always a definite and reliable and versioned description of what's going on.
    But if you need to change workflows and aren't sure whether they work fine (e.g. because the production cluster is different in terms of performance than your development cluster, or w/e), the development speed goes down drastically. Even if you don't want to try it out live, you either have a lot of latency going to the development cluster or you need a huge machine as you need to put it locally in a K8s setup (for realistic scenarios in enterprises).
    There are benefits having everything in code and inside GitOps, but it's certainly not fast prototyping for sure.
    The comparison to cron is very true.
    The only way to really check that it runs is to deploy it (like for cron, too), but you should only deploy what you are sure that it runs, so it's a chicken-egg problem. You can run tests, but they don't look the same way as usual in Python or in ETL or in SQL databases or in pandas, and they are complex to write and failure modes might be difficult to understand (especially checking all possible triggering rules).
    I personal would in most cases prefer a dynamic tool, I could easily change while running. (You might still want to block changes on the production system, but for at least for development or staging environment, this is what I really missed when working with Airflow).
    But year, the visualizations are awesome and explaining the complexity of a system to stakeholders works much easier. So, in practice, you'll get a lot of acceptance if work is slow, so this counterbalances it significant.

  • @jamescaldwell3207
    @jamescaldwell3207 2 роки тому

    I would argue that the useful functions should be called into the airflow context from a separate module. With this methodology python could be used to run code outside airflow support.
    Am I missing something?

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      What are you responding to specifically?

    • @jamescaldwell3207
      @jamescaldwell3207 2 роки тому

      @@BryanCafferky Reusability of code utilized by airflow.
      For context, I landed here while listening to arguments for and against airflow because I'm trying to figure out if I'm going to learn it or Prefect. I don't know much about either, hence the question at the bottom of my comment.

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      @@jamescaldwell3207 Did you watch the video? I have no issue with reuse. Whichever fits your requirements with the least cost/effort to maintain is probably the best tool.

    • @jamescaldwell3207
      @jamescaldwell3207 2 роки тому +1

      @@BryanCafferky Of course I did. My comment was regarding right around 13:50 where you state that generic functions cannot be used anywhere else because of the decorators.
      I would think non-specific functions would be in a separate module and imported for use inside a task. If that function is specific to airflow but generic within the operational capacity of airflow, then one could create an airflow specific library for use across multiple jobs.
      As stated, I'm deciding whether to learn one of two tools and my comment was an assumption which posed the question if I was missing something. Having now looked it up in the spirit of ending what is starting to feel like a combative exchange, I've learned my assumption was correct.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      @@jamescaldwell3207 Sorry. No worries. Glad to get the question. I recorded this video 5 months ago so not all the details are still fresh in my mind. The reference time was helpful. Your point is valid. In fact, you could create non airflow generic function libraries too. As I look back at this, I can see that when using the decorator, only the outermost function is decorated. Also, you can write code that does not use the decorators although I think the decorators are intuitive. See this blog for more details. airflow.apache.org/docs/apache-airflow/stable/modules_management.html

  • @tomwright9904
    @tomwright9904 2 місяці тому +1

    cron job + make?

    • @tomwright9904
      @tomwright9904 2 місяці тому

      Hmm... not sure about the idea of throwing away configuration that is written down with a bunch of non-documented, non-recreatable jobs.

  • @XxXxXboxLivexXxXxX
    @XxXxXboxLivexXxXxX Рік тому +1

    I evaluated airflow and luigi(which you didn't mention), I feel that airflow is the one that has enough extendability to get to work with my company's compute resources/environment. It seems you just went through the tutorials and didn't implement anything significant in airflow. The limitations you mention seem a little arbitrary(most people like python) and I don't understand how these are resolved with other options or what associated tradeoffs I would be making. Still going to use airflow, this is clickbait.

    • @BryanCafferky
      @BryanCafferky  Рік тому

      Thanks for your thoughts but I have asked colleagues who have used Airflow extensively and they agreed with my points. Also, most of the viewers of this video who left comments also agreed and confirmed with their experiences. It not about Python, it's about the best solution to a problem. Sometimes that will be Airflow but for most use cases, I don't think is and I get concerned when people get defensive about a given technology. BTW: It's not click bait when you follow though with a content that is consistent with the title. Live Long and Prosper.

  • @thiagopdesouza
    @thiagopdesouza 2 роки тому

    Dear Bryan, thank you very much for this video! Very valuable and straight to the point content. Congrats!

  • @ivarec
    @ivarec 2 роки тому +2

    Your channel is awesome (and I'm very picky). I've recommended it to my whole team and I'll try to get our company to help you on Patreon as well. Keep it up!

  • @msingh1319
    @msingh1319 2 роки тому

    Hi Bryan, the gcp Gui etl option is datafusion.

  • @MattCamp
    @MattCamp 2 роки тому +2

    did you really delete my comment.. wow.. I didn't even say anything bad.. just that I disagreed and thought you were wrong..

  • @pulanala1421
    @pulanala1421 Рік тому

    Can it compete with Control M?

    • @BryanCafferky
      @BryanCafferky  Рік тому +1

      Don't know. Never heard of Control M. Do you work for them?

    • @pulanala1421
      @pulanala1421 Рік тому +1

      @@BryanCafferky nope ,it is a commercial scheduling tool I have used and based on your presentation everything you mentioned exactly like what control M does.A task or job scheduling tool!

  • @rursus8354
    @rursus8354 2 роки тому

    Good video! Besides, singular of "vertices" is "vertex." Because it is Latin.

  • @falcon20243
    @falcon20243 Рік тому

    Thanks Bryan This is a good video.

  • @SagarSingh-ie8tx
    @SagarSingh-ie8tx 2 роки тому

    You are correct 👍

  • @kalasend
    @kalasend 3 місяці тому

    You, sir, are a master in title marketing 😂

  • @tutkal1985
    @tutkal1985 2 роки тому

    clear and great explanation

  • @cmcmahon1978
    @cmcmahon1978 2 роки тому

    Dear lord ... Please dont use ADF over Airflow unless you are doing a deployment pipeline. Unless you enjoy working DEEEP under the covers doing things like spinning up Powershell jobs to complete tasks in an environment that is not really strongly backed by Source control .... unless you want to link it to a git repo and stare at JSON blobs to figure out whats wrong with the underlying "code". I do agree with you that there is a finite set of things that Airflow is good at and things that it shouldn't be used for out of the box.
    I wholeheartedly disagree that the "python" needed to do many of the simpler dag use cases are difficult to accomplish as most of the out of the box operators are pretty thoroughly documented and example code on how to use them lives everywhere on the internet. I would say that even in the case that you want to do something that Airflow doesnt do directly out of the box there is always the ability to use the numerous Python Operators to run custom code, Or the ability to spinup Kubernetes Pod Operators and allow them to scale in the cluster for heavier ML tasks.
    "Use Databricks" ... yes you can ... but databricks is a potentially expensive way to orchestrate one thing. Whereas airflow can not only orchestrate Spark but do many things that Databricks cant do. Also ... at the end of the day Databricks just winds up being a bunch of JSON.
    I think the ETL code you show is a fairly ok example of example code ... but not really an example of how an ETL process would be setup in the real world. Nor are you showing many of the purely built in operators that will allow you to orchestrate jobs in a tremendous amount of services in one centralized place.
    Mostly IMO yes ... if you dont want to write any code dont use airflow ... If you are ok with some mostly cut/paste code for many basic dags and functions but also want the ability to do things that none of the other mentioned tools that I have personally looked at can do ... I would give airflow a shot ... Or ... if you arent into doing ANY of the management work look at a managed airflow service.

  • @gamsc
    @gamsc 2 роки тому

    Thanks. Very informative.

  • @sakesun
    @sakesun 2 роки тому

    Agree with the video.

  • @zacharyedwards665
    @zacharyedwards665 Рік тому +1

    Got fucking boomed by the title

  • @MSPalazzuoli
    @MSPalazzuoli 2 роки тому

    Thanks, best explanation ever!

  • @swapnilpatil6986
    @swapnilpatil6986 Рік тому

    Wonderful video, myth busted.
    can you plz throw some
    lights on dbt tool.
    Its also being promoted as an ETl tool.but i am not sure of its use case.

  • @llorulez
    @llorulez 2 роки тому +1

    in the current startup i work is good enough, not expensive and easy to use

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Thanks for the comment. Yes. It does do a lot. What other Workflow engines were considered?

    • @llorulez
      @llorulez 2 роки тому +1

      @@BryanCafferky mainly kubeflow but our company is not that big to use fully dedicated kubernetes clusters. Any tool you would recommend? Interesting video btw.

    • @BryanCafferky
      @BryanCafferky  2 роки тому +1

      @@llorulez Thanks for the info. It all depends on what you need to do. The video was meant to get people to stop and think before jumping in as Airflow is pretty complex but can be a great solution. For ETL/Data Movement, if the workflow is sequential, I would use a simpler tool which I mention in the video. Databricks notebooks/jobs can work well but it depends on whether you need the scale. Dask looks good for non-Spark loads and is really easy to start with but gets complex with the scale out. Each public could has their own ETL PaaS services as well. My focus is parsimony, i.e. just enough to do what you need and no more.

    • @llorulez
      @llorulez 2 роки тому +1

      @@BryanCafferky maybe it was easy for me because we extensively use docker and was quick using dockeroperators but as you mention it can be really challenging.

  • @josuevervideos
    @josuevervideos 2 роки тому

    great video!! thank you

  • @adibauI
    @adibauI 2 роки тому

    I think the right title should be "Don't Use Apache Airflow if you are a Data Scientist" Bc as a DevOps Junior, Airflow looks awesome compared to cron scripts, at least in the project I'm in

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      Could be. I am finding that installing and configuring Airflow can be challenging. I only see one SaaS offering on Azure for it and it starts at 45K.

    • @adibauI
      @adibauI 2 роки тому +1

      @@BryanCafferky Yeah you are totally right, I'm trying to implement it in a docker project with conda deps, and hell, this is hard

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      @@adibauI Thanks. I was wondering if it was me. :-) Usually, just to get a basic dev environment for a tool is easy but not this. Python Dask is a piece of cake and for Spark, you can just use Databricks Community Edition.

    • @adibauI
      @adibauI 2 роки тому

      @@BryanCafferky Works perfectly when you build the system based on it, but the thing is that I need to execute python modules from outside airflow's container. Which I think the best way will be to define every single dep I need in Airflow's Dockerfile so it can run the tasks

    • @BryanCafferky
      @BryanCafferky  2 роки тому

      @@adibauI Yeah. I think that makes sense. Reach out on LinkedIn if you would like to connect. I'd be interested in following you progress on this.

  • @damarh
    @damarh 2 роки тому

    I am actually looking for a scheduler to run python scripts, but if tha means i have to wrote MORE python... good lord.

  • @pmsanthosh
    @pmsanthosh Рік тому

    Kettle by pentaho is slow

  • @rjribeiro
    @rjribeiro 2 роки тому +4

    - I thought it was obvious that Airflow's use case is to be the orchestrator of a data pipeline, not the executor. Who uses Airflow for ETL/ELT is using wrong.
    - I don't see a problem with being a code oriented tool, as Python is very easy to learn. Is almost Low Code.
    - Comparisons with "best options" were meaningless. The use cases are different. It would have been more logical to have quoted Prefect, perhaps Dagster

    • @edpearson5464
      @edpearson5464 2 роки тому +6

      Python as low code was a good laugh to start my morning, thanks.

  • @Praveen_Kumar_R_CBE
    @Praveen_Kumar_R_CBE 2 роки тому

    Very true..

  • @eth6706
    @eth6706 2 роки тому

    Azure data factory is far superior in my experience. Airflow isn't terrible though.

  • @CobraTackle
    @CobraTackle 5 місяців тому

    Thank you so much

  • @IgorLucci
    @IgorLucci 2 роки тому

    very good!!

  • @bettatheexplorer1480
    @bettatheexplorer1480 2 роки тому

    I love airflow.

  • @kaanmutlu4953
    @kaanmutlu4953 2 роки тому

    My exact thoughts... literally a job scheduler on roids...