BUG FIXES: ❌ "I am getting an error while launching airflow tasks test retail gcs_to_raw 2023-01-01 : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte" ✅ open the file with Notepad or VSCode and make sure to save it with the UTF-8 encoding then reupload the file to GCS and it will work. ❌ Dependency errors ✅ Make sure you use quay.io/astronomer/astro-runtime:8.8.0 in the Dockerfile (or airflow 2.6.1), if not use that version and restart Airflow (astro dev restart with the Astro CLI)
ok this did the trick for me: # Step 1: Read the CSV file with a more lenient encoding df = pd.read_csv('/Users/thomasstoffels/airflow_tutorial/include/dataset/online_retail.csv', encoding='ISO-8859-1') # Step 2: Write the cleaned DataFrame back to the same file (or a new one) df.to_csv('/Users/thomasstoffels/airflow_tutorial/include/dataset/online_retail_cleaned.csv', index=False, encoding='utf-8')
Hi so just like everyone else here I faced similar issue during dbt part. To solve that in requirements file use this code instead of one mentioned in video. astronomer-cosmos[dbt-bigquery] protobuf Dont put any version number ahead. Once changes are done do astro dev restart. It will take some time but it worked for me.
@@MarcLamberti I am getting the below error when I running task test on gcs_to_raw. I could not load the data into bq because of this. But schema got loaded WARNING - Connection schemes (type: google_cloud_platform) shall not contain '_' according to RFC3986.
A bit overwhelmed at the beginning because we need to plan a lot of models and configurations. But it soon became interesting when we can just copy/paste and replace them later because we could grasp the structure, i.e DAGs, dependencies, dim tables, fact tables,... Very powerful and easy to manage, maintenance and cooperate as I feel. Thanks for a pleasing well-editing video.
Pure gold! This is such a fascinating project end to end. Thank you Marc for such a huge video. You are a monster man, really appreciate the love and effort you put into this 🙏
Thank you so much Marc, After I following the steap that you have teach it make my knowledge about this flow and tools stronger before.Even it have some error when follow the step but many comments on this clip help me to overcome it. Thanks again!
Hi Marc, following your guide, so far so good, but I haven't finished it yet... Mistyping in Notion guide: 1. Keypath JSON -> Keyfile Path 2. "airflow tasks test retail raw_to_gcs 2023-01-01" -> "airflow tasks test retail upload_csv_to_gcs 2023-01-01"
Update2. 1. "airflow tasks test retail check_extract 2023-01-01" -> "airflow tasks test retail check_load 2023-01-01" 2. "project.yml" -> "dbt_project.yml" 3. "target_name='dev',cos -> "target_name='dev', 4. Reports astro dev bas -> astro dev bash cd /usr/local/airflow -> cd include/dbt 5. report_customer_invoices => report_customer_invoices.yml report_product_invoices => report_product_invoices.yml report_year_invoices => report_year_invoices.yml 6. code section for chain is missing Thank you so much for this video. I work primarily with Databricks, and I wanted to look if somebody is able to create a real pipeline demo guide and what's the situation with other paths of DE like Snowflake+dbt, Airflow+dbt . I see that all cool stuff are actually not free, so people have to pay for comfort: "less pain in ass" Airflow (astronomer) + for DQ soda.io + for dbt + for DWH + dashboard probably... Thanks again!
@@eldardragomir6705 Thank you for your feedback! Well, I would say all of that cool stuff is free unless you want it managed by someone else. Otherwise, Airflow, codre dbt, soda and metabase are free :)
Hi @MarcLamberti ! Thank you so much for making this video. I have one question so far, in the dbt section to be precise, did you do something else between 30:47 and 30:49? because when running the bash commands I can see that my dbt folder looks much more different than yours, in the later timestamp it appears to be 2 new folders: dbt_packages and logs.
Hey Marc, such a cool project!!! Thanks a lot. But I am having the weirdest issue, when I go into astro dev bash, and try to run dbt from the command line, all the comments just have 0 output and do nothing. They don't fail, but don't work either, and have a blank output
Hi Marc, thanks for this project. At 44:00 my airflow doesn't show full dbt tasks (no dim_run and dim_test) even though I followed your instructions exactly step by step. Can you help me fix this?
great job, I have enjoyed every part of it.. just have a question I am at the part with DbtTaskGroup on airflow.. in the video in the airflow UI, the DAG graph in the dbt task group you have for every transform task a test task as well, but I don't have them.. why is that? same thing about the report tasks on UI
That's really great tutorial. Very useful. but what are the concerns/steps if we want to put this into CI/CD environment to be pushed to production for example?
The only manual thing which is the Querying of the country Table in Bigquery can be added as a function in DAG chain. If anyone interested mention me in the comment, I can help you out. And THANKS Marc!
Thank you so much for the project, i just finished and it was amazin, i have a question, is that possible to deploy the container as VM in google cloud ?
Hi Mark Can you help me understand how do we actually host Airflow in real world? How it is installed by organizations and how access is granted and how Developer uses it?
Hello Marc nice work! i has a problem with dbt command line, when a try dbt deps in "/usr/local/airflow/include/dbt" does nothing but when i create new directory .dbt in "/usr/local/airflow/ " and move .yml in new directory it works, do you have any idea why it might be?
Thanks for answering. It doesn't do anything, I could still fix it with sudo chmod 777 -R retail_airflow i think dbt deps dont log error for problem with permissions @@MarcLamberti
Very interesting video, but I have certain functional doubts about working with a cloud to ingest data. This is regarding the example in the video with the CSV file, and I apologize for my ignorance. What would be the difference in bash processes or real-time processing, knowing that in my organization, the data source is a data warehouse? I understand that I could do the same with DQA and DBT. I don't see the advantage of using BigQuery.
This is a great project! I have a question though, I used it as a template for a project of my own (using astro and same docker image) and everything seems to work great except I believe I don't have writing rights inside of the container (PermissionError: [Errno 13] Permission denied). I found out when trying to run a function with PythonOperator, but also seems to happen with BashOperator if I wanted to create a directory for example. When inside I tested making a directory or creating a file with bash and permission was also denied. I've tried RUN chmod 777 in the dockerfile without success. I am pretty sure it is something relatively simple but I haven't found any solution online yet. Thanks!
hey Marc, thanks for the amazing project! i have one quick doubt: around 27 to 28:00 minute mark, is it necessary to make the includes inside the function? like below the definition for example: 28:36 -> "from soda.scan import Scan" ? or it is the same to import in the header of a python file like we are used to do? thank you in advance
I have a question, do you suggest that we deploy and test all process in local container by astro, then upload the image in cloud as compute instance, or totally work with google composer ?
Hi @Marc, when I run "astro dev start", it doesn't return any log or pass/fail anything, totally blank. I was looking for solution but couldn't find out. Can you help?
Thank you! Well, Soda is free, but Soda cloud is not. You can use Soda without the cloud offering. Great expectations is *in my opinion* way too hard just for making data quality checks. I found Soda much easier to set up and use.
Hello marc, after I launch metabase its taking forever to load metabase ui to get to let's get started and in dockers its showing metabase image core might slow down or crash when used as image. What to do
been having dbt deps issue around the 55:41. Can't install packages which is weird. tried using Sudo but astro keep asking for password. onlline respurces not helping. was able to use dbt deps earlier in the course, don't know what changed. DId anyone have similar issue?
Getting Error: The conflict is caused by: 8.521 soda-core 3.0.45 depends on opentelemetry-api~=1.16.0 8.521 apache-airflow 2.7.0+astro.1 depends on opentelemetry-api==1.15.0 8.521 8.521 To fix this you could try to: 8.521 1. loosen the range of package versions you've specified 8.521 2. remove package versions to allow pip attempt to solve the dependency conflict I get it when inserting soda in the requirements.txt
Does anyone only have 4 jobs instead of 8 within the dbtTaskgroup task in airflow? I only have "*_run" jobs but not "*_test" jobs. Any help is appreciated. Thanks!
Hello marc! thank youu for your great explanation. But I want to ask why I cant import cosmos although in that env astronomer-cosmos already installed? it's already 2 days and I still cant figure it out lol this project stuck after building the models in BQ (for me) :(
@@MarcLamberti I dont know sir, so what I did is installing it manually using pip instead of using requirement.txt (?) but now I experience a new issue: in DBT_CONFIG = ProfileConfig(profiles_yml_filepath) always return as None although the path is correct. Maybe my dags error because of this one issue. Any help sir?
Wow such a huge video. I enjoyed every part of it. I would love to see more end to end projects like this using airflow. Thank you so much for all your efforts Marc 🙏
There is a problem somewhere in the dependencies, because I cant seem to install protobuf==3.20 and cosmos with 1.0.3 at the same time, not sure how you were able to do that.
I am getting an error while launching airflow tasks test retail gcs_to_raw 2023-01-01 : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte Any ideas ? Thanks !
hello! i had an issue with Metabase too. it wasn't starting up after running "astro dev restart". it seemed docker wasn't recognizing the "docker-compose-override.yml" file since there was no error. i later realized i misspelled the file's name. it's ".override.yml" not "-override.yml". i hope this helps.
Hi guys, aftet all project and delete the tables on GCP, I got this error in the first task, when I started the tasks on airflow: airflow.exceptions.AirflowNotFoundException: The conn_id `gcp` isn't defined Please, can someone help me?
I am getting erro while installing soda-core-bigquery==3.0.45 using requirements.txt here is the error ERROR: Cannot install apache-airflow==2.7.0+astro.1 and soda-core-bigquery because these package versions have conflicting dependencies. #0 20.50 #0 20.50 The conflict is caused by: #0 20.50 soda-core 3.0.45 depends on opentelemetry-api~=1.16.0 #0 20.50 apache-airflow 2.7.0+astro.1 depends on opentelemetry-api==1.15.0 #0 20.50 #0 20.50 To fix this you could try to: #0 20.50 1. loosen the range of package versions you've specified #0 20.50 2. remove package versions to allow pip attempt to solve the dependency conflict
I followed the steps but got an error when including DBT project in airflow. the error was: AttributeError: 'ProfileConfig' object has no attribute 'validate_project'. Did you mean: 'validate_profile'? Could you let me know what could have gone wrong?
This was the complete error message: Broken DAG: [/usr/local/airflow/dags/run_template_best_target.py] Traceback (last most recent call): File "/usr/local/lib/python3.10/site-packages/cosmos/airflow/task_group.py", line 26, in __init__ DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs)) File "/usr/local/lib/python3.10/site-packages/cosmos/converter.py", line 106, in __init__ project_config.validate_project() AttributeError: 'ProfileConfig' object has no 'validate_project' attribute. Did you mean: 'validate_profile'?
@@MarcLamberti Thanks for replying, Marc @MarcLamberti ! I believe that my profiles.yml is correct, because if I run the dbt directly through the terminal in my project, it works. But performing DbtTaskGroup in Airflow gives this error.
Hello I am getting encoding error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte while running gcs_to_raw
The way I solved it (I guess there is better way), open the csv with pandas and put encoding latin therefore download new csv and use that for tutorial
Error.... Can someone help me? Airflow is starting up! Error: there might be a problem with your project starting up. The webserver health check timed out after 1m0s but your project will continue trying to start. Run 'astro dev logs --webserver | --scheduler' for details. Try again or use the --wait flag to increase the time out
Hi so just like everyone else here I faced similar issueduring dbt part. To solve that in requirements fileuse this code instead of one mentioned in video. astronomer-cosmos[dbt-bigquery] protobuf Dont put any version number ahead. Once changesare done do astro dev restart. It will take some timebut it worked for me.
I am getting this error. Can someone help me? Airflow is starting up! Error: there might be a problem with your project starting up. The webserver health check timed out after 1m0s but your project will continue trying to start. Run 'astro dev logs --webserver | --scheduler' for details. Try again or use the --wait flag to increase the time out
@@MarcLamberti I did that and it says that it is waiting for port 5432 of postgres. It seems that the port is being used, however I modified the yaml and set the port to 5433 and the problem still there.
@@MarcLamberti I managed to solved the issue. It seems the docker desktop app was the issue. I uninstalled and reinstalled the app. Now it is working well. Thanks for your effort
Hi @MarcLamberti ! I have the problem: unable to run this(no error but no tables in bigQuery,any outputs when runing commands):dbt deps dbt run --profiles-dir /usr/local/airflow/include/dbt/
Same here! Somehow after dbt deps or even dbt run --profiles-dir /path got nothing from terminal and nothing on big query. Tried to include --print or --no-quiet but got anything as well. Did you found the solution? ps: I'm using linux (ubuntu 22.04LTS)
@@cesar222vinicius nah man, the only way is to setup airflow by your self without astra cli, or just try to find same project as astra. Personalny I recommend first way. And do not start from scratch but with airflow docker image available in airflow docs
unable to run this(no error but no tables in bigQuery,any outputs when runing commands):dbt deps dbt run --profiles-dir /usr/local/airflow/include/dbt/
I get the following error when testing either the create dataset task or the gcs_to_raw task: "OSError: Could not find lib geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so']."
Hi, Thanks for this amazing tutorial. I really appreciate the effort and time that you put in it. But when I install soda-core bigquery, there is a dependency issue which I can't seem to resolve and need your help, I hope you can help because I really want to complete this project. Versions trying to install: apache-airflow-providers-google==10.3.0 soda-core-bigquery==3.0.45 Error: ERROR: Cannot install -r requirements.txt (line 6) because these package versions have conflicting dependencies. #9 34.65 #9 34.65 The conflict is caused by: #9 34.65 soda-core-bigquery 3.0.41 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.40 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.39 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.38 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.37 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.35 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.34 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.33 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.32 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.31 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.30 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.29 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.28 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.27 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.26 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.25 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.24 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.23 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.22 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.21 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.20 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.19 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.18 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.17 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.16 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.15 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.14 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.13 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.12 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.11 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.10 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.9 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.8 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.7 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.6 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.5 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.4 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.3 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.2 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.1 depends on google-cloud-bigquery=2.25.0 #9 34.65 soda-core-bigquery 3.0.0 depends on google-cloud-bigquery=2.25.0 #9 34.65 #9 34.65 To fix this you could try to: #9 34.65 1. loosen the range of package versions you've specified #9 34.65 2. remove package versions to allow pip attempt to solve the dependency conflict #9 34.65 #9 34.65 ERROR: ResolutionImpossible: for help visit pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
I am really trying to follow all the steps but every time I encounter error. Now this ERROR: failed to solve: process "/bin/bash -o pipefail -e -u -x -c if grep -Eqx 'apache-airflow\\s*[=~>]{1,2}.*' requirements.txt; then echo >&2 \"Do not upgrade by specifying 'apache-airflow' in your requirements.txt, change the base image instead!\"; exit 1; fi; pip install --no-cache-dir --root-user-action=ignore -r requirements.txt" did not complete successfully: exit code: 1 Error: command 'docker build -t pipeline-tutorial_04e6a1/airflow:latest failed: failed to execute cmd: exit status 1
@@MarcLamberti Dag Import error: ValueError: 'source' is not a valid DbtResourceType. I tested the transformation model and it completed successfully. I was able to confirm the dataset exist on big query .....when I add the task(dbt task group) then my dag breaks.
BUG FIXES:
❌ "I am getting an error while launching airflow tasks test retail gcs_to_raw 2023-01-01 : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte"
✅ open the file with Notepad or VSCode and make sure to save it with the UTF-8 encoding then reupload the file to GCS and it will work.
❌ Dependency errors
✅ Make sure you use quay.io/astronomer/astro-runtime:8.8.0 in the Dockerfile (or airflow 2.6.1), if not use that version and restart Airflow (astro dev restart with the Astro CLI)
thx changing encoding and saving the file before uploading to bucket helped a lot
Please pin this comment
Dependency error solution didnt work.
Sorry for newbie question, but which file has to be opened with vscode and utf-8 encoded? Thanks
ok this did the trick for me: # Step 1: Read the CSV file with a more lenient encoding
df = pd.read_csv('/Users/thomasstoffels/airflow_tutorial/include/dataset/online_retail.csv', encoding='ISO-8859-1')
# Step 2: Write the cleaned DataFrame back to the same file (or a new one)
df.to_csv('/Users/thomasstoffels/airflow_tutorial/include/dataset/online_retail_cleaned.csv', index=False, encoding='utf-8')
Hi so just like everyone else here I faced similar issue during dbt part. To solve that in requirements file use this code instead of one mentioned in video.
astronomer-cosmos[dbt-bigquery]
protobuf
Dont put any version number ahead. Once changes are done do astro dev restart. It will take some time but it worked for me.
Thank you so much for sharing
@@MarcLamberti Thank you too for sharing this amazing project 🎉
Thanks for sharing. I also used the code above when seeing the error shown on my airflow web page.
We need more such kind of end-to-end solution based tutorials. Thanks Marc
Will do my best 🫶
we need more @@MarcLamberti
@@MarcLamberti I am getting the below error when I running task test on gcs_to_raw. I could not load the data into bq because of this. But schema got loaded
WARNING - Connection schemes (type: google_cloud_platform) shall not contain '_' according to RFC3986.
A bit overwhelmed at the beginning because we need to plan a lot of models and configurations. But it soon became interesting when we can just copy/paste and replace them later because we could grasp the structure, i.e DAGs, dependencies, dim tables, fact tables,...
Very powerful and easy to manage, maintenance and cooperate as I feel.
Thanks for a pleasing well-editing video.
Pure gold! This is such a fascinating project end to end. Thank you Marc for such a huge video. You are a monster man, really appreciate the love and effort you put into this 🙏
Thank you ❤️❤️❤️❤️
Thanks for this tutorial, i've tried the same thing with airflow+postgres+dbt+cosmos, it was very challenging and inspiring
Thank you so much Marc, After I following the steap that you have teach it make my knowledge about this flow and tools stronger before.Even it have some error when follow the step but many comments on this clip help me to overcome it. Thanks again!
Very nice with these end to end projects! Your best video so far!!
Thank you!!
Hi Marc, following your guide, so far so good, but I haven't finished it yet...
Mistyping in Notion guide:
1. Keypath JSON -> Keyfile Path
2. "airflow tasks test retail raw_to_gcs 2023-01-01" -> "airflow tasks test retail upload_csv_to_gcs 2023-01-01"
Thank you so much! Will make the edits!
Fixed!
Update2.
1.
"airflow tasks test retail check_extract 2023-01-01" ->
"airflow tasks test retail check_load 2023-01-01"
2.
"project.yml" -> "dbt_project.yml"
3. "target_name='dev',cos -> "target_name='dev',
4. Reports
astro dev bas -> astro dev bash
cd /usr/local/airflow -> cd include/dbt
5.
report_customer_invoices => report_customer_invoices.yml
report_product_invoices => report_product_invoices.yml
report_year_invoices => report_year_invoices.yml
6.
code section for chain is missing
Thank you so much for this video. I work primarily with Databricks, and I wanted to look if somebody is able to create a real pipeline demo guide and what's the situation with other paths of DE like Snowflake+dbt, Airflow+dbt .
I see that all cool stuff are actually not free, so people have to pay for comfort: "less pain in ass" Airflow (astronomer) + for DQ soda.io + for dbt + for DWH + dashboard probably...
Thanks again!
@@eldardragomir6705 Thank you for your feedback! Well, I would say all of that cool stuff is free unless you want it managed by someone else. Otherwise, Airflow, codre dbt, soda and metabase are free :)
End to end projects are the best to be added in portfolio section in the cv
That’s right ☺️
Hi @MarcLamberti ! Thank you so much for making this video. I have one question so far, in the dbt section to be precise, did you do something else between 30:47 and 30:49? because when running the bash commands I can see that my dbt folder looks much more different than yours, in the later timestamp it appears to be 2 new folders: dbt_packages and logs.
Hey Marc, such a cool project!!!
Thanks a lot.
But I am having the weirdest issue, when I go into astro dev bash, and try to run dbt from the command line, all the comments just have 0 output and do nothing. They don't fail, but don't work either, and have a blank output
Hi Marc, thanks for this project. At 44:00 my airflow doesn't show full dbt tasks (no dim_run and dim_test) even though I followed your instructions exactly step by step. Can you help me fix this?
Same here
@@sirinebouksila9631 same here, do you faind any solution?
great job, I have enjoyed every part of it.. just have a question I am at the part with DbtTaskGroup on airflow.. in the video in the airflow UI, the DAG graph in the dbt task group you have for every transform task a test task as well, but I don't have them.. why is that? same thing about the report tasks on UI
not sure it could help or not. I faced the same problem. but it worked after closing the web page and reconnecting to the local host
That's really great tutorial. Very useful. but what are the concerns/steps if we want to put this into CI/CD environment to be pushed to production for example?
Next video, stay tuned 😃
@@MarcLamberti that would be too great 😃
@@MarcLamberti Yes! We need this video!!
The only manual thing which is the Querying of the country Table in Bigquery can be added as a function in DAG chain. If anyone interested mention me in the comment, I can help you out.
And THANKS Marc!
I love this format, I wish you had more like that. I think so cool useful to have end-to-end projects like that, it gives the big picture
Thank you so much for the project, i just finished and it was amazin, i have a question, is that possible to deploy the container as VM in google cloud ?
Hi Mark
Can you help me understand how do we actually host Airflow in real world? How it is installed by organizations and how access is granted and how Developer uses it?
Very nice Thank You marc.
Great Content! Thanks for offering this
Hello Marc nice work! i has a problem with dbt command line, when a try dbt deps in "/usr/local/airflow/include/dbt" does nothing but when i create new directory .dbt in "/usr/local/airflow/ " and move .yml in new directory it works, do you have any idea why it might be?
hm, I need to try again, maybe something has changed or I made a mistate. What's the timecode where I show that?
Thanks for answering. It doesn't do anything, I could still fix it with sudo chmod 777 -R retail_airflow i think dbt deps dont log error for problem with permissions @@MarcLamberti
I have a similar issue, dbt deps & dbt run --profiles-dir /usr/local/airflow/include/dbt/ don't do anything.
Congrats Marc, great content!
Thanks a lot!
Just by the accent I assume that you are a french speaker. It would be great if you can make videos in french also. Great and explicative video!
Very nice ! this is very important video,I hope we can built the same project with snowflake
Great content! Would it possible to have some videos about how to configure cosmos with cloud composer on gcp? Much appreciated!
Hi I have troubles, after running "dbt deps" command nothing happens what should i do ?
I have the same problem
Try to update Bigquery and dbt version then try it once dbt deps
This is so helpful ! Thank you so much for your efforts .
Very interesting video, but I have certain functional doubts about working with a cloud to ingest data. This is regarding the example in the video with the CSV file, and I apologize for my ignorance.
What would be the difference in bash processes or real-time processing, knowing that in my organization, the data source is a data warehouse? I understand that I could do the same with DQA and DBT. I don't see the advantage of using BigQuery.
very good project, thanks for sharing Marc
You’re welcome 🫶
Thank you so much for this amazing project. Learn a lot!!
You're very welcome!
This is just awesome. I've learned a lot from watching this video. Thank you so much for creating such a great value, Marc :)
Thank you 🙏
This is a great project! I have a question though, I used it as a template for a project of my own (using astro and same docker image) and everything seems to work great except I believe I don't have writing rights inside of the container (PermissionError: [Errno 13] Permission denied). I found out when trying to run a function with PythonOperator, but also seems to happen with BashOperator if I wanted to create a directory for example. When inside I tested making a directory or creating a file with bash and permission was also denied. I've tried RUN chmod 777 in the dockerfile without success. I am pretty sure it is something relatively simple but I haven't found any solution online yet. Thanks!
hey Marc, thanks for the amazing project! i have one quick doubt: around 27 to 28:00 minute mark, is it necessary to make the includes inside the function? like below the definition for example: 28:36 -> "from soda.scan import Scan" ? or it is the same to import in the header of a python file like we are used to do? thank you in advance
It’s necessary as the python code here runs in a python virtual environment
I have a question, do you suggest that we deploy and test all process in local container by astro, then upload the image in cloud as compute instance, or totally work with google composer ?
Hi @Marc, when I run "astro dev start", it doesn't return any log or pass/fail anything, totally blank. I was looking for solution but couldn't find out. Can you help?
Where can i find the document about cosmos ?, I cannot find any documentation that show those code/config you do!
nice tutorial, just a quick question:
why (paid)Soda and why not (FREE) Great Expectations ?
Thank you! Well, Soda is free, but Soda cloud is not. You can use Soda without the cloud offering. Great expectations is *in my opinion* way too hard just for making data quality checks. I found Soda much easier to set up and use.
@@MarcLamberti Hi Marc. So after the 45 days trial I still can run this project?
Hi Marc and guys, how can I increase the time out of airflow in this project?
What do you mean by the time out of airflow?
Hello marc, after I launch metabase its taking forever to load metabase ui to get to let's get started and in dockers its showing metabase image core might slow down or crash when used as image. What to do
Any aws data engineer project which will be explained in interview sir
Very nice Thank you.
Thank you!
been having dbt deps issue around the 55:41. Can't install packages which is weird. tried using Sudo but astro keep asking for password. onlline respurces not helping.
was able to use dbt deps earlier in the course, don't know what changed. DId anyone have similar issue?
Hi Marc I want to trigger the completed dag but give me error in the transform, what is the trouble ?
What error?
Getting Error:
The conflict is caused by:
8.521 soda-core 3.0.45 depends on opentelemetry-api~=1.16.0
8.521 apache-airflow 2.7.0+astro.1 depends on opentelemetry-api==1.15.0
8.521
8.521 To fix this you could try to:
8.521 1. loosen the range of package versions you've specified
8.521 2. remove package versions to allow pip attempt to solve the dependency conflict
I get it when inserting soda in the requirements.txt
Let me check
Any reason why you create the dot folder under include. Most tutorials I seen it is included under dags folder.
What’s not DAG should not be in the DAGs folder :)
Does anyone only have 4 jobs instead of 8 within the dbtTaskgroup task in airflow? I only have "*_run" jobs but not "*_test" jobs.
Any help is appreciated. Thanks!
Same here, I believe he did not show that.
Hello marc! thank youu for your great explanation.
But I want to ask why I cant import cosmos although in that env astronomer-cosmos already installed? it's already 2 days and I still cant figure it out lol
this project stuck after building the models in BQ (for me) :(
What do you mean by you can’t import it?
@@MarcLamberti I dont know sir, so what I did is installing it manually using pip instead of using requirement.txt (?)
but now I experience a new issue:
in DBT_CONFIG = ProfileConfig(profiles_yml_filepath) always return as None although the path is correct. Maybe my dags error because of this one issue. Any help sir?
I missunderstand at 26:49, why we need create that vm ???
nice video. Can please make a video on airflow managed instance in azure
Could be a great idea :)
Nice video. You can also take a look at kedro framework
Thank you 🫶
thanks a lot for a very detailed demo. Your course on Udemy also rocks!!
Thank you so much 🫶
Wow such a huge video. I enjoyed every part of it. I would love to see more end to end projects like this using airflow. Thank you so much for all your efforts Marc 🙏
Thank you so much! More will come then ❤
Que conteúdo incrível! muito obrigado por compartilhar tanto conhecimento!!! abraços from Brasil
Thank you so much!
I am unable to restart the airflow instance after 29:30 , can anyone plzzzzzz help
Same here.
@MarcLamberti would you be able to help us?
excellent content ❤
Thank you so much!
There is a problem somewhere in the dependencies, because I cant seem to install protobuf==3.20 and cosmos with 1.0.3 at the same time, not sure how you were able to do that.
what's your astro-runtime version in the dockerfile?
@@MarcLamberti 9.1.0, but I can try 8.8 as you have in the video and see if that fixes it, its true that is a jump between major versions.
@@dan_dom Yes, could you try? In the meantime I'm checking how to solve that
I am getting an error while launching airflow tasks test retail gcs_to_raw 2023-01-01 : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte Any ideas ? Thanks !
Yes, open the file with Notepad or VSCode and make sure to save it with the UTF-8 encoding then reupload the file to GCS and it will work.
Hi im facing problem with metabase, it is not showing up
@MarcLamberti
hello!
i had an issue with Metabase too.
it wasn't starting up after running "astro dev restart".
it seemed docker wasn't recognizing the "docker-compose-override.yml" file since there was no error.
i later realized i misspelled the file's name.
it's ".override.yml" not "-override.yml".
i hope this helps.
Hi guys, aftet all project and delete the tables on GCP, I got this error in the first task, when I started the tasks on airflow: airflow.exceptions.AirflowNotFoundException: The conn_id `gcp` isn't defined
Please, can someone help me?
Log complete:
airflow.exceptions.AirflowNotFoundException: The conn_id `gcp` isn't defined
[2024-02-23, 00:05:58 UTC] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=retail, task_id=upload_csv_to_gcs, execution_date=20240222T015123, start_date=20240223T000558, end_date=20240223T000558
[2024-02-23, 00:05:58 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 15 for task upload_csv_to_gcs (The conn_id `gcp` isn't defined; 688)
[2024-02-23, 00:05:58 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-02-23, 00:05:58 UTC] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check
It looks like you’re using a connection gcp that you haven’t defined yet in the Airflow UI, look at your connections :)
I am getting erro while installing soda-core-bigquery==3.0.45 using requirements.txt
here is the error
ERROR: Cannot install apache-airflow==2.7.0+astro.1 and soda-core-bigquery because these package versions have conflicting dependencies.
#0 20.50
#0 20.50 The conflict is caused by:
#0 20.50 soda-core 3.0.45 depends on opentelemetry-api~=1.16.0
#0 20.50 apache-airflow 2.7.0+astro.1 depends on opentelemetry-api==1.15.0
#0 20.50
#0 20.50 To fix this you could try to:
#0 20.50 1. loosen the range of package versions you've specified
#0 20.50 2. remove package versions to allow pip attempt to solve the dependency conflict
I followed the steps but got an error when including DBT project in airflow.
the error was:
AttributeError: 'ProfileConfig' object has no attribute 'validate_project'. Did you mean: 'validate_profile'?
Could you let me know what could have gone wrong?
This was the complete error message:
Broken DAG: [/usr/local/airflow/dags/run_template_best_target.py] Traceback (last most recent call):
File "/usr/local/lib/python3.10/site-packages/cosmos/airflow/task_group.py", line 26, in __init__
DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs))
File "/usr/local/lib/python3.10/site-packages/cosmos/converter.py", line 106, in __init__
project_config.validate_project()
AttributeError: 'ProfileConfig' object has no 'validate_project' attribute. Did you mean: 'validate_profile'?
Can you double check your profile file?
@@MarcLamberti Thanks for replying, Marc @MarcLamberti !
I believe that my profiles.yml is correct, because if I run the dbt directly through the terminal in my project, it works. But performing DbtTaskGroup in Airflow gives this error.
at 40:50, how you can folder dbt_packages and logs :///
what do you mean?
Hello I am getting encoding error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 227179: invalid start byte while running gcs_to_raw
The way I solved it (I guess there is better way), open the csv with pandas and put encoding latin therefore download new csv and use that for tutorial
Thank you for sharing. Or you can open the file with VSCode or Notepad and save the file again with utf8 encoding
so helpful
Thank you 🙏
Error.... Can someone help me?
Airflow is starting up!
Error: there might be a problem with your project starting up. The webserver health check timed out after 1m0s but your project will continue trying to start. Run 'astro dev logs --webserver | --scheduler' for details.
Try again or use the --wait flag to increase the time out
I am guessing you are using astro version 8.8.0
If yes then revert back to what it was before and read my another comment which has the solution
Yes, before it was astro-runtime:10.2.0@@sprinter5901
Yes. it was astro 10.2.0@@sprinter5901 , i cant find your other comment?
@@azizbohra6487Hi currently I am not on my pc. Just sort comments with most recent and you will find it
Hi so just like everyone else here I faced similar issueduring dbt part. To solve that in requirements fileuse this code instead of one mentioned in video.
astronomer-cosmos[dbt-bigquery]
protobuf
Dont put any version number ahead. Once changesare done do astro dev restart. It will take some timebut it worked for me.
I am getting this error. Can someone help me?
Airflow is starting up!
Error: there might be a problem with your project starting up. The webserver health check timed out after 1m0s but your project will continue trying to start. Run 'astro dev logs --webserver | --scheduler' for details.
Try again or use the --wait flag to increase the time out
Can you looks at the logs as suggested?
@@MarcLamberti I did that and it says that it is waiting for port 5432 of postgres. It seems that the port is being used, however I modified the yaml and set the port to 5433 and the problem still there.
@@MarcLamberti I managed to solved the issue. It seems the docker desktop app was the issue. I uninstalled and reinstalled the app. Now it is working well. Thanks for your effort
@@theniyota Ty bro this comment safe my time
@@theniyota Thanks Man
i tried to do it with aws managed airflow - and failed. damn.
don't use MWAA 🥹
Hi @MarcLamberti ! I have the problem: unable to run this(no error but no tables in bigQuery,any outputs when runing commands):dbt deps
dbt run --profiles-dir /usr/local/airflow/include/dbt/
same, did you found the solution ?
Same here! Somehow after dbt deps or even dbt run --profiles-dir /path got nothing from terminal and nothing on big query. Tried to include --print or --no-quiet but got anything as well. Did you found the solution? ps: I'm using linux (ubuntu 22.04LTS)
@@cesar222vinicius nah man, the only way is to setup airflow by your self without astra cli, or just try to find same project as astra. Personalny I recommend first way. And do not start from scratch but with airflow docker image available in airflow docs
Very Fast
This vidéo gave me noisy
unable to run this(no error but no tables in bigQuery,any outputs when runing commands):dbt deps
dbt run --profiles-dir /usr/local/airflow/include/dbt/
make sure that you are in the include/dbt folder before executing dbt deps and dbt run
@@MarcLamberti @MarcLamberti ,yes I'am in the correct folder:
(dbt_venv)astro@a1629bd2ef32:/usr/local/airflow/include/dbt$ dbt deps
(dbt_venv)
astro@a1629bd2ef32:/usr/local/airflow/include/dbt$ dbt run --profiles-dir /usr/local/airflow/include/dbt/
(dbt_venv) astro@a1629bd2ef32:/usr/local/airflow/include/dbt$
I'm encountering the same issue. It seems like dbt is not installed or something because it doesn't even throw an error.
@@Facu55 can you connect with me on LinkedIn and send me a screenshot of your error?
@@MarcLamberti Done!
I get the following error when testing either the create dataset task or the gcs_to_raw task: "OSError: Could not find lib geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so']."
Keep the same versions as shown in the pinned comment and the video. I have the same error with the latest versions
@@MarcLamberti Thanks, managed to complete the pipeline :)
Hi, Thanks for this amazing tutorial. I really appreciate the effort and time that you put in it. But when I install soda-core bigquery, there is a dependency issue which I can't seem to resolve and need your help, I hope you can help because I really want to complete this project.
Versions trying to install:
apache-airflow-providers-google==10.3.0
soda-core-bigquery==3.0.45
Error:
ERROR: Cannot install -r requirements.txt (line 6) because these package versions have conflicting dependencies.
#9 34.65
#9 34.65 The conflict is caused by:
#9 34.65 soda-core-bigquery 3.0.41 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.40 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.39 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.38 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.37 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.35 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.34 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.33 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.32 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.31 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.30 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.29 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.28 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.27 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.26 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.25 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.24 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.23 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.22 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.21 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.20 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.19 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.18 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.17 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.16 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.15 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.14 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.13 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.12 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.11 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.10 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.9 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.8 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.7 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.6 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.5 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.4 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.3 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.2 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.1 depends on google-cloud-bigquery=2.25.0
#9 34.65 soda-core-bigquery 3.0.0 depends on google-cloud-bigquery=2.25.0
#9 34.65
#9 34.65 To fix this you could try to:
#9 34.65 1. loosen the range of package versions you've specified
#9 34.65 2. remove package versions to allow pip attempt to solve the dependency conflict
#9 34.65
#9 34.65 ERROR: ResolutionImpossible: for help visit pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
let me double check
@@MarcLamberti Thank you! It would nice also if you could pin the solution because I think others also might be facing the same issue.
Классный акцент. Нихуя не понял😂
Вы можете это сделать 🥲
I am really trying to follow all the steps but every time I encounter error. Now this ERROR: failed to solve: process "/bin/bash -o pipefail -e -u -x -c if grep -Eqx 'apache-airflow\\s*[=~>]{1,2}.*' requirements.txt; then echo >&2 \"Do not upgrade by specifying 'apache-airflow' in your requirements.txt, change the base image instead!\"; exit 1; fi; pip install --no-cache-dir --root-user-action=ignore -r requirements.txt" did not complete successfully: exit code: 1
Error: command 'docker build -t pipeline-tutorial_04e6a1/airflow:latest failed: failed to execute cmd: exit status 1
Hi,
I faced the same and then I install "Microsoft Visual C++" then it worked.
Hope it can help.
THANKS ALLOT
Happy that it helps :)
@@MarcLamberti I'm stucked after creating the dbtTaskGroup ---> ValueError: 'source' is not a valid DbtResourceType
@MarcLamberti can you help me, with the issue been stuck here for 3 days?
@@zuandrecoetzee1957 what issue?
@@MarcLamberti Dag Import error: ValueError: 'source' is not a valid DbtResourceType. I tested the transformation model and it completed successfully. I was able to confirm the dataset exist on big query .....when I add the task(dbt task group) then my dag breaks.