Це відео не доступне.
Перепрошуємо.
Python Libraries You Should Know As A Data Engineer - Python For Beginners
Вставка
- Опубліковано 18 сер 2024
- What python libraries should data engineers know?
Here is a list from beginner to advanced!
Beginner
- Requests
- Paramiko
- Psycopg2 or SQLAlchemy
- Datetime
Mid
- BeautifulSoup
- Airflow
- All the cloud libraries(AWS, GCP, Azure)
Advanced
- PySpark
- PyKafka
0:00 Intro
2:10 Requests
2:44 Paramiko
3:02 Psycopg2
4:00 Basic Data Engineering Project Idea
4:42 BeautifulSoup
5:02 Datetime
6:00 Airflow
6:33 All the cloud libraries(AWS, GCP, Azure)
8:30 PySpark and PyKafka
If you enjoyed this video, check out some of my other top videos.
Top Courses To Become A Data Engineer In 2022
• Top Courses To Become ...
What Is The Modern Data Stack - Intro To Data Infrastructure Part 1
• What Is The Modern Dat...
If you would like to learn more about data engineering, then check out Googles GCP certificate
bit.ly/3NQVn7V
If you'd like to read up on my updates about the data field, then you can sign up for our newsletter here.
seattledataguy...
Or check out my blog
www.theseattle...
And if you want to support the channel, then you can become a paid member of my newsletter
seattledataguy...
Tags: Data engineering projects, Data engineer project ideas, data project sources, data analytics project sources, data project portfolio
_____________________________________________________________
Subscribe: / @seattledataguy
_____________________________________________________________
About me:
I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consult on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.
*I do participate in affiliate programs, if a link has an "*" by it, then I may receive a small portion of the proceeds at no extra cost to you.
If you guys want to learn more about data engineering, then sign up for my newsletter here seattledataguy.substack.com/
Beginner -
1. Requests (and sftp)
2. Psycopg2 and similar database libraries
3. Beautifulsoup and scrapy
4. Datetime
5. Virtualenv
Intermediate -
6. Airflow
7. Boto3 and similar libraries to interact with cloud
8. Flask/Django
Advanced (based on need to know) -
9. Pyspark
10. Pyarrow
Up
Warning e logging too
Some other cool libraries from my side:
- Pandas - you've mentioned it but you haven't put it in a context that one should know I think (vide the case from your Facebook interviews) - I think its essential for any sort of data wrangling with Python.
- NumPy - essential stuff for any sort of algebra if you want to dive deeper into ML
- MyPy/Pydantic - for data validation & static typing
- Pytest - for testing
- matplotlib & seaborn - for data visualization in Python
- any sort of file libraries for specific file formats like json, csv, avro-python etc.
- ML libraries like scikit-learn
- FastAPI as an alternative to Django/Flask
- Selenium
- argparse for scripting
Although I haven't used most of these in my job on a regular basis - I think it doesn't hurt to know them :)
sympy is more of an algebra library. I think you meant numpy is a linear algebra library. This can be a good way of thinking about it for a beginner who wants to learn ML, but I find it gets used a lot for stuff where you want to try and represent continuous mathematics as closely as possible on a computer. For example, numpy would also be also be good for stuff like signal processing or creating a function of best fit for your data that can be plotted.
Requests
Psycopg
Bigquery
Beautifulsoup & scrapy
Datetime
Boto 3
Flask
Virtualenv
Spark
Pyarrow
Pykafka
Snowflake
Thanks! I finally added in the agenda so these are now included.
Psycho pg2 is how I've heard folks say it too!
Great content as usual! I'd add json library to that
amazing thank you!
You're very welcome!
Watching the premiere... expecting to hear about the tenacity library here xD
I'm stuck in a "data engineer" position where all my boss will let me do is debug SQL script and it's killing me
how long have you been there?
QUIT
Leave if you can. You are doing yourself no favors by wasting years at a job you don’t like and especially one that isn’t improving your skills
I have to use a shell script ti execute mysql queries then pass the resulrt as an argument in my python scripts >_< wish i could just use mysql connector
How can you know pandas every which direction, but not understand a dictionary? You wouldn't know how to construct a dataframe from a dictionary of lists (often my approach when webscraping) or know how to use the map function to change categorical names. Wes McKinney (who created pandas) even says that a pandas series data structure is similar to an ordered dictionary.
I've gone through possibly all python courses in Udemy but have never seen a course focused on Data Engineering and the good-to-know libraries. Some times there is one short chapter about one of them buth nothing complete. Anyone has any tips?
You are awesome.
good list, but most of your psycopg2 stuff prob would have been easier with sqlalchemy
Regarding to APIs I always thought we should learn how to pull from them, not actually create them. So where does Flask fits into all that?
Depends on what product is built on top of your db/dw. You might need to build an api on top of your warehouse to power your product.
@@playea123 Cool. And do you know what kind of custom API could run over a DW? I could only think such case in an OLTP context...
@@gabrielkolletalves493 depends on how you model your DW. If you want something similar to an OLTP, Snowflake rolled out hybrid tables a few months ago
hey! leave gcp libs alone 😂