- 70
- 44 603
TheAverageEngineer
United States
Приєднався 22 лют 2023
No holds barred Data Engineering. Talk of Data Engineering, Spark, SQL, Python, Rust, Go, Data Warehouse, Data Lakes, Databricks, Delta Lake. The list goes on. I cover everything Data Engineering, nothing is sacred.
Daft on Databricks Unity Catalog | The Spark Killer | Reading Delta Lake with Daft
The Average Engineer introduces Daft, a newer Dataframe tool in Python, built with Rust, that has the PERFECT integration with Databrick Unity Catalog Delta Lake Tables.
Переглядів: 141
Відео
DuckDB INSIDE Postgres | pg_duckdb Postgres Extension Explained | Performance - DuckDB vs Postgres
Переглядів 316День тому
The Average Engineer goes deep into testing Postgres vs DuckDB with the new pg_duckdb Postgres Extension. You will not believe the query performance results!
Wordpress Drama Explained | Wordpress is Dying | What's Happening with Wordpress??!!
Переглядів 45221 день тому
The Average Engineering gives you a quick and simple explanation of what is happening with all the WordPress Drama!
How to make the perfect PR | Small Engineering Changes for a BIG win | PRs can be a nightmare!
Переглядів 10328 днів тому
The Average Engineer teaches you how to make the perfect pull request (PR) to avoid the nightmare of ego and back-and-forth fights. Keeping Engineering Changes small and to the point is step number one.
The Death of the Data Warehouse (death by Lake House)
Переглядів 224Місяць тому
The Data Warehouse is dead now, killed by the all-powerful Lake House. What is a Data Warehouse, and what is a Lake House? The Lakehouse of Delta Lake and Spark has replaced SQL Server Data Warehouses.
Amazon RTO | Amazon employees must return to the office! | Amazon is firing lazy people.
Переглядів 530Місяць тому
In a strange twist the giant Amazon is mandating a RTO (Return To Work) policy for ALL their employees! What's going to happen next? Is everyone quitting or getting fired from Amazon??
There are 3 Types of Data Engineers ... Which are You?? | Data Engineering | Career
Переглядів 310Місяць тому
Have you ever wondered what the different types of Data Engineers are? Career paths in Data Engineering? The Average Engineer tells the hidden truth about Data Engineering types.
Streaming Data from Postgres to Delta Lake Unity Catalog in Databricks | Postgres | Delta Lake
Переглядів 1532 місяці тому
The Average Engineer shows you how to stream realtime data from a Postgres RDS instance on AWS into Databricks Unity Catalog Delta Lake tables with Estuary.
Polars in 2 Minutes | Introduction to Polars | Python Polars Overview
Переглядів 4062 місяці тому
The Average Engineer gives the Introduction and Overview of Python's new Dataframe tool called Polars, in under 2 minutes. Learn Polars now!
The Rise of the Evil Notebook Engineer | Databricks Notebooks | Data Engineering | Software
Переглядів 4162 місяці тому
The Average Engineer lets loose on all those Evil Notebook Engineers who can't conform to Engineering Best Practices. Databricks Notebooks will kill you someday.
Replace Databricks Spark with Polars || Save Money on Databricks || Polars || Apache Spark
Переглядів 3402 місяці тому
The Average Engineer shows you how to cut Databricks costs by replacing Apache Spark with Polars.
Snowflake Is Dying | Snowflake Data Breach! | Snowflake Data | Feature Doom
Переглядів 4883 місяці тому
The Average Engineer looks to the sky and asks what in the world is happening with Snowflake? Data breaches, no new features, high costs, will Snowflake make it, or die on the vine??
CI/CD for Data Engineering | Continuous Delivery - Continuous Deployment | Data Engineering
Переглядів 2943 місяці тому
The best Data Teams are the ones that embrace CI/CD best practices for the deployment and management of code. Data Teams SUCK at CI/CD. This is CI/CD for Data Engineers.
Introduction to Daft | Distributed Dataframes | Python | Polars vs Daft
Переглядів 3753 місяці тому
The Average Engineer pits two Python Dataframe libraries against each other. Daft vs Polars. Can the Daft Dataframe library, written in Rust beat Polars, the GOAT?
Premature Optimization Is Evil? | Software Engineering | Software Design
Переглядів 1024 місяці тому
The Average Engineer takes on both sides of the Premature Optimization Is Evil debate.
How Tech Debt, Databricks, and Spark UDFs ruined my weekend | Spark | Databricks | Tech Debt
Переглядів 1744 місяці тому
How Tech Debt, Databricks, and Spark UDFs ruined my weekend | Spark | Databricks | Tech Debt
Databricks Cost Savings || Databricks || Saving Money on Databricks
Переглядів 2275 місяців тому
Databricks Cost Savings || Databricks || Saving Money on Databricks
Snowflake Summit 2024 Key Highlights || Snowflake Summit || 2024 || Key Take Aways
Переглядів 3705 місяців тому
Snowflake Summit 2024 Key Highlights || Snowflake Summit || 2024 || Key Take Aways
Databricks Buys Tabular || Eats Snowflakes Lunch || Databricks vs Snowflake || Delta Lake vs Iceberg
Переглядів 1,4 тис.5 місяців тому
Databricks Buys Tabular || Eats Snowflakes Lunch || Databricks vs Snowflake || Delta Lake vs Iceberg
Python Classes are Useless and The Devil?! || Python Programming || OOP is dead? || Never write OOP!
Переглядів 3685 місяців тому
Python Classes are Useless and The Devil?! || Python Programming || OOP is dead? || Never write OOP!
Introduction to Databricks Unity Catalog || Databricks Setup || Spark + Delta Lake || Unity Catalog
Переглядів 2645 місяців тому
Introduction to Databricks Unity Catalog || Databricks Setup || Spark Delta Lake || Unity Catalog
Developing Production Level Databricks Pipelines || Spark || Delta Lake || Databricks Data Pipelines
Переглядів 9765 місяців тому
Developing Production Level Databricks Pipelines || Spark || Delta Lake || Databricks Data Pipelines
Data Warehouse Battles || Redshift vs Snowflake vs BigQuery vs Databricks || Lake House vs DWs
Переглядів 8396 місяців тому
Data Warehouse Battles || Redshift vs Snowflake vs BigQuery vs Databricks || Lake House vs DWs
Reading JSON with Rust || Explicit vs Vague Programming || Python vs Rust
Переглядів 1446 місяців тому
Reading JSON with Rust || Explicit vs Vague Programming || Python vs Rust
Google Fires Python || Google Layoffs || Google Fires Python Team || Google Outsources!
Переглядів 7 тис.6 місяців тому
Google Fires Python || Google Layoffs || Google Fires Python Team || Google Outsources!
Data Engineering Survey 2024 Reaction || Data Engineering Trends in 2024
Переглядів 2826 місяців тому
Data Engineering Survey 2024 Reaction || Data Engineering Trends in 2024
Data Analytics Suck! || Analytics is the Worst Job Ever!
Переглядів 1,5 тис.6 місяців тому
Data Analytics Suck! || Analytics is the Worst Job Ever!
Why you CANT get to Senior Software Engineer || Skills need to move from Junior to Senior Engineer
Переглядів 6386 місяців тому
Why you CANT get to Senior Software Engineer || Skills need to move from Junior to Senior Engineer
Databricks Doubles Costs. Reddit Goes Wild. I get in trouble.
Переглядів 1,2 тис.7 місяців тому
Databricks Doubles Costs. Reddit Goes Wild. I get in trouble.
Apache Spark Connect ... Writing Spark with Rust!
Переглядів 6387 місяців тому
Apache Spark Connect ... Writing Spark with Rust!
I thought the same thing when I saw Daft. There was a great demo of it at the Ray conference earlier this year
Some obvious remarks with a lot of rage bait on top. What a BS
you are rigth why i wanna to be in the office if even that all my team is in another office in another country i am remote employee since pandemic really who likes to go to the office i think that the answer is the people who dose,nt have family , friends or don,t have hobbies who likes to spend 3 or 4 hours driving
Fair enough comments. If analytical processing risks interfering with transaction processing, it does not really help to do full table scans. Having said that: transactional processing is brilliant if you can drip-feed changes to a duckdb database sitting elsewhere. That may be faster than using materialized views with aggregate-refreshes. Something like BemiDB which fills a duckdb database from postgress can be handy. Maybe compare the speed between that extract and postgres.
Thank you for making this! I combed through all the pg_duckdb promotional material trying to figure out what amazing innovation made pg_duckdb analytical queries so much faster than their pure Postgres equivalents. After all, the main thing that's supposed to make most OLAP databases (including DuckDB) so much faster is compressed columnar storage, and pg_duckdb doesn't change the row-oriented storage method of Postgres. I couldn't find an answer better than "it's magic." Glad to know that the "magic" is just to not use the main optimization tool that Postgres has. I will say that it's still impressive that the pg_duckdb query performs at least on the same order of magnitude as Postgres (with indexes). I think this should have been the main selling point. If you are an analyst who has to write a bunch of ad-hoc queries, it can be really painful and time-consuming to have to think through the indexes at every turn. An index that works well for some analytical query might be useless against a tweaked version. Indexes are a resource that take time to build, especially for large data sets. If you can get close to index-based performance but eliminate the need for index building altogether, that usually amounts to a substantial net gain, especially for ad-hoc work.
TLDR(?) : a programmer makes a very good system but he is being a perfectionist. But other people say his current craftwork is fine. They have disagreement. More people know and take sides.
For an opinion, this one's pretty spot on. I would say that linting/formatting errors shouldn't wait for CI/CD, and should instead be trapped with pre-commit hooks so it never gets to the place where another person needs to witness it.
I do system migrations for a living, this involves a lot of data. I'm a developer that uses python code to do all transformations and loading. There are also analysts that work on the projects, who use SQL to do their analysis, so that they can check the data looks like it should, and what outscoping should be done. This makes it necessary for the tables I'm saving supports querying wtih SQL. Is there a datalake solution that has full SQL support? Or should I adopt a hybrid approach where I save json and mapping tables (.csv) to the datalake, but save all tables to e.g. Microsoft SQL Server? Thanks.
Nice overview. I wasn't clear on what DuckDB was about but it's clearer now. I think I prefer the polars snippet due to the better modularity. It's closer to the unix philosophy: do one thing and do it well. Integrating with cloud storage providers in a database engine seems a bit weird to me. I do agree that the appeal is obvious and my preference isn't a strong one. The DuckDB way is also way more portable, as you can just copy the SQL code and run it wherever.
But traditional RDMS will continue to be used to serve data to applications, right ? Because applications require fast response times and RDBMS are still faster to serve data for small to medium size data volumes.
I agree with you, for now, but that answer will be different 3 years from now. With the advent of things like Lake House apps from Databricks, along with Serverless compute ... that discussion is changing quickly.
duckdb
SQL programmers are bad programmers? It's nothing but your ignorance. SQL can do it in one SQL what you writing pages of crazy code that is converted to SQL altimately. Stop your BS advice. People are more knowledgable than you.
wow, all the information in 2 minutes. Thank you so much
Who does RTO serve? For some, WFH is the right path - it avoids the commute (unpaid time, travel costs, harm to environment and health), reduces the inevitable distractions in open floor plan workplaces, and allows many people to work where they thrive. It can be counterproductive for some, and for jobs with poor ability to assess employee performance it's also a great way to hide incompetence. In-office can be useful for social types (though that socialization might come at the cost of others' productivity) or for work that simply can't be effectively done remotely. It's also a performative way for C-level to pretend that they're doing something meaningful to address financial woes, and gives managers a reason to exists (after all, you can't manage people when you can't hover over their shoulders, apparently).
I've been remote since COVID, and full time in office before thet. I miss the whiteboarding sessions the most. There's a lot less idea sharing remote because you're chatting less. Also makes my days blur together. The last four years went by in an instant. Startup offices aren't beige hell holes. I worked on site at AWS and it wasn't either.
It would be helpful to see more of the code, while you are talking about the code.
I got an offer recently for 'Junior Data Software Engineer'. Do you think Is it somehow way of naming one of these group?
Probably Group 1 unless you're writing code and not SQL. I've hired a Junior Data Engineer as a primarily Group 2 engineer (lots of Spark pipelines, bash, etc ... not much SQL)
How would Group 3 folks still be called Data Engineers?
Because traditional Staff Level + Software Engineers have little to no data experience depending on their background. The people makes tools like Polars clearly have next-level Data Engineering understand to have the ability to provide tooling that Data Engineers use.
Link to full article dataengineeringcentral.substack.com/p/there-are-3-types-of-data-engineers
RE: os.environ.get Why not use pydantic to load your env vars? It'll ensure the types are correctly casted, and create a "define once, use anywhere kind of deal. Unless os.environ.get was only for demonstration purposes, then I've said nothing.
Yeah for the most part, the syntax is better than pandas
Are the databrick jobs you are talking about, notebooks? If not could you describe what kind of jobs you are using to presumably run python (pyspark) code? I'm not yet familiar with the capabilities of databricks
I'm talking about large prepacked (zip'd) PySpark code that is submitted to a cluster as a Job.
@@theaverageflatlander is this method easy to use. Sounds a bit weird to me, but I suppose this is the way it's commonly done? Is it easier if you have a self hosted server from which you orchestrate and send the requests?
How are you hosting the apache airflow? You guys have a Linux server at the office? I work at a small startup, and don't feel like this is something I would want to be responsible for (I'm talking setting up a self hosted server).
We use MWAA from AWS. It's very expensive. Airflow is easy to manage if you want. Look into Astronomer, Dagster, Prefect, Mage, etc. They are all the same.
Very Helpful, Please use OBS or something to record screen and your face at one place.
Thank you for sharing this. really useful. Audio can be little louder.
I wish you would’ve asked how many work with streaming data (near real time and/or real time data) pipelines.
I'm extremely inexperienced and have only used notebooks in the cloud (so for programs that need to run continiously). Is the alternative to notebooks just docker containers? How does a company generally run python code in production? I work at a startup and I'm the only technical guy.
"Is the alternative to notebooks just docker containers?" Oh man. No. Sorry, it's not you. It's just that this is a common question and shows how far away so many analyst-types are. The alternative to notebooks is small, modular .py files organized into something called a "package". Docker helps both notebooks AND packages be reproducible (if you leave and come back to your code 1 month from now, there's still a good chance you can install and run everything without fighting "Dependency hell")
@@Eriddoch yes obviously my python project makes use of packages. I'm more wondering how I run these continuously in the cloud. Do I let Microsoft host the docker container for me, or are there other ways of running my python project in the cloud?
@@Eriddoch Could you have a look at my previous reply? I'm still struggling with finding how you would run a python project in the cloud without docker containers.
@@matthiaswarlop2316 You can still use Databricks to run Python packages as workflows, for example. Instead of the workflows pointing to notebooks in a Git repo, they can point to a package. An alternative, if you're not using Databricks, is indeed to run containers, in Kubernetes clusters, for example.
Skill issue 😂
Pure gold here 😄😄
What I’m struggling to understand is how it’s even possible to make notebooks work in production such that there is an entire faction of people shilling for it. Like how are their companies even functioning when 9.999999/10 cases the notebook breaks because the cells are ran out of order or something
You can run notebooks (running the cells in order) in Databricks, as if it's production. I imagine they're nice prototypes, but manually tested code is just a long-term nono.
Databricks sort of encourages you to use notebooks all the way. Orquestrating them through workflows solves the cell execution order issue. But maintaining them is insane as the Workspace scales up.
@@recs8564 how do you run python code in your company?
That blog got you a subcriber
thanky kindly
If you had a post every day, I’d read every one. Your posts are excellent.
wow, someone on the internet who I haven't made mad. It's a Christmas miracle.
Interesting, thanks for sharing!
Not everyone a christian
don't be a baby.
very good advice ;)
Polars also has a lot of issues reading and writing parquet files. Recently logged a bug, and got fixed a week after, but the update broke an other part of writing parquet files in my code, so I just convert all my dataframes to pandas before writing them away
what OS are you having problems on, linux, windows?
@@theaverageengineer Linux, Ubuntu. Does this matter?
Lots of words but saying nothing
dont be a baby. Snowflake sucks, they lost to Databricks. Even your mom knows that.
What a childish ad hominem. Guess it esplains why he has less then 1k subscribers.
@@alexschievink where's your UA-cam channel you milk toast programmer?
dataengineeringcentral.substack.com/p/snowflake-is-dying-on-the-vine
I don’t think most places will leave Snowflake because all these databases migrations cost A LOT of money. Especially for a place that uses snowflake as a Data Warehouse. The people who need it more for AI/ML & Data Warehousing those people will probably be more likely to leave snowflake to databricks. Also like you said, people on old tech that actually want to or need to move to the cloud will probably go to Snowflake first if they don’t have enough people skilled with python & spark to move to databricks.
I think snowpipe, snowspark and dynamic tables in snowflake are great. New features doesn’t mean great software services. It could mean confusion
Snowflake has lost the ML race.
Weirdly, you are like invisible on google. I can’t find your blog after finding it once.
👻
The substack link to full article. dataengineeringcentral.substack.com/p/cicd-for-data-engineers
Can you post the substack link
Make me. dataengineeringcentral.substack.com/p/cicd-for-data-engineers
@@theaverageengineer Your SEO is really bad and could stunt your growth. I read your blog and wanted to find it later and just couldn't. "Average" is used in a ton of generic blog post "average day in the life", "average salary for xyz", etc. I ended up finding it a month later my searching for a very specific article. Just a heads up because I LOVE your work and think you deserve more attention.
What data did you say you used, I didn’t catch the name
Resorting to setting up a raspberry pi cluster in my room so I can peacefully learn Spark
I for one am happy they're using Ray instead of spinning off with their own software. Though I agree on the (slight) bait-and-switch of their claims.
Also, please normalize the sound in your videos. Your voice is a lot more silent than your outro music 😅
Speaking of S3 - did AWS fix the bug where if you know the bucketname of some one else, you can spam that bucket with requests that end up with 404s, racking up your bill, because a call was made?
Good video Daniel, ran to your substack to check out if you had made a new post but didn't find any lol, i enjoy all the rust DE content you have made
Here it is dataengineeringcentral.substack.com/p/introduction-to-daft-vs-polars
@@theaverageengineer damn, i must have missed it. thank you so much, you're one of the best community creators out there
To be clear you are running databricks notebook with airflow, what is your view with using pulumi to push your spark scripts to databricks for running your pipeline. Do you use the approach too
No. Only one pipeline was running a DB Notebook from Airflow. The one that broke. Everything else is in git and deployed via CICD and run as DB jobs.
@@theaverageengineer Awesome, I anticipate tutorials on this
Heard it takes a year to learn the build system. Phew! As for "search" , it is just a catalog.