DSDSD - Dutch Seminar on Data Systems Design

34
18 986

Lambda functions in the duck's nest - Tania Bogatsch

24:56

C3: Compressing Correlated Columns - Thomas Glas

25:29

Towards LLM-augmented Database Systems - Carsten Binnig

49:36

ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)

28:37

Cardinality Estimation Graphs by Semih Salihoğlu - University of Waterloo

54:09

LingoDB: Open compilation and optimization framework for sustainable data processing - M. Jungmair

26:45

Efficient CSV Parsing - On the Complexity of Simple Things - Pedro Holanda

DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN:
We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to unite, foster collaborations between its members, and bring in high-quality international speakers. We would like to invite all researchers, especially PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on from researchers in your field.
Website: dsdsd.da.cwi.nl/
X: x.com/dsdsdnl
Speaker: Pedro Holanda
Title: Efficient CSV Parsing: On the Complexity of Simple Things
Abstract: In this talk, we will revisit different CSV parsing
implementations in DuckDB and compare them with the current
implementation. The bulk of the talk is to discuss the design and
implementation decisions in DuckDB's current CSV Parser. In particular,
we will examine the parallel algorithm, the CSV buffer manager, and the
transitions of the CSV state machine. Disclaimer: This talk is not for
the faint of heart; some very exotically built CSV files will be depicted.
Bio: Pedro is an early contributor to DuckDB and currently works as a
software engineer at DuckDB Labs, focusing on core and integration
aspects of DBMS technology. He completed his PhD at the Database
Architectures group at CWI, researching Indexes for Interactive Data
Analysis.

Відео

Lambda functions in the duck's nest - Tania Bogatsch

24:56

Lambda functions in the duck's nest - Tania Bogatsch

Переглядів 2026 місяців тому

C3: Compressing Correlated Columns - Thomas Glas

25:29

C3: Compressing Correlated Columns - Thomas Glas

Переглядів 1606 місяців тому

Towards LLM-augmented Database Systems - Carsten Binnig

49:36

Towards LLM-augmented Database Systems - Carsten Binnig

Переглядів 2816 місяців тому

ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)

28:37

ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)

Переглядів 494Рік тому

Cardinality Estimation Graphs by Semih Salihoğlu - University of Waterloo

54:09

Cardinality Estimation Graphs by Semih Salihoğlu - University of Waterloo

Переглядів 242Рік тому

LingoDB: Open compilation and optimization framework for sustainable data processing - M. Jungmair

26:45

LingoDB: Open compilation and optimization framework for sustainable data processing - M. Jungmair

Переглядів 280Рік тому

Decoupling Compute and Storage for Stream Processing Systems by Yingjun Wu - CEO RisingWave Labs

50:25

Decoupling Compute and Storage for Stream Processing Systems by Yingjun Wu - CEO RisingWave Labs

Переглядів 713Рік тому

Implementing InfluxDB IOx, "from scratch" using Apache Arrow, DataFusion, and Rust by Andrew Lamb

48:47

Implementing InfluxDB IOx, "from scratch" using Apache Arrow, DataFusion, and Rust by Andrew Lamb

Переглядів 3,5 тис.Рік тому

Shredding deeply nested JSON, one vector at a time by Laurens Kuiper - DuckDB Labs

21:25

Shredding deeply nested JSON, one vector at a time by Laurens Kuiper - DuckDB Labs

Переглядів 994Рік тому

[{ "description": "DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international spea...

Provenance Research in Gray Systems Lab at Microsoft by Fotis Psallidas

44:06

Provenance Research in Gray Systems Lab at Microsoft by Fotis Psallidas

Переглядів 132Рік тому

Database Schemas in the Wild: What Can We Learn from a Large Corpus of Relational Database Schemas?

21:08

Database Schemas in the Wild: What Can We Learn from a Large Corpus of Relational Database Schemas?

Переглядів 177Рік тому

Efficient detection of multivariate correlations in static and streaming data by Jens d'Hondt

20:11

Efficient detection of multivariate correlations in static and streaming data by Jens d'Hondt

Переглядів 77Рік тому

Stardog query optimiser: Join ordering and cardinality estimations for graph queries by Pavel Klinov

59:34

Stardog query optimiser: Join ordering and cardinality estimations for graph queries by Pavel Klinov

Переглядів 151Рік тому

Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning by David Vos

17:29

Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning by David Vos

Переглядів 257Рік тому

Leveraging Generative AI for Data Processing by Immanuel Trummer [DSDSD 2023]

42:10

Leveraging Generative AI for Data Processing by Immanuel Trummer [DSDSD 2023]

Переглядів 298Рік тому

Leveraging Generative AI for Data Processing by Immanuel Trummer [DSDSD 2023]

Data Science through the Looking Glass and what we found there by Bojan Karlaš

30:01

Data Science through the Looking Glass and what we found there by Bojan Karlaš

Переглядів 1572 роки тому

Data Science through the Looking Glass and what we found there by Bojan Karlaš

Data Management for Emerging Problems in Large Networks by Arijit Khan

45:31

Data Management for Emerging Problems in Large Networks by Arijit Khan

Переглядів 5382 роки тому

Data Management for Emerging Problems in Large Networks by Arijit Khan

Building machine learning systems for the era of data-centric AI by Ce Zhang

37:10

Building machine learning systems for the era of data-centric AI by Ce Zhang

Переглядів 1462 роки тому

Building machine learning systems for the era of data-centric AI by Ce Zhang

mlinspect - Lightweight Inspection of Native Machine Learning Pipelines by Stefan Grafberger

28:04

mlinspect - Lightweight Inspection of Native Machine Learning Pipelines by Stefan Grafberger

Переглядів 1232 роки тому

mlinspect - Lightweight Inspection of Native Machine Learning Pipelines by Stefan Grafberger

Algorithms for Relational Knowledge Graphs by Martin Bravenboer

53:33

Algorithms for Relational Knowledge Graphs by Martin Bravenboer

Переглядів 5272 роки тому

Algorithms for Relational Knowledge Graphs by Martin Bravenboer

The LDBC Social Network Benchmark: Business Intelligence workload by Gábor Szárnyas

21:52

The LDBC Social Network Benchmark: Business Intelligence workload by Gábor Szárnyas

Переглядів 1492 роки тому

The LDBC Social Network Benchmark: Business Intelligence workload by Gábor Szárnyas

Taking a Peek under the Hood of Snowflake's Metadata Management by Max Heimel

37:13

Taking a Peek under the Hood of Snowflake's Metadata Management by Max Heimel

Переглядів 7982 роки тому

Taking a Peek under the Hood of Snowflake's Metadata Management by Max Heimel

Glidesort: Efficient In-Memory Adaptive Stable Sorting on Modern Hardware by Orson Peters

17:00

Glidesort: Efficient In-Memory Adaptive Stable Sorting on Modern Hardware by Orson Peters

Переглядів 1,9 тис.2 роки тому

Glidesort: Efficient In-Memory Adaptive Stable Sorting on Modern Hardware by Orson Peters

Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning By Carsten Binnig

44:33

Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning By Carsten Binnig

Переглядів 1172 роки тому

Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning By Carsten Binnig

Parallel Grouped Aggregation in DuckDB By Hannes Mühleisen

29:00

Parallel Grouped Aggregation in DuckDB By Hannes Mühleisen

Переглядів 8992 роки тому

Parallel Grouped Aggregation in DuckDB By Hannes Mühleisen

Efficient collaborative analytics with no information leakage:An idea whose time has come | Vasiliki

37:03

Efficient collaborative analytics with no information leakage:An idea whose time has come | Vasiliki

Переглядів 1152 роки тому

Efficient collaborative analytics with no information leakage:An idea whose time has come | Vasiliki

Opening the Black Box of Internal Stream Processor State By Jim Verheijde

28:08

Opening the Black Box of Internal Stream Processor State By Jim Verheijde

Переглядів 502 роки тому

Opening the Black Box of Internal Stream Processor State By Jim Verheijde

Push-Based Execution in DuckDB by Mark Raasveldt (CWI)

30:28

Push-Based Execution in DuckDB by Mark Raasveldt (CWI)

Переглядів 1,4 тис.2 роки тому

Push-Based Execution in DuckDB by Mark Raasveldt (CWI)

Building Advanced SQL Analytics From Low-Level Plan Operator By Thomas Neumann (TU Munich)

26:50

Building Advanced SQL Analytics From Low-Level Plan Operator By Thomas Neumann (TU Munich)

Переглядів 1,2 тис.2 роки тому

Building Advanced SQL Analytics From Low-Level Plan Operator By Thomas Neumann (TU Munich)

КОМЕНТАРІ

@JohnMyers-w8x Місяць тому
The key concept for me was "the closer a double is to 0, the more exact its representation"
@zhengyuzhang7992 3 місяці тому
very nice and detailed lecture
@madacol 5 місяців тому
It was confusing to understand what happened in 12:47 when the indexes [0,2,3,3,4] appeared. So this is what's going on, the filter removes 2 entries, and only remained indexes 0,2,3 , and those are exactly the first 3 elements in the vector [ *0,2,3* , 3,4], so the last 2 elements 3,4 are ignored. And this is specified by the first vector [ [0,2] , [2,1] ] that tells the first 2 elements is the first row, and the third element is the second row. (there's no mention on what to do with the rest of indexes, so they are ignored)
@DouEnergy 6 днів тому
Thanks for the explanation
@manickbadsah 6 місяців тому
Thanks for the amazing presentation.
@timpz 8 місяців тому
Great presentation and interesting idea! It would be interesting to see what the compression ratio would be for 32 bit floats, bit packing the integers would be SIMD-friendly but I suspect probably reduce the ratio by a significant amount for "normal" values.
@LeonardoKuffo 5 місяців тому
With 32bit floats compression ratios would be halved since the same numbers will still be packed in the same amount of bits after FOR+BP. But the algorithm would remain the same. It would be interesting to see then how ALP keeps up with these other algorithms (chimp, zstd, etc). We have already implemented ALP in DuckDB for floats also. So you may run some quick tests there using the duckdb Python api. What we saw is that 32bit floats are more common in an ML context (e.g. model weights), in which case ALPrd would be used given the randomness of these numbers. Here we can still save a few bits more than, for example, Zstd if the weights want to be stored losslessly!
@stevierusso945 Рік тому
👇 'Promo sm'
@-h2780 Рік тому
정말 미친 놈들이 많구나
@nahblue Рік тому
👏
@刘陶峰 2 роки тому
good
@howardzhang6655 2 роки тому
thanks
@sufalpal5041 2 роки тому
Data really powers everything that we do
@sweetmelodies9678 2 роки тому
“The world is now awash in data and we can see consumers in a lot clearer ways.”....👍👍👍

DSDSD - Dutch Seminar on Data Systems Design

КОМЕНТАРІ