DSDSD - Dutch Seminar on Data Systems Design
DSDSD - Dutch Seminar on Data Systems Design
  • 34
  • 18 986
Efficient CSV Parsing - On the Complexity of Simple Things - Pedro Holanda
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN:
We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to unite, foster collaborations between its members, and bring in high-quality international speakers. We would like to invite all researchers, especially PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on from researchers in your field.
Website: dsdsd.da.cwi.nl/
X: x.com/dsdsdnl
Speaker: Pedro Holanda
Title: Efficient CSV Parsing: On the Complexity of Simple Things
Abstract: In this talk, we will revisit different CSV parsing
implementations in DuckDB and compare them with the current
implementation. The bulk of the talk is to discuss the design and
implementation decisions in DuckDB's current CSV Parser. In particular,
we will examine the parallel algorithm, the CSV buffer manager, and the
transitions of the CSV state machine. Disclaimer: This talk is not for
the faint of heart; some very exotically built CSV files will be depicted.
Bio: Pedro is an early contributor to DuckDB and currently works as a
software engineer at DuckDB Labs, focusing on core and integration
aspects of DBMS technology. He completed his PhD at the Database
Architectures group at CWI, researching Indexes for Interactive Data
Analysis.
Переглядів: 1 267

Відео

Lambda functions in the duck's nest - Tania Bogatsch
Переглядів 2026 місяців тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to unite, foster collaborations between its members, and bring in high-quality international speakers. We would like to invi...
C3: Compressing Correlated Columns - Thomas Glas
Переглядів 1606 місяців тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to unite, foster collaborations between its members, and bring in high-quality international speakers. We would like to invi...
Towards LLM-augmented Database Systems - Carsten Binnig
Переглядів 2816 місяців тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to unite, foster collaborations between its members, and bring in high-quality international speakers. We would like to invi...
ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)
Переглядів 494Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Cardinality Estimation Graphs by Semih Salihoğlu - University of Waterloo
Переглядів 242Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
LingoDB: Open compilation and optimization framework for sustainable data processing - M. Jungmair
Переглядів 280Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Decoupling Compute and Storage for Stream Processing Systems by Yingjun Wu - CEO RisingWave Labs
Переглядів 713Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Implementing InfluxDB IOx, "from scratch" using Apache Arrow, DataFusion, and Rust by Andrew Lamb
Переглядів 3,5 тис.Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Shredding deeply nested JSON, one vector at a time by Laurens Kuiper - DuckDB Labs
Переглядів 994Рік тому
[{ "description": "DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international spea...
Provenance Research in Gray Systems Lab at Microsoft by Fotis Psallidas
Переглядів 132Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Database Schemas in the Wild: What Can We Learn from a Large Corpus of Relational Database Schemas?
Переглядів 177Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Efficient detection of multivariate correlations in static and streaming data by Jens d'Hondt
Переглядів 77Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Stardog query optimiser: Join ordering and cardinality estimations for graph queries by Pavel Klinov
Переглядів 151Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning by David Vos
Переглядів 257Рік тому
DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN: We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like...
Leveraging Generative AI for Data Processing by Immanuel Trummer [DSDSD 2023]
Переглядів 298Рік тому
Leveraging Generative AI for Data Processing by Immanuel Trummer [DSDSD 2023]
Data Science through the Looking Glass and what we found there by Bojan Karlaš
Переглядів 1572 роки тому
Data Science through the Looking Glass and what we found there by Bojan Karlaš
Data Management for Emerging Problems in Large Networks by Arijit Khan
Переглядів 5382 роки тому
Data Management for Emerging Problems in Large Networks by Arijit Khan
Building machine learning systems for the era of data-centric AI by Ce Zhang
Переглядів 1462 роки тому
Building machine learning systems for the era of data-centric AI by Ce Zhang
mlinspect - Lightweight Inspection of Native Machine Learning Pipelines by Stefan Grafberger
Переглядів 1232 роки тому
mlinspect - Lightweight Inspection of Native Machine Learning Pipelines by Stefan Grafberger
Algorithms for Relational Knowledge Graphs by Martin Bravenboer
Переглядів 5272 роки тому
Algorithms for Relational Knowledge Graphs by Martin Bravenboer
The LDBC Social Network Benchmark: Business Intelligence workload by Gábor Szárnyas
Переглядів 1492 роки тому
The LDBC Social Network Benchmark: Business Intelligence workload by Gábor Szárnyas
Taking a Peek under the Hood of Snowflake's Metadata Management by Max Heimel
Переглядів 7982 роки тому
Taking a Peek under the Hood of Snowflake's Metadata Management by Max Heimel
Glidesort: Efficient In-Memory Adaptive Stable Sorting on Modern Hardware by Orson Peters
Переглядів 1,9 тис.2 роки тому
Glidesort: Efficient In-Memory Adaptive Stable Sorting on Modern Hardware by Orson Peters
Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning By Carsten Binnig
Переглядів 1172 роки тому
Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning By Carsten Binnig
Parallel Grouped Aggregation in DuckDB By Hannes Mühleisen
Переглядів 8992 роки тому
Parallel Grouped Aggregation in DuckDB By Hannes Mühleisen
Efficient collaborative analytics with no information leakage:An idea whose time has come | Vasiliki
Переглядів 1152 роки тому
Efficient collaborative analytics with no information leakage:An idea whose time has come | Vasiliki
Opening the Black Box of Internal Stream Processor State By Jim Verheijde
Переглядів 502 роки тому
Opening the Black Box of Internal Stream Processor State By Jim Verheijde
Push-Based Execution in DuckDB by Mark Raasveldt (CWI)
Переглядів 1,4 тис.2 роки тому
Push-Based Execution in DuckDB by Mark Raasveldt (CWI)
Building Advanced SQL Analytics From Low-Level Plan Operator By Thomas Neumann (TU Munich)
Переглядів 1,2 тис.2 роки тому
Building Advanced SQL Analytics From Low-Level Plan Operator By Thomas Neumann (TU Munich)

КОМЕНТАРІ

  • @JohnMyers-w8x
    @JohnMyers-w8x Місяць тому

    The key concept for me was "the closer a double is to 0, the more exact its representation"

  • @zhengyuzhang7992
    @zhengyuzhang7992 3 місяці тому

    very nice and detailed lecture

  • @madacol
    @madacol 5 місяців тому

    It was confusing to understand what happened in 12:47 when the indexes [0,2,3,3,4] appeared. So this is what's going on, the filter removes 2 entries, and only remained indexes 0,2,3 , and those are exactly the first 3 elements in the vector [ *0,2,3* , 3,4], so the last 2 elements 3,4 are ignored. And this is specified by the first vector [ [0,2] , [2,1] ] that tells the first 2 elements is the first row, and the third element is the second row. (there's no mention on what to do with the rest of indexes, so they are ignored)

    • @DouEnergy
      @DouEnergy 6 днів тому

      Thanks for the explanation

  • @manickbadsah
    @manickbadsah 6 місяців тому

    Thanks for the amazing presentation.

  • @timpz
    @timpz 8 місяців тому

    Great presentation and interesting idea! It would be interesting to see what the compression ratio would be for 32 bit floats, bit packing the integers would be SIMD-friendly but I suspect probably reduce the ratio by a significant amount for "normal" values.

    • @LeonardoKuffo
      @LeonardoKuffo 5 місяців тому

      With 32bit floats compression ratios would be halved since the same numbers will still be packed in the same amount of bits after FOR+BP. But the algorithm would remain the same. It would be interesting to see then how ALP keeps up with these other algorithms (chimp, zstd, etc). We have already implemented ALP in DuckDB for floats also. So you may run some quick tests there using the duckdb Python api. What we saw is that 32bit floats are more common in an ML context (e.g. model weights), in which case ALPrd would be used given the randomness of these numbers. Here we can still save a few bits more than, for example, Zstd if the weights want to be stored losslessly!

  • @stevierusso945
    @stevierusso945 Рік тому

    👇 'Promo sm'

  • @-h2780
    @-h2780 Рік тому

    정말 미친 놈들이 많구나

  • @nahblue
    @nahblue Рік тому

    👏

  • @刘陶峰
    @刘陶峰 2 роки тому

    good

  • @howardzhang6655
    @howardzhang6655 2 роки тому

    thanks

  • @sufalpal5041
    @sufalpal5041 2 роки тому

    Data really powers everything that we do

  • @sweetmelodies9678
    @sweetmelodies9678 2 роки тому

    “The world is now awash in data and we can see consumers in a lot clearer ways.”....👍👍👍