- 39
- 343 204
Robin Moffatt
United Kingdom
Приєднався 26 лют 2020
Robin has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online at rmoff.net/talks/, and his blog articles at rmoff.net/. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.
[Devoxx UK] 🚂 On Track with Apache Kafka: Building a Streaming ETL solution with Rail Data
As data engineers, we frequently need to build scalable systems working with data from a variety of sources and with various ingest rates, sizes, and formats. This talk takes an in-depth look at how Apache Kafka can be used to provide a common platform on which to build data infrastructure driving both real-time analytics as well as event-driven applications.
Using a public feed of railway data it will show how to ingest data from message queues such as ActiveMQ with Kafka Connect, as well as from static sources such as S3 and REST endpoints. We’ll then see how to use stream processing to transform the data into a form useful for streaming to analytics in tools such as Elasticsearch and Neo4j. The same data will be used to drive a real-time notifications service through Telegram.
If you’re wondering how to build your next scalable data platform, how to reconcile the impedance mismatch between stream and batch, and how to wrangle streams of data-this talk is for you!
⏱️Time codes
===========
00:00:00 Introduction
00:01:14 Trains as a source of event streams
00:01:48 Streaming ETL - overview
00:02:47 Streaming ETL with rail data - live visualisation demo
00:06:27 ksqlDB introduction and brief overview
00:07:24 Streaming ETL - ingest - detail
00:07:39 Data sources
00:08:13 Streaming data from ActiveMQ into Kafka
00:09:34 Ingesting data from Amazon S3 into Kafka
00:10:29 Ingesting data from CSV into Kafka
00:11:54 Ingest - recap
00:12:11 Streaming ETL - Transformation overview
00:12:45 Modeling rail data in an Entity-Relationship-Diagram (ERD)
00:14:01 Extracting and wrangling ActiveMQ messages in Kafka
00:15:12 Using ksqlDB to process messages from ActiveMQ
00:16:03 Splitting multiple message types out from a single topic in Kafka using ksqlDB
00:16:58 Joining events to lookup data with ksqlDB (stream-table joins)
00:18:03 Building composite keys in ksqlDB
00:18:11 CASE statement in ksqlDB for decoding values
00:18:36 Schemas in stream processing, and using ksqlDB to define and apply them
00:20:37 The role of the Schema Registry in streaming ETL
00:21:18 Transformation - recap
00:21:40 Using the transformed data
00:22:33 Kafka Connect overview
00:22:59 Streaming from Kafka to Elasticsearch
00:23:36 Kafka to RDBMS (PostgreSQL)
00:24:28 Do you *actually* need a database?
00:25:00 Building materialised views in ksqlDB
00:25:42 Kafka to S3
00:27:02 Kafka to Neo4j
00:27:36 Building real-time alerting using Kafka
00:29:33 Monitoring and Maintenance
00:30:26 Conclusion & Summary
Confluent Cloud
============
Confluent Cloud provides fully managed Apache Kafka, connectors, Schema Registry, and ksqlDB. Try it out and use code RMOFF200 for money off your bill: www.confluent.io/confluent-cloud/tryfree/?.devx_ch.rmoff_xHV1mGXV5Ds&
Resources & Links
============
📓Slides: talks.rmoff.net/6GsyFX/on-track-with-apache-kafka-building-a-streaming-etl-solution-with-rail-data
👾Demo code: rmoff.dev/kafka-trains-code-01
Using a public feed of railway data it will show how to ingest data from message queues such as ActiveMQ with Kafka Connect, as well as from static sources such as S3 and REST endpoints. We’ll then see how to use stream processing to transform the data into a form useful for streaming to analytics in tools such as Elasticsearch and Neo4j. The same data will be used to drive a real-time notifications service through Telegram.
If you’re wondering how to build your next scalable data platform, how to reconcile the impedance mismatch between stream and batch, and how to wrangle streams of data-this talk is for you!
⏱️Time codes
===========
00:00:00 Introduction
00:01:14 Trains as a source of event streams
00:01:48 Streaming ETL - overview
00:02:47 Streaming ETL with rail data - live visualisation demo
00:06:27 ksqlDB introduction and brief overview
00:07:24 Streaming ETL - ingest - detail
00:07:39 Data sources
00:08:13 Streaming data from ActiveMQ into Kafka
00:09:34 Ingesting data from Amazon S3 into Kafka
00:10:29 Ingesting data from CSV into Kafka
00:11:54 Ingest - recap
00:12:11 Streaming ETL - Transformation overview
00:12:45 Modeling rail data in an Entity-Relationship-Diagram (ERD)
00:14:01 Extracting and wrangling ActiveMQ messages in Kafka
00:15:12 Using ksqlDB to process messages from ActiveMQ
00:16:03 Splitting multiple message types out from a single topic in Kafka using ksqlDB
00:16:58 Joining events to lookup data with ksqlDB (stream-table joins)
00:18:03 Building composite keys in ksqlDB
00:18:11 CASE statement in ksqlDB for decoding values
00:18:36 Schemas in stream processing, and using ksqlDB to define and apply them
00:20:37 The role of the Schema Registry in streaming ETL
00:21:18 Transformation - recap
00:21:40 Using the transformed data
00:22:33 Kafka Connect overview
00:22:59 Streaming from Kafka to Elasticsearch
00:23:36 Kafka to RDBMS (PostgreSQL)
00:24:28 Do you *actually* need a database?
00:25:00 Building materialised views in ksqlDB
00:25:42 Kafka to S3
00:27:02 Kafka to Neo4j
00:27:36 Building real-time alerting using Kafka
00:29:33 Monitoring and Maintenance
00:30:26 Conclusion & Summary
Confluent Cloud
============
Confluent Cloud provides fully managed Apache Kafka, connectors, Schema Registry, and ksqlDB. Try it out and use code RMOFF200 for money off your bill: www.confluent.io/confluent-cloud/tryfree/?.devx_ch.rmoff_xHV1mGXV5Ds&
Resources & Links
============
📓Slides: talks.rmoff.net/6GsyFX/on-track-with-apache-kafka-building-a-streaming-etl-solution-with-rail-data
👾Demo code: rmoff.dev/kafka-trains-code-01
Переглядів: 1 968
Відео
[Kafka Summit] 🚢 All at Sea with Streams - Using Kafka to Detect Patterns in the Behaviour of Ships
Переглядів 1,1 тис.3 роки тому
This talk is from #KafkaSummit Americas 2021 📝 Abstract: The great thing about streams of real-time events is that they can be used to spot behaviours as they happen and respond to them as needed. Instead of waiting until tomorrow to find out what happened yesterday, we can act on things straight away. This talk will show a real-life example of one particular pattern that it's useful to detect-...
[DevSum 2021] Kafka as a Platform: the Ecosystem from the Ground Up
Переглядів 7923 роки тому
Presented at DevSum 2021: www.devsum.se Kafka has become a key data infrastructure technology, and we all have at least a vague sense that it is a messaging system, but what else is it? How can an overgrown message bus be getting this much buzz? Well, because Kafka is merely the center of a rich streaming data platform that invites detailed exploration. In this talk, we’ll look at the entire st...
Kafka Connect JDBC sink deep-dive: Working with Primary Keys
Переглядів 12 тис.3 роки тому
The Kafka Connect JDBC Sink can be used to stream data from a Kafka topic to a database such as Oracle, Postgres, MySQL, DB2, etc. This video explains how to configure it to handle primary keys based on your data using the `pk.mode` and `pk.fields` configuration options. ✍️ [Blog] Kafka Connect JDBC Sink deep-dive: Working with Primary Keys rmoff.net/2021/03/12/kafka-connect-jdbc-sink-deep-dive...
ksqlDB HOWTO: Handling Time
Переглядів 2,5 тис.3 роки тому
When you do processing in ksqlDB that is based on time (such as windowed aggregations, or stream-stream joins) it is important that you define correctly the timestamp by which you want your data to be processed. This could be the timestamp that's part of the Kafka message metadata, or it could be a field in the value of the Kafka message itself. By default ksqlDB will use the timestamp of the K...
ksqlDB HOWTO: Split and Merge Kafka Topics
Переглядів 3,7 тис.3 роки тому
Using ksqlDB you can split streams of data in Apache Kafka based on values in a field. You can also merge separate streams of data together into one. ksqlDB uses SQL to describe the stream processing that you want to do. For example: Splitting a stream: CREATE STREAM ORDERS_UK AS SELECT * FROM ORDERS WHERE COUNTRY='UK'; CREATE STREAM ORDERS_OTHER AS SELECT * FROM ORDERS WHERE COUNTRY!='UK'; Mer...
ksqlDB HOWTO: Reserialising data in Apache Kafka
Переглядів 2,1 тис.3 роки тому
Using ksqlDB you can reserialise data in Apache Kafka topics. For example, you can take a stream of CSV data and write it to a new topic in Avro. ksqlDB supports many serialisation formats including Avro, Protobuf, JSON Schema, JSON, and Delimited (CSV, TSV, etc). ksqlDB uses SQL to describe the stream processing that you want to do. For example: CREATE STREAM ORDERS_CSV WITH (VALUE_FORMAT='DEL...
ksqlDB HOWTO: Integration with other systems
Переглядів 1,6 тис.3 роки тому
Using ksqlDB you can pull data in from other systems (e.g. databases, JMS message queues, etc etc), and push data down to other systems (NoSQL stores, Elasticsearch, databases, Neo4j, etc etc). This is done using Kafka Connect, which can be run embedded within ksqlDB or as a separate cluster of workers. ksqlDB can be used to create and control the connectors. For example: CREATE SINK CONNECTOR ...
ksqlDB HOWTO: Stateful Aggregates
Переглядів 3 тис.3 роки тому
Using ksqlDB you can build stateful aggregations of state on events in Apache Kafka topics. These are persisted as Kafka topics and held in a state store within ksqlDB that you can query directly or from an external application using the Java client or REST API. ksqlDB uses SQL to describe the stream processing that you want to do. For example: CREATE TABLE ORDERS_BY_MAKE AS SELECT MAKE, COUNT(...
ksqlDB HOWTO: Joins
Переглядів 2,9 тис.3 роки тому
Using ksqlDB you can enrich messages on a Kafka topic with reference data held in another topic. This could come from a database, message queue, producer API, etc. ksqlDB uses SQL to describe the stream processing that you want to do. With JOIN clause you can define relationships between streams and/or tables in ksqlDB (which are built on topics in Kafka) For example: CREATE STREAM ORDERS_ENRIC...
ksqlDB HOWTO: Schema Manipulation
Переглядів 1,7 тис.3 роки тому
Using ksqlDB you can manipulate a stream of data in Apache Kafka and write it to a new topic with transformations including: * Remove/drop fields * CAST datatypes * Reformat timestamps from BIGINT epoch to human-readable strings * Flatten nested objects (STRUCT) 💾 Run ksqlDB yourself: ksqldb.io?.devx_ch.rmoff_7pH5KEQiYYo& ☁️ Use ksqlDB as a managed service: www.confluent.io/confluent-cloud/tryf...
ksqlDB HOWTO: Filtering
Переглядів 1,8 тис.3 роки тому
Using ksqlDB you can filter streams of data in Apache Kafka and write new topics in Kafka populated by a subset of another. ksqlDB uses SQL to describe the stream processing that you want to do. With the WHERE clause you can define predicates to filter the data as you require. For example: CREATE STREAM ORDERS_NY AS SELECT * FROM ORDERS WHERE ADDRESS_STATE='New York'; 💾 Run ksqlDB yourself: ksq...
🎄Twelve Days of SMT 🎄 - Day 12: Community transformations
Переглядів 2,1 тис.3 роки тому
Apache Kafka ships with many Single Message Transformations included - but the great thing about it being an open API is that people can, and do, write their own transformations. Many of these are shared with the wider community, and in this final installment of the series I’m going to look at some of the transformations written by Jeremy Custenborder and available in kafka-connect-transform-co...
🎄Twelve Days of SMT 🎄 - Day 11: Filter and Predicate
Переглядів 2,8 тис.3 роки тому
Apache Kafka 2.6 added support for defining predicates against which transforms are conditionally executed, as well as a Filter Single Message Transform to drop messages - which in combination means that you can conditionally drop messages. The predicates that ship with Apache Kafka are: * RecordIsTombstone - The value part of the message is null (denoting a tombstone message) * HasHeaderKey- M...
🎄Twelve Days of SMT 🎄 - Day 10: ReplaceField
Переглядів 1,1 тис.3 роки тому
The ReplaceField Single Message Transform has three modes of operation on fields of data passing through Kafka Connect, either in a Source connector or Sink connector. * Include *only* the fields specified in the list (`whitelist`) * Include all fields *except* the ones specified (`blacklist`) * Rename field(s) (`renames`) 👾 Demo code and details: github.com/confluentinc/demo-scene/blob/master/...
🎄Twelve Days of SMT 🎄 - Day 8: TimestampConverter
Переглядів 1,8 тис.3 роки тому
🎄Twelve Days of SMT 🎄 - Day 8: TimestampConverter
🎄 Twelve Days of SMT 🎄 - Day 7: TimestampRouter
Переглядів 5984 роки тому
🎄 Twelve Days of SMT 🎄 - Day 7: TimestampRouter
🎄Twelve Days of SMT 🎄 - Day 6: InsertField II
Переглядів 9204 роки тому
🎄Twelve Days of SMT 🎄 - Day 6: InsertField II
🎄Twelve Days of SMT 🎄 - Day 5: MaskField
Переглядів 7944 роки тому
🎄Twelve Days of SMT 🎄 - Day 5: MaskField
🎄Twelve Days of SMT 🎄- Day 4: RegexRouter
Переглядів 1,5 тис.4 роки тому
🎄Twelve Days of SMT 🎄- Day 4: RegexRouter
🎄Twelve Days of SMT 🎄 - Day 3: Flatten
Переглядів 1,8 тис.4 роки тому
🎄Twelve Days of SMT 🎄 - Day 3: Flatten
🎄Twelve Days of SMT 🎄 - Day 2: ValueToKey and ExtractField
Переглядів 2,9 тис.4 роки тому
🎄Twelve Days of SMT 🎄 - Day 2: ValueToKey and ExtractField
🎄 Twelve Days of SMT 🎄 - Day 1: InsertField (timestamp)
Переглядів 10 тис.4 роки тому
🎄 Twelve Days of SMT 🎄 - Day 1: InsertField (timestamp)
Exploring the Kafka Connect REST API
Переглядів 11 тис.4 роки тому
Exploring the Kafka Connect REST API
From Zero to Hero with Kafka Connect
Переглядів 30 тис.4 роки тому
From Zero to Hero with Kafka Connect
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline!
Переглядів 7 тис.4 роки тому
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline!
Building a Telegram bot with Apache Kafka, ksqlDB, and Go
Переглядів 2,6 тис.4 роки тому
Building a Telegram bot with Apache Kafka, ksqlDB, and Go
Kafka Connect in Action: Loading a CSV file into Kafka
Переглядів 26 тис.4 роки тому
Kafka Connect in Action: Loading a CSV file into Kafka