AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418)

AWS Events

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 20 сер 2024

КОМЕНТАРІ • 27

@vaibhavdce123 4 роки тому ⁺³
One of best talks on Amazon Redshift
@DiogoNomura 3 роки тому ⁺²
Really the best content of Redshift!
@galeop Рік тому ⁺¹
My notes:
19:34:
## Data distrubution strategies, for data distribution across slices (ie virtual nodes -1 physical node is made up of [2, 16] slices, acting as sharding virtual nodes). ##
Note that data in a table is stored as follows : it is distributed according to the table's data-distribution strategy (eg according to their distribution key, if you chose "KEY based distribution"), and then data is sorted within each partition according to the sort key(s) (the sort key is relevant for zone maps).
Data distribution has two goals :
1) *Avoid "hot partitions/shards"* , ie a partition of data where the request is centered, thus putting a lot of strain on the resources underlying the slice/node that hosts it, while the other slices in the cluster are under-utilized.
2) *Skip IO* : with a query like "SELECT * FROM my_table WHERE country='FRANCE'; " if this table was distributed according to the "country" column, then all the relevant data for this query was stored on the same partition, and can be returned quickly. Compute nodes can quickly identify the partitions where the data is and is not. (Note: "country" is probably not a good choice of partition key, as the resulting partitions will probably be too big, and therefore we'll result in a hot partition)
3) *Return relevant data together in few IO operations:* this is the same idea as 2) : when all relevant data is in the whole partition (or serveral partitions, all the data of those partitions is relevant), then all the data blocks of the partition are relevant, and the storage layer can read them quickly in just a few IO operations. For this reason, it makes sense to use as partition key a column that is often used as joining creteria (eg "JOIN mytable1 and mytable2 ON columnToUseAsPartitionKey")
There are 4 data distribution options for a table:
- *hash partitioning/sharding* (aka "distribution based on a *key*") : you choose a column (aka key) of your table that will be used as partition key. All the rows that share the same value for this column will be stored together in the same partition/shard. To get a good data dsitribution, and avoid having hot partitions, use a column with an even value distribution (to get even-sized partitions), and with high cardinality (to avoid having huge partitions, that can't be broken down in any smaller pieces), and on which your queries aren't focused on a single value (that's another reason why "country" is a bad partition key, because you'll typically query about facts regarding a single country, and therefore you'll put all the strain on the partition corresponding to that country, thus making it a hot partition ; for the same reason, don't use date as a partition key, because you'll typically query facts on a specific date, thus making the partition corresponding to that date a hot partition).
With hash partitioning, if you choose your distribution key right, you'll avoid hot partitions (gaol 1), skip IO (goal 2) and read your data in fewer IO operations (goal 3).
- *Even distribution:* the rows of your table are evenly spread among your slices, in a round robin fashion, row by row. The idea is to avoid hot partitions (goal 1), but you don't get goals 2 and 3.
- *ALL*: your table is duplicated on the 1st slice of each of your compute nodes. This only makes sense for "small" tables (ie No I think it’s okay as the data model is typically for historizations, so inserts would be appended to the table, in the same HCC/ORC set of row. Sure, adding one by one new rows is a heavier process than doing so in a row-oriented DB, but it’s a heavy process taking place “at the margin”, affecting just one HCC/ORC set of rows per table at a time.
25:00 data is always at least twice redundant in the cluster. Redshift uses 2-phase commit for transactions.
@aksharjamgaonkar1677 2 роки тому
this is pretty crisp and to the point, worth your time.
@raghukundurthi1288 4 роки тому ⁺¹
Wow! An intelligent and insightful deep dive into the hidden pearls of Redshift! Love the topic of Advisor! Thanks Tony & Harshida!
@galeop Рік тому
13:40 so zone maps are like "storage indexes" in Oracle Exadata ; they allow to skip unecessary IO by telling the storage server what are the min and max values of a compressed 1MB storage block, thus allowing the server to know if the values it looks for NOT in that block. Sort keys in Redshift are meant to determine how your table will be sorted on disk, and therefore impact the zone maps that track the min/max of the sort keys in a 1MB block.You'd typically set sort keys for frequently filtered columns, and have fewer than 4 sort keys per table (the table being sorted based on column X, and then column Y when column X had identical vallues, etc.). Use in priority low cardinality columns for sort keys, as they'll allow to make zone maps more meaningful (to know where the filtered data is not)
@galeop Рік тому
16:34 like all datawarehouses, you should use a denormalized data moldel ; something akin to a star schema or snowflake schema. In your star schema, the Fact table is a denormalised table where entities that belong to the same Fact have been pre-joinned ("materialized", ie they are stored joined together on the Fact table). You will also store in the Faact table pre-calculated aggregations (as you store those values on the table, they are "materialized").
@venkata.krishnan 4 роки тому ⁺¹
Excellent presentation structure..
@saravanangovindharaj6997 3 роки тому ⁺¹
Awesome presentation!
@ArtisticKalam 3 роки тому
Best talk on redshift
@andreyavdeev8181 3 роки тому ⁺¹
fantastic session! thank you!
@upuldi 2 роки тому
Very good presentation
@liuruibnu 3 роки тому ⁺¹
Excellent structure!
@vivien1252 3 роки тому
Fantastic session
@shrabanti84 2 роки тому ⁺²
I have two question,
1. from performance point of view do we have any major difference between temp table and ctas with backup no?
2. For alter table append command what if the the source datatypes are same but data lengths are not matching will it still work?
@RPRAVEENKUMAR 4 роки тому ⁺¹
Very vast for a beginner like me. But answered a lot of questions
@yasink18 Рік тому
Best way to use update statement where I am getting slow response
@galeop Рік тому
43:40 for WLM and queues, what are these percentages of memory about? Leader node's memory, compute node's memory, or both?
@galeop Рік тому
55:01 I suppose that altering the distribution key of a table means that all my table's partitions will be re-written from scratch, according to that new distribution key ; so it's a very heavy operation. Correct ?
@galeop Рік тому
27:51 according to which criteria should the big S3 file to COPY be split into 16 smaller file ? Should we partition it according to the same distribution method as the one used in the table (eg same hash key)?
@dsapraveen 4 роки тому
I suppose, Advisor is not available in ap-south-1, I wonder why AWS is selective when releasing these kind add-on of services,.
@Thevisionaryaddy 3 роки тому
how much concurrency and CPU usage if you are running a Decent Size Query which takes about 4-5 hours and is made joining 2-3 tables. It runs in the very morning
@akshayarawat 4 роки тому ⁺²
What is the slidshare link?
@liuruibnu 3 роки тому
16:16
@user-so5fk1mt9o 4 роки тому ⁺¹
Is a joke compare to Google bigquery
@archivepurpose4983 3 роки тому
in your dreams. :)
@avinashravipati9002 Рік тому
Slides can be found at
d1.awsstatic.com/events/reinvent/2019/Deep_dive_and_best_practices_for_Amazon_Redshift_ANT418.pdf

Наступне

Автоматичне відтворення

AWS re:Invent 2019: [REPEAT] Best practices for Amazon S3 (including storage classes) (STG302-R)