Generic Type 2 Slowly Changing Dimension using Mapping Data Flows

Data Factory

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 лис 2024

КОМЕНТАРІ • 65

@yogeshnikam8064 2 роки тому ⁺¹
This is really good. What i observed is it will update start time. end time of all old entries for that ID. ideally it should update the recent active entry only. to avoid this the ID and Key columns should be used in sink key columns.
@jorglang7883 4 місяці тому
I really like this generice approach.
How would I handle the case, if the source column name is different from the sink column name?
@jodyclaggett786 3 роки тому ⁺²
I got this running, but it doesn't appear to be handling deleted records appropriately. I want a missing/deleted record from the source data to be marked as Active = 0 and ActiveEndTime set. However, when I run the dataflow with a source record that was previously in the data and now isn't, nothing happens. What do I need to do to handle "deleted" records?
@AllAboutBI 3 роки тому
Hi Daniel, I just have a basic question. In the source of data flow, you dont explicitly mention the file name. Still, it works perfectly. How is that possible?
@MSDataFactory 3 роки тому ⁺²
The delimited text source is reading from a parameterized folder name. All files in that folder will be read by the source.
@jannujasmineshalini Рік тому ⁺¹
Do we need to create a separate data flow for each dimension? because this approach can work only on a single file what if the other file have diff schema?
@bytedonor 6 місяців тому
You need to create a metadata in Azure SQL Database where you can define all the particulars of your dimension tables. The metadata can be configurable and can be called in your dataflow's pipeline via Lookup activity.
@sandroquinta1125 2 місяці тому
Will this work on Fabric?
@vickyvinay23 3 роки тому
In the "AddHashExisting", why the primary key is split (Comma Seperated) but not on the CSV source?
@mountainsandmagic645 3 роки тому ⁺¹
Unless I’ve missed something this doesn’t work , it updates all previous records with the the endtime. So if I have 5 previous records it updates them all based on the PK.
@martijnstraatman6296 2 роки тому ⁺¹
Hi,
Thanks for this nice demo!
In the case of multiple csv's with the same primary key, all new records are marked for insert which leads to two active records. My expectation is that the current record is ended ( which is the case). The two new records should processed in sequence so that I have only one active record.
Example:
current state in dimension table: record a is active
csv 1 has an update for record a
csv 2 has an update for record a
Both files are being read and both records are inserted which leads to the following result
current record is set inactive
record from csv 1 is set active
record from csv 2 is set active
How can I solve this issue? I think the oldest record should be processed first.
thx
@rahuldhip2002 4 роки тому ⁺¹
Thanks for this video, question related to performance: when creating md5 hash and comparing run time, do you see any issue? Say if my dimension table got 1 million rows and have 70 columns then, don't you see issue comparing the has columns run time? Or do you suggest I should store the hash values in a column of target dimension table and compare that with the source?
@MSDataFactory 4 роки тому
Its tough to say without looking at the monitoring logs! If you are doing multiple comparisons, hash values should be quicker though
@easyyuz2707 3 роки тому
@@MSDataFactory thanks a lot. I also wonder about this question. Doing a hash comparison is definitely better if we compare multiple columns. But I think the question is something different. target dim table does not have hashcolumn stored, so every time we run the pipeline we recalculate the hashcolumn both for incoming source and destination tables, instead of storing the hash value on the destination and only generating the hash for incoming source columns. If the dim table is huge as question states, generating hash at every runtime and comparing wouldn't cause any performance issues? on the other side, if we create a hashcolumn in Dimtable, then storage will be an issue. Is that why you don't store the hashvalue in dimtable?
@MSDataFactory 3 роки тому
@@easyyuz2707 Calculating the hash in Spark data frames is a very fast operation. It's the I/O and database connector read/write times that will effect your flows the most. Either method is valid, although storing the hash in the Dim table is also very common.
@easyyuz2707 3 роки тому
@@MSDataFactory Thanks a lot! This is very helpful!
@epictamil3755 3 роки тому
@@easyyuz2707 I 've implemented this in both Informatica and ADF same MD5. ADF is faster than Informatica
@philiplee6011 2 роки тому
Great demo - thanks!! However can this be made to work if the source and target column names are different? eg customer_id in the source and custId in the target
@manianan897 4 роки тому
Thanks for the lovely demo.
Can you please advise if there is any option where we can parameterise the data type conversion from String to Int/date etc.
e.g. passing a list of attributes which shoould be converted to int/date before table load ? implicit conversion does not seem to work
Thanks in advance
@MSDataFactory 4 роки тому
Hey Manish, you can always use column patterns to have matching conditions or infer drifted columns if the data is coming from CSV. Drifted data takes the data type of the source system.
@ubaddala 3 роки тому
Thank you Daniel for the nice video. As you mentioned at timeline 5:05, is it possible to compute the row hash on BOTH the source and the target ONLY for the column names that are common to both? I tried to something like "src_hash = md5(of only columns that are also in target) and trg_hash = md5(of only columns that are also in the source)". But, I wasn't able to! Can you please do video about this -- if such a thing is possible or throw some ideas?
@puthenful 3 роки тому
Quick question the data debug cluster taking lot of time to start cluster.
@sivaramkodali8282 3 роки тому
I did the exact same thing and It is failing at NewAndUpdatedRows. While columns_hash comparison between Salary. Any thoughts?
@mugunthanc8660 3 роки тому ⁺¹
I believe the data types should be declared as 'VARCHAR' for the 'Salary' column in the Dim table. I also had the problem, as i declared the salary data type as DECIMAL. Although the values are the same, the MD5 returned different values. Then tried changing the data type to VARCHAR, & it worked.
Botton line is..
@Brondahl 4 роки тому
Heya, how do the parameters for the Generic DataSets get set? We don't seem to be passing them in the way I would in a pipeline?
@MSDataFactory 4 роки тому
The parameters for the generic dataset get passed in as part of the Execute Data Flow activity. These parameters are created in the dataset.
@BrentLeslie 3 роки тому
This is good, but what happens when you don't have an ID column in your source? I can't find a way to insert the ID column name (from the parameters) as a new field in the data flow. Tried with a Select transformation and a Derived Column transformation, but can't set the name of the field to be dynamic based on the parameter. Any tips?
@MSDataFactory 3 роки тому
You need a key to be able to know which dimension rows to update or insert. You can use Exists to check for the existence of a dimensions member by looking for it by other attributes in your existing dimension table. If the dimension member is not found, then it is new and you will generate a new ID using Surrogate Key. If the dimension member exists, and properties have changed, you can then grab the ID from the target dimension table and use it to update the existing row.
@Sarabjitsethi 3 роки тому
I have used exact same flow for my transformation but i am getting dups for 2nd run... Addhashexist and addhasexisting should be different? in my case it is different for same data row.. when i am running the pipeline 2nd time i am getting dups what would be the reasons
.Any suggestions .
@epictamil3755 3 роки тому
to avoid the dups you have to add one more condition in the branch that addhasexist ! = Addhash incoming .
@AshishPatel-kn3kc 3 роки тому ⁺¹
Its really helpful.
@devprakash7876 4 роки тому ⁺¹
unable to see this "Generic Type 2 Slowly Changing Dimension" in template section
@MSDataFactory 4 роки тому ⁺²
Should be available from the template gallery next week
@mani653 Рік тому
@@MSDataFactory still not available .but,it was very good video
@balanm8570 4 роки тому
How do we pass composite primary key columns in a parameterized way ?. In the above example, we have ONLY one column parameterized as 'ID'. What if I have composite primary key columns like 'ID, Code'. I tried with this approach and below is the error
"message":"Job failed due to reason: at Sink 'sink1': org.apache.spark.sql.AnalysisException: cannot resolve `source.AIDNLIFDMD,NSEQ` in search condition given columns...
Here the composite primary key columns are AIDNLIFDMD & NSEQ
Any suggestions would be much appreciated pls ?
@MSDataFactory 4 роки тому
In Sink Settings under "Key columns", select "Custom expression". Use a string array parameter like: ['id','name']
@balanm8570 4 роки тому
@@MSDataFactory Thanks for your response team. But, I could not find the option as "Custom expression" in Sink Settings under "Key Columns". It says "Import schema to get the column list". When we give as ['key1','key2'], it is not validated and it says "Invalid type, expected string and provided string[]". Sorry I could not attach my screen shot in the comment here.
Appreciate your support on this.
@MSDataFactory 4 роки тому ⁺¹
@@balanm8570 This only works with database sinks and when you delete, update, or upserts turned on. See if you are able to open my screenshot here: github.com/kromerm/adfdataflowdocs/blob/master/images/keycols.png
@balanm8570 4 роки тому
@@MSDataFactory Thanks Team for your timely response. Really Appreciate it....
Do we have any plan or ETA by when we can expect this feature for Delta Sink?
@rahuldhip2002 4 роки тому
Is the surrogate key column 'Key' is identity in the SQL Database? Do you need to make the surrogate key in the table as Identity for this scenario? I was thinking if you use MPP database then you can't have sequential number.
@MSDataFactory 4 роки тому
In ADF, "Key column" is just being used as a way to uniquely identify which rows to update in the target database. So the target database, like an MPP, is not required to support primary key as a column property.
@rahuldhip2002 4 роки тому
@@MSDataFactory Thanks, Actually, I was talking about 'Key' column (surrogate key) not the unique business key (ID) which you treated as Primary key. I believe you need the surrogate key. So the question was, are you using IDENTITY for the surrogate key?
@MSDataFactory 4 роки тому
@@rahuldhip2002 This example is generating the surrogate key inside the data flow logic itself using the Surrogate Key transformation, not the database Identity column property. The primary key parameter is just the name of the database key column for matching.
@rahuldhip2002 4 роки тому
@@MSDataFactory Thanks for your explanation. It means whenever I am creating surrogate key in the database, I should not make that column as Identity. That is my takeaway.
@MSDataFactory 4 роки тому ⁺³
@@rahuldhip2002 You can still do that, but when you map that identity column in a data flow sink, you will need to check "Skip writing key columns" or use a pre-processing script to enable identity write.
@anilmuppavarapu8618 3 роки тому
Can some one provide the link to the template mentioned at the end of the video. where do i find: data factory scd type 2 templates
@MSDataFactory 3 роки тому
In the ADF or Synapse UI, you will use the template gallery. In the gallery, you can find the template or use the search bar.
@RonaldPostelmans 2 роки тому
Hello Daniel, it's a wonderfull example of type 2. Your the only one sofar that has made a clear video about it. I have done it with my one data. It adds new records perfectly, only updated records will not get an enddate and the field Active is not updated from 1 to 0. Weird enough in the data preview it does exactly what it should do but in real unfortunately not. Any idea how i can fix that?
@leidagoncalves6403 2 роки тому
I am having the exact same problem, did you manage to find a solution?
@RonaldPostelmans 2 роки тому
@@leidagoncalves6403 unfortunately not, at this point we work with An sql database so My work around is to do the scdtype2 with An sql script which works perfectly. When we start working with datalake in stead of sql database, we have to work with dataflows in the future , then this will be a problem. I creates An issue at Microsoft but they did not answer it. On the internet i was not able to Find other websites or other UA-cam tutorials that about the specific slowly changing dimensionale type 2 case, only type 1 . Some where there but totally un clear
@leidagoncalves6403 2 роки тому
@@RonaldPostelmans Thank you very much. I am actually sinking the data to CosmosDb but having the exact same issue. It seems that some other step or configuration is needed for updates and deletes but I need to investigate further. Thanks.
@nigelcrask9408 Рік тому ⁺¹
@@RonaldPostelmans It's the last step (the sink) you need to enable the Update and set the column using thePrimaryKey Parameter - see 11mins 41secs of the video. I hope this helps someone else and I'll not get my 2 hours back!
@nigelcrask9408 Рік тому
@@leidagoncalves6403 see my post below
@RobertLenior 11 місяців тому
In SSIS this is very very easy to accomplish, why is it still so painstaking cumbersome in ADF?
@MSDataFactory 11 місяців тому
This is a pattern for generic SCD Type 2 for any dimension table using schema drift and column pattern transformations, which is not possible in SSIS
@learnnrelearn7553 4 роки тому ⁺¹
nice video
@srikarvandanapu2192 3 роки тому
For Dropping columns.can you please send me the code or logic ?
@MSDataFactory 3 роки тому
To drop columns, use the Select tranformation
@shauntkhalatian429 3 роки тому ⁺¹
!in(['id_hash','columns_hash','MaxSurrogateKey'],name) I hope that helps
@crispy9489 3 роки тому
@@MSDataFactory they know that, you didnt show the code on the screen to use in the select.
@crispy9489 3 роки тому
@@shauntkhalatian429 thank you, works
@MSDataFactory 3 роки тому
@@crispy9489 There is a link to the actual pipeline template in the video description that has all of the code

Наступне

Автоматичне відтворення

Handle SCD Type-2 Data loads with ADF | SCD Type 2 , 1, 0 Data loads with ADF