A few people have asked for templates for these flows. Template for this flow can be found at gist.githubusercontent.com/nifinotes/995287d1831e12a8f3405fbdb49650a4/raw/2555e90aede61ef8fc8dc35d4400874add3433d0/Anti-Patterns_Part_1_-_Split_and_Re-Merge.xml
to summarize 14 minutes talk, use Record processors :) good job, I found great performance improvement as well with Record processors and I love that they infer schemas most of the time so one does not have to type avro schemas like in the first versions of NiFi record processors
Thanks. We've seen a huge improvement in the record-based processors over the last year or so. That's the power of open source with an amazing community behind it!
Really nice and informative video. Do you have a repository for these tutorials (i.e. XML files of both the anti-pattern and the best practice) so that we can follow along more easily? I can't wait for this series as this really helps to turbo-charge any team that is using Nifi.
I don't have it available anywhere right now, but that's a really good idea. I'll see if I can get a NiFi Registry setup somewhere that is publicly consumable. Thanks!
@@nifinotes5127 Or even a set of Github Gists of these XML would suffice for a start? Example from @ijokarumawak gist.github.com/ijokarumawak/b37db141b4d04c2da124c1a6d922f81f - these are very helpful tips also.
Hi Mark! Outstanding video, congrats! I have a question: I ran into this anti pattern myself, and I'm trying to improve performances. I have a 2.2M records flowfile coming out of an ExecuteSQL processor. Each of this record must be enriched with a REST call to an external service using a record's column as a parameter. Is there a way to optimize performances rather than splitting all the records and picking up that single column from each record to perform that REST call?
Thanks Roberto. There are options, assuming the web service allows batch queries. Generally the pattern would be to use SplitRecord to break up that 2.2MM records into something like 100 records per flowfile - or whatever size you want to be a batch to your web service. Then you’d use ForkEnrichment, transform the data on the ‘enrich’ route, use InvokeHTTP, etc to get the enrichment data, and then use JoinEnrichment to join the enrichment data back with the original data. There are examples in the ForkEnrichment docs.
@@nifinotes5127 thanks again Mark! What's the reason behind splitting a big flowfile into a 100 records chunk? And another question: would it be possible to enrich my multi rows flowfile with a complex SQL query against an Oracle database? I have to pull a column out of the enrichment db based on a rather complex query, and I would like to do this with a batch approach
@onesoulgospelchoir6742 the splitting to 100 records is to provide a reasonable size for the bulk request to the web service. You probably don’t want a single web request with 2.2MM records. Even creating that request would take a lot of memory. You could do enrichment using a database with bulk query but it’s a bit complicated. You’d need to form your own query, likely using something like SELECT * FROM TABLE WHERE COLUMN IN (…). If more help is needed id recommend the slack channel or NiFi mailing lists. UA-cam comments are not very conducive to long explanations 🙂
Thank you a BUNCH for this content, Mark! Would you consider making a video explaining the use of Schema Registries for the read/write record services? I'm a bit of a noob and I'm trying to generate records from xml files. Does InferSchema work even if I don't have a custom schema made?
Sorry I missed this comment Alex. After I finish up this series on anti-patterns, I'm not entirely sure what I'll want to talk about next - I have quite a list :) But I'll certainly add this to it, it's a great idea! Might need to do a Twitter poll to find out what's most interesting to people :) You can certainly infer the schema without having a custom schema. If you have an explicit schema that you want to use, that is preferred (because it helps to ensure that the data matches what you expect, because it's more efficient, and because it can then be shared with other services that may want to process the data). But for users who don't have a more mature schema registry setup, schema inference should treat you well!
@@nifinotes5127 thanks i will figure it out. I read data from kafa almost 3 million message and then have to find and replace a text. Using repllacetext which is causing memory issues as this is resource intensive. How do we address this any specific alternative rather than split to small files ?
A few people have asked for templates for these flows. Template for this flow can be found at gist.githubusercontent.com/nifinotes/995287d1831e12a8f3405fbdb49650a4/raw/2555e90aede61ef8fc8dc35d4400874add3433d0/Anti-Patterns_Part_1_-_Split_and_Re-Merge.xml
Pls pin this comment 👍🏻
to summarize 14 minutes talk, use Record processors :) good job, I found great performance improvement as well with Record processors and I love that they infer schemas most of the time so one does not have to type avro schemas like in the first versions of NiFi record processors
Thanks. We've seen a huge improvement in the record-based processors over the last year or so. That's the power of open source with an amazing community behind it!
Very informative and clear explaination.
Really nice and informative video. Do you have a repository for these tutorials (i.e. XML files of both the anti-pattern and the best practice) so that we can follow along more easily?
I can't wait for this series as this really helps to turbo-charge any team that is using Nifi.
I don't have it available anywhere right now, but that's a really good idea. I'll see if I can get a NiFi Registry setup somewhere that is publicly consumable. Thanks!
@@nifinotes5127 Or even a set of Github Gists of these XML would suffice for a start? Example from @ijokarumawak
gist.github.com/ijokarumawak/b37db141b4d04c2da124c1a6d922f81f - these are very helpful tips also.
Hi Mark! Outstanding video, congrats! I have a question: I ran into this anti pattern myself, and I'm trying to improve performances. I have a 2.2M records flowfile coming out of an ExecuteSQL processor. Each of this record must be enriched with a REST call to an external service using a record's column as a parameter. Is there a way to optimize performances rather than splitting all the records and picking up that single column from each record to perform that REST call?
Thanks Roberto. There are options, assuming the web service allows batch queries. Generally the pattern would be to use SplitRecord to break up that 2.2MM records into something like 100 records per flowfile - or whatever size you want to be a batch to your web service. Then you’d use ForkEnrichment, transform the data on the ‘enrich’ route, use InvokeHTTP, etc to get the enrichment data, and then use JoinEnrichment to join the enrichment data back with the original data. There are examples in the ForkEnrichment docs.
@@nifinotes5127 thank you so much! Such a prompt and comprehensive response!
@@nifinotes5127 thanks again Mark! What's the reason behind splitting a big flowfile into a 100 records chunk?
And another question: would it be possible to enrich my multi rows flowfile with a complex SQL query against an Oracle database? I have to pull a column out of the enrichment db based on a rather complex query, and I would like to do this with a batch approach
@onesoulgospelchoir6742 the splitting to 100 records is to provide a reasonable size for the bulk request to the web service. You probably don’t want a single web request with 2.2MM records. Even creating that request would take a lot of memory.
You could do enrichment using a database with bulk query but it’s a bit complicated. You’d need to form your own query, likely using something like SELECT * FROM TABLE WHERE COLUMN IN (…). If more help is needed id recommend the slack channel or NiFi mailing lists. UA-cam comments are not very conducive to long explanations 🙂
Awesome.
Thank you a BUNCH for this content, Mark! Would you consider making a video explaining the use of Schema Registries for the read/write record services? I'm a bit of a noob and I'm trying to generate records from xml files. Does InferSchema work even if I don't have a custom schema made?
Sorry I missed this comment Alex. After I finish up this series on anti-patterns, I'm not entirely sure what I'll want to talk about next - I have quite a list :) But I'll certainly add this to it, it's a great idea! Might need to do a Twitter poll to find out what's most interesting to people :)
You can certainly infer the schema without having a custom schema. If you have an explicit schema that you want to use, that is preferred (because it helps to ensure that the data matches what you expect, because it's more efficient, and because it can then be shared with other services that may want to process the data). But for users who don't have a more mature schema registry setup, schema inference should treat you well!
Hi Mark, very informative videos. do you mind sharing the templates. thanks!
There should be a link in the comments.
@@nifinotes5127 thanks i will figure it out.
I read data from kafa almost 3 million message and then have to find and replace a text. Using repllacetext which is causing memory issues as this is resource intensive. How do we address this any specific alternative rather than split to small files ?
Great!!!