Hi Mark This is the best nifi lesson I'v had. Thanks a lot for providing the "Why's " and taking the time to show these best practices Please consider advising on the following flow: - The flow is starting from a FS [PrimaryNode], however having a cookie-cutter splitting source flowfiles even more. - Where should be the best place to share the load over the cluster (3 nodes cluster)? - Currently the flow is taking ~7hrs having only 2 nodes participating , having the default settings 'DO_NOT_LOAD_BALANCE' Thanks a lot
If you do need to load balance you generally want to do it as soon as possible in your flow. That way, all nodes get to participate the most in the data processing. I.e., if you load balance near the end, then the receiving node has to do a lot of processing before distributing. Better to load balance early and share the processing. Also, if you’re reading from a remote File System, the best approach is to perform a listing of the file system (ListFile, for instance) and load balance that. Load balancing the listing is super inexpensive because you don’t have to push the data between nodes, just the listing. Does that answer your question?
Hi, Mark! Great materials! Where I can read about availability of queuing semantics of all possible sources in Nifi? Is only Kafka have this opporunity? Thank you!
Queuing semantics are entirely source-dependent. Kafka is certainly one of the most popular. But JMS Queues, Amazon SQS, MQTT, and most pub-sub systems offer similar queuing semantics. In this case, there's usually no need to load balance. For other sources, where data is only received by a single node (for example, when using a ListFTP / FetchFTP combo) you'd want to use load balancing after the listing.
@@nifinotes5127 hi Mark, thanks for the information. how about selecthiveql ?, the upstream is from generateflowfile that is configured to run in primary node only.
Nice vid, good series of videos. 😊 Got a question: how is load-balancing with round-robin handled with multiple reboots? Does the data neatly moves to the rest of the cluster or is stuck to one node until it reconnects?
If the node disconnects, the data destined for that node will be rebalanced to go to the other nodes. The data that is already sitting on that node will stay there until the node restarts, because the data only lives on that node.
If you want to distribute the data evenly across the NiFi cluster and you have more NiFi nodes than Kafka partitions you’d want to load balance. If you had say 8 partitions and 10 NiFi nodes, though, it probably wouldn’t be worth the cost of load balancing, you’d be better off just distributing across the 8 nodes that are pulling from Kafka.
Hi Mark,Nice vedio. I am using round Robin after the fetchsql processor and migrate the date to cloud. Once the loading is completed, redirecting the flowfiles to a custom processor via single node load balancer. I am getting error while using single node load balancer. Nifi version used is 1.9.4.Can you please assist.
Hi Priscilla, the best place to go for support is probably the mailing list - users@nifi.apache.org. You can also use the slack channel, if you prefer - apachenifi.slack.com/
Hi Mark, well explained. can you suggest something regarding CPU utilization, as when I am running nifi in clustered environment one of the node always more than 150% eventhough there is no flow file , same flow file working fine in single node !
It's hard to say without a lot more details. What version of NiFi are you running? Is it always the same node? How many nodes in the cluster? Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node? Are the processors running, but with no data, or are they stopped?
@@nifinotes5127Thanks Mark for your prompt action, Please find my comments as per your question and mine observation What version of NiFi are you running? I am Using Nifi 1.13.2. and Jdk 8 (tried with jdk 11 as well). Is it always the same node? Yes it is always same node. As of now I am using 2 nodes in cluster any I found the same issue with 3 node as well. Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node? [Ans.] random node, Are the processors running, but with no data, or are they stopped? [Ans.] Once we start the flow all processor are running and Scheduling Strategy is "Timer driven". Use case: processing a csv file and load the content in hive. 1. fetching CSV file from GETSFTP processor(which is entry point and in this process execution is happening in "Primary node" only, rest all processor configured as "All nodes") 2. In each queue Load Balance Strategy is as of now "Round robin", well I tried with other strategy also and some combination as well. I tried with other option also in Scheduling Strategy "Event driven". Finding: During replying you I found, If I only run all processor where I configured process execution in "all nodes" cpu utilization is 3% in idle case but the same if if I do with where I configured process execution in "Primary Node" those processor giving more than 150 % in Idle stage. Nut this configuration is needed as this is entry point processor and if I remove this configuration I will get duplicate copy in my queue and in further flow.
OK. So if you're looking to pull data from an SFTP server and distribute across the cluster, you don't want to use GetSFTP. Instead, use ListSFTP. Run ListSFTP on Primary Node only. Run Schedule can probably be something like 1 minute. Then connect ListSFTP to FetchSFTP. The connection here should be load balanced using "Round Robin". No other connection should have load balancing. This will appropriately load balance the listing of data, and then all nodes will pull their share of data from the SFTP server. No need to re-shuffle the data after receiving it. Also, please eliminate any use of Event driven. That was an experimental feature in order to see if it could improve performance in a lot of cases. In reality, it caused worse performance than Timer-Driven and can be expensive. It will be removed in version 2.0 of NiFi. Making these changes should give you a very efficient dataflow. Single node periodically polls for data and distributes across the cluster. All nodes then grab their share and do their processing.
The anti-pattern is using load balanced connections to distribute the load across the cluster when the data is already well distributed. This uses a huge amount of resources but provides no benefit.
your article on scaling NiFi to 1000 nodes is awesome.
Great video! Please add the link to your blog post in the description. Thanks and best regards from Brazil!
Good call, updated description. Thanks!
Hi Mark
This is the best nifi lesson I'v had.
Thanks a lot for providing the "Why's " and taking the time to show these best practices
Please consider advising on the following flow:
- The flow is starting from a FS [PrimaryNode], however having a cookie-cutter splitting source flowfiles even more.
- Where should be the best place to share the load over the cluster (3 nodes cluster)?
- Currently the flow is taking ~7hrs having only 2 nodes participating , having the default settings 'DO_NOT_LOAD_BALANCE'
Thanks a lot
If you do need to load balance you generally want to do it as soon as possible in your flow. That way, all nodes get to participate the most in the data processing. I.e., if you load balance near the end, then the receiving node has to do a lot of processing before distributing. Better to load balance early and share the processing.
Also, if you’re reading from a remote File System, the best approach is to perform a listing of the file system (ListFile, for instance) and load balance that. Load balancing the listing is super inexpensive because you don’t have to push the data between nodes, just the listing.
Does that answer your question?
Hi, Mark! Great materials!
Where I can read about availability of queuing semantics of all possible sources in Nifi? Is only Kafka have this opporunity?
Thank you!
Queuing semantics are entirely source-dependent. Kafka is certainly one of the most popular. But JMS Queues, Amazon SQS, MQTT, and most pub-sub systems offer similar queuing semantics. In this case, there's usually no need to load balance. For other sources, where data is only received by a single node (for example, when using a ListFTP / FetchFTP combo) you'd want to use load balancing after the listing.
@@nifinotes5127 Thanks!
@@nifinotes5127 hi Mark, thanks for the information. how about selecthiveql ?, the upstream is from generateflowfile that is configured to run in primary node only.
Nice vid, good series of videos. 😊
Got a question: how is load-balancing with round-robin handled with multiple reboots? Does the data neatly moves to the rest of the cluster or is stuck to one node until it reconnects?
If the node disconnects, the data destined for that node will be rebalanced to go to the other nodes. The data that is already sitting on that node will stay there until the node restarts, because the data only lives on that node.
when to decide that we need to use round robin after pulling data from kafka ?
If you want to distribute the data evenly across the NiFi cluster and you have more NiFi nodes than Kafka partitions you’d want to load balance. If you had say 8 partitions and 10 NiFi nodes, though, it probably wouldn’t be worth the cost of load balancing, you’d be better off just distributing across the 8 nodes that are pulling from Kafka.
Hi Mark,Nice vedio. I am using round Robin after the fetchsql processor and migrate the date to cloud. Once the loading is completed, redirecting the flowfiles to a custom processor via single node load balancer. I am getting error while using single node load balancer. Nifi version used is 1.9.4.Can you please assist.
Hi Priscilla, the best place to go for support is probably the mailing list - users@nifi.apache.org. You can also use the slack channel, if you prefer - apachenifi.slack.com/
Hi Mark,
well explained.
can you suggest something regarding CPU utilization, as when I am running nifi in clustered environment one of the node always more than 150% eventhough there is no flow file , same flow file working fine in single node !
It's hard to say without a lot more details. What version of NiFi are you running? Is it always the same node? How many nodes in the cluster? Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node? Are the processors running, but with no data, or are they stopped?
@@nifinotes5127Thanks Mark for your prompt action,
Please find my comments as per your question and mine observation
What version of NiFi are you running?
I am Using Nifi 1.13.2. and Jdk 8 (tried with jdk 11 as well).
Is it always the same node?
Yes it is always same node. As of now I am using 2 nodes in cluster any I found the same issue with 3 node as well.
Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node?
[Ans.] random node,
Are the processors running, but with no data, or are they stopped?
[Ans.] Once we start the flow all processor are running and Scheduling Strategy is "Timer driven".
Use case: processing a csv file and load the content in hive.
1. fetching CSV file from GETSFTP processor(which is entry point and in this process execution is happening in "Primary node" only, rest all processor configured as "All nodes")
2. In each queue Load Balance Strategy is as of now "Round robin", well I tried with other strategy also and some combination as well.
I tried with other option also in Scheduling Strategy "Event driven".
Finding: During replying you I found, If I only run all processor where I configured process execution in "all nodes" cpu utilization is 3% in idle case but the same if if I do with where I configured process execution in "Primary Node" those processor giving more than 150 % in Idle stage. Nut this configuration is needed as this is entry point processor and if I remove this configuration I will get duplicate copy in my queue and in further flow.
OK. So if you're looking to pull data from an SFTP server and distribute across the cluster, you don't want to use GetSFTP. Instead, use ListSFTP. Run ListSFTP on Primary Node only. Run Schedule can probably be something like 1 minute. Then connect ListSFTP to FetchSFTP. The connection here should be load balanced using "Round Robin". No other connection should have load balancing. This will appropriately load balance the listing of data, and then all nodes will pull their share of data from the SFTP server. No need to re-shuffle the data after receiving it.
Also, please eliminate any use of Event driven. That was an experimental feature in order to see if it could improve performance in a lot of cases. In reality, it caused worse performance than Timer-Driven and can be expensive. It will be removed in version 2.0 of NiFi.
Making these changes should give you a very efficient dataflow. Single node periodically polls for data and distributes across the cluster. All nodes then grab their share and do their processing.
you guys are geniuses
Template for the flow can be found at gist.github.com/nifinotes/80381e2accf22ba4fc118f75af4528c2
ty
so the anti pattern is compressing the daya?
The anti-pattern is using load balanced connections to distribute the load across the cluster when the data is already well distributed. This uses a huge amount of resources but provides no benefit.
Great job! keep going man!
doing a great job ! - but breathe - think you are telling all of this to a single person :)