Apache NiFi Anti-Patterns Part 3 - Load Balancing

Поділитися
Вставка
  • Опубліковано 29 лис 2024

КОМЕНТАРІ • 26

  • @convincible-u1y
    @convincible-u1y 4 роки тому +2

    your article on scaling NiFi to 1000 nodes is awesome.

  • @RossittoS
    @RossittoS 4 роки тому +2

    Great video! Please add the link to your blog post in the description. Thanks and best regards from Brazil!

    • @nifinotes5127
      @nifinotes5127  4 роки тому +1

      Good call, updated description. Thanks!

  • @ranzilber
    @ranzilber 2 роки тому +1

    Hi Mark
    This is the best nifi lesson I'v had.
    Thanks a lot for providing the "Why's " and taking the time to show these best practices
    Please consider advising on the following flow:
    - The flow is starting from a FS [PrimaryNode], however having a cookie-cutter splitting source flowfiles even more.
    - Where should be the best place to share the load over the cluster (3 nodes cluster)?
    - Currently the flow is taking ~7hrs having only 2 nodes participating , having the default settings 'DO_NOT_LOAD_BALANCE'
    Thanks a lot

    • @nifinotes5127
      @nifinotes5127  2 роки тому

      If you do need to load balance you generally want to do it as soon as possible in your flow. That way, all nodes get to participate the most in the data processing. I.e., if you load balance near the end, then the receiving node has to do a lot of processing before distributing. Better to load balance early and share the processing.
      Also, if you’re reading from a remote File System, the best approach is to perform a listing of the file system (ListFile, for instance) and load balance that. Load balancing the listing is super inexpensive because you don’t have to push the data between nodes, just the listing.
      Does that answer your question?

  • @РустемАлетдинов-ю7ь

    Hi, Mark! Great materials!
    Where I can read about availability of queuing semantics of all possible sources in Nifi? Is only Kafka have this opporunity?
    Thank you!

    • @nifinotes5127
      @nifinotes5127  Рік тому

      Queuing semantics are entirely source-dependent. Kafka is certainly one of the most popular. But JMS Queues, Amazon SQS, MQTT, and most pub-sub systems offer similar queuing semantics. In this case, there's usually no need to load balance. For other sources, where data is only received by a single node (for example, when using a ListFTP / FetchFTP combo) you'd want to use load balancing after the listing.

    • @РустемАлетдинов-ю7ь
      @РустемАлетдинов-ю7ь Рік тому

      @@nifinotes5127 Thanks!

    • @elvinfatku
      @elvinfatku Рік тому

      @@nifinotes5127 hi Mark, thanks for the information. how about selecthiveql ?, the upstream is from generateflowfile that is configured to run in primary node only.

  • @semstuff2324
    @semstuff2324 3 роки тому +1

    Nice vid, good series of videos. 😊
    Got a question: how is load-balancing with round-robin handled with multiple reboots? Does the data neatly moves to the rest of the cluster or is stuck to one node until it reconnects?

    • @nifinotes5127
      @nifinotes5127  3 роки тому

      If the node disconnects, the data destined for that node will be rebalanced to go to the other nodes. The data that is already sitting on that node will stay there until the node restarts, because the data only lives on that node.

  • @esraamagdi8264
    @esraamagdi8264 3 місяці тому

    when to decide that we need to use round robin after pulling data from kafka ?

    • @nifinotes5127
      @nifinotes5127  3 місяці тому

      If you want to distribute the data evenly across the NiFi cluster and you have more NiFi nodes than Kafka partitions you’d want to load balance. If you had say 8 partitions and 10 NiFi nodes, though, it probably wouldn’t be worth the cost of load balancing, you’d be better off just distributing across the 8 nodes that are pulling from Kafka.

  • @priscilla2899
    @priscilla2899 3 роки тому

    Hi Mark,Nice vedio. I am using round Robin after the fetchsql processor and migrate the date to cloud. Once the loading is completed, redirecting the flowfiles to a custom processor via single node load balancer. I am getting error while using single node load balancer. Nifi version used is 1.9.4.Can you please assist.

    • @nifinotes5127
      @nifinotes5127  3 роки тому

      Hi Priscilla, the best place to go for support is probably the mailing list - users@nifi.apache.org. You can also use the slack channel, if you prefer - apachenifi.slack.com/

  • @adityamoghe1820
    @adityamoghe1820 2 роки тому

    Hi Mark,
    well explained.
    can you suggest something regarding CPU utilization, as when I am running nifi in clustered environment one of the node always more than 150% eventhough there is no flow file , same flow file working fine in single node !

    • @nifinotes5127
      @nifinotes5127  2 роки тому

      It's hard to say without a lot more details. What version of NiFi are you running? Is it always the same node? How many nodes in the cluster? Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node? Are the processors running, but with no data, or are they stopped?

    • @adityamoghe1820
      @adityamoghe1820 2 роки тому

      @@nifinotes5127Thanks Mark for your prompt action,
      Please find my comments as per your question and mine observation
      What version of NiFi are you running?
      I am Using Nifi 1.13.2. and Jdk 8 (tried with jdk 11 as well).
      Is it always the same node?
      Yes it is always same node. As of now I am using 2 nodes in cluster any I found the same issue with 3 node as well.
      Is the node with high CPU usage the Primary Node or the Cluster Coordinator or just a random node?
      [Ans.] random node,
      Are the processors running, but with no data, or are they stopped?
      [Ans.] Once we start the flow all processor are running and Scheduling Strategy is "Timer driven".
      Use case: processing a csv file and load the content in hive.
      1. fetching CSV file from GETSFTP processor(which is entry point and in this process execution is happening in "Primary node" only, rest all processor configured as "All nodes")
      2. In each queue Load Balance Strategy is as of now "Round robin", well I tried with other strategy also and some combination as well.
      I tried with other option also in Scheduling Strategy "Event driven".
      Finding: During replying you I found, If I only run all processor where I configured process execution in "all nodes" cpu utilization is 3% in idle case but the same if if I do with where I configured process execution in "Primary Node" those processor giving more than 150 % in Idle stage. Nut this configuration is needed as this is entry point processor and if I remove this configuration I will get duplicate copy in my queue and in further flow.

    • @nifinotes5127
      @nifinotes5127  2 роки тому

      OK. So if you're looking to pull data from an SFTP server and distribute across the cluster, you don't want to use GetSFTP. Instead, use ListSFTP. Run ListSFTP on Primary Node only. Run Schedule can probably be something like 1 minute. Then connect ListSFTP to FetchSFTP. The connection here should be load balanced using "Round Robin". No other connection should have load balancing. This will appropriately load balance the listing of data, and then all nodes will pull their share of data from the SFTP server. No need to re-shuffle the data after receiving it.
      Also, please eliminate any use of Event driven. That was an experimental feature in order to see if it could improve performance in a lot of cases. In reality, it caused worse performance than Timer-Driven and can be expensive. It will be removed in version 2.0 of NiFi.
      Making these changes should give you a very efficient dataflow. Single node periodically polls for data and distributes across the cluster. All nodes then grab their share and do their processing.

  • @samsal073
    @samsal073 3 роки тому +1

    you guys are geniuses

  • @nifinotes5127
    @nifinotes5127  4 роки тому +1

    Template for the flow can be found at gist.github.com/nifinotes/80381e2accf22ba4fc118f75af4528c2

  • @perianka
    @perianka 2 місяці тому

    so the anti pattern is compressing the daya?

    • @nifinotes5127
      @nifinotes5127  2 місяці тому

      The anti-pattern is using load balanced connections to distribute the load across the cluster when the data is already well distributed. This uses a huge amount of resources but provides no benefit.

  • @behrouzseyedi
    @behrouzseyedi 4 роки тому +1

    Great job! keep going man!

  • @InsightByte
    @InsightByte 4 роки тому +1

    doing a great job ! - but breathe - think you are telling all of this to a single person :)