Nifi GetEventHub is multiplying the data by the number of nodes - azure

I have some flows that get the data from an azure eventhub, im using the GetAzureEventhub processor. The data that im getting is being multiplyed by the number of nodes that I have in the cluster, I have 4 nodes. If I indicate to the processor to just run on the primary node, the data is not replicated 4 times.
I found that the eventhub for each consumer group accepts up to 5 readers, I read this in this article, each reader will have its own separate offset and they consume the same data. So in conclussion Im reading the same data 4 times.
I have 2 questions:
How can I coordinate this 4 nodes in order to go throught the same reader?
In case this is not posible, how can indicate nifi to just one of the nodes to read?
Thanks, if you need any clarification, ask for it.

GetAzureEventHub currently does not perform any coordination across nodes so you would have to run it on primary node only to avoid duplication.
The processor would require refactoring to perform coordination across the nodes of the cluster and assign unique partitions to each node, and handle failures (i.e. if a node consuming partition 1 goes down, another node has to take over partition 1).
If the Azure client provided this coordination somehow (similar to the Kafka client) then it would require less work on the NiFi side, but I'm not familiar enough with Azure to know if it provides anything like this.

Related

Azure Eventhub / Event Processor Host: Partitioning not working as predicted

We are working on a project right now which implements and uses the Azure Eventhub.
We use the Event Processor Host to process the data from the Eventhub. We have 32 partitions distributed on to 3 nodes and are wondering how the Event Processor Host distributes and balances the partitions on to the receivers / nodes – especially when using partition key.
We currently have 4 different customers (blue, orange, purple and light blue) which sends us different sizes of data. As you can see the blue customer on the left sends approx. 132k strings of data, while the light blue customer on the right only sends 28.
Our theory was, that given a partitionkey based on the customer (the coloridentification) we would see that a customers data would only be placed in one node.
Instead we can see that the data is somehow evenly distributed on the 3 nodes as seen below:
Node 1:
Node 2:
Node 3:
Is there something we’ve misunderstood in regards to how the use of the partitionkey works? From what we’ve read in the documentation, then when we don’t specify partition keys, then a “round-robin” approach will be used – but even with the use of a partition key, it somehow distributes them evenly.
Are we somehow stressing the nodes – with a blue customer having a huge amount of data and another customer having almost nothing? Or what is going on?
To visualize our theory we've drawn the following:
So are we stressing the top node with a blue customer, that in the end has to move a partition to the middle node?
A partition key is intended to be used when you want to be sure that a set of events is routed to the same partition, but you don't want to assign an explicit partition. In short, use of a partition key is an explicit request to control routing and prevents the service from balancing across partitions.
When you specify a partition key, it is used to produce a hash value that the Event Hubs service uses to assign the partition to which the event will be routed. Every event that uses the same partition key will be published to the same partition.
To allow the service to round-robin when publishing, you cannot specify a partition key or an explicit partition identifier.
Jesse already explained what partition key is good for so I won't repeat that.
If you want customer to consumer-node affinity, you should consider dedicating an independent eventhub to each customer so that you can tell your system something like
node-1 processes data from customerA only by consuming events from eventhub-1
node-2 processes data from customerB only by consuming events from eventhub-2
and so on...
Making use of partition key doesn't really address you business logic here.
One more thing. If you are planning to run this with larger number of customers in the future then you also need to consider to scale out your design to create affinity between customer and EH namespace as well.

How does cassandra scale for a single key read

How does cassandra handle large amount of reads for a single key? Think about a very popular celebrity whose twitter page is hit consistently.
you will usually have multiple replicas of each shard. Lets say your replica count is 3. Then reads for a single key can be spread over the nodes hosting those replicas. But that's the limit of the parallelism - adding more nodes to your cluster would not increase the number of replicas and hence the traffic would still have to talk to just those 3 nodes. There's various tricks people use for such cases (e.g. caching in the web server so it doesn't have to keep going back to the database or denormalizing the data so it is spread over more nodes).

Equivalent to consumer group in Cassandra

I'm trying to create a sort of a consumer group as it exist in Kafka but for Cassandra. The goal is to have a request been paginated and each page done by one instance of an App.
Is there any notion like the consumer group one in Cassandra ?
The TL;DR; is that no, the consumer-group notion doesn't exist in the clients in Cassandra. The burden of which client processes what is entirely on the app developer.
You can use Cassandra's tokens to do selective paging.
Assuming 2 clients (easy example)
Client 1 pages from -2^63 to 0
Client 2 pages from 1 to 2^63 - 1
The above idea assumes you want to page through all the data in something similar to a batch process which wouldn't be a good fit for Cassandra.
If you're after the latest N results, where the 1st half is sent to client 1 and the second to client 2 you can use a logical bucket in your partitioning key.
If you're looking to scale the processing of a large number of Cassandra rows, you might consider a scalable execution platform like Flink or Storm. You'd be able to parallelize both the reading of the rows and the processing of the rows, although a parallelizable source (or spout in Storm) is not something you can get out of the box.

Asynchronous object sharing in Spark

I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

Resources