Asynchronous object sharing in Spark - apache-spark

I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.

Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.


Hazelcast - when a new cluster member is merging, is the new member operational?

When a new member joins a cluster, table repartitioning and data merge will happen.
If the data is large, I believe it will take some time. While it is happening, what is the state of the cache like?
If I am using embedded mode, does it block my application until the merging is completed? or if I don't want to work with an incomplete cache, do I need to wait (somehow) before starting my application operations?
Partition migration will start as soon as the member joins the cluster. It will not block your application because it will progress asynchronously in the background.
Only mutating operations that fall into a migrating partition are blocked. Read-only operations are not blocked.
Mutating operations will get PartitionMigrationException which is a RetryableHazelcastException so they will be retried for default 2 minutes. If you have small partition sizes, then migration of a partition will last shorter. You can increase partition count via system property hazelcast.partition.count.
If you want to block your application until all migrations finish, you can check isClusterSafe method to make sure there are no migrating partitions in the cluster. But beware that isClusterSafe returns the status of the cluster rather than current member so it might not be something to rely on. Instead, I would recommend not to block the application while partitions are migrating.

Nifi GetEventHub is multiplying the data by the number of nodes

I have some flows that get the data from an azure eventhub, im using the GetAzureEventhub processor. The data that im getting is being multiplyed by the number of nodes that I have in the cluster, I have 4 nodes. If I indicate to the processor to just run on the primary node, the data is not replicated 4 times.
I found that the eventhub for each consumer group accepts up to 5 readers, I read this in this article, each reader will have its own separate offset and they consume the same data. So in conclussion Im reading the same data 4 times.
I have 2 questions:
How can I coordinate this 4 nodes in order to go throught the same reader?
In case this is not posible, how can indicate nifi to just one of the nodes to read?
Thanks, if you need any clarification, ask for it.
GetAzureEventHub currently does not perform any coordination across nodes so you would have to run it on primary node only to avoid duplication.
The processor would require refactoring to perform coordination across the nodes of the cluster and assign unique partitions to each node, and handle failures (i.e. if a node consuming partition 1 goes down, another node has to take over partition 1).
If the Azure client provided this coordination somehow (similar to the Kafka client) then it would require less work on the NiFi side, but I'm not familiar enough with Azure to know if it provides anything like this.

Is it possible for Spark worker nodes broadcast variables?

I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....
Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.
