Is it possible for Spark worker nodes broadcast variables? - apache-spark

I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!

Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....

Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.

Related

How to run multiple queries in Scylla using "Non Atomic" Batch/Pipeline

I understand that Scylla allows batch statements like these.
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
These statements have performance implications as it ensures atomicity. However, I simply have many insert statements which I want to perform from my node client in a single IO. Atomicity among these inserts is not needed. Any idea how I can do that? Can't find anything.
Batching multiple inserts in Cassandra world usually is an antipattern (except when they go into one partition, see the docs). When you're sending inserts into multiple partitions in one batch, the coordinator node will need to take care for taking data from this batch and sending them to nodes that are owning the data. And this puts an additional load onto the coordinating node that first needs to backup the content of the batch just not to lose it if it crashes in the middle of execution, and then need to execute all operations, and wait for results of execution before sending it back to caller (see this diagram to understand how so-called logged batch works).
When you don't need atomicity, then the best performance would be by sending multiple parallel inserts, and waiting for their execution - it will be faster, it will put less load onto nodes, and driver can use token-aware load balancing policy, so requests will be sent to nodes that own data (if you're using prepared statements). In node.js you can achieve this by using Concurrent Execution API - there are several variants of its usage, so it's better to look into the documentation to select what is best for your use case.

Asynchronous object sharing in Spark

I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.

How to reuse broadcast variable in Spark?

I am using broadcast variable to join operation in Spark. But I meet issue about the time broadcast to load from driver to executor. So I just want load once but use for multi job(range application cycle).
Link my ref: https://github.com/apache/spark/blob/branch-2.2/core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
Broadcast variables are not related to a job but to a session/context. If you reuse the same SparkSession it's likely that the broadcast variable will be reused. If I recall correctly, under certain types of memory pressure the workers may clear the broadcast variable but, if it is referenced, it would be automatically re-broadcast to satisfy the reference.
Broadcast variables, which can be used to cache a value in memory on all nodes. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
EdhBroadcast broadcast = new EdhBroadcast(JavaSparkContext);
It's not possible Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.
You can create rdd and cache rdd and reuse.

How to define a global read\write variables in Spark

Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?
One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.
Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?
The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:
Performance would presumably be bad because of the network overhead
between the nodes and the driver application.
To my understanding, each rdd will be passed a copy of the variable
(cache in this case) when the variable is first accessed by the
function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .
What is the best way to implement and store such a cache?
Thanks
Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding
where,
when,
in what order (excluding of course the order of transformations defined by the lineage / DAG)
and how many times
given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.
These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.
If all you want is a simple cache then you have multiple options:
use one of the methods described by Tzach Zohar in Caching in Spark
use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)
If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.
* This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Resources