How do I share a private variable among tasks in Flink (Scala)? - multithreading

I want to share a private variable in my Flink job (Scala) across the parallel tasks of Flink. My code is something like this :
object myJob extends flinkJob {
private val myVariable = someValue
def run(params) : Unit = {
//Stream processing
//myVariable is used here in the ProcessFunction
}
}
When I run this job with some parallelization, will there be a single copy of "myVariable" shared across all of the Flink tasks? If not, how can I ensure that only one single copy of the variable is used and maintained across all of the parallel tasks?

Since parallelized copies of your operator run as tasks in separate JVMs, there can be no "sharing" of a variable. What you can do is use a BroadcastStream to share the same data with multiple tasks. If you need to be able to update the variable, then you'll need to look at using iterations, or store the variable in an external system that you can regularly query.

Related

Spark Broadcast Variable with only one task using it

To my understanding, Spark works like this:
For standard variables, the Driver sends them together with the lambda (or better, closure) to the executors for each task using them.
For broadcast variables, the Driver sends them to the executors only once, the first time they are used.
Is there any advantage to use a broadcast variable instead of a standard variable when we know it is used only once, so there would be only one transfer even in case of a standard variable?
Example (Java):
public class SparkDriver {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Map<String,String> dictionary = new HashMap<>();
dictionary.put("J", "Java");
dictionary.put("S", "Spark");
SparkConf conf = new SparkConf()
.setAppName("Try BV")
.setMaster("local");
try (JavaSparkContext context = new JavaSparkContext(conf)) {
final Broadcast<Map<String,String>> dictionaryBroadcast = context.broadcast(dictionary);
context.textFile(inputPath)
.map(line -> { // just one transformation using BV
Map<String,String> d = dictionaryBroadcast.value();
String[] words = line.split(" ");
StringBuffer sb = new StringBuffer();
for (String w : words)
sb.append(d.get(w)).append(" ");
return sb.toString();
})
.saveAsTextFile(outputPath); // just one action!
}
}
}
There are several advantages regarding the use of broadcast variables, even if you use only once:
You avoid several problems of serialization. When you serialize an anonymous inner class that uses a field belonging to the external class this involves serializing its enclosing class. Even if spark and other framework has written a workaround to partially mitigate this problem, although sometimes the ClosureCleaner doesn't do the trick. You could avoid the NotSerializableExceptions by performing some tricks i.e.: copy a class member variable into a closure, transform the anonymous inner class into a class and put only the required fields in the constructor etc.
If you use the BroadcastVariable you don't even think about that, the method itself will serialize only the required variable. I suggest reading not-serializable-exception question and the first answer to deepen better the concept.
The serialization performance of the closure is, most of the time, worse than a specialized serialization method. As the official documentation of spark says: data-serialization
Kryo is significantly faster and more compact than Java serialization (often as much as 10x).
Searching on the Spark classes from the official spark repo I had seen that the closure is serialized through the variable SparkEnv.get.closureSerializer. The only assignment of that variable is the one present at line 306 of the SparkEnv class that use the standard and inefficient JavaSerializer.
In that case, if you serialize a big object you lose some performance due to the network bandwidth. This could be also an explanation of why the official doc claiming about to switch to BroadcastVariable for task larger than 20 KiB.
There is only one copy for each machine, in case of more executor on the same phisical machine there is an advantages.
> Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
The distribution algorithm is probably a lot more efficient. Using the immutability property of the BroadcastVariable is not difficult to think of distributing following a p2p algorithm instead of a centralized one. Imagine for example from the driver whenever you had finished with the first executor sending the BroadcastVariable to the second, but in parallel the initial executor send the data to the third and so on. Picture kindly provided by the bitTorrent Wikipedia page:
I had no deepen the implementation from spark but, as the documentation of the Broadcast variables says:
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Surely a more efficient algorithm than the trivial centralized-one can be designed using the immutability property of the BroadcastVariable.
Long story short: isn't the same thing use a closure or a broadcast variable. If the object that you are sending is big, use a broadcast variable.
Please refer to this excellent article: https://www.mikulskibartosz.name/broadcast-variables-and-broadcast-joins-in-apache-spark/ I could re-write it but it serves the purpose well and answers your question.
In summary:
A broadcast variable is an Apache Spark feature that lets us send a
read-only copy of a variable to every worker node in the Spark
cluster.
The broadcast variable is useful only when we want to:
Reuse the same variable across multiple stages of the Spark job
Speed up joins via a small table that is broadcast to all worker nodes, not all Executors.

from single machine to parallel processing?

i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}
In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.

Asynchronous object sharing in Spark

I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.

Dynamic CPUs per Task in Spark

Lets say my job performs several spark actions, where the first few are not using multiple cores for a single task so I would like each instance to perform (executor.cores) tasks in parallel (spark.task.cpus=1).
Then suppose I have another action which can be parallelized - I'm desiring a feature where I could increase spark.task.cpus (say to use more cores on the executor), and perform fewer tasks simultaneously on each instance.
My workaround right now is to save data, start a new sparkContext with new settings, and reload the data.
The use case: my later actions may be unavoidable skewed and I may want to apply more than one core per task to avoid bottlenecking on such large tasks, but I don't want this to impact the earlier actions which can benefit from using 1 core per task.
From looking around my guess is that I can't do this currently, so I'm mainly wondering if there is a a significant limitation for not allowing this. Alternatively, suggestions for how I could trick spark into achieving something similar.
Note: Currently using 1.6.2 but willing to hear other options for Spark2+

is multi-stage workers allowed in Apache Spark?

I need to know that how does Spark allow communication between worker nodes ?
All the tasks assigned to workers are from the master program, but can a worker's output be sent to another worker, so it can process the further steps on it ..
I am working on a case where there are multiple types of tasks to be carried out, suppose say tasks A,B,C.
For task C to be started, task A and B should be completed, but A and B can be done independent of each other. So, i need few workers for task A, and few for B, and they must call workers of task C, without involving the master. Please provide me insights on how this can be achieved.
Is this kind of a feature available in Yarn ?
I am just throwing a possible solution, although I have not tested it myself and I am unsure of its success possibilities.
What comes to my mind is creating a kind of barrier between the B and C tasks by making use of an action such as count. This will force Spark to complete all previous steps -in all nodes- before starting with stage C (I am not very sure of this statement).
You could then use the broadcast functionality to cache a variable and make it available for all executors without having to communicate with the master.
I would give like to give a shot for the possible answers for this question. In my opinion this can be done in two ways:
1.) If the tasks A and B are independent and need to be completed before C, why not first perform tasks A and B on your RDD and then use the result of these tasks (or new rdd) and perform C using another action.
2.) Inter worker communication is a problem in spark (AFAIK). Only ways of communication in spark are broadcast and accumulator variables. But both of them are useful for driver-worker communicataion, not worker-worker communication. One possible workaround would be to save the result or variables from worker to a common storage such as HDFS and access it from another worker. For e.g. in PySpark there are efficient ways of communicating from worker machine to HDFS using Popep, Pydoop, Hadoopy etc.

Resources