Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.
For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.
What is the best way to do this?
Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]
Does Broadcast will make sure the HashMap always contains the value.
Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]
Update:
Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?
When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.
First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.
As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
For example.
rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})
rdd = rdd.map(lambda x: x + broadcast.value["value"])
rdd.collect()
As you can see I access the value inside the dictionary in every iteration of the transformation.
You should broadcast the variable.
Making the variable static will cause the class to be serialized and distributed and you might not want that.
Related
I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe
To my understanding, Spark works like this:
For standard variables, the Driver sends them together with the lambda (or better, closure) to the executors for each task using them.
For broadcast variables, the Driver sends them to the executors only once, the first time they are used.
Is there any advantage to use a broadcast variable instead of a standard variable when we know it is used only once, so there would be only one transfer even in case of a standard variable?
Example (Java):
public class SparkDriver {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Map<String,String> dictionary = new HashMap<>();
dictionary.put("J", "Java");
dictionary.put("S", "Spark");
SparkConf conf = new SparkConf()
.setAppName("Try BV")
.setMaster("local");
try (JavaSparkContext context = new JavaSparkContext(conf)) {
final Broadcast<Map<String,String>> dictionaryBroadcast = context.broadcast(dictionary);
context.textFile(inputPath)
.map(line -> { // just one transformation using BV
Map<String,String> d = dictionaryBroadcast.value();
String[] words = line.split(" ");
StringBuffer sb = new StringBuffer();
for (String w : words)
sb.append(d.get(w)).append(" ");
return sb.toString();
})
.saveAsTextFile(outputPath); // just one action!
}
}
}
There are several advantages regarding the use of broadcast variables, even if you use only once:
You avoid several problems of serialization. When you serialize an anonymous inner class that uses a field belonging to the external class this involves serializing its enclosing class. Even if spark and other framework has written a workaround to partially mitigate this problem, although sometimes the ClosureCleaner doesn't do the trick. You could avoid the NotSerializableExceptions by performing some tricks i.e.: copy a class member variable into a closure, transform the anonymous inner class into a class and put only the required fields in the constructor etc.
If you use the BroadcastVariable you don't even think about that, the method itself will serialize only the required variable. I suggest reading not-serializable-exception question and the first answer to deepen better the concept.
The serialization performance of the closure is, most of the time, worse than a specialized serialization method. As the official documentation of spark says: data-serialization
Kryo is significantly faster and more compact than Java serialization (often as much as 10x).
Searching on the Spark classes from the official spark repo I had seen that the closure is serialized through the variable SparkEnv.get.closureSerializer. The only assignment of that variable is the one present at line 306 of the SparkEnv class that use the standard and inefficient JavaSerializer.
In that case, if you serialize a big object you lose some performance due to the network bandwidth. This could be also an explanation of why the official doc claiming about to switch to BroadcastVariable for task larger than 20 KiB.
There is only one copy for each machine, in case of more executor on the same phisical machine there is an advantages.
> Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
The distribution algorithm is probably a lot more efficient. Using the immutability property of the BroadcastVariable is not difficult to think of distributing following a p2p algorithm instead of a centralized one. Imagine for example from the driver whenever you had finished with the first executor sending the BroadcastVariable to the second, but in parallel the initial executor send the data to the third and so on. Picture kindly provided by the bitTorrent Wikipedia page:
I had no deepen the implementation from spark but, as the documentation of the Broadcast variables says:
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Surely a more efficient algorithm than the trivial centralized-one can be designed using the immutability property of the BroadcastVariable.
Long story short: isn't the same thing use a closure or a broadcast variable. If the object that you are sending is big, use a broadcast variable.
Please refer to this excellent article: https://www.mikulskibartosz.name/broadcast-variables-and-broadcast-joins-in-apache-spark/ I could re-write it but it serves the purpose well and answers your question.
In summary:
A broadcast variable is an Apache Spark feature that lets us send a
read-only copy of a variable to every worker node in the Spark
cluster.
The broadcast variable is useful only when we want to:
Reuse the same variable across multiple stages of the Spark job
Speed up joins via a small table that is broadcast to all worker nodes, not all Executors.
I am using broadcast variable to join operation in Spark. But I meet issue about the time broadcast to load from driver to executor. So I just want load once but use for multi job(range application cycle).
Link my ref: https://github.com/apache/spark/blob/branch-2.2/core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
Broadcast variables are not related to a job but to a session/context. If you reuse the same SparkSession it's likely that the broadcast variable will be reused. If I recall correctly, under certain types of memory pressure the workers may clear the broadcast variable but, if it is referenced, it would be automatically re-broadcast to satisfy the reference.
Broadcast variables, which can be used to cache a value in memory on all nodes. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
EdhBroadcast broadcast = new EdhBroadcast(JavaSparkContext);
It's not possible Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.
You can create rdd and cache rdd and reuse.
I am just getting the hang of Spark, and I have function that needs to be mapped to an rdd, but uses a global dictionary:
from pyspark import SparkContext
sc = SparkContext('local[*]', 'pyspark')
my_dict = {"a": 1, "b": 2, "c": 3, "d": 4} # at no point will be modified
my_list = ["a", "d", "c", "b"]
def my_func(letter):
return my_dict[letter]
my_list_rdd = sc.parallelize(my_list)
result = my_list_rdd.map(lambda x: my_func(x)).collect()
print result
The above gives the expected result; however, I am really not sure about my use of the global variable my_dict. It seems that a copy of the dictionary is made with every partition. And it just does not feel right..
It looked like broadcast is what I am looking for. However, when I try to use it:
my_dict_bc = sc.broadcast(my_dict)
def my_func(letter):
return my_dict_bc[letter]
I get the following error:
TypeError: 'Broadcast' object has no attribute '__getitem__
This seems to imply that I cannot broadcast a dictionary.
My question: If I have a function that uses a global dictionary, that needs to be mapped to rdd, what is the proper way to do it?
My example is very simple, but in reality my_dict and my_list are much larger, and my_func is more complicated.
You forgot something important about Broadcast objects, they have a property called value where the data is stored.
Therefore you need to modify my_func to something like this:
my_dict_bc = sc.broadcast(my_dict)
def my_func(letter):
return my_dict_bc.value[letter]
The proper way to do it depends on how the read-only shared variables (the dictionary in your case) will be accessed in the rest of the program. In the case you described, you don't need to use a broadcast variable. From the Spark programming guide section on broadcast variables:
Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
In your case, if the data is only needed in the single map stage, there is no need to explicitly broadcast the variable (it is not "useful"). However, if the same dictionary were to be used later in another stage, then you might wish to use broadcast to avoid serializing and deserializing the dictionary before each stage.
A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)