When is a broadcast overkill? - apache-spark

I have started working on a Scala Spark codebase where everything that can be broadcasted seems to be, even small objects (a handful of small String attributes)
For example, I see this a lot:
val csvParser: CSVParser = new CSVParser(someComputedValue())
val csvParserBc = sc.broadcast(csvParser)
someFunction(..., csvParserBc)
My question is twofold:
Is broadcasting useful when a small object is reused in several closures?
Is broadcasting useful when a small object is used one single time?
I'm under the impression that in that cases broadcasting is not useful, and could even be wasteful, but I'd like a more enlightened opinion.

When you broadcast something it is copied to each executor once. If you don't broadcast it, it's copied along with each task. So broadcasting is useful if you have a large object and/or many more tasks than executors.
In my experience this is very rarely the case. Broadcast complicates the code. So I would always start off without a broadcast and only add a broadcast if I find that this is required for good performance.

Related

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

Is it possible for Spark worker nodes broadcast variables?

I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!
Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....
Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.

Would it be worthy to broadcast small variables in Apache Spark?

According to Spark Tuning Tips, broadcast functionality can be used on large objects to reduce the size of each serialized task.
This makes sense to me, but my question would be, for small objects like Integer or Boolean objects, would it still be worthy to have object creation overhead to broadcast them? My hunch is that it is discouraged, but I couldn't find any convincing explanation on this top online, please help out if you have done some benchmarking and study.
Here is the code to define the variables:
final Broadcast<String> someFolderBroadcast = javaSparkContext.broadcast(someFolder);
final Broadcast<Boolean> someModeBroadcast = javaSparkContext.broadcast(isSomeMode);
someFolderBroadcast.value() and someModeBroadcast.value() are used to retrieve the stored values in the Broadcast variables.
Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large.In general, tasks larger than about 20 KB are probably worth optimizing.
So if your variables (or tasks) are larger than 20 KB, broadcast them!

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

Resources