I am using a broadcast variable in a loop like the following (to make it short, I show it with a kind of psudocode, not exact syntax in Java):
Broadcast<List<E>> brdList = jsc.broadcast(myVariable);
JavaRDD<myType> rdd = rawRdd.map(f(brdList.value()));
List<E> updatedBrdList = rdd.map(g).collect();
brdList.unpersist();
int itr = 1000;
while (itr != 0){
Broadcast<List<E>> brdNewList(updatedBrdList);
rdd = rdd.map(f(brdNewList.value()));
updatedBrdList = rdd.map(g).collect();
itr--;
}
Is this usage a valid form of using broadcast variable? Does the brdNewList occupy one location in memory or in each iteration new space is occupied and a new copy is created?
With few iterations (~ <100), it works fine, but with larger iterations, it gives the following error:
the error with using broadcast variable in Spark
Is there any way to play around it and make it works? The value of the broadcast variable is necessary to be accessed through all nodes in each iteration.
Is this related to driver memory? or there are some computations in executors (workers)? (I am running my code on a cluster with 5 nodes.)
Any help is appreciated!
"To use a broadcast value in a Spark transformation you have to create it first using SparkContext.broadcast and then use value method to access the shared value. Learn it in Introductory Example section".
Related
I am working on Spark SQL where I need to find out Diff between two large CSV's.
Diff should give:-
Inserted Rows or new Record // Comparing only Id's
Changed Rows (Not include inserted ones) - Comparing all column values
Deleted rows // Comparing only Id's
Spark 2.4.4 + Java
I am using Databricks to Read/Write CSV
Dataset<Row> insertedDf = newDf_temp.join(oldDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti");
Long insertedCount = insertedDf.count();
logger.info("Inserted File Count == "+insertedCount);
Dataset<Row> deletedDf = oldDf_temp.join(newDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti")
.select(oldDf_temp.col(key));
Long deletedCount = deletedDf.count();
logger.info("deleted File Count == "+deletedCount);
Dataset<Row> changedDf = newDf_temp.exceptAll(oldDf_temp); // This gives rows (New +changed Records)
Dataset<Row> changedDfTemp = changedDf.join(insertedDf, changedDf.col(key)
.equalTo(insertedDf.col(key)),"left_anti"); // This gives only changed record
Long changedCount = changedDfTemp.count();
logger.info("Changed File Count == "+changedCount);
This works well for CSV with columns upto 50 or so.
The Above code fails for one row in CSV with 300+columns, so I am sure this is not file Size problem.
But if I have a CSV having 300+ Columns then it fails with Exception
Max iterations (100) reached for batch Resolution – Spark Error
If I set the below property in Spark, It Works!!!
sparkConf.set("spark.sql.optimizer.maxIterations", "500");
But my question is why do I have to set this?
Is there something wrong which I am doing?
Or this behaviour is expected for CSV's which have large columns.
Can I optimize it in any way to handle Large column CSV's.
The issue you are running into is related to how spark takes the instructions you tell it and transforms that into the actual things it's going to do. It first needs to understand your instructions by running Analyzer, then it tries to improve them by running its optimizer. The setting appears to apply to both.
Specifically your code is bombing out during a step in the Analyzer. The analyzer is responsible for figuring out when you refer to things what things you are actually referring to. For example, mapping function names to implementations or mapping column names across renames, and different transforms. It does this in multiple passes resolving additional things each pass, then checking again to see if it can resolve move.
I think what is happening for your case is each pass probably resolves one column, but 100 passes isn't enough to resolve all of the columns. By increasing it you are giving it enough passes to be able to get entirely through your plan. This is definitely a red flag for a potential performance issue, but if your code is working then you can probably just increase the value and not worry about it.
If it isn't working, then you will probably need to try to do something to reduce the number of columns used in your plan. Maybe combining all the columns into one encoded string column as the key. You might benefit from checkpointing the data before doing the join so you can shorten your plan.
EDIT:
Also, I would refactor your above code so you could do it all with only one join. This should be a lot faster, and might solve your other problem.
Each join leads to a shuffle (data being sent between compute nodes) which adds time to your job. Instead of computing adds, deletes and changes independently, you can just do them all at once. Something like the below code. It's in scala psuedo code because I'm more familiar with that than the Java APIs.
import org.apache.spark.sql.functions._
var oldDf = ..
var newDf = ..
val changeCols = newDf.columns.filter(_ != "id").map(col)
// Make the columns you want to compare into a single struct column for easier comparison
newDf = newDF.select($"id", struct(changeCols:_*) as "compare_new")
oldDf = oldDF.select($"id", struct(changeCols:_*) as "compare_old")
// Outer join on ID
val combined = oldDF.join(newDf, Seq("id"), "outer")
// Figure out status of each based upon presence of old/new
// IF old side is missing, must be an ADD
// IF new side is missing, must be a DELETE
// IF both sides present but different, it's a CHANGE
// ELSE it's NOCHANGE
val status = when($"compare_new".isNull, lit("add")).
when($"compare_old".isNull, lit("delete")).
when($"$compare_new" != $"compare_old", lit("change")).
otherwise(lit("nochange"))
val labeled = combined.select($"id", status)
At this point, we have every ID labeled ADD/DELETE/CHANGE/NOCHANGE so we can just a groupBy/count. This agg can be done almost entirely map side so it will be a lot faster than a join.
labeled.groupBy("status").count.show
I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.
I'm making Spark App but get stuck on broadcast variable. According to document, broadcast variable should be 'read only'. What if it's properties are mutable?
In local, it works like variable. I don't have cluster environment, so ...
case object Var {
private var a = 1
def get() = {
a = a + 1
a
}
}
val b = sc.broadcast(Var)
// usage
b.value.get // => 2
b.value.get // => 3
// ...
Is this wrong usage of broadcast? It seems destroy the broadcast variable's consistency.
Broadcasts are moved from the driver JVM to executor JVMs once per executor. What happens is Var would get serialized on driver with its current a, then copied and deserialized to all executor JVMs. Let's say get was never called on driver before broadcasting. Now all executors get a copy of Var with a = 1 and whenever they call get, the value of a in their local JVM gets increased by one. That's it, nothing else happens and the changes of a won't get propagated to any other executor or the driver and the copies of Var will be out of sync.
Is this wrong usage of broadcast?
Well, the interesting question is why would you do that as only the initial value of a gets transferred. If the aim is to build local counters with a common initial value it technically works but there are much cleaner ways to implement that. If the aim is to get the value changes back to the driver then yes, it is wrong usage and accumulators should be used instead.
It seems destroy the broadcast variable's consistency.
Yep, definitely as explained earlier.
let's say I declare a variable and I use it inside map/filter function in spark. does my above declared variable is each time sent from driver to worker for each operation on values of map/filter.
Does my helloVariable is sent to worker node for each values of consumerRecords ? if so how to avoid it ?
String helloVariable = "hello testing"; //or some config/json object
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value()+" --- "+helloVariable;
} );
Yep. When you normally pass functions to Spark, such as a map() or a filter(), this functions can use variables defined outside them in the driver program, but each task running on the cluster gets a new copy of each variable (using serialization and sending by network), and updates from these copies are not propagated back to the driver.
So the common case for this scenario is to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. If you are interested in the broadcasting mechanism, here you can read a very good short explanation.
According to the Spark documentation, this process can be graphically shown like this:
Broadcast variables can be used, for example, to give every node a copy of a large dataset (for example, a dictionary with a list of keywords) in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
So in your case your code might look like this:
Broadcast<String> broadcastVar = sc.broadcast("hello testing");
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value() + " --- " + broadcastVar.value();
});
I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks