Access mutable property of immutable broadcast variable - apache-spark

I'm making Spark App but get stuck on broadcast variable. According to document, broadcast variable should be 'read only'. What if it's properties are mutable?
In local, it works like variable. I don't have cluster environment, so ...
case object Var {
private var a = 1
def get() = {
a = a + 1
a
}
}
val b = sc.broadcast(Var)
// usage
b.value.get // => 2
b.value.get // => 3
// ...
Is this wrong usage of broadcast? It seems destroy the broadcast variable's consistency.

Broadcasts are moved from the driver JVM to executor JVMs once per executor. What happens is Var would get serialized on driver with its current a, then copied and deserialized to all executor JVMs. Let's say get was never called on driver before broadcasting. Now all executors get a copy of Var with a = 1 and whenever they call get, the value of a in their local JVM gets increased by one. That's it, nothing else happens and the changes of a won't get propagated to any other executor or the driver and the copies of Var will be out of sync.
Is this wrong usage of broadcast?
Well, the interesting question is why would you do that as only the initial value of a gets transferred. If the aim is to build local counters with a common initial value it technically works but there are much cleaner ways to implement that. If the aim is to get the value changes back to the driver then yes, it is wrong usage and accumulators should be used instead.
It seems destroy the broadcast variable's consistency.
Yep, definitely as explained earlier.

Related

In which scenario Object from driver node is serialized and sent to workers node in apache spark

let's say I declare a variable and I use it inside map/filter function in spark. does my above declared variable is each time sent from driver to worker for each operation on values of map/filter.
Does my helloVariable is sent to worker node for each values of consumerRecords ? if so how to avoid it ?
String helloVariable = "hello testing"; //or some config/json object
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value()+" --- "+helloVariable;
} );
Yep. When you normally pass functions to Spark, such as a map() or a filter(), this functions can use variables defined outside them in the driver program, but each task running on the cluster gets a new copy of each variable (using serialization and sending by network), and updates from these copies are not propagated back to the driver.
So the common case for this scenario is to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. If you are interested in the broadcasting mechanism, here you can read a very good short explanation.
According to the Spark documentation, this process can be graphically shown like this:
Broadcast variables can be used, for example, to give every node a copy of a large dataset (for example, a dictionary with a list of keywords) in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
So in your case your code might look like this:
Broadcast<String> broadcastVar = sc.broadcast("hello testing");
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value() + " --- " + broadcastVar.value();
});

Using a broadcast variable in a loop in Spark with updated values

I am using a broadcast variable in a loop like the following (to make it short, I show it with a kind of psudocode, not exact syntax in Java):
Broadcast<List<E>> brdList = jsc.broadcast(myVariable);
JavaRDD<myType> rdd = rawRdd.map(f(brdList.value()));
List<E> updatedBrdList = rdd.map(g).collect();
brdList.unpersist();
int itr = 1000;
while (itr != 0){
Broadcast<List<E>> brdNewList(updatedBrdList);
rdd = rdd.map(f(brdNewList.value()));
updatedBrdList = rdd.map(g).collect();
itr--;
}
Is this usage a valid form of using broadcast variable? Does the brdNewList occupy one location in memory or in each iteration new space is occupied and a new copy is created?
With few iterations (~ <100), it works fine, but with larger iterations, it gives the following error:
the error with using broadcast variable in Spark
Is there any way to play around it and make it works? The value of the broadcast variable is necessary to be accessed through all nodes in each iteration.
Is this related to driver memory? or there are some computations in executors (workers)? (I am running my code on a cluster with 5 nodes.)
Any help is appreciated!
"To use a broadcast value in a Spark transformation you have to create it first using SparkContext.broadcast and then use value method to access the shared value. Learn it in Introductory Example section".

Using Java 8 parallelStream inside Spark mapParitions

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks

Replaying an RDD in spark streaming to update an accumulator

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?
The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

arangodb truncate fails on large a collection

I get a timeout in arangosh and the arangodb service gets unresponsive if I try to truncate a large collection of ~40 million docs. Message:
arangosh [database_xxx]> db.[collection_yyy].truncate() ; JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 2001: Error reading from: 'tcp://127.0.0.1:8529' 'timeout during read'] !
throw new ArangoError(requestResult); ! ^ stacktrace: Error
at Object.exports.checkRequestResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
at ArangoCollection.truncate (/usr/share/arangodb/js/client/modules/org/arangodb/arango-collection.js:468:12)
at <shell command>:1:11
ArangoDB 2.6.9 on Debian Jessie, AWS ec2 m4.xlarge, 16G RAM, SSD.
The service gets unresponsive. I suspect it got stuck (not just busy), because it doesn't work until after I stop, delete database in /var/lib/arangodb/databases/, then start again.
I know I may be leaning towards the limits of performance due to the size, but I would guess that it is the intention not to fail regardless of size.
However on a non cloud Windows 10, 16GB RAM, SSD the same action succeeded well - after a while.
Is it a bug? I have some python code that loads dummy data into a collection if it helps. Please let me know if I shall provide more info.
Would it help to fiddle with --server.request-timeout ?
Increasing --server.request-timeout for the ArangoShell will only increase the timeout that the shell will use before it closes an idle connection.
The arangod server will also shut down lingering keep-alive connections, and that may happen earlier. This is controlled via the server's --server.keep-alive-timeout setting.
However, increasing both won't help much. The actual problem seems to be the truncate() operation itself. And yes, it may be very expensive. truncate() is a transactional operation, so it will write a deletion marker for each document it removes into the server's write-ahead log. It will also buffer each deletion in memory so the operation can be rolled back if it fails.
A much less intrusive operation than truncate() is to instead drop the collection and re-create it. This should be very fast.
However, indexes and special settings of the collection need to be recreated / restored manually if they existed before dropping it.
For a document collection, it can be achieved like this:
function dropAndRecreateCollection (collectionName) {
// save state
var c = db._collection(collectionName);
var properties = c.properties();
var type = c.type();
var indexes = c.getIndexes();
// drop existing collection
db._drop(collectionName);
// restore collection
var i;
if (type == 2) {
// document collection
c = db._create(collectionName, properties);
i = 1;
}
else {
// edge collection
c = db._createEdgeCollection(collectionName, properties);
i = 2;
}
// restore indexes
for (; i < indexes.length; ++i) {
c.ensureIndex(indexes[i]);
}
}

Resources