Why does Spark not invoke reduceByKey when the Tuple2's key is the original object in mapToPair - apache-spark

I have a JavaDStream as in JavaDStream sourceDStream for stream processing.
In mapToPair for this DStream, I use the input object as the key and value for Tuple2 as in
Case 1:
public Tuple2<SourceObject, SourceObject> call(SourceObject sourceObject)
Tuple2<WidgetDetail, WidgetDetail> tuple2;
tuple2 = new Tuple2<> (sourceObject, sourceObject);
return tuple2;
}
where sourceObject also implements equals () because it is also used as the key in mapToPair and reduceByKey.
I also call cache on both sourceDStream and rdd to ensure that they are processed before reduceByKey as in
sourceDStream = sourceDStream.cache ();
rdd = sourceDStream.mapToPair ()
rdd = rdd.cache ();
reducedRdd = rdd.reduceByKey ();
reducedRdd.foreachRDD ();
reducedRdd.foreachPartition ();
However, Spark
When a sourceDStream's size is small, say 50 or less, Spark does not call SourceObject's equals, so in turn reduceByKey is not called at all.
So the duplicate keys are not reduced / merged when foreachPartition is called.
Even when sourceDStream's size is larger say 100+, Spark only call SourceObject's equals for a small subset of objects,
even though there are more objects in sourceDStream with the same key. So reduceByKey is not called for the remaining many objects with the same key.
Both above conditions result in excessive number of objects with the same key that foreachPartition needs to process.
Yet when I use a wrapper object as a key for sourceObject as in the code below
Case 2:
public class SourceKey {
private SourceObject sourceObject;
public void setSourceObject (SourceObject sourceObject) {
this.sourceObject = sourceObject;
}
public boolean equals (Object obj) {
...
}
}
public Tuple2<SourceKey, SourceKey> call(SourceObject sourceObject)
Tuple2<WidgetDetail, WidgetDetail> tuple2;
SourceKey sourceKey = new SourceKey ();
sourceKey.setSourceObject(sourceObject);
tuple2 = new Tuple2<> (sourceKey, sourceKey);
return tuple2;
}
then Spark works as expected where it calls SourceKey's equals for all objects in sourceDStream. So reduceByKey is called for all objects with the same key.
For case 1, why does Spark skip calling SourceObject's equals when SourceObject is also used as the key / value in Tuple2 of mapToPair ?
How do you solve this issue and have Spark calls SourceObject's equals for all objects in sourceDStream, so that objects with the same keys are reduced ?
Thanks.
Michael,

Related

The function in map is not executed [duplicate]

When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava

Transform JavaPairDStream to Tuple3 in Java

I am experimenting with the Spark job that streams data from Kafka and produces to Cassandra.
The sample I am working with takes a bunch of words in a given time interval and publishes the word count to Cassandra. I am also trying to also publish the timestamp along with the word and its count.
What I have so far is as follows:
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, groupId, topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
Now I am trying to append to these records the timestamp. What I have tried is something like this:
Tuple3<String, Date, Integer> finalRecord =
wordCounts.map(s -> new Tuple3<>(s._1(), new Date().getTime(), s._2()));
Which of course is shown as wrong in my IDE. I am completely new to working with Spark libraries and writing in this form (I guess lambda based) functions.
Can someone help me correct this error and achieve what I am trying to do?
After some searching done on the web and studying some examples I was able to achieve what I wanted as follows.
In order to append the timestamp attribute to the existing Tuple with two values, I had to create a simple bean with which represents my Cassandra row.
public static class WordCountRow implements Serializable {
String word = "";
long timestamp;
Integer count = 0;
Then, I had map the (word, count) Tuple2 objects in the JavaPairDStream structure to a JavaDStream structure that holds objects of the above WordCountRow class.
JavaDStream<WordCountRow> wordCountRows = wordCounts.map((Function<Tuple2<String, Integer>, WordCountRow>)
tuple -> new WordCountRow(tuple._1, new Date().getTime(), tuple._2));
Finally, I could call foreachRDD method on this structure (which returns objects of WordCountRow) which I can write to Cassandra one after the other.
wordCountRows.foreachRDD((VoidFunction2<JavaRDD<WordCountRow>,Time>)(rdd,time)->{
final SparkConf sc=rdd.context().getConf();
final CassandraConnector cc=CassandraConnector.apply(sc);
rdd.foreach((VoidFunction<WordCountRow>)wordCount->{
try(Session session=cc.openSession()){
String query=String.format(Joiner.on(" ").join(
"INSERT INTO test_keyspace.word_count",
"(word, ts, count)",
"VALUES ('%s', %s, %s);"),
wordCount.word,wordCount.timestamp,wordCount.count);
session.execute(query);
}
});
});
Thanks

How to create Spark broadcast variable from Java String array?

I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.
This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)
The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})

Store countByKey result into Cassandra

I want to count the number of IndicatePresence messages for each user for any given day (out of a Cassandra table), and then store this in a separate Cassandra table to drive some dashboard pages. I managed to get the 'countByKey' working, but now cannot figure out how to use the Spark-Cassandra 'saveToCassandra' method with a Map (it only takes RDD).
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> indicatePresenceTable = javaFunctions(sc).cassandraTable("mykeyspace", "indicatepresence");
JavaPairRDD<UserDate, CassandraRow> keyedByUserDate = indicatePresenceTable.keyBy(new Function<CassandraRow, UserDate>() {
private static final long serialVersionUID = 1L;
#Override
public UserDate call(CassandraRow cassandraIndicatePresenceRow) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
return new UserDate(cassandraIndicatePresenceRow.getString("userid"), sdf.format(cassandraIndicatePresenceRow.getDate("date")));
}
});
Map<UserDate, Object> countByKey = keyedByUserDate.countByKey();
writerBuilder("analytics", "countbykey", ???).saveToCassandra();
Is there a way use a Map directly in a writerBuilder? Or should I write my own custom reducer, that returns an RDD, but essentially does the same thing as the countByKey method? Or, should I convert each entry in the Map into a new POJO (eg UserDateCount, with user, date, and count) and use 'parallelize' to turn the list into an RDD and then store that?
The best thing to do would be to never return the result to the driver (by using countByKey). Instead do a reduceByKey to get another RDD back in the form of (key, count). Map that RDD to the row format of your table and then call saveToCassandra on that.
The most important strength of this approach is we never serialize the data back to the driver application. All the information is kept on the cluster and saved from their directly to C* rather than running through the bottleneck of the driver application.
Example (Very Similar to a Map Reduce Word Count):
Map each element to (key, 1)
Call reduceByKey to change (key, 1) -> (key, count)
Map each element to something writeable to C* (key,count)-> WritableObject
Call save to C*
In Scala this would be something like
keyedByUserDate
.map(_.1, 1) // Take the Key portion of the tuple and replace the value portion with 1
.reduceByKey( _ + _ ) // Combine the value portions for all elements which share a key
.map{ case (key, value) => your C* format} // Change the Tuple2 to something that matches your C* table
.saveToCassandra(ks,tab) // Save to Cassandra
In Java it is a little more convoluted (Insert your types in for K and V)
.mapToPair(new PairFunction<Tuple2<K,V>,K,Long>>, Tuple2<K, Long>(){
#Override
public Tuple2<K, Long> call(Tuple2<K, V> input) throws Exception {
return new Tuple2(input._1(),1)
}
}.reduceByKey(new Function2(Long,Long,Long)(){
#Override
public Long call(Long value1, Long value2) throws Exception {
return value1 + value2
}
}.map(new Function1(Tuple2<K, Long>, OutputTableClass)(){
#Override
public OutputTableClass call(Tuple2<K,Long> input) throws Exception {
//Do some work here
return new OutputTableClass(col1,col2,col3 ... colN)
}
}.saveToCassandra(ks,tab, mapToRow(OutputTableClass.class))

Periodic Broadcast in Apache Spark Streaming

I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.

Resources