How to create Spark broadcast variable from Java String array? - apache-spark

I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.

This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)

The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})

Related

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

Transform JavaPairDStream to Tuple3 in Java

I am experimenting with the Spark job that streams data from Kafka and produces to Cassandra.
The sample I am working with takes a bunch of words in a given time interval and publishes the word count to Cassandra. I am also trying to also publish the timestamp along with the word and its count.
What I have so far is as follows:
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, groupId, topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
Now I am trying to append to these records the timestamp. What I have tried is something like this:
Tuple3<String, Date, Integer> finalRecord =
wordCounts.map(s -> new Tuple3<>(s._1(), new Date().getTime(), s._2()));
Which of course is shown as wrong in my IDE. I am completely new to working with Spark libraries and writing in this form (I guess lambda based) functions.
Can someone help me correct this error and achieve what I am trying to do?
After some searching done on the web and studying some examples I was able to achieve what I wanted as follows.
In order to append the timestamp attribute to the existing Tuple with two values, I had to create a simple bean with which represents my Cassandra row.
public static class WordCountRow implements Serializable {
String word = "";
long timestamp;
Integer count = 0;
Then, I had map the (word, count) Tuple2 objects in the JavaPairDStream structure to a JavaDStream structure that holds objects of the above WordCountRow class.
JavaDStream<WordCountRow> wordCountRows = wordCounts.map((Function<Tuple2<String, Integer>, WordCountRow>)
tuple -> new WordCountRow(tuple._1, new Date().getTime(), tuple._2));
Finally, I could call foreachRDD method on this structure (which returns objects of WordCountRow) which I can write to Cassandra one after the other.
wordCountRows.foreachRDD((VoidFunction2<JavaRDD<WordCountRow>,Time>)(rdd,time)->{
final SparkConf sc=rdd.context().getConf();
final CassandraConnector cc=CassandraConnector.apply(sc);
rdd.foreach((VoidFunction<WordCountRow>)wordCount->{
try(Session session=cc.openSession()){
String query=String.format(Joiner.on(" ").join(
"INSERT INTO test_keyspace.word_count",
"(word, ts, count)",
"VALUES ('%s', %s, %s);"),
wordCount.word,wordCount.timestamp,wordCount.count);
session.execute(query);
}
});
});
Thanks

broadcast variable fails to take all data

When applying broadcast variable with collectasmap(), not all the values are included by broadcast variable. e.g.
val emp = sc.textFile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
emp_new.foreach(println)
val emp_newBC = sc.broadcast(emp_new.collectAsMap())
println(emp_newBC.value)
When i checked the values within emp_newBC I saw that not all the data from emp_new appear. What am i missing?
Thanks in advance.
The problem is that emp_new is a collection of tuples, while emp_newBC is a broadcasted map. If you are collecting map, the duplicate keys are being removed and therefore you have less data. If you want to get back a list of all tuples, use
val emp_newBC = sc.broadcast(emp_new.collect())

Periodic Broadcast in Apache Spark Streaming

I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.

How to update multiple rows using Hector

Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.

Resources