Transform JavaPairDStream to Tuple3 in Java - apache-spark

I am experimenting with the Spark job that streams data from Kafka and produces to Cassandra.
The sample I am working with takes a bunch of words in a given time interval and publishes the word count to Cassandra. I am also trying to also publish the timestamp along with the word and its count.
What I have so far is as follows:
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, groupId, topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
Now I am trying to append to these records the timestamp. What I have tried is something like this:
Tuple3<String, Date, Integer> finalRecord =
wordCounts.map(s -> new Tuple3<>(s._1(), new Date().getTime(), s._2()));
Which of course is shown as wrong in my IDE. I am completely new to working with Spark libraries and writing in this form (I guess lambda based) functions.
Can someone help me correct this error and achieve what I am trying to do?

After some searching done on the web and studying some examples I was able to achieve what I wanted as follows.
In order to append the timestamp attribute to the existing Tuple with two values, I had to create a simple bean with which represents my Cassandra row.
public static class WordCountRow implements Serializable {
String word = "";
long timestamp;
Integer count = 0;
Then, I had map the (word, count) Tuple2 objects in the JavaPairDStream structure to a JavaDStream structure that holds objects of the above WordCountRow class.
JavaDStream<WordCountRow> wordCountRows = wordCounts.map((Function<Tuple2<String, Integer>, WordCountRow>)
tuple -> new WordCountRow(tuple._1, new Date().getTime(), tuple._2));
Finally, I could call foreachRDD method on this structure (which returns objects of WordCountRow) which I can write to Cassandra one after the other.
wordCountRows.foreachRDD((VoidFunction2<JavaRDD<WordCountRow>,Time>)(rdd,time)->{
final SparkConf sc=rdd.context().getConf();
final CassandraConnector cc=CassandraConnector.apply(sc);
rdd.foreach((VoidFunction<WordCountRow>)wordCount->{
try(Session session=cc.openSession()){
String query=String.format(Joiner.on(" ").join(
"INSERT INTO test_keyspace.word_count",
"(word, ts, count)",
"VALUES ('%s', %s, %s);"),
wordCount.word,wordCount.timestamp,wordCount.count);
session.execute(query);
}
});
});
Thanks

Related

How to retrieve particular string from a text kotlin?

I am getting the scan result in a string like ---
DriverId=60cb1daa20056c0c92ebe457,Amount=10.0
I want to retrive driver id and amount from this string.
How can I retrive?
Please help...
It depends on your overall format. Basic operations like substrings as suggested by #iLoveYou3000 can work fine if you really have this fixed format.
If the keys are dynamic, or could be changed in the future, you could also use more general approaches, for instance using split():
val attributeStrings = input.split(",")
val attributesMap = attributeStrings.map { it.split("=") }.associate { it[0] to it[1] }
val driverId = attributesMap["DriverId"]
val amount = attributesMap["Amount"].toDouble() // or .toBigDecimal()
This is one of the possible ways that I could think of.
val driverID= str.substringAfter("DriverId=", "").substringBefore(",", "")
val amount = str.substringAfter("Amount=", "")

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

How to create Spark broadcast variable from Java String array?

I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.
This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)
The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})

Store countByKey result into Cassandra

I want to count the number of IndicatePresence messages for each user for any given day (out of a Cassandra table), and then store this in a separate Cassandra table to drive some dashboard pages. I managed to get the 'countByKey' working, but now cannot figure out how to use the Spark-Cassandra 'saveToCassandra' method with a Map (it only takes RDD).
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> indicatePresenceTable = javaFunctions(sc).cassandraTable("mykeyspace", "indicatepresence");
JavaPairRDD<UserDate, CassandraRow> keyedByUserDate = indicatePresenceTable.keyBy(new Function<CassandraRow, UserDate>() {
private static final long serialVersionUID = 1L;
#Override
public UserDate call(CassandraRow cassandraIndicatePresenceRow) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
return new UserDate(cassandraIndicatePresenceRow.getString("userid"), sdf.format(cassandraIndicatePresenceRow.getDate("date")));
}
});
Map<UserDate, Object> countByKey = keyedByUserDate.countByKey();
writerBuilder("analytics", "countbykey", ???).saveToCassandra();
Is there a way use a Map directly in a writerBuilder? Or should I write my own custom reducer, that returns an RDD, but essentially does the same thing as the countByKey method? Or, should I convert each entry in the Map into a new POJO (eg UserDateCount, with user, date, and count) and use 'parallelize' to turn the list into an RDD and then store that?
The best thing to do would be to never return the result to the driver (by using countByKey). Instead do a reduceByKey to get another RDD back in the form of (key, count). Map that RDD to the row format of your table and then call saveToCassandra on that.
The most important strength of this approach is we never serialize the data back to the driver application. All the information is kept on the cluster and saved from their directly to C* rather than running through the bottleneck of the driver application.
Example (Very Similar to a Map Reduce Word Count):
Map each element to (key, 1)
Call reduceByKey to change (key, 1) -> (key, count)
Map each element to something writeable to C* (key,count)-> WritableObject
Call save to C*
In Scala this would be something like
keyedByUserDate
.map(_.1, 1) // Take the Key portion of the tuple and replace the value portion with 1
.reduceByKey( _ + _ ) // Combine the value portions for all elements which share a key
.map{ case (key, value) => your C* format} // Change the Tuple2 to something that matches your C* table
.saveToCassandra(ks,tab) // Save to Cassandra
In Java it is a little more convoluted (Insert your types in for K and V)
.mapToPair(new PairFunction<Tuple2<K,V>,K,Long>>, Tuple2<K, Long>(){
#Override
public Tuple2<K, Long> call(Tuple2<K, V> input) throws Exception {
return new Tuple2(input._1(),1)
}
}.reduceByKey(new Function2(Long,Long,Long)(){
#Override
public Long call(Long value1, Long value2) throws Exception {
return value1 + value2
}
}.map(new Function1(Tuple2<K, Long>, OutputTableClass)(){
#Override
public OutputTableClass call(Tuple2<K,Long> input) throws Exception {
//Do some work here
return new OutputTableClass(col1,col2,col3 ... colN)
}
}.saveToCassandra(ks,tab, mapToRow(OutputTableClass.class))

Periodic Broadcast in Apache Spark Streaming

I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.

Resources