Spark Streaming: Average of all the time - apache-spark

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key). For state tracking I use the StatCounter class, which holds all temperature values as doubles and re-calculates the average each stream via calling the StatCounter.mean method. Here my program:
EDITED MY WHOLE CODE: NOW USING StatCounter
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1));
streamingContext.checkpoint("hdfs://server:8020/spark-history/checkpointing");
JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
});
JavaPairDStream<String, Float> temperatureDStream = sensorDStream.mapToPair(new PairFunction<SensorData, String, Float>() {
public Tuple2<String, Float> call(SensorData sensorData) throws Exception {
return new Tuple2<String, Float>(sensorData.getIdSensor(), sensorData.getValTemp());
}
});
JavaPairDStream<String, StatCounter> statCounterDStream = temperatureDStream.updateStateByKey(new Function2<List<Float>, Optional<StatCounter>, Optional<StatCounter>>() {
public Optional<StatCounter> call(List<Float> newTemperatures, Optional<StatCounter> statsYet) throws Exception {
StatCounter stats = statsYet.or(new StatCounter());
for(float temp : newTemperatures) {
stats.merge(temp);
}
return Optional.of(stats);
}
});
JavaPairDStream<String, Double> avgTemperatureDStream = statCounterDStream.mapToPair(new PairFunction<Tuple2<String,StatCounter>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, StatCounter> statCounterTuple) throws Exception {
String key = statCounterTuple._1();
double avgValue = statCounterTuple._2().mean();
return new Tuple2<String, Double>(key, avgValue);
}
});
avgTemperatureDStream.print();
This seems to work fine. But now to the question:
I just found an example online which also shows how to calculate a average of all time here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html
They use AtmoicLongs etc. for storing the "stateful values" and update them in a forEachRDD method.
My question now is: What is the better solution for a stateful calculation of all time in Spark Streaming? Are there any advantages / disadvantages of using one or the other way? Thank you!

Related

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

How to output the content of JavaMapwithStateDstream to the textFile?

all. I have two questions about Spark-streaming's application.
First one is how to output JavaMapwithStateDstream's content into the textFile, I went through the API document, and found out it's of the Dstreamlike interface.So I use the following code, trying to output the content:
Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>> mappingFunc =
new Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> call(String word, Optional<Integer> one,
State<Integer> state) {
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<>(word, sum);
state.update(sum);
return output;
}
};
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
adCounts.mapWithState(StateSpec.function(mappingFunc));
stateDstream.print();
stateDstream.foreachRDD(new Function<JavaRDD<Tuple2<String,Integer>>, Void>() {
#Override
public Void call(JavaRDD<Tuple2<String, Integer>> rdd) throws Exception {
rdd.saveAsTextFile("/path/to/hdfs");
return null;
}
});
However, Nothing output to the hdfs path.But I can see the print result from the console
Please tell me what's the matter??How can I output the content of JavaMapwithStateDstream?
Second question:
I want to update the real-time result every duration, even no other new flowing in, how can I implement it??
Thanks.
I found out the reason why JavaMapwithStateDstream can print out something but not saving the textFile, since it is updated/initialized every duration, new data flowing in will be covered by next time's initialization, thus nothing can be saved into the textFile.
Workaround is to declare a new variable to copy the value of stateDstream,
I use Dstream here, I think JavaPairDstream should be also ok.
DStream<Tuple2<String, Integer>> fin_Counts = stateDstream.dstream();
fin_Counts.print();
fin_Counts can be updated and saved.

Measure duration of executing combineByKey function in Spark

I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)

How to store and read data from Spark PairRDD

Spark PairRDD has the option to save the file.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<String, Integer> myPairRDD =
baseRDD.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(input, input.length());
}
});
myPairRDD.saveAsTextFile("path");
Spark context textfile reads the data to JavaRDD only.
How to reconstruct the PairRDD directly from source?
Note:
Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.
But with huge data it is taking considerable amount of resources.
Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7
You can save them as object file if you don't mind result file not being human readable.
save file:
myPairRDD.saveAsObjectFile(path);
and then you can read pairs like this:
JavaPairRDD.fromJavaRDD(sc.objectFile(path))
EDIT:
working example:
JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD
= JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);
Storing the Spark PairRDD in Sequence file works well in this scenario.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {
#Override
public Tuple2<Text, IntWritable> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
}
});
myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
JavaPairRDD<Text, IntWritable> newbaseRDD =
context.sequenceFile(path , Text.class, IntWritable.class);
// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
#Override
public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
System.out.println(arg0);
}
});
As suggested by user52045, following code works with Java 8.
myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);
Example using Scala:
Reading text file & save it as Object file format
val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");
Reading object file & save it as text file format:
val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

spark-streaming: how to output streaming data to cassandra

I am reading kafka streaming messages using spark-streaming.
Now I want to set Cassandra as my output.
I have created a table in cassandra "test_table" with columns "key:text primary key" and "value:text"
I have mapped the data successfully into JavaDStream<Tuple2<String,String>> data like this:
JavaSparkContext sc = new JavaSparkContext("local[4]", "SparkStream",conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(3000));
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
JavaDStream<Tuple2<String,String>> data = messages.map(new Function< Tuple2<String,String>, Tuple2<String,String> >()
{
public Tuple2<String,String> call(Tuple2<String, String> message)
{
return new Tuple2<String,String>( message._1(), message._2() );
}
}
);
Then I have created a List:
List<TestTable> list = new ArrayList<TestTable>();
where TestTable is my custom class having the same structure as my Cassandra table, with members "key" and "value":
class TestTable
{
String key;
String val;
public TestTable() {}
public TestTable(String k, String v)
{
key=k;
val=v;
}
public String getKey(){
return key;
}
public void setKey(String k){
key=k;
}
public String getVal(){
return val;
}
public void setVal(String v){
val=v;
}
public String toString(){
return "Key:"+key+",Val:"+val;
}
}
Please suggest a way how to I add the data from JavaDStream<Tuple2<String,String>> data into the List<TestTable> list.
I am doing this so that I can subsequently use
JavaRDD<TestTable> rdd = sc.parallelize(list);
javaFunctions(rdd, TestTable.class).saveToCassandra("testkeyspace", "test_table");
to save the RDD data into Cassandra.
I had tried coding this way:
messages.foreachRDD(new Function<Tuple2<String,String>, String>()
{
public List<TestTable> call(Tuple2<String,String> message)
{
String k = message._1();
String v = message._2();
TestTable tbl = new TestTable(k,v);
list.put(tbl);
}
}
);
but seems some type mis-match happenning.
Please help.
Assuming that the intention of this program is to save the streaming data from kafka into Cassandra, it's not necessary to dump the JavaDStream<Tuple2<String,String>> data into a List<TestTable> list.
The Spark-Cassandra connector by DataStax supports this functionality directly through the Spark Streaming extensions.
It should be sufficient to use such extensions on the JavaDStream:
javaFunctions(data).writerBuilder("testkeyspace", "test_table", mapToRow(TestTable.class)).saveToCassandra();
instead of draining data on an intermediary list.

Resources