Cartesian product of two DStream in Spark

Cartesian product of two DStream in Spark - apache-spark

How I can product two DStream in apache streaming like cartesian(RDD<U>) which when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
One solution is using join as follow that doesn't seem good.
JavaPairDStream<Integer, String> xx = DStream_A.mapToPair(s -> {
return new Tuple2<>(1, s);
});
JavaPairDStream<Integer, String> yy = DStream_B.mapToPair(e -> {
return new Tuple2<>(1, e);
});
DStream_A_product_B = xx.join(yy);
Is there any better solution? or how i can use Cartesian method of RDD?

I found the answer:
JavaPairDStream<String, String> cartes = DStream_A.transformWithToPair(DStream_B,
new Function3<JavaPairRDD<String, String>, JavaRDD<String>, Time, JavaPairRDD<String, String>>() {
#Override
public JavaPairRDD<String, String> call(JavaRDD<String> rddA, JavaRDD<String> rddB, Time v3) throws Exception {
JavaPairRDD<String, String> res = rddA.cartesian(rddB);
return res;
}
});

Related

Spark Streaming: Average of all the time

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key). For state tracking I use the StatCounter class, which holds all temperature values as doubles and re-calculates the average each stream via calling the StatCounter.mean method. Here my program:
EDITED MY WHOLE CODE: NOW USING StatCounter
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1));
streamingContext.checkpoint("hdfs://server:8020/spark-history/checkpointing");
JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
});
JavaPairDStream<String, Float> temperatureDStream = sensorDStream.mapToPair(new PairFunction<SensorData, String, Float>() {
public Tuple2<String, Float> call(SensorData sensorData) throws Exception {
return new Tuple2<String, Float>(sensorData.getIdSensor(), sensorData.getValTemp());
}
});
JavaPairDStream<String, StatCounter> statCounterDStream = temperatureDStream.updateStateByKey(new Function2<List<Float>, Optional<StatCounter>, Optional<StatCounter>>() {
public Optional<StatCounter> call(List<Float> newTemperatures, Optional<StatCounter> statsYet) throws Exception {
StatCounter stats = statsYet.or(new StatCounter());
for(float temp : newTemperatures) {
stats.merge(temp);
}
return Optional.of(stats);
}
});
JavaPairDStream<String, Double> avgTemperatureDStream = statCounterDStream.mapToPair(new PairFunction<Tuple2<String,StatCounter>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, StatCounter> statCounterTuple) throws Exception {
String key = statCounterTuple._1();
double avgValue = statCounterTuple._2().mean();
return new Tuple2<String, Double>(key, avgValue);
}
});
avgTemperatureDStream.print();
This seems to work fine. But now to the question:
I just found an example online which also shows how to calculate a average of all time here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html
They use AtmoicLongs etc. for storing the "stateful values" and update them in a forEachRDD method.
My question now is: What is the better solution for a stateful calculation of all time in Spark Streaming? Are there any advantages / disadvantages of using one or the other way? Thank you!

Is it good practice to open several Kafka streams in one Spark Context?

We have several applications that follow the same logic and patterns, and would like to know if it's good practice to open several streams in one spark context. So the main application to submit, would have something of this sort;
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("test-app");
conf.set("log4j.configuration", "\\log4j.properties");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(20));
// Iterate streams
for (RealtimeApplication app : realtimeApplications)
{
app.execute(ssc);
}
// Trigger!
ssc.start();
// Await stopping of the service...
ssc.awaitTermination();
Then, in the implementation of the abstract method execute(JavaStreamingContext ssc) you would have the following code...
JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(ssc, this.getZkQuorum(), this.getSparkGroup(), topicsSet);
JavaDStream<String> lines = kafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
// Extract transation
String value = tuple2._2();
// Do something here...
String result = executeSomething(value);
return result;
}
});
Is this something to be considered wrong in Spark development?

I would rather share your logic through RDD like
JavaDStream<String> lines1 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
JavaDStream<String> lines2 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
JavaDStream<String> lines3 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
With one source stream

What is recommended way to get top N RDD (key,value) pair in the form of another RDD

When we want the result back to the Spark's context the task can be done by using take ordered. I extend this idea by calling parallelize function to push the result to cluster as shown below:
JavaPairDStream<Integer, String> filteredPair = pairDStream.transformToPair(
new Function<JavaPairRDD<Integer, String>, JavaPairRDD<Integer, String>>() {
public JavaPairRDD<Integer, String> call(JavaPairRDD <Integer, String> pairRDD) throws Exception {
List<Tuple2<Integer, String>> topList = pairRDD.takeOrdered(10);
return ssc.sparkContext().parallelizePairs(topList);
}
});
What bother me is that I think only RDD-transformation function should be expected inside the transformToPair. By mixing actions function with the transformation, does it destroy laziness of the transformation?

Measure duration of executing combineByKey function in Spark

I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}

The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)

How to store and read data from Spark PairRDD

Spark PairRDD has the option to save the file.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<String, Integer> myPairRDD =
baseRDD.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(input, input.length());
}
});
myPairRDD.saveAsTextFile("path");
Spark context textfile reads the data to JavaRDD only.
How to reconstruct the PairRDD directly from source?
Note:
Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.
But with huge data it is taking considerable amount of resources.
Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7

You can save them as object file if you don't mind result file not being human readable.
save file:
myPairRDD.saveAsObjectFile(path);
and then you can read pairs like this:
JavaPairRDD.fromJavaRDD(sc.objectFile(path))
EDIT:
working example:
JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD
= JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);

Storing the Spark PairRDD in Sequence file works well in this scenario.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {
#Override
public Tuple2<Text, IntWritable> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
}
});
myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
JavaPairRDD<Text, IntWritable> newbaseRDD =
context.sequenceFile(path , Text.class, IntWritable.class);
// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
#Override
public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
System.out.println(arg0);
}
});
As suggested by user52045, following code works with Java 8.
myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);

Example using Scala:
Reading text file & save it as Object file format
val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");
Reading object file & save it as text file format:
val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cartesian product of two DStream in Spark - apache-spark

Related

Spark Streaming: Average of all the time

Is it good practice to open several Kafka streams in one Spark Context?

What is recommended way to get top N RDD (key,value) pair in the form of another RDD

Measure duration of executing combineByKey function in Spark

How to store and read data from Spark PairRDD

Categories

Resources