Get the key of a JavaPairRDD - apache-spark

I have a JavaPairRDD < String, Iterable < Tuple2 < String, String>>>
I printed it in a file and the content is
(ABC,[(ABC,1)])
(BBC,[(BBC,1)])
(CBD,[(CBD,1)])
(BBD,[(BBD,1)])
(ACD,[(ACD,1)])
Now I want to take only the strings ABC, BBC, CBD, BBD, ACD to a JavaRDD and print them in a file
Till now I am able to print them in a console using foreach
foreach(new VoidFunction<Tuple2<String, Iterable<Tuple2<String, String>>>>() {
#Override
public void call(Tuple2<String, Iterable<Tuple2<String, String>>> t) throws Exception {
// TODO Auto-generated method stub
System.out.println(t._1);
}
});
I want to do the same in a file. I am new to spark and so don't know how I could acheive this. Any help would be much appreciated. Thanks in advance.

Please, try:
pairRdd.keys().coalesce(1).saveAsTextFile("some_path");

Related

How to save Sparks MatrixFactorizationModel recommendProductsForUsers to Hbase

I am new to spark and I am want to save the output of recommendProductsForUsers to Hbase table. I found an example (https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/) showing to use JavaPairRDD and saveAsNewAPIHadoopDataset to save.
How can I convert JavaRDD<Tuple2<Object, Rating[]>> to JavaPairRDD<ImmutableBytesWritable, Put> so that I can use saveAsNewAPIHadoopDataset?
//Loads the data from hdfs
MatrixFactorizationModel sameModel = MatrixFactorizationModel.load(jsc.sc(), trainedDataPath);
//Get recommendations for all users
JavaRDD<Tuple2<Object, Rating[]>> ratings3 = sameModel.recommendProductsForUsers(noOfProductsToReturn).toJavaRDD();
By using mapToPair. From the same source you provided example(i changed types by hand):
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = javaRDD.mapToPair(
new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> row) throws Exception {
Put put = new Put(Bytes.toBytes(row.getString(0)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier1"), Bytes.toBytes(row.getString(1)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier2"), Bytes.toBytes(row.getString(2)));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
It goes like this, you cretne new instance of put supplying it with row key in constructor, and then for each column you call add. and then you return the put created.
This is how i solved the above problem, hope this will be helpful to someone.
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts1 = ratings3
.mapToPair(new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> arg0)
throws Exception {
Rating[] userAndProducts = arg0._2;
System.out.println("***********" + userAndProducts.length + "**************");
List<Item> items = new ArrayList<Item>();
Put put = null
String recommendedProduct = "";
for (Rating r : userAndProducts) {
//Some logic here to convert Ratings into appropriate put command
// recommendedProduct = r.product;
}
put.addColumn(Bytes.toBytes("recommendation"), Bytes.toBytes("product"),Bytes.toBytes(recommendedProduct)); Bytes.toBytes("product"),Bytes.toBytes(response.getItems().toString()));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
System.out.println("*********** Number of records in JavaPairRdd: "+ hbasePuts1.count() +"**************");
hbasePuts1.saveAsNewAPIHadoopDataset(newApiJobConfig.getConfiguration());
jsc.stop();
We just open sourced Splice Machine and we have examples integrating MLIB with querying and storage into Splice Machine. I do not know if this will help but thought I would let you know.
http://community.splicemachine.com/use-spark-libraries-splice-machine/
Thanks for the post, very cool.

Spark Streaming: Average of all the time

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key). For state tracking I use the StatCounter class, which holds all temperature values as doubles and re-calculates the average each stream via calling the StatCounter.mean method. Here my program:
EDITED MY WHOLE CODE: NOW USING StatCounter
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1));
streamingContext.checkpoint("hdfs://server:8020/spark-history/checkpointing");
JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
});
JavaPairDStream<String, Float> temperatureDStream = sensorDStream.mapToPair(new PairFunction<SensorData, String, Float>() {
public Tuple2<String, Float> call(SensorData sensorData) throws Exception {
return new Tuple2<String, Float>(sensorData.getIdSensor(), sensorData.getValTemp());
}
});
JavaPairDStream<String, StatCounter> statCounterDStream = temperatureDStream.updateStateByKey(new Function2<List<Float>, Optional<StatCounter>, Optional<StatCounter>>() {
public Optional<StatCounter> call(List<Float> newTemperatures, Optional<StatCounter> statsYet) throws Exception {
StatCounter stats = statsYet.or(new StatCounter());
for(float temp : newTemperatures) {
stats.merge(temp);
}
return Optional.of(stats);
}
});
JavaPairDStream<String, Double> avgTemperatureDStream = statCounterDStream.mapToPair(new PairFunction<Tuple2<String,StatCounter>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, StatCounter> statCounterTuple) throws Exception {
String key = statCounterTuple._1();
double avgValue = statCounterTuple._2().mean();
return new Tuple2<String, Double>(key, avgValue);
}
});
avgTemperatureDStream.print();
This seems to work fine. But now to the question:
I just found an example online which also shows how to calculate a average of all time here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html
They use AtmoicLongs etc. for storing the "stateful values" and update them in a forEachRDD method.
My question now is: What is the better solution for a stateful calculation of all time in Spark Streaming? Are there any advantages / disadvantages of using one or the other way? Thank you!

How to output the content of JavaMapwithStateDstream to the textFile?

all. I have two questions about Spark-streaming's application.
First one is how to output JavaMapwithStateDstream's content into the textFile, I went through the API document, and found out it's of the Dstreamlike interface.So I use the following code, trying to output the content:
Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>> mappingFunc =
new Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> call(String word, Optional<Integer> one,
State<Integer> state) {
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<>(word, sum);
state.update(sum);
return output;
}
};
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
adCounts.mapWithState(StateSpec.function(mappingFunc));
stateDstream.print();
stateDstream.foreachRDD(new Function<JavaRDD<Tuple2<String,Integer>>, Void>() {
#Override
public Void call(JavaRDD<Tuple2<String, Integer>> rdd) throws Exception {
rdd.saveAsTextFile("/path/to/hdfs");
return null;
}
});
However, Nothing output to the hdfs path.But I can see the print result from the console
Please tell me what's the matter??How can I output the content of JavaMapwithStateDstream?
Second question:
I want to update the real-time result every duration, even no other new flowing in, how can I implement it??
Thanks.
I found out the reason why JavaMapwithStateDstream can print out something but not saving the textFile, since it is updated/initialized every duration, new data flowing in will be covered by next time's initialization, thus nothing can be saved into the textFile.
Workaround is to declare a new variable to copy the value of stateDstream,
I use Dstream here, I think JavaPairDstream should be also ok.
DStream<Tuple2<String, Integer>> fin_Counts = stateDstream.dstream();
fin_Counts.print();
fin_Counts can be updated and saved.

wordCounts.dstream().saveAsTextFiles("LOCAL FILE SYSTEM PATH", "txt"); does not write to file

I am trying to write JavaPairRDD into file in local system. Code below:
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.dstream().saveAsTextFiles("/home/laxmikant/Desktop/teppppp", "txt");
I am trying to save the logs or the wordcount in file. But it is not able to save in a local file (NOT HDFS).
I also tried to save on HDFS using
saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt")
The above line does not write to file. Can anybody tell the solution?
Various solutions on stackoverflow dont work.
Try to write output as an absolute path:
saveAsTextFiles("file:///home/laxmikant/Desktop/teppppp", "txt");
rdd.saveAsTextFile("C:/Users/testUser/file.txt")
It will not save the data into the file.txt file. It will throw the FileAlreadyExists Exception. Because this method will create the own file and saves the rdd in that particular file.
Try to use the following code to save the rdd's in a file.
rdd.SaveAsTextFile("C:/Users/testUser")
It will create create a file under testUser folder and saves the rdd's into that file.
The syntax seems to correct
saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt");
but the full syntax is
wordCounts.saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt"); // no dstream()
My guess is that the data is somewhere stuck in some system buffer and is not getting written. If you try to stream a lot more data using "nc" then you may see a file with data being created. This is what happened in my case.

How to store and read data from Spark PairRDD

Spark PairRDD has the option to save the file.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<String, Integer> myPairRDD =
baseRDD.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(input, input.length());
}
});
myPairRDD.saveAsTextFile("path");
Spark context textfile reads the data to JavaRDD only.
How to reconstruct the PairRDD directly from source?
Note:
Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.
But with huge data it is taking considerable amount of resources.
Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7
You can save them as object file if you don't mind result file not being human readable.
save file:
myPairRDD.saveAsObjectFile(path);
and then you can read pairs like this:
JavaPairRDD.fromJavaRDD(sc.objectFile(path))
EDIT:
working example:
JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD
= JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);
Storing the Spark PairRDD in Sequence file works well in this scenario.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {
#Override
public Tuple2<Text, IntWritable> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
}
});
myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
JavaPairRDD<Text, IntWritable> newbaseRDD =
context.sequenceFile(path , Text.class, IntWritable.class);
// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
#Override
public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
System.out.println(arg0);
}
});
As suggested by user52045, following code works with Java 8.
myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);
Example using Scala:
Reading text file & save it as Object file format
val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");
Reading object file & save it as text file format:
val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

Resources