access Broadcast Variables in Spark java - apache-spark

I need to process spark Broadcast variables using Java RDD API. This is my code what I have tried so far:
This is only sample code to check whether its works or not? In my case I need to work on two csvfiles.
SparkConf conf = new SparkConf().setAppName("BroadcastVariable").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
Map<Integer,String> map = new HashMap<Integer,String>();
map.put(1, "aa");
map.put(2, "bb");
map.put(9, "ccc");
Broadcast<Map<Integer, String>> broadcastVar = ctx.broadcast(map);
List<Integer> list = new ArrayList<Integer>();
list.add(1);
list.add(2);
list.add(9);
JavaRDD<Integer> listrdd = ctx.parallelize(list);
JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value());
System.out.println(mapr.collect());
and it prints output like this:
[{1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}]
and my requirement is :
[{aa, bb, ccc}]
Is it possible to do like in my required way?

I used JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value().get(x));
insted of JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value());.
Its working now.

Related

Spark Kotlin - create empty Dataset

I am playing around with Kotlin for Spark: https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/
and I am trying to create an empty Dataset based on a data class:
data class Company(val ticker:String)
val ds:Dataset<Company> = spark.createDataset() // <- don't know what to put in the brackets
Found out by myself:
val emptyList:List<Company> = emptyList()
var ds = emptyList.toDS(spark)
A way simpler way would be:
withSpark {
val ds = dsOf<Company>()
}
or, what will be introduced in version 1.2.0:
withSpark {
val ds = emptyDataset<Company>()
}

Spark ALS with strings labels - Conversion back to string

I have this code:
val userIndexer: StringIndexer = new StringIndexer()
.setInputCol("userKey")
.setOutputCol("user")
val alsRatings = userIndexerModel.transform(ratings)
val matrixFactorizationModel = ALS.trainImplicit(alsRatings.rdd, rank = 10, iterations = 10)
val rec = matrixFactorizationModel.recommendProductsForUsers(20)
This gives me back recommendations with user ids. I want to have my user key strings back. What is the more efficient way to do it? Thanks.
PD: I certainly cannot understand why ALS library developers don't accept string labels. It's extremely painful and expensive to deal with conversions (string to int and then int to string) from the outside. Hope there is an issue or something in their backlog.
I generally run the StringIndexer collect the labels in the Driver. And
parallelize the labels with an index. And instead of calling Transform using the StringIndexer. I join the DataFrames to get the same result as a StringIndexer.
val swidConverter = new StringIndexer()
.setInputCol("id")
.setOutputCol("idIndex").fit(df)
val idDf = spark.sparkContext.parallelize(
swidConverter.labels.zipWithIndex
).toDF("id", "idIndex").repartition(PARTITION_SIZE) // set the partition size depending on your data size.
// Joining the idDf(DataFrame) with the actual Data.
val indexedDF = df.join(idDf,idDf.col("id")===df.col("id")).select("idIndex","product_id","rating")
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("idIndex")
.setItemCol("product_id")
.setRatingCol("rating")
val model = als.fit(indexedDF)
val resultRaw = model.recommendForAllUsers(4)
// Joining the idDf(DataFrame) with the Result to get the original ID from the indexed Id.
val resultDf = resultRaw.join(idDf,resultRaw.col("idIndex")===idDf.col("idIndex")).select("id","recommendations")

Kotlin and Spark - SAM issues

Maybe I'm doing something that is not quite supported, but I really want to use Kotlin as I learn Apache Spark with this book
Here is the Scala code sample I'm trying to run. The flatMap() accepts a FlatMapFunction SAM type:
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
Here is my attempt to do this in Kotlin. But it is having a compilation issue on the fourth line:
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = SparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap{ s:String -> s.split(" ") } //ERROR
When I hover over it I get this compile error:
Am I doing anything unreasonable or unsupported? I don't see any suggestions to autocomplete with lambdas either :(
Despite the fact the problem is solved I would like to provide some information regarding the reasons of compilation problem. In this example input has a type of RDD, whose flatMap() method accepts a lambda that should return TraversableOnce[U]. As Scala has it's own collections framework, Java collection types cannot be converted to TraversableOnce.
Moreover, I'm not so sure Scala Functions are really SAMs. As far as I can see from the screenshots Kotlin doesn't offer replacing a Function instance with a lambda.
Ah, I figured it out. I knew there was a way since Spark supports both Java and Scala. The key to this particular problem was to use a JavaSparkContext instead of the Scala-based SparkContext.
For some reason Scala and Kotlin don't always get along with SAM conversions. But Java and Kotlin do...
fun main(args: Array<String>) {
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = JavaSparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap { it.split(" ") }
}
See my comment at #Michael for my fix. However, can I recommend the open source Kotlin Spark API by JetBrains for future reference? It solves many lambda errors, especially using the Dataset API but can also make working with Spark from Kotlin generally easier:
withSpark(appName = "Line Counter", master = "local") {
val input = sc.textFile("C:\\spark_workspace\\myfile.txt", 1)
val words = input.flatMap { s: String -> s.split(" ").iterator() }
}

How to convert List to JavaRDD

We know that in spark there is a method rdd.collect which converts RDD to a list.
List<String> f= rdd.collect();
String[] array = f.toArray(new String[f.size()]);
I am trying to do exactly opposite in my project. I have an ArrayList of String which I want to convert to JavaRDD. I am looking for this solution for quite some time but have not found the answer. Can anybody please help me out here?
You're looking for JavaSparkContext.parallelize(List) and similar. This is just like in the Scala API.
Adding to Sean Owen and others solutions
You can use JavaSparkContext#parallelizePairs for List ofTuple
List<Tuple2<Integer, Integer>> pairs = new ArrayList<>();
pairs.add(new Tuple2<>(0, 5));
pairs.add(new Tuple2<>(1, 3));
JavaSparkContext sc = new JavaSparkContext();
JavaPairRDD<Integer, Integer> rdd = sc.parallelizePairs(pairs);
There are two ways to convert a collection to a RDD.
1) sc.Parallelize(collection)
2) sc.makeRDD(collection)
Both of the method are identical, so we can use any of them
If you are using a .scala file, or you don't want to or cannot use JavaSparkContext, then you could:
use SparkContext instead of JavaSparkContext
convert your Java List to a Scala List
use SparkContext's parallelize method
For example:
List<String> javaList = new ArrayList<>()
javaList.add("abc")
javaList.add("def")
sc.parallelize(javaList.asScala)
This will generate an RDD for you.
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("fieldx1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx3", DataTypes.LongType, true));
List<Row> data = new ArrayList<>();
data.add(RowFactory.create("","",""));
Dataset<Row> rawDataSet = spark.createDataFrame(data, schema).toDF();

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources