Kotlin with spark create dataframe from POJO which has pojo classes within - apache-spark

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.

It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.

Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Related

Unable to find encoder for type stored in a Dataset. error in spite of providing the proper implicits [duplicate]

This question already has answers here:
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
(3 answers)
Closed 4 years ago.
I was testing some basic spark code where in I was converting a dataframe to dataset by reading from a datasource.
import org.apache.spark.sql.SparkSession
object RunnerTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.getOrCreate
case class Characters(name: String, id: Int)
import spark.implicits._
val path = "examples/src/main/resources/Characters.csv"
val peopleDS = spark.read.csv(path).as[Characters]
}
}
This is way too simple code yet I am getting compilation error saying,
Error:(42, 43) Unable to find encoder for type Characters. An implicit
Encoder[Characters] is needed to store Characters instances in a
Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for
serializing other types will be added in future releases.
val peopleDS = spark.read.csv(path).as[Characters]
Am using Spark 2.4 and sbr 2.12.8 though.
Actually the problem here was that the case class was inside the main object. For some reason spark doesn't like it.It was a silly mistake but took a while to figure out what was missing. Once I moved case class out of object,it just compiled fine.
import org.apache.spark.sql.SparkSession
case class Characters(name: String, id: Int)
object RunnerTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.getOrCreate
import spark.implicits._
val path = "examples/src/main/resources/Characters.csv"
val peopleDS = spark.read.csv(path).as[Characters]
}
}

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

Kotlin and Spark - SAM issues

Maybe I'm doing something that is not quite supported, but I really want to use Kotlin as I learn Apache Spark with this book
Here is the Scala code sample I'm trying to run. The flatMap() accepts a FlatMapFunction SAM type:
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
Here is my attempt to do this in Kotlin. But it is having a compilation issue on the fourth line:
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = SparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap{ s:String -> s.split(" ") } //ERROR
When I hover over it I get this compile error:
Am I doing anything unreasonable or unsupported? I don't see any suggestions to autocomplete with lambdas either :(
Despite the fact the problem is solved I would like to provide some information regarding the reasons of compilation problem. In this example input has a type of RDD, whose flatMap() method accepts a lambda that should return TraversableOnce[U]. As Scala has it's own collections framework, Java collection types cannot be converted to TraversableOnce.
Moreover, I'm not so sure Scala Functions are really SAMs. As far as I can see from the screenshots Kotlin doesn't offer replacing a Function instance with a lambda.
Ah, I figured it out. I knew there was a way since Spark supports both Java and Scala. The key to this particular problem was to use a JavaSparkContext instead of the Scala-based SparkContext.
For some reason Scala and Kotlin don't always get along with SAM conversions. But Java and Kotlin do...
fun main(args: Array<String>) {
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = JavaSparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap { it.split(" ") }
}
See my comment at #Michael for my fix. However, can I recommend the open source Kotlin Spark API by JetBrains for future reference? It solves many lambda errors, especially using the Dataset API but can also make working with Spark from Kotlin generally easier:
withSpark(appName = "Line Counter", master = "local") {
val input = sc.textFile("C:\\spark_workspace\\myfile.txt", 1)
val words = input.flatMap { s: String -> s.split(" ").iterator() }
}

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources