Unable to find encoder for type stored in a Dataset. error in spite of providing the proper implicits [duplicate] - apache-spark

This question already has answers here:
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
(3 answers)
Closed 4 years ago.
I was testing some basic spark code where in I was converting a dataframe to dataset by reading from a datasource.
import org.apache.spark.sql.SparkSession
object RunnerTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.getOrCreate
case class Characters(name: String, id: Int)
import spark.implicits._
val path = "examples/src/main/resources/Characters.csv"
val peopleDS = spark.read.csv(path).as[Characters]
}
}
This is way too simple code yet I am getting compilation error saying,
Error:(42, 43) Unable to find encoder for type Characters. An implicit
Encoder[Characters] is needed to store Characters instances in a
Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for
serializing other types will be added in future releases.
val peopleDS = spark.read.csv(path).as[Characters]
Am using Spark 2.4 and sbr 2.12.8 though.

Actually the problem here was that the case class was inside the main object. For some reason spark doesn't like it.It was a silly mistake but took a while to figure out what was missing. Once I moved case class out of object,it just compiled fine.
import org.apache.spark.sql.SparkSession
case class Characters(name: String, id: Int)
object RunnerTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.getOrCreate
import spark.implicits._
val path = "examples/src/main/resources/Characters.csv"
val peopleDS = spark.read.csv(path).as[Characters]
}
}

Related

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Convert String expression to actual working instance expression

I am trying to convert an expression in Scala that is saved in database as String back to working code.
I have tried Reflect Toolbox, Groovy, etc. But I can't seem to achieve what I require.
Here's what I tried:
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val toolbox = currentMirror.mkToolBox()
val code1 = q"""StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true))"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
where I need to use the sType instance for passing customSchema to csv file for dataframe creation but it seems to fail.
Is there any way I can get the string expression of the StructType to convert into actual StructType instance? Any help would be appreciated.
If StructType is from Spark and you want to just convert String to StructType you don't need reflection. You can try this:
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
import scala.util.Try
def fromString(raw: String): StructType =
Try(DataType.fromJson(raw)).getOrElse(LegacyTypeStringParser.parse(raw)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing: $raw")
}
val code1 =
"""StructType(Array(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true)))"""
fromString(code1) // res0: org.apache.spark.sql.types.StructType
The code is taken from the org.apache.spark.sql.types.StructType companion object from Spark. You cannot use it directly as it's in private package. Moreover, it uses LegacyTypeStringParser so I'm not sure if this is good enough for Production code.
Your code inside quasiquotes, needs to be valid Scala syntax, so you need to provide quotes for strings. You'd also need to provide all the necessary imports. This works:
val toolbox = currentMirror.mkToolBox()
val code1 =
q"""
//we need to import all sql types
import org.apache.spark.sql.types._
StructType(
//StructType needs list
List(
//name arguments need to be in proper quotes
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("tstamp",TimestampType,true),
StructField("date",DateType,true)
)
)
"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
println(sType)
But maybe instead of trying to recompile the code, you should consider other alternatives as serializing struct type somehow (perhaps to JSON?).

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

Spark Avro write RDD to multiple directories by key

I need to split an RDD by first letters (A-Z) and write the files into directories respectively.
The simple solution is to filter the RDD for each letter, but this requires 26 passes.
There is a response to a similar question for writing to text files here, but I cannot figure out how to do this for Avro files.
Has anyone been able to do this?
You can use multipleoutputformat to do this
It is a two step task :-
First you need the multiple output format for avro. Below is the code for that:
package avro
import org.apache.hadoop.mapred.lib.MultipleOutputFormat
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.util.Progressable
import org.apache.avro.mapred.AvroOutputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.hadoop.io.NullWritable
import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapred.RecordWriter
class MultipleAvroFileOutputFormat[K] extends MultipleOutputFormat[AvroWrapper[K], NullWritable] {
val outputFormat = new AvroOutputFormat[K]
override def generateFileNameForKeyValue(key: AvroWrapper[K], value: NullWritable, name: String) = {
val name = key.datum().asInstanceOf[String].substring(0, 1)
name + "/" + name
}
override def getBaseRecordWriter(fs: FileSystem,
job: JobConf,
name: String,
arg3: Progressable) = {
outputFormat.getRecordWriter(fs, job, name, arg3).asInstanceOf[RecordWriter[AvroWrapper[K], NullWritable]]
}
}
In your driver code you have to mention that you want to use the Above given output format. You also need to mention the output schema for avro data. Below is sample driver code which stores a RDD of string in avro format with schema {"type":"string"}
package avro
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.JobConf
import org.apache.avro.mapred.AvroJob
import org.apache.avro.mapred.AvroWrapper
object AvroDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setAppName(args(0));
conf.setMaster("local[2]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[AvroWrapper[String]]))
val sc = new SparkContext(conf);
val input = sc.parallelize(Seq("one", "two", "three", "four"), 1);
val pairRDD = input.map(x => (new AvroWrapper(x), null));
val job = new JobConf(sc.hadoopConfiguration)
val schema = "{\"type\":\"string\"}"
job.set(AvroJob.OUTPUT_SCHEMA, schema) //set schema for avro output
pairRDD.partitionBy(new HashPartitioner(26)).saveAsHadoopFile(args(1), classOf[AvroWrapper[String]], classOf[NullWritable], classOf[MultipleAvroFileOutputFormat[String]], job, None);
sc.stop()
}
}
I hope you get a better answer than mine...
I've been in a similar situation myself, except with "ORC" instead of Avro. I basically threw up my hands and ended up calling the ORC file classes directly to write the files myself.
In your case, my approach would entail partitioning the data via "partitionBy" into 26 partitions, one for each first letter A-Z. Then call "mapPartitionsWithIndex", passing a function that outputs the i-th partition to an Avro file at the appropriate path. Finally, to convince Spark to actually do something, have mapPartitionsWithIndex return, say, a List containing the single boolean value "true"; and then call "count" on the RDD returned by mapPartitionsWithIndex to get Spark to start the show.
I found an example of writing an Avro file here: http://www.myhadoopexamples.com/2015/06/19/merging-small-files-into-avro-file-2/

Resources