Output not in readable format in Spark 2.2.0 Dataset - apache-spark

Following is the code which i am trying to execute with spark2.2.0 on intellij IDE. But the output i am getting is doesnt look in readble format.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example").master("local[2]")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
Output shown :
+--------------------+
| value|
+--------------------+
|[01 00 44 61 74 6...|
+--------------------+
Can anyone explain if I am missing anything here?
Thanks

This is because you're using Kryo Encoder which is not designed to deserialize objects for show.
In general you should never use Kryo Encoder when more precise Encoders are available. It has poorer performance and less features. Instead use Product Encoder
spark.createDataset(Seq(Person("Andy", 32)))(Encoders.product[Person])

Related

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Kryo encoder v.s. RowEncoder in Spark Dataset

The purpose of the following examples is to understand the difference of the two encoders in Spark Dataset.
I can do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
val myStructType = StructType(Seq(StructField("id", IntegerType), StructField("value", StringType)))
implicit val myRowEncoder = RowEncoder(myStructType)
val ds = df.map{case row => row}
ds.show
//+---+-----+
//| id|value|
//+---+-----+
//| 1| a|
//| 2| d|
//+---+-----+
I can also do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show
//+--------------------+
//| value|
//+--------------------+
//|[01 00 6F 72 67 2...|
//|[01 00 6F 72 67 2...|
//+--------------------+
The only difference of the code is: one is using Kryo encoder, another is using RowEncoder.
Question:
What is the difference using the two?
Why one is displaying encoded values, another is displaying human readable values?
When should we use which?
Encoders.kryo simply creates an encoder that serializes objects of type T using Kryo
RowEncoder is an object in Scala with apply and other factory methods.
RowEncoder can create ExpressionEncoder[Row] from a schema.
Internally, apply creates a BoundReference for the Row type and returns a ExpressionEncoder[Row] for the input schema, a CreateNamedStruct serializer (using serializerFor internal method), a deserializer for the schema, and the Row type
RowEncoder knows about schema and uses it for serialization and deserialization.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
Kryo is good for efficiently storaging large dataset and network intensive application.
for more information you can refer to these links:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-RowEncoder.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Encoders.html
https://medium.com/#knoldus/kryo-serialization-in-spark-55b53667e7ab
https://stackoverflow.com/questions/58946987/what-are-the-pros-and-cons-of-java-serialization-vs-kryo-serialization#:~:text=Kryo%20is%20significantly%20faster%20and,in%20advance%20for%20best%20performance.
According to Spark's documentation, SparkSQL does NOT use Kryo or Java
serializations (standardly).
Kryo is for RDDs and not Dataframes or DataSets. Hence the question is
a little off-beam afaik.
Does Kryo help in SparkSQL?
This elaborates on custom objects, but...
UPDATED Answer after some free time
Your example was not really what I would call custom type. They are
are just structs with primitives. No issue.
Kryo is a serializer, DS, DF's use Encoders for columnar advantage.
Kryo is used internally by Spark for shuffling.
This user defined example case class Foo(name: String, position: Point) is one that we can do with DS or DF or via kryo. But what's
the point with Tungsten and Catalyst working with "understanding the
structure of the data"? and thus able to optimize. You also get a
single binary value with kryo and I have found few examples of how to
work successfully with it, e.g. JOIN.
KRYO Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
implicit val PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+--------------------+
| value|
+--------------------+
|[01 00 D2 02 6C 6...|
+--------------------+
Encoder for DS using case class Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+----+--------+
|name|position|
+----+--------+
| bar| [0, 0]|
+----+--------+
This strikes me as the way to go with Spark, Tungsten, Catalyst.
Now, more complicated stuff is this when an Any is involved, but Any is not a good thing:
val data = Seq(
("sublime", Map(
"good_song" -> "santeria",
"bad_song" -> "doesn't exist")
),
("prince_royce", Map(
"good_song" -> 4,
"bad_song" -> "back it up")
)
)
val schema = List(
("name", StringType, true),
("songs", MapType(StringType, StringType, true), true)
)
val rdd= spark.sparkContext.parallelize(data)
rdd.collect
val df = spark.createDataFrame(rdd)
df.show()
df.printSchema()
returns:
Java.lang.UnsupportedOperationException: No Encoder found for Any.
Then this example is interesting that is a valid custom object use case
Spark No Encoder found for java.io.Serializable in Map[String, java.io.Serializable]. But I would stay away from such.
Conclusions
Kryo vs Encoder vs Java Serialization in Spark? states that kryo is for RDD but that is for legacy; internally one can use it. Not 100% correct but actually to the point.
Spark: Dataset Serialization is also an informative link.
The stuff has evolved and the spirit is to not use kryo for DS, DF.
Hope this helps.
TL/DR: dont' trust the show() method output. Your internal structure of your case class is not lost.
I was confused about how the show() method rendered the content of the
tuples as one column named 'value' with binary content. I concluded
(incorrectly) that the dataframe was just a one column binary
blob that no longer adhered to the structure of a Tuple2[Integer,String]
with columns named id and value.
However, when I printed the actual content of the collected data frame
I saw the correct values, column names and types. So I think this is just
an issue with the show() method.
The program below should servce to reproduce my results:
object X extends App {
val sparkSession = SparkSession.builder().appName("tests")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show // This shows only one column 'value' w/ binary content
// This shows that the schema and values are actually correct. The below will print:
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[1,a]
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[2,d]
val collected: util.List[Row] = ds.collectAsList()
collected.forEach{ row =>
System.out.println("row shema:" + row.schema)
System.out.println("row:" + row)
}
}

Decoding Java enums/custom non case classes using Structured Spark Streaming

I am trying to use structured streaming in Spark 2.1.1 to read from Kafka and decode Avro encoded messages. I have the a UDF defined as per this question.
val sr = new CachedSchemaRegistryClient(conf.kafkaSchemaRegistryUrl, 100)
val deser = new KafkaAvroDeserializer(sr)
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
val topic = conf.inputTopic
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.kafkaServers)
.option("subscribe", topic)
.load()
df.printSchema()
val result = df.selectExpr("CAST(key AS STRING)", """decodeMessage($"value") as "value_des"""")
val query = result.writeStream
.format("console")
.outputMode(OutputMode.Append())
.start()
However I get the following failure.
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type DeviceRelayStateEnum is not supported
It fails on this line
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
An alternate approach was to define encoders for the custom classes I have
implicit val enumEncoder = Encoders.javaSerialization[DeviceRelayStateEnum]
implicit val messageEncoder = Encoders.product[DeviceRead]
but that fails with the following error when the messageEncoder is getting registered.
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for DeviceRelayStateEnum
- option value class: "DeviceRelayStateEnum"
- field (class: "scala.Option", name: "deviceRelayState")
- root class: "DeviceRead"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:476)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
When I attempt to do this using a map after the load() I get the following compilation error.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[DeviceRead])org.apache.spark.sql.Dataset[DeviceRead].
Unspecified value parameter evidence$6.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Does that essentially mean that I cannot use Structured Streaming for Java enums? And it can only be used with either primitives or case classes?
I read a few related questions 1, 2, 3 around this and it seems the possibility of specifying a custom Encoder for a class i.e. UDT was removed in 2.1 and the new functionality was not added.
Any help will be appreciated.
I think you may be asking for too much in the current version of Structured Streaming (and Spark SQL) in general.
I've been yet unable to fully understand how to deal with the issue of missing encoders in a so-called more professional way, but the same issue you'd get when you tried to create a Dataset of enums. That might not simply be supported yet.
Structured Streaming is just a streaming library on top of Spark SQL and uses it for serialization-deserialization (SerDe).
To make the story short and to get you going (until you figure out a better way), I'd recommend avoid using enums in the business objects you use to represent the schema of your datasets.
So, I'd recommend doing something along the lines:
val decodeMessage = udf { bytes:Array[Byte] =>
val dr = deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead]
// do additional transformation here so you use a custom streaming-specific class
// Here I'm using a simple tuple to hold what might be relevant
// You could create a case class instead to have proper names
(dr.id, dr.value)
}

Spark: printing Hbase data and converting it into Dataframe

I am having difficulty in playing with the data received from my Hbase table. I have a Hbase table EMP_META: COLUMN_NAME,SALARY,DESIGNATION,BONUS and I read it using below code:
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local", "hbase-test")
println("Running Phoenix Context")
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "EMP_META")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("--------------: "+hBaseRDD.first())
}
However when I print it using the above print statement I get below output:
(65 6d 70 6c 6f 79 65 65,keyvalues={employee/0:COLUMN_NAME/1483975443911/Put/vlen=4/seqid=0, employee/0:DATA_TYPE/1483975443911/Put/vlen=7/seqid=0, employee/0:_0/1483975443911/Put/vlen=1/seqid=0})
Instead of simple data text row. I want to convert the output to a dataframe so that I can easily play with the data. Can someone please help me in this.
Thanks
If you want to convert hbaseRDD to DataFrame,you can use the follow code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
hBaseRDD.toDF
If you want to convert the result to String, you should convert the Array[Byte] to String.The data stored in HBase is Array[Byte].Try to use Bytes.toString(data) to convert it.

Transforming PySpark RDD with Scala

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though.
I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized.
My question would be: how to get Java strings out of the DStream object?
Here is the simplest Python code I came up with:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))
from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(ssc, ["IN"], {"metadata.broker.list": "localhost:9092"})
values = stream.map(lambda tuple: tuple[1])
ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)
ssc.start()
I'm running this code in PySpark, passing it the path to my JAR:
pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar
On the Scala side, I have:
package com.seigneurin
import org.apache.spark.streaming.api.java.JavaDStream
object MyPythonHelper {
def doSomething(jdstream: JavaDStream[String]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(println)
})
}
}
Now, let's say I send some data into Kafka:
echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic IN
The println statement in the Scala code prints something that looks like:
[B#758aa4d9
I expected to get foo bar instead.
Now, if I replace the simple println statement in the Scala code with the following:
rdd.foreach(v => println(v.getClass.getCanonicalName))
I get:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
This suggests that the strings are actually passed as arrays of bytes.
If I simply try to convert this array of bytes into a string (I know I'm not even specifying the encoding):
def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(bytes => println(new String(bytes)))
})
}
I get something that looks like (special characters might be stripped off):
�]qXfoo barqa.
This suggests the Python string was serialized (pickled?). How could I retrieve a proper Java string instead?
Long story short there is no supported way to do something like this. Don't try this in production. You've been warned.
In general Spark doesn't use Py4j for anything else than some basic RPC calls on the driver and doesn't start Py4j gateway on any other machine. When it is required (mostly MLlib and some parts of SQL) Spark uses Pyrolite to serialize objects passed between JVM and Python.
This part of the API is either private (Scala) or internal (Python) and as such not intended for general usage. While theoretically you access it anyway either per batch:
package dummy
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.streaming.api.java.JavaDStream
import org.apache.spark.sql.DataFrame
object PythonRDDHelper {
def go(rdd: JavaRDD[Any]) = {
rdd.rdd.collect {
case s: String => s
}.take(5).foreach(println)
}
}
complete stream:
object PythonDStreamHelper {
def go(stream: JavaDStream[Any]) = {
stream.dstream.transform(_.collect {
case s: String => s
}).print
}
}
or exposing individual batches as DataFrames (probably the least evil option):
object PythonDataFrameHelper {
def go(df: DataFrame) = {
df.show
}
}
and use these wrappers as follows:
from pyspark.streaming import StreamingContext
from pyspark.mllib.common import _to_java_object_rdd
from pyspark.rdd import RDD
ssc = StreamingContext(spark.sparkContext, 10)
spark.catalog.listTables()
q = ssc.queueStream([sc.parallelize(["foo", "bar"]) for _ in range(10)])
# Reserialize RDD as Java RDD<Object> and pass
# to Scala sink (only for output)
q.foreachRDD(lambda rdd: ssc._jvm.dummy.PythonRDDHelper.go(
_to_java_object_rdd(rdd)
))
# Reserialize and convert to JavaDStream<Object>
# This is the only option which allows further transformations
# on DStream
ssc._jvm.dummy.PythonDStreamHelper.go(
q.transform(lambda rdd: RDD( # Reserialize but keep as Python RDD
_to_java_object_rdd(rdd), ssc.sparkContext
))._jdstream
)
# Convert to DataFrame and pass to Scala sink.
# Arguably there are relatively few moving parts here.
q.foreachRDD(lambda rdd:
ssc._jvm.dummy.PythonDataFrameHelper.go(
rdd.map(lambda x: (x, )).toDF()._jdf
)
)
ssc.start()
ssc.awaitTerminationOrTimeout(30)
ssc.stop()
this is not supported, untested and as such rather useless for anything else than the experiments with Spark API.

Resources