spark custom kryo encoder not providing schema for UDF - apache-spark

When following along with How to store custom objects in Dataset? and trying to register my own kryo encoder for a data frame I face an issue of Schema for type com.esri.core.geometry.Envelope is not supported
There is a function which will parse a String (WKT) to an geometry object like:
def mapWKTToEnvelope(wkt: String): Envelope = {
val envBound = new Envelope()
val spatialReference = SpatialReference.create(4326)
// Parse the WKT String into a Geometry Object
val ogcObj = OGCGeometry.fromText(wkt)
ogcObj.setSpatialReference(spatialReference)
ogcObj.getEsriGeometry.queryEnvelope(envBound)
envBound
}
This is applied to an UDF like:
implicit val envelopeEncoder: Encoder[Envelope] = Encoders.kryo[Envelope]
val ST_Envelope = udf((wkt: String) => mapWKTToEnvelope(wkt))
However, the UDF will compile but throw a runtime error of:
[error] Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type com.esri.core.geometry.Envelope is not supported
[error] at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
[error] at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
[error] at org.apache.spark.sql.functions$.udf(functions.scala:3076)
edit
Whereas
val first = df[(String, String)].first
val envBound = new Envelope()
val ogcObj = OGCGeometry.fromText(first._1)
ogcObj.setSpatialReference(spatialReference)
ogcObj.getEsriGeometry.queryEnvelope(envBound)
spark.createDataset(Seq((envBound)))(envelopeEncoder)
Works just fine:
root
|-- value: binary (nullable = true)
+--------------------+
| value|
+--------------------+
|[01 00 63 6F 6D 2...|
+--------------------+
How can I get it to work in the UDF as well

Related

Not able to register UDF in spark sql

I trying to register my UDF function and want to use this in my spark sql query but not able to register my udf Im getting below error.
val squared = (s: Column) => {
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
}
squared: org.apache.spark.sql.Column => org.apache.spark.sql.Column = <function1>
scala> sqlContext.udf.register("dc",squared)
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:143)
... 48 elided
I tried to change Column to String but getting below error.
val squared = (s: String) => {
| concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
| }
<console>:28: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
can someone please guide me how should i implement this.
All spark functions from this package org.apache.spark.sql.functions._ will not be able to access inside UDF.
Instead of built in spark functions ..you can use plain scala code to get same result.
val df = spark.sql("select * from your_table")
def date_concat(date:Column): Column = {
concat(substring(date,4,2),year(to_date(from_unixtime(unix_timestamp(date,"dd-MM-yyyy")))))
}
df.withColumn("date_column_name",date_concat($"date_column_name")) // with function.
df.withColumn("date_column_name",concat(substring($"date_column_name",4,2),year(to_date(from_unixtime(unix_timestamp($"date_column_name","dd-MM-yyyy")))))) // without function, direct method.
df.createOrReplaceTempView("table_name")
spark.sql("[...]") // Write your furthur logic in sql if you want.

Output not in readable format in Spark 2.2.0 Dataset

Following is the code which i am trying to execute with spark2.2.0 on intellij IDE. But the output i am getting is doesnt look in readble format.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example").master("local[2]")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
Output shown :
+--------------------+
| value|
+--------------------+
|[01 00 44 61 74 6...|
+--------------------+
Can anyone explain if I am missing anything here?
Thanks
This is because you're using Kryo Encoder which is not designed to deserialize objects for show.
In general you should never use Kryo Encoder when more precise Encoders are available. It has poorer performance and less features. Instead use Product Encoder
spark.createDataset(Seq(Person("Andy", 32)))(Encoders.product[Person])

Spark classnotfoundexception in UDF

When I call a function it works. but when I call that function in UDF will not work.
This is full code.
val sparkConf = new SparkConf().setAppName("HiveFromSpark").set("spark.driver.allowMultipleContexts","true")
val sc = new SparkContext(sparkConf)
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
///////////// UDFS
def toDoubleArrayFun(vec:Any) : scala.Array[Double] = {
return vec.asInstanceOf[WrappedArray[Double]].toArray
}
def toDoubleArray=udf((vec:Any) => toDoubleArrayFun(vec))
//////////// PROCESS
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where word='soccer'")
println("==== test get value then transform")
println(df.head().get(0))
println(toDoubleArrayFun(df.head().get(0)))
println("==== test transform by udf")
df.withColumn("word_v", toDoubleArray(col("vec")))
.show(10);
Then this the output.
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext#6e9484ad
hive: org.apache.spark.sql.hive.HiveContext =
toDoubleArrayFun: (vec: Any)Array[Double]
toDoubleArray: org.apache.spark.sql.UserDefinedFunction
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
==== test get value then transform
WrappedArray(-0.88675,, 0.0216657)
[D#4afcc447
==== test transform by udf
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, xdad008.band.nhnsystem.com): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$$$5ba2a895f25683dd48fe725fd825a71$$$$$$iwC$$anonfun$toDoubleArray$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Full output here.
https://gist.github.com/jeesim2/efb52f12d6cd4c1b255fd0c917411370
As you can see "toDoubleArrayFun" function works well, but in udf it claims ClassNotFoundException.
I can not change the hive data structure, and need to convert vec to Array[Double] to make a Vector instance.
So what problem with code above?
Spark version is 1.6.1
Update 1
Hive table's 'vec' column type is "array<double>"
Below code also cause error
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where
word='hh'")
df.printSchema()
var word_vec = df.head().get(0)
println(word_vec)
println(Vectors.dense(word_vec))
output
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
root
|-- vec: array (nullable = true)
| |-- element: double (containsNull = true)
==== test get value then transform
word_vec: Any = WrappedArray(-0.88675,...7)
<console>:288: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues:Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Any)
println(Vectors.dense(word_vec))
This means hive 'array<double>' column cant not be casted to Array<Double>
Actually I want to calculate distance:Double with two array<double> column.
How do I add Vector column based on array<double> column?
Typical method is
Vectors.sqrt(Vectors.dense(Array<Double>, Array<Double>)
Since udf function has to go serialization and deserialization process, any DataType will not work. You will have to define exact DataType of the column you are passing to the udf function.
From the output in your question it seems that you have only one column in your dataframe i.e. vec which is of Array[Double] type
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
There actually is no need of that udf function as your vec column is already of Array dataType and that is what your udf function is doing as well i.e. casting the value to Array[Double].
Now, your other function call is working
println(toDoubleArrayFun(df.head().get(0)))
because there is no need of serialization and de-serialization process, its just scala function call.

How do I convert Binary string to scala string in spark scala

I am reading an avro file which contains a field as binary string, I need to convert it into a java.lang.string to pass it to another library(spark-xml-util), how do I convert it into java.lang.string efficiently. This is the code I have got so far : -
val df = sqlContext.read.format("com.databricks.spark.avro").load("filePath/fileName.avro")
df.select("myField").collect().mkString
The last line gives me the following exception: -
Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.lang.String
at org.apache.spark.sql.Row$class.getString(Row.scala:255)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:165)
df schema is: -
root
|-- id: string (nullable = true)
|-- myField: binary (nullable = true)
Considering the state of the API right now (2.2.0), your best call is to create a UDF to do just that and replace the column :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => new String(payload))
df.withColumn("myField", toString(df("myField")))
or if as you seem to imply the data is compressed using GZIP you can :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(payload))
scala.io.Source.fromInputStream(inputStream).mkString
})
df.withColumn("myField", toString(df("myField")))
In Spark 3.0 you can cast between BINARY and STRING data.
scala> val df = sc.parallelize(Seq("ABC", "BCD", "CDE", "DEF")).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.select($"value", $"value".cast("BINARY"),
$"value".cast("BINARY").cast("STRING")).show()
+-----+----------+-----+
|value| value|value|
+-----+----------+-----+
| ABC|[41 42 43]| ABC|
| BCD|[42 43 44]| BCD|
| CDE|[43 44 45]| CDE|
| DEF|[44 45 46]| DEF|
+-----+----------+-----+
I don't have your data to test with, but you should be able to do:
df.select($"myField".cast("STRING"))
This obviously depends on the actual data (i.e. don't cast a JPEG to STRING) but assuming it's UTF-8 encoded this should work.
In the previous solution, the code new String(payload) did not work for me on true binary data.
Ultimately the solution was a little more involved, with the length of the binary data required as a 2nd parameter.
def binToString(payload: Array[Byte], payload_length: Int): String = {
val ac: Array[Char] = Range(0,payload_length).map(i => payload(i).toChar).toArray
return ac.mkString
}
val binToStringUDF = udf( binToString(_: Array[Byte], _: Int): String )

How to add a schema to a Dataset in Spark?

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

Resources