Not able to register UDF in spark sql - apache-spark

I trying to register my UDF function and want to use this in my spark sql query but not able to register my udf Im getting below error.
val squared = (s: Column) => {
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
}
squared: org.apache.spark.sql.Column => org.apache.spark.sql.Column = <function1>
scala> sqlContext.udf.register("dc",squared)
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:143)
... 48 elided
I tried to change Column to String but getting below error.
val squared = (s: String) => {
| concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
| }
<console>:28: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
can someone please guide me how should i implement this.

All spark functions from this package org.apache.spark.sql.functions._ will not be able to access inside UDF.
Instead of built in spark functions ..you can use plain scala code to get same result.
val df = spark.sql("select * from your_table")
def date_concat(date:Column): Column = {
concat(substring(date,4,2),year(to_date(from_unixtime(unix_timestamp(date,"dd-MM-yyyy")))))
}
df.withColumn("date_column_name",date_concat($"date_column_name")) // with function.
df.withColumn("date_column_name",concat(substring($"date_column_name",4,2),year(to_date(from_unixtime(unix_timestamp($"date_column_name","dd-MM-yyyy")))))) // without function, direct method.
df.createOrReplaceTempView("table_name")
spark.sql("[...]") // Write your furthur logic in sql if you want.

Related

how to create dataframe in UDF

I have a problem. I want to create a DataFrame in UDF and use my model to transform it to another. But I get this Exception. Is there something wrong in Spark Conf? I don't know. Is there anyone can help me to solve this problem?
Code:
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String):Double = {
val modelLoaded = modelBroad.value
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val dataDF = sparkss.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
println(f"The prediction of $id and $text is $result")
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text")).show()
Exception:
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics$lzycompute(LocalTableScanExec.scala:37)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:85)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3366)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3365)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
at com.zamplus.mine.SparkSubmit$.com$zamplus$mine$SparkSubmit$$model_predict$1(SparkSubmit.scala:21)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
... 22 more
There is issue with your UDF. UDF runs on multiple instances and uses all variables that we are using inside it. So you should passed all required global variable as a parameters such as modelBroad otherwise it will give you null pointer exception.
There are few more good practice that you are not following in UDF. Some of are:
You do not need to create spark session in UDF. Otherwise it will create multiple spark session and which will cause issues. Instead of this pass global spark session as a variable in UDF if required.
Remove unnecessary pritnln in UDF, which effect your return also.
I have changed your code just for reference. It is just a prototype of ideal UDF. Please change it accordingly.
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String,spark:SparkSession,modelBroad:<datatype>):Double = {
val modelLoaded = modelBroad.value
val dataDF = spark.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text",lit(sparkss),lit(modelBroad))).show()

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}
Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)
Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src
USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

Spark 2.1 register UDF to functionRegistry

Hi I want to register a UDF object that is already created. I'm using spark 2.1, and the sparkSession.udf.register() function does not accept a UDF parameter only a regular scala function. It's easy to miss something from the large Spark API so just asking is there a function or constructor that will allow this in 2.1?
In this case I'd reverse the the problem and use udf registration to get UserDefinedFunction:
import org.apache.spark.sql.expressions.UserDefinedFunction
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
Which would work both in DataFrames:
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
and SQL:
spark.sql("SELECT id(id) FROM RANGE(42)")

Spark classnotfoundexception in UDF

When I call a function it works. but when I call that function in UDF will not work.
This is full code.
val sparkConf = new SparkConf().setAppName("HiveFromSpark").set("spark.driver.allowMultipleContexts","true")
val sc = new SparkContext(sparkConf)
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
///////////// UDFS
def toDoubleArrayFun(vec:Any) : scala.Array[Double] = {
return vec.asInstanceOf[WrappedArray[Double]].toArray
}
def toDoubleArray=udf((vec:Any) => toDoubleArrayFun(vec))
//////////// PROCESS
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where word='soccer'")
println("==== test get value then transform")
println(df.head().get(0))
println(toDoubleArrayFun(df.head().get(0)))
println("==== test transform by udf")
df.withColumn("word_v", toDoubleArray(col("vec")))
.show(10);
Then this the output.
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext#6e9484ad
hive: org.apache.spark.sql.hive.HiveContext =
toDoubleArrayFun: (vec: Any)Array[Double]
toDoubleArray: org.apache.spark.sql.UserDefinedFunction
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
==== test get value then transform
WrappedArray(-0.88675,, 0.0216657)
[D#4afcc447
==== test transform by udf
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, xdad008.band.nhnsystem.com): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$$$5ba2a895f25683dd48fe725fd825a71$$$$$$iwC$$anonfun$toDoubleArray$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Full output here.
https://gist.github.com/jeesim2/efb52f12d6cd4c1b255fd0c917411370
As you can see "toDoubleArrayFun" function works well, but in udf it claims ClassNotFoundException.
I can not change the hive data structure, and need to convert vec to Array[Double] to make a Vector instance.
So what problem with code above?
Spark version is 1.6.1
Update 1
Hive table's 'vec' column type is "array<double>"
Below code also cause error
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where
word='hh'")
df.printSchema()
var word_vec = df.head().get(0)
println(word_vec)
println(Vectors.dense(word_vec))
output
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
root
|-- vec: array (nullable = true)
| |-- element: double (containsNull = true)
==== test get value then transform
word_vec: Any = WrappedArray(-0.88675,...7)
<console>:288: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues:Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Any)
println(Vectors.dense(word_vec))
This means hive 'array<double>' column cant not be casted to Array<Double>
Actually I want to calculate distance:Double with two array<double> column.
How do I add Vector column based on array<double> column?
Typical method is
Vectors.sqrt(Vectors.dense(Array<Double>, Array<Double>)
Since udf function has to go serialization and deserialization process, any DataType will not work. You will have to define exact DataType of the column you are passing to the udf function.
From the output in your question it seems that you have only one column in your dataframe i.e. vec which is of Array[Double] type
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
There actually is no need of that udf function as your vec column is already of Array dataType and that is what your udf function is doing as well i.e. casting the value to Array[Double].
Now, your other function call is working
println(toDoubleArrayFun(df.head().get(0)))
because there is no need of serialization and de-serialization process, its just scala function call.

to_json function in spark sql

I am trying to use the new to_json function in Spark 2.1 to convert a Map type column to JSON. Code snippet below:
import org.apache.spark.sql.functions.to_json
val df = spark.sql("SELECT *, to_json(Properties) AS Prop, to_json(Measures) AS Meas FROM tempTable")
However, I get the below error. Any ideas what could be wrong?
org.apache.spark.sql.AnalysisException: Undefined function: 'to_json'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 10

Resources