apache-spark: How to use UDF in expr? - apache-spark

How do we use a spark UDF in expression?
I have:
def power3(input: Int) = input * input * input
import org.apache.spark.sql.functions.udf
val power3UDF = udf(power3(_:Int):Int).withName("power3UDF")
numDF
.select(expr("Number"), expr("power3UDF(Number) AS POWER3"))
.show()
This gives error:
Undefined function: power3UDF. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.power3udf.

Need to register the function spark SQL function:
spark.udf.register("power3", power3(_:Int):Int)

Related

Using Accumulator inside Pyspark UDF

I want to access accumulator inside pyspark udf :
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
accum=spark.sparkContext.accumulator(0)
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return accum.value
convertUDF = udf(lambda g,s : prob(g,s),IntegerType())
problem i am getting :
raise Exception("Accumulator.value cannot be accessed inside tasks")
Exception: Accumulator.value cannot be accessed inside tasks
Please let me know how to access accumulator value and how can we change it inside
Pyspark UDF .
You cannot access the .value of the accumulator in the udf. From the documentation (see this answer too):
Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value.
It is unclear why you need to return accum.value in this case. I believe you only need to return 2 in the else block looking at your if block:
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return 2

Not able to register UDF in spark sql

I trying to register my UDF function and want to use this in my spark sql query but not able to register my udf Im getting below error.
val squared = (s: Column) => {
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
}
squared: org.apache.spark.sql.Column => org.apache.spark.sql.Column = <function1>
scala> sqlContext.udf.register("dc",squared)
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:143)
... 48 elided
I tried to change Column to String but getting below error.
val squared = (s: String) => {
| concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
| }
<console>:28: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
can someone please guide me how should i implement this.
All spark functions from this package org.apache.spark.sql.functions._ will not be able to access inside UDF.
Instead of built in spark functions ..you can use plain scala code to get same result.
val df = spark.sql("select * from your_table")
def date_concat(date:Column): Column = {
concat(substring(date,4,2),year(to_date(from_unixtime(unix_timestamp(date,"dd-MM-yyyy")))))
}
df.withColumn("date_column_name",date_concat($"date_column_name")) // with function.
df.withColumn("date_column_name",concat(substring($"date_column_name",4,2),year(to_date(from_unixtime(unix_timestamp($"date_column_name","dd-MM-yyyy")))))) // without function, direct method.
df.createOrReplaceTempView("table_name")
spark.sql("[...]") // Write your furthur logic in sql if you want.

Spark 2.1 register UDF to functionRegistry

Hi I want to register a UDF object that is already created. I'm using spark 2.1, and the sparkSession.udf.register() function does not accept a UDF parameter only a regular scala function. It's easy to miss something from the large Spark API so just asking is there a function or constructor that will allow this in 2.1?
In this case I'd reverse the the problem and use udf registration to get UserDefinedFunction:
import org.apache.spark.sql.expressions.UserDefinedFunction
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
Which would work both in DataFrames:
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
and SQL:
spark.sql("SELECT id(id) FROM RANGE(42)")

to_json function in spark sql

I am trying to use the new to_json function in Spark 2.1 to convert a Map type column to JSON. Code snippet below:
import org.apache.spark.sql.functions.to_json
val df = spark.sql("SELECT *, to_json(Properties) AS Prop, to_json(Measures) AS Meas FROM tempTable")
However, I get the below error. Any ideas what could be wrong?
org.apache.spark.sql.AnalysisException: Undefined function: 'to_json'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 10

How to register UDF with no argument in Pyspark

I have tried Spark UDF with parameter using lambda function and register it. but how could I create udf with not argument and registrar it I have tried this my sample code will expected to show current time
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
udfGateTime=udf(getTime,TimestampType())
But PySpark is showing
NameError: name 'TimestampType' is not defined
which probably means my UDF is not registered
I was comfortable with this format
spark.udf.register('GATE_TIME', lambda():getTime(), TimestampType())
but does lambda function take empty argument? Though I didn't try it, I am a bit confused. How could I write the code for registering this getTime() function?
lambda expression can be nullary. You're just using incorrect syntax:
spark.udf.register('GATE_TIME', lambda: getTime(), TimestampType())
There is nothing special in lambda expressions in context of Spark. You can use getTime directly:
spark.udf.register('GetTime', getTime, TimestampType())
There is no need for inefficient udf at all. Spark provides required function out-of-the-box:
spark.sql("SELECT current_timestamp()")
or
from pyspark.sql.functions import current_timestamp
spark.range(0, 2).select(current_timestamp())
I have done a bit tweak here and it is working well for now
import datetime
from pyspark.sql.types import*
def getTime():
timevalue=datetime.datetime.now()
return timevalue
def GetVal(x):
if(True):
timevalue=getTime()
return timevalue
spark.udf.register('GetTime', lambda(x):GetVal(x),TimestampType())
spark.sql("select GetTime('currenttime')as value ").show()
instead of currenttime any value can pass at it will give current date time here
The error "NameError: name 'TimestampType' is not defined" seems to be due to the lack of:
import pyspark.sql.types.TimestampType
For more info regarding TimeStampType see this answer https://stackoverflow.com/a/30992905/5088142

Resources