How to register UDF with no argument in Pyspark - apache-spark

I have tried Spark UDF with parameter using lambda function and register it. but how could I create udf with not argument and registrar it I have tried this my sample code will expected to show current time
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
udfGateTime=udf(getTime,TimestampType())
But PySpark is showing
NameError: name 'TimestampType' is not defined
which probably means my UDF is not registered
I was comfortable with this format
spark.udf.register('GATE_TIME', lambda():getTime(), TimestampType())
but does lambda function take empty argument? Though I didn't try it, I am a bit confused. How could I write the code for registering this getTime() function?

lambda expression can be nullary. You're just using incorrect syntax:
spark.udf.register('GATE_TIME', lambda: getTime(), TimestampType())
There is nothing special in lambda expressions in context of Spark. You can use getTime directly:
spark.udf.register('GetTime', getTime, TimestampType())
There is no need for inefficient udf at all. Spark provides required function out-of-the-box:
spark.sql("SELECT current_timestamp()")
or
from pyspark.sql.functions import current_timestamp
spark.range(0, 2).select(current_timestamp())

I have done a bit tweak here and it is working well for now
import datetime
from pyspark.sql.types import*
def getTime():
timevalue=datetime.datetime.now()
return timevalue
def GetVal(x):
if(True):
timevalue=getTime()
return timevalue
spark.udf.register('GetTime', lambda(x):GetVal(x),TimestampType())
spark.sql("select GetTime('currenttime')as value ").show()
instead of currenttime any value can pass at it will give current date time here

The error "NameError: name 'TimestampType' is not defined" seems to be due to the lack of:
import pyspark.sql.types.TimestampType
For more info regarding TimeStampType see this answer https://stackoverflow.com/a/30992905/5088142

Related

Using Accumulator inside Pyspark UDF

I want to access accumulator inside pyspark udf :
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
accum=spark.sparkContext.accumulator(0)
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return accum.value
convertUDF = udf(lambda g,s : prob(g,s),IntegerType())
problem i am getting :
raise Exception("Accumulator.value cannot be accessed inside tasks")
Exception: Accumulator.value cannot be accessed inside tasks
Please let me know how to access accumulator value and how can we change it inside
Pyspark UDF .
You cannot access the .value of the accumulator in the udf. From the documentation (see this answer too):
Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value.
It is unclear why you need to return accum.value in this case. I believe you only need to return 2 in the else block looking at your if block:
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return 2

How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)

NameError: name 'average' is not defined

How to filter the data from the dataframe using avg for replacing null values
When running this snippet of code:
df.select(colname).agg(avg(colname))
I receive this exception:
name error: avg not defined
what other command can I use?
got it..
have use..
from pyspark.sql.functions import *
from pyspark.sql.types import *
it would be cleaner a solution like this:
import pyspark.sql.functions as F
df.select(colname).agg(F.avg(colname))

Is there an inverse function for pyspark's expr?

I know there's a function called expr that turns your spark sql into a spark column with that expression:
>>> from pyspark.sql import functions as F
>>> F.expr("length(name)")
Column<b'length(name)'>
Is there a function that does the opposite - turn your Column into a pyspark's sql string? Something like:
>>> F.inverse_expr(F.length(F.col('name')))
'length(name)'
I found out that Column's __repr__ gives you an idea what the column expression is (like Column<b'length(name)'>, but it doesn't seem to be usable programmatically, without some hacky parsing and string-replacing.
In scala, we can use column#expr to get sql type expression as below-
length($"entities").expr.toString()
// length('entities)
In pyspark-
print(F.length("name")._jc.expr.container)
# length(name)
I tried the accepted answer by #Som in Spark 2.4.2 and Spark 3.2.1 and it didn't work.
The following approach worked for me in pyspark:
import pyspark
from pyspark.sql import Column
def inverse_expr(c: Column) -> str:
"""Convert a column from `Column` type to an equivalent SQL column expression (string)"""
from packaging import version
sql_expression = c._jc.expr().sql()
if version.parse(pyspark.__version__) < version.parse('3.2.0'):
# prior to Spark 3.2.0 f.col('a.b') would be converted to `a.b` instead of the correct `a`.`b`
# this workaround is used to fix this issue
sql_expression = re.sub(
r'''(`[^"'\s]+\.[^"'\s]+?`)''',
lambda x: x.group(0).replace('.', '`.`'),
sql_expression,
flags=re.MULTILINE
)
return sql_expression
>>> from pyspark.sql import functions as F
>>> inverse_expr(F.length(F.col('name')))
'length(`name`)'
>>> inverse_expr(F.length(F.lit('name')))
"length('name')"
>>> inverse_expr(F.length(F.col('table.name')))
'length(`table`.`name`)'

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below
df.select(udf_date_time('entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
>>>DateTimeFormat('10152019','mmddyyyy','mmddyyyy')
'10152019'
>>>
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
["10-15-2019"],
["10-16-2019"],
["10-17-2019"],
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy'
df.select('exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
+----------+----------+
| exit_date| new_dates|
+----------+----------+
|10-15-2019|2019-10-15|
|10-16-2019|2019-10-16|
|10-17-2019|2019-10-17|
+----------+----------+

Resources