Using Accumulator inside Pyspark UDF - apache-spark

I want to access accumulator inside pyspark udf :
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
accum=spark.sparkContext.accumulator(0)
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return accum.value
convertUDF = udf(lambda g,s : prob(g,s),IntegerType())
problem i am getting :
raise Exception("Accumulator.value cannot be accessed inside tasks")
Exception: Accumulator.value cannot be accessed inside tasks
Please let me know how to access accumulator value and how can we change it inside
Pyspark UDF .

You cannot access the .value of the accumulator in the udf. From the documentation (see this answer too):
Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value.
It is unclear why you need to return accum.value in this case. I believe you only need to return 2 in the else block looking at your if block:
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return 2

Related

Is there an inverse function for pyspark's expr?

I know there's a function called expr that turns your spark sql into a spark column with that expression:
>>> from pyspark.sql import functions as F
>>> F.expr("length(name)")
Column<b'length(name)'>
Is there a function that does the opposite - turn your Column into a pyspark's sql string? Something like:
>>> F.inverse_expr(F.length(F.col('name')))
'length(name)'
I found out that Column's __repr__ gives you an idea what the column expression is (like Column<b'length(name)'>, but it doesn't seem to be usable programmatically, without some hacky parsing and string-replacing.
In scala, we can use column#expr to get sql type expression as below-
length($"entities").expr.toString()
// length('entities)
In pyspark-
print(F.length("name")._jc.expr.container)
# length(name)
I tried the accepted answer by #Som in Spark 2.4.2 and Spark 3.2.1 and it didn't work.
The following approach worked for me in pyspark:
import pyspark
from pyspark.sql import Column
def inverse_expr(c: Column) -> str:
"""Convert a column from `Column` type to an equivalent SQL column expression (string)"""
from packaging import version
sql_expression = c._jc.expr().sql()
if version.parse(pyspark.__version__) < version.parse('3.2.0'):
# prior to Spark 3.2.0 f.col('a.b') would be converted to `a.b` instead of the correct `a`.`b`
# this workaround is used to fix this issue
sql_expression = re.sub(
r'''(`[^"'\s]+\.[^"'\s]+?`)''',
lambda x: x.group(0).replace('.', '`.`'),
sql_expression,
flags=re.MULTILINE
)
return sql_expression
>>> from pyspark.sql import functions as F
>>> inverse_expr(F.length(F.col('name')))
'length(`name`)'
>>> inverse_expr(F.length(F.lit('name')))
"length('name')"
>>> inverse_expr(F.length(F.col('table.name')))
'length(`table`.`name`)'

Spark Accumulator not working

I want to get the number of closed orders from this data using accumulators. But it is giving me incorrect answer, just zero(0). What is the problem? I am using Hortonworks Sandbox. The code is below. I am using spark-submit.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(N_closed, line):
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
return status
closedRDD = rdd.filter(lambda x: is_closed(N_closed, x))
print('The answer is ' + str(N_closed.value))
But when I submit it, I get zero.
spark-submit --master yarn closedCounter.py
UpDate:
Now, when I change my code it works fine. Is this the right way to do it?
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(line):
global N_closed
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))
Second Update:
I understand it now, In Jupyter Notebook, without Yarn, it gives me the correct answer because I have called an action (count) before checking the value from the accumulator.
Computations inside transformations are evaluated lazily, so unless an action happens on an RDD the transformationsare not executed. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen on the RDD
https://www.edureka.co/blog/spark-accumulators-explained
(Examples in Scala)
But basically, you need to perform an action on rdd.
For example
N_closed = sc.accumulator(0)
def is_closed(line):
status = line.split(",")[-1]=="CLOSED"
if status:
N_closed.add(1)
return status
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))

Join RDD using python conditions

I have two RDD. First one contains information related IP address (see col c_ip):
[Row(unic_key=1608422, idx=18, s_date='2016-12-31', s_time='15:00:07', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
Row(unic_key=1608423, idx=19, s_date='2016-12-31', s_time='15:00:08', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
]
And another RDD which is IP geolocation.
network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
1.0.0.0/24,2077456,2077456,,0,0,,-33.4940,143.2104,1000
1.0.1.0/24,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.2.0/23,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.4.0/22,2077456,2077456,,0,0,,-33.4940,143.2104,1000
I would like to match these two but the problem is that I dont have a strict equivalent between the column in both RDD.
I would like to use the Python3 Package ipaddress and do a check like this:
> import ipaddress
> ipaddress.IPv4Address('1.0.0.5') in ipaddress.ip_network('1.0.0.0/24')
True
Is it possible to use a python function to perform the join (left outer join to not exclude any lines from my first RDD)? How can I do that?
When using Apache Spark 1.6, you can still use an UDF function as a predicate in a join. After generating some test data:
import ipaddress
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
sessions = sc.parallelize([(1608422,'119.228.181.78'),(1608423, '119.228.181.78')]).toDF(['unic_key','c_ip'])
geo_ip = sc.parallelize([('1.0.0.0/24',2077456,2077456),
('1.0.1.0/24',1810821,1814991),
('1.0.2.0/23',1810821,1814991),
('1.0.4.0/22',2077456,2077456)]).toDF(['network','geoname_id','registered_country_geoname_id'])
You can create the UDF predicate as follows:
def ip_range(ip, network_range):
return ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range))
pred = udf(lambda ip, network_range:ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range)), BooleanType())
And then you can use the UDF if the following join:
sessions.join(geo_ip).where(pred(sessions.c_ip, geo_ip.network))
Unfortunately this currently doesn't work in Spark 2.x, see https://issues.apache.org/jira/browse/SPARK-19728

How to register UDF with no argument in Pyspark

I have tried Spark UDF with parameter using lambda function and register it. but how could I create udf with not argument and registrar it I have tried this my sample code will expected to show current time
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
udfGateTime=udf(getTime,TimestampType())
But PySpark is showing
NameError: name 'TimestampType' is not defined
which probably means my UDF is not registered
I was comfortable with this format
spark.udf.register('GATE_TIME', lambda():getTime(), TimestampType())
but does lambda function take empty argument? Though I didn't try it, I am a bit confused. How could I write the code for registering this getTime() function?
lambda expression can be nullary. You're just using incorrect syntax:
spark.udf.register('GATE_TIME', lambda: getTime(), TimestampType())
There is nothing special in lambda expressions in context of Spark. You can use getTime directly:
spark.udf.register('GetTime', getTime, TimestampType())
There is no need for inefficient udf at all. Spark provides required function out-of-the-box:
spark.sql("SELECT current_timestamp()")
or
from pyspark.sql.functions import current_timestamp
spark.range(0, 2).select(current_timestamp())
I have done a bit tweak here and it is working well for now
import datetime
from pyspark.sql.types import*
def getTime():
timevalue=datetime.datetime.now()
return timevalue
def GetVal(x):
if(True):
timevalue=getTime()
return timevalue
spark.udf.register('GetTime', lambda(x):GetVal(x),TimestampType())
spark.sql("select GetTime('currenttime')as value ").show()
instead of currenttime any value can pass at it will give current date time here
The error "NameError: name 'TimestampType' is not defined" seems to be due to the lack of:
import pyspark.sql.types.TimestampType
For more info regarding TimeStampType see this answer https://stackoverflow.com/a/30992905/5088142

PySpark PythonUDF Missing input attributes

I'm trying to use Spark SQL Data Frame to read some data in and apply a bunch of text clean up functions to each row.
import langid
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
from pyspark.sql import HiveContext
hsC = HiveContext(sc)
df = hsC.sql("select * from sometable")
def check_lang(data_str):
language = langid.classify(data_str)
# only english
record = ''
if language[0] == 'en':
# probability of correctly id'ing the language greater than 90%
if language[1] > 0.9:
record = data_str
return record
check_lang_udf = udf(lambda x: check_lang(x), StringType())
clean_df = df.select("Field1", check_lang_udf("TextField"))
However when I attempt to run this I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.select.
: java.lang.AssertionError: assertion failed: Unable to evaluate PythonUDF. Missing input attributes
I've spent a good deal trying to gather up more information on this but I can't find anything.
As a sidenote, I know the code below works but I'd like to stay with dataframes.
removeNonEn = data.map(lambda record: (record[0], check_lang(record[1])))
I haven't tried this code, but from the API docs suggest this should work:
hsC.registerFunction("check_lang", check_lang)
clean_df = df.selectExpr("Field1", "check_lang('TextField')")

Resources