Writing test for pyspark udf - apache-spark

I am having some internal python dependency being executed inside a spark pandas_udf . To pass parameters we are wrapping this inside another function.
Code looks like below
def wrapper_fn(df, parameters):
#pandas_udf(schema,GROUPED_MAP)
def run_pandas_code():
""" Importing some python library and using it """
return pandas_df
return df.groupby(<key>).apply(run_pandas_code)
I want to write a test that executes the function wrapper_fn . But when i write these test, i get pickle error . Can someone recommend a good way to test pyspark udf's .

It was possible to do this eventually. The problem was with the reference to some class which pyspark weren't able to serialise.

Related

Apache Beam + Databricks Notebook - map function error

I am trying to run a simple pipeline using Apache Beam on DataBricks Notebooks, but I am unable to create any custom functions. Here is a simple example:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
def my_func(s):
print(s)
pipeline_options = PipelineOptions([
"--runner=DirectRunner",
])
with beam.Pipeline(options=pipeline_options) as p:
(
p
| "Create data" >> beam.Create(['foo', 'bar', 'baz'])
| "print result" >> beam.Map(my_func)
)
Produces:
RuntimeError: Unable to pickle fn CallableWrapperDoFn(<function Map.<locals>.<lambda> at 0x7fb5a17a6b80>): It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
This error also occurs if I use a lambda function. Everything works as expected if I use the builtin print function. Why am I getting this error? How can I pass custom functions to my pipeline in this environment?
Reference to SparkContext seems to be getting pulled in when the lambda function gets serialized by Beam.
Instead of using beam.Map can you try defining a new Beam DoFn and using the ParDo transform ?

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

Pattern to add functions to existing Python classes

I'm writing a helper library for pandas.
Similarly to scala implicits, I would like to add my custom functions to all the instances of an existing Python class (pandas.DataFrame in this case), of which I have no control: I cannot modify it, I cannot extend it and ask users to use my extension instead of the original class.
import pandas as pd
df = pd.DataFrame(...)
df.my_function()
What's the suggested pattern to achieve this with Python 3.6+?
If exactly this is not achievable, what's the most common, robust, clear and least-surprising pattern used in Python for a similar goal? Can we get anything better by requiring Python 3.7+, 3.8+ or 3.9+?
I know it's possible to patch at runtime single instances or classes to add methods. This is not what I would like to do: I would prefer a more elegant and robust solution, applicable to a whole class and not single instances, IDE-friendly so code completion can suggest my_function.
My case is specific to pandas.DataFrame, so a solution applicable only to this class could be also fine, as long as it uses documented, official APIs of pandas.
In the below code I am creating a function with a single self argument.
This function is the assigned to an attribute of the pd.DataFrame class and if the callable as a method.
import pandas as pd
def my_new_method(self):
print(type(self))
print(self)
pd.DataFrame.new_method = my_new_method
df = pd.DataFrame({'col1': [1, 2, 3]})
df.new_method()

Logging from pandas udf

I am trying to log from a pandas udf called within a python transform.
Because the code is being called on the executor is does not show up in the driver's logs.
I have been looking at some options on SO but so far the closest option is this one
Any idea on how to surface the logs in the driver logs or any other log files available under build is welcome.
import logging
logger = logging.getLogger(__name__)
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def my_udf(my_pdf):
logger.info('calling my udf')
do_some_stuff()
results_df = my_df.groupby("Name").apply(my_udf)
As you said, the work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
I also explain another option is you are more familiar with pandas here:
How to debug pandas_udfs without having to use Spark?
Edit: I have a complete answer here: In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
It is not ideal (as it stops the code) but you can do
raise Exception(<variable_name>)
inside the pandas_udf and it gives you the value of the named variable.

Can one see the whole definition of user defined function?

I've loaded a couple of functions (f and g) from another script in my jupyter notebook. if I pass the parameters, I am able to get the proper output. My question is, is it possible for me to see the whole definition of the function (f or g)?
I tried to see the function but it shows the memory location that was assigned to it.
You can do this with the built in inspect library.
The below snippet should get you acquainted with how to see the source code of a function.
def hello_world():
print("hello_world")
import inspect
source = inspect.getsource(hello_world)
print(source)
You need to comment your function inside (check docstring, https://www.python.org/dev/peps/pep-0257/) like
def func(a,b):
"""
Wonderful
"""
return a+b
Then in your jupyter notebook you can use Shift + Tab on your function.
I can not comment, but this comes from another thread How can I see function arguments in IPython Notebook Server 3?

Resources