List user defined functions in pyspark - apache-spark

I am currently using the pySpark console to play around with Spark and I was wondering if there is a way to list all functions that were define by me?
Currently I am forced to scroll all the way up to the definition of the function which can be tedious if you have a lot of output to scroll over.
Thank you so much for your help!

Keeping your workspace clean makes more sense but if you really need something like this you can filter variables in the current scope:
[k for (k, v) in globals().items() if (
callable(v) and # function or callable object
getattr(v, "__module__", None) == "__main__" and # defined in __main__
not k.startswith("_") # not hidden
)]

Related

Logging from pandas udf

I am trying to log from a pandas udf called within a python transform.
Because the code is being called on the executor is does not show up in the driver's logs.
I have been looking at some options on SO but so far the closest option is this one
Any idea on how to surface the logs in the driver logs or any other log files available under build is welcome.
import logging
logger = logging.getLogger(__name__)
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def my_udf(my_pdf):
logger.info('calling my udf')
do_some_stuff()
results_df = my_df.groupby("Name").apply(my_udf)
As you said, the work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
I also explain another option is you are more familiar with pandas here:
How to debug pandas_udfs without having to use Spark?
Edit: I have a complete answer here: In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
It is not ideal (as it stops the code) but you can do
raise Exception(<variable_name>)
inside the pandas_udf and it gives you the value of the named variable.

Using the Python WITH statement to create temporary variable

Suppose I have Pandas data. Any data. I import seaborn to make a colored version of the correlation between varibales. Instead of passing the correlation expression into the heatmap fuction, and instead of creating a one-time variable to store the correlation output, how can I use the with statement to create temporary variable that no longer existss after the heatmap is plotted?
Doesn't work
# Assume: season = sns, Data is heatmapable
with mypandas_df.correlation(method="pearson") as heatmap_input:
# possible other statements
sns.heatmap(heatmap_input)
# possible other statements
If this exissted, then after seaborn plots the map, heatmap_input no longer exists as a variable. I would like tat functionality.
Long way
# this could be temporary but is now global
tcbtbing = mypandas_df.correlation(method="pearson")
sns.heatmap(tcbtbing)
Compact way
sns.heatmap( mypandas_df.correlation(method="pearson") )
I'd like to use the with statement (or similar short) construction to avoid the Long Way and the Compact way, but leave room for other manipulations, such as to the plot itself.
You need to implement enter and exit for the class you want to use it.
see: Implementing use of 'with object() as f' in custom class in python

Can one see the whole definition of user defined function?

I've loaded a couple of functions (f and g) from another script in my jupyter notebook. if I pass the parameters, I am able to get the proper output. My question is, is it possible for me to see the whole definition of the function (f or g)?
I tried to see the function but it shows the memory location that was assigned to it.
You can do this with the built in inspect library.
The below snippet should get you acquainted with how to see the source code of a function.
def hello_world():
print("hello_world")
import inspect
source = inspect.getsource(hello_world)
print(source)
You need to comment your function inside (check docstring, https://www.python.org/dev/peps/pep-0257/) like
def func(a,b):
"""
Wonderful
"""
return a+b
Then in your jupyter notebook you can use Shift + Tab on your function.
I can not comment, but this comes from another thread How can I see function arguments in IPython Notebook Server 3?

UDF reload in PySpark

I am using PySpark (inside an Jupyter Notebook, which connects to a Spark-cluster) and some UDFs. The UDF takes a list as the an additional parameter and I construct the UDF like this:
my_udf = F.udf(partial(my_normal_fn, list_param=list), StringType())
Everything works fine, with regard to executing the function. But I noticed that the UDF is never updated.
To clarify: When I update the list, for example by altering a element in the list, the UDF is not updated. The old version with the old list is still used. Even if I execute the whole notebook again.
I have to restart the Jupyter Kernel in order to use new version of the list. Which is really annoying...
Any thoughts?
I found the solution.
My my_normal_fn did have the following signature:
def my_normal_fn(x, list_param=[]):
dosomestuffwith_x_and_list_param
Changing it to
def my_normal_fn(x, list_param):
dosomestuffwith_x_and_list_param
Did the trick. See here for more information.
Thanks to user Drjones78 of the SparkML-Slack channel.

how structure a python3/tkinter project

I'm developing a small application using tkinter and PAGE 4.7 for design UI.
I had designed my interface and generated python source code. I got two files:
gm_ui_support.py: here declare tk variables
gm_ui.py : here declare widget for UI
I'm wondering how this files are supposed to be use, one of my goals is to be able to change the UI as many times as I need recreating this files, so if I put my code inside any of this files will be overwritten each time.
So, my question is:
Where I have to put my own code? I have to extend gm_ui_support? I have to create a 3th class? I do directly at gm_ui_support?
Due the lack of answer I'm going to explain my solution:
It seems that is not possible to keep both files unmodified, so I edit gm_ui_support.py (declaration of tk variables and events callback). Each time I make a change that implies gm_ui_support.py I copy changes manually.
To minimize changes on gm_ui_support I create a new file called gm_control.py where I keep a status dict with all variables (logical and visual) and have all available actions.
Changes on gm_ui_support.py:
I create a generic function (sync_control) that fills my tk variables using a dict
At initialize time it creates my class and invoke sync_control (to get default values defined in control)
On each callback I get extract parameter from event and invoke logical action on control class (that changes state dict), after call to sync_control to show changes.
Sample:
gm_ui_support.py
def sync_control():
for k in current_gm_control.state:
gv = 'var_'+k
if gv in globals():
#print ('********** found '+gv)
if type(current_gm_control.state[k]) is list:
full="("
for v in current_gm_state.state[k]:
if len(full)>1: full=full+','
full=full+"'"+v+"'"
full=full+")"
eval("%s.set(%s)" % (gv, full))
else:
eval("%s.set('%s')" % (gv, current_gm_state.state[k]))
else:
pass
def set_Tk_var():
global current_gm_state
current_gm_control=gm_control.GM_Control()
global var_username
var_username = StringVar()
...
sync_control()
...
def on_select_project(event):
w = event.widget
index = int(w.curselection()[0])
value = w.get(index)
current_gm_control.select_project(value)
sync_state()
...

Resources