Pass python functions to Scala RDD in pyspark - apache-spark

I have a scala library which (to put it simple) receives a function, applies it to an RDD and returns another RDD
def runFunction(rdd: RDD, function: Any => Any) = {
....
val res = rdd.map(function)
...
}
In scala the usage would be
import mylibrary.runFunction
runFunction(myRdd, myScalaFun)
This library is packaged in a jar and I want to now use it in python too. What I would like to do is to load this library in Python and pass to it a python function. Usage in Python would be:
spark._jvm.mylibrary.runFunction(myPythonRdd, myPythonFun)
This would allow me to use python functions as well as Scala ones without the need to port the whole library to python. Is this something that can be achieved with Spark capabilities of going back and forth between Python and JVM?

There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD and not RDD, and those need explicit unboxing in Scala. Since your Scala function takes an RDD, you need to write a wrapper in Scala that receives a JavaRDD and performs the unboxing first:
def runFunctionWrapper(jrdd: JavaRDD, ...) = {
runFunction(jrdd.rdd, ...)
}
Then call it like
spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)
Note that by Python convention, _jrdd is considered a private member of the Python RDD class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm member of SparkContext.
The real problem is making Scala call back into Python for the application of function. In PySpark, the Python RDD's map() method creates an instance of org.apache.spark.api.python
.PythonFunction, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner. This is very different from building a wrapper around the Python function and passing it to the map() method of the RDD instance.
I believe it is best if you simply replicate the functionality of runFunction in Python or (much better performance-wise) replicate the functionality of myPythonFun in Scala. Or, if what you do can be done interactively, follow the suggestion of #EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

SparkContext can only be used on the driver

I am trying to use SparkContext.binaryFiles function to process a set of ZIP files. The setup is to map from a RDD of filenames, in which the mapping function uses the binaryFiles function.
The problem is that SparkContext is referenced in the mapping function, and I'm getting this error. How can I fix it?
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Sample code:
file_list_rdd.map(lambda x: sc.binaryFiles("/FileStore/tables/xyz/" + x[1]))
where file_list_rdd is a RDD of (id, filename) tuples.
It would appear that you need to call the function without referencing the spark context - and if that is actually applicable.
Also consider moving the function / def into the map body statement(s) itself. That is commonly done - and we are using a functional language. I have been at a loss to resolve Serialization errors unless I resort to the aforementioned and move defs to the Executor logic.
Some file processing is also done via the driver. This post could be of interest: How to paralelize spark etl more w/out losing info (in file names). Based on your code snippet I would be looking at this here.
And you should use something like this and process accordingly:
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
Now you are using it from the Driver and the sc.

UDFs in Spark pipelines

I create a UDF in python to compute an array of dates between two date columns in a table and register it to the spark session. I use this UDF in a pipeline to compute a new column.
Now when I save this pipeline to HDFS and expect it to be read back for execution in a different program(with a different spark session), the UDF is not available because its not globally registered anywhere. Since the process is generic and needs to run multiple pipelines, I dont want to add the UDF definition and register it to the spark session there.
Is there anyway for me to register UDFs globally across all spark session?
Can I add this in as a dependency somehow in a neat maintainable manner?
I have the same problem trying to save it from python and import it in scala.
I think I will just use SQL to do what I want to do.
I also saw I could use python .py file in Scala, but I do not find yet a way to use it in a UDF transformer.
If you want to use pyspark from a java pipeline think you can use the jar of the UDF using sql_context.udf.registerJavaFunction (or sql_context.sql("CREATE function newF as 'f' USING JAR 'udf.jar'")), this seemed to work for me but I dont care since I need
to do python => scala.

What pyspark api calls requiere same version of python in workers in yarn-client mode

usually I run my code with different versions of Python in the driver than in the worker nodes, using yarn-client mode.
For instance, I usually use python3.5 in the driver and the default python2.6 in workers and this works pretty.
I am currently in a project where we need to call
sqlContext.createDataFrame
But this seems to try to execute this sentence in python in the workers and then I got the requirement of installing the same version of python in workers which is what I am trying to avoid.
So, For using "sqlContext.createDataFrame" it is a requirement to have the same python version in driver and workers ?
And if so, which other "pure" pyspark.sql api call would also have this requirement?
Thanks,
Jose
Yes, the same Python verion is the requirement in general. Some API call may not fail because there is no Python executor in use but it is not a valid configuration.
Every call that interacts with Python code, like udf or DataFrame.rdd.* will trigger the same exception.
If you want to avoid upgrading cluster Python then use Python 2 on the driver.
In general, many pyspark operations are just a wrapper to calling spark operations on the JVM. For these operations it doesn't matter what version of python is used in the worker because no python is executed on the worker, only JVM operations.
Examples of such operations include reading a dataframe from file, all built-in functions which do not require python objects/functions as input etc.
Once a function requires an actual python object or function this becomes a little trickier.
Let's say for example that you want to use a UDF and use lambda x: x+1 as the function.
Spark doesn't really know what the function is. Instead it serializes it and sends it to the worker who de-serialize it in turn.
For this serialization/de-serialization process to work, the versions of both sides need to be compatible and that is often not the case (especially between major versions).
All of this leads us to createDataFrame. If you use RDD as one of the parameters for example, the RDD would contain python objects as the records and these would need to be serialized and de-serialized and therefore must have the same version.

Apache Spark: Python function serialized automatically

I was going through the Apache spark documentation. Spark docs for python says the following:
...We can pass Python functions to Spark, which are automatically
serialized along with any variables that they reference...
I don't fully understand what it means. Does it have to do something the the RDD type?
What does it mean in the context of spark?
The serialization is necessary when using PySpark because the function you define locally needs to be executed remotely on each of the worker nodes. This concept isn't really related to the RDD type.

Resources