UDFs in Spark pipelines

UDFs in Spark pipelines - apache-spark

I create a UDF in python to compute an array of dates between two date columns in a table and register it to the spark session. I use this UDF in a pipeline to compute a new column.
Now when I save this pipeline to HDFS and expect it to be read back for execution in a different program(with a different spark session), the UDF is not available because its not globally registered anywhere. Since the process is generic and needs to run multiple pipelines, I dont want to add the UDF definition and register it to the spark session there.
Is there anyway for me to register UDFs globally across all spark session?
Can I add this in as a dependency somehow in a neat maintainable manner?

I have the same problem trying to save it from python and import it in scala.
I think I will just use SQL to do what I want to do.
I also saw I could use python .py file in Scala, but I do not find yet a way to use it in a UDF transformer.
If you want to use pyspark from a java pipeline think you can use the jar of the UDF using sql_context.udf.registerJavaFunction (or sql_context.sql("CREATE function newF as 'f' USING JAR 'udf.jar'")), this seemed to work for me but I dont care since I need
to do python => scala.

Related

How to find out when Spark Application has any memory/disk spills without checking Spark UI

My Environment:
Databricks 10.4
Pyspark
I'm looking into Spark performance and looking specifically into memory/disk spills that are available in Spark UI - Stage section.
What I want to achieve is to get notified if my job had spills.
I have found something below but I'm not sure how it works:
https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/SpillListener.html
I want to find a smart way where major spills are rather than going though all the jobs/stages manually.
ideally, I want to find spills programmatically using pyspark.

You can use SpillListener class as shown below,
spillListener = spark._jvm.org.apache.spark.SpillListener()
print(spillListener.numSpilledStages())
If you need more details, you have to extend that class and override the methods.
But, I think we can't add custom listeners directly on PySpark we have to do it via Scala. Refer this.
You can refer this page to see how we can implement SpillListener in scala.

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?

I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.

Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Does spark application equivalent to user code?

i want to know if spark application concept is equivalent to "user code". i mean spark application=user code or script that use the framework spark(like PySpark in python ) ?

If I understand your question correctly:
In general - your spark scripts are the same as regular code.
But there are some differences. When you run spark most of your code is evaluated lazily and executed only on actions (like collect, show, count, etc.). But before execution under the hood these operations are optimized and might not be run at the same order as they are in script. In example - filters are shifted up the stream.
This course is good for general understanding: https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/course/ (of course there are other and newer resources).
And talking about Py-Spark - it is just an API to Spark framework and you might have code that is run by Python and then call Py-Spark for data processing.

What pyspark api calls requiere same version of python in workers in yarn-client mode

usually I run my code with different versions of Python in the driver than in the worker nodes, using yarn-client mode.
For instance, I usually use python3.5 in the driver and the default python2.6 in workers and this works pretty.
I am currently in a project where we need to call
sqlContext.createDataFrame
But this seems to try to execute this sentence in python in the workers and then I got the requirement of installing the same version of python in workers which is what I am trying to avoid.
So, For using "sqlContext.createDataFrame" it is a requirement to have the same python version in driver and workers ?
And if so, which other "pure" pyspark.sql api call would also have this requirement?
Thanks,
Jose

Yes, the same Python verion is the requirement in general. Some API call may not fail because there is no Python executor in use but it is not a valid configuration.
Every call that interacts with Python code, like udf or DataFrame.rdd.* will trigger the same exception.
If you want to avoid upgrading cluster Python then use Python 2 on the driver.

In general, many pyspark operations are just a wrapper to calling spark operations on the JVM. For these operations it doesn't matter what version of python is used in the worker because no python is executed on the worker, only JVM operations.
Examples of such operations include reading a dataframe from file, all built-in functions which do not require python objects/functions as input etc.
Once a function requires an actual python object or function this becomes a little trickier.
Let's say for example that you want to use a UDF and use lambda x: x+1 as the function.
Spark doesn't really know what the function is. Instead it serializes it and sends it to the worker who de-serialize it in turn.
For this serialization/de-serialization process to work, the versions of both sides need to be compatible and that is often not the case (especially between major versions).
All of this leads us to createDataFrame. If you use RDD as one of the parameters for example, the RDD would contain python objects as the records and these would need to be serialized and de-serialized and therefore must have the same version.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string