Apache Spark: Python function serialized automatically - apache-spark

I was going through the Apache spark documentation. Spark docs for python says the following:
...We can pass Python functions to Spark, which are automatically
serialized along with any variables that they reference...
I don't fully understand what it means. Does it have to do something the the RDD type?
What does it mean in the context of spark?

The serialization is necessary when using PySpark because the function you define locally needs to be executed remotely on each of the worker nodes. This concept isn't really related to the RDD type.

Related

Does the User Defined Functions (UDF) in SPARK works in a distributed way?

Does the User Defined Functions (UDF) in SPARK works in a distributed way if data is stored in different nodes or it accumulates all data into the master node for processing purpose? If it works in a distributed way then can we convert any function in python whether it's pre-defined or user-defined into spark UDF like mentioned below :
spark.udf.register("myFunctionName", functionNewName)
Spark dataframe is distributed across the cluster in partitions. Each partition is processed by the UDF, so the answer is yes. You can also see this in Spark UI.

what is the difference between spark.read.parquet vs pyarrow.hdfs.connect().read_parquet

I have files in hdfs in parquet format, there are two options to read it:
spark.read.parquet(hdfs_path)
pyarrow.hdfs.connect().read_parquet(hdfs_path)
May i know what is the difference between this two, and which one is better?
Thanks.
The first piece of code ,spark.read.parquet() function, is native to Apache Spark. It is a method defined in the DatasourceReader class in Apache Spark's source code. It is implmented in Scala
The second piece of code, pyarrow.hdfs.connect().read_parquet(hdfs_path), also reads parquet files from hdfs, but is implemented in Apache Arrrow and is defined the PyArrow library in Python.
The First code snippet shall read your parquet data in a Spark Dataframe and you will have all the parallel processing capabilities available to you from the get go.

Pass python functions to Scala RDD in pyspark

I have a scala library which (to put it simple) receives a function, applies it to an RDD and returns another RDD
def runFunction(rdd: RDD, function: Any => Any) = {
....
val res = rdd.map(function)
...
}
In scala the usage would be
import mylibrary.runFunction
runFunction(myRdd, myScalaFun)
This library is packaged in a jar and I want to now use it in python too. What I would like to do is to load this library in Python and pass to it a python function. Usage in Python would be:
spark._jvm.mylibrary.runFunction(myPythonRdd, myPythonFun)
This would allow me to use python functions as well as Scala ones without the need to port the whole library to python. Is this something that can be achieved with Spark capabilities of going back and forth between Python and JVM?
There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD and not RDD, and those need explicit unboxing in Scala. Since your Scala function takes an RDD, you need to write a wrapper in Scala that receives a JavaRDD and performs the unboxing first:
def runFunctionWrapper(jrdd: JavaRDD, ...) = {
runFunction(jrdd.rdd, ...)
}
Then call it like
spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)
Note that by Python convention, _jrdd is considered a private member of the Python RDD class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm member of SparkContext.
The real problem is making Scala call back into Python for the application of function. In PySpark, the Python RDD's map() method creates an instance of org.apache.spark.api.python
.PythonFunction, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner. This is very different from building a wrapper around the Python function and passing it to the map() method of the RDD instance.
I believe it is best if you simply replicate the functionality of runFunction in Python or (much better performance-wise) replicate the functionality of myPythonFun in Scala. Or, if what you do can be done interactively, follow the suggestion of #EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.

Spark - jdbc read all happens on driver?

I have spark reading from Jdbc source (oracle) I specify lowerbound,upperbound,numpartitions,partitioncolumn but looking at web ui all the read is happening on driver not workers,executors. Is that expected?
In Spark framework, in general whatever code you write within a transformation such as map, flatMap etc. will be executed on the executor. To invoke a transformation you need a RDD which is created using the dataset that you are trying to compute on. To materialize the RDD you need to invoke an action so that transformations are applied to the data.
I believe in your case, you have written a spark application that reads jdbc data. If that is the case it will all be executed on Driver and not executor.
If you haven not already, try creating a Dataframe using this API.

How to chain multiple jobs in Apache Spark

I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?
Is there any API for it from Apache Spark ? Is this even idiomatic approach ?
From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.
Also there are runJob and submitJob on SparkContext, but are they good fit for it ?
Use the same RDD definition to define the input/output of your jobs.
You should then be able to chain them.
The other option is to use DataFrames instead of RDD and figure out the schema at run-time.

Resources