SparkContext can only be used on the driver - apache-spark

I am trying to use SparkContext.binaryFiles function to process a set of ZIP files. The setup is to map from a RDD of filenames, in which the mapping function uses the binaryFiles function.
The problem is that SparkContext is referenced in the mapping function, and I'm getting this error. How can I fix it?
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Sample code:
file_list_rdd.map(lambda x: sc.binaryFiles("/FileStore/tables/xyz/" + x[1]))
where file_list_rdd is a RDD of (id, filename) tuples.

It would appear that you need to call the function without referencing the spark context - and if that is actually applicable.
Also consider moving the function / def into the map body statement(s) itself. That is commonly done - and we are using a functional language. I have been at a loss to resolve Serialization errors unless I resort to the aforementioned and move defs to the Executor logic.
Some file processing is also done via the driver. This post could be of interest: How to paralelize spark etl more w/out losing info (in file names). Based on your code snippet I would be looking at this here.
And you should use something like this and process accordingly:
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
Now you are using it from the Driver and the sc.

Related

How to access java runtime variables like java.lang.Runtime.getRuntime().maxMemory() for pyspark executors?

The question is all there is. I want a way to check the java runtime variables for the executor jvm created but I am working with pyspark. How can I access java.lang.Runtime.getRuntime().maxMemory() if I am working with pyspark?
based on the comment I have tried to run the following code but both approaches are unsuccessful
#created a RDD
l = sc.range(100)
Now, I have to run func = sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory() on each executor. So, I do the following
l.map(lambda x:sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory()).collect()
Which results in
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
The spark context can only be used on the driver
I also tried
func = sc._gateway.jvm.java.lang.Runtime.getRuntime()
l.map(lambda x:func.maxMemory()).collect()
which results in the following error
TypeError: cannot pickle '_thread.RLock' object

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Pass python functions to Scala RDD in pyspark

I have a scala library which (to put it simple) receives a function, applies it to an RDD and returns another RDD
def runFunction(rdd: RDD, function: Any => Any) = {
....
val res = rdd.map(function)
...
}
In scala the usage would be
import mylibrary.runFunction
runFunction(myRdd, myScalaFun)
This library is packaged in a jar and I want to now use it in python too. What I would like to do is to load this library in Python and pass to it a python function. Usage in Python would be:
spark._jvm.mylibrary.runFunction(myPythonRdd, myPythonFun)
This would allow me to use python functions as well as Scala ones without the need to port the whole library to python. Is this something that can be achieved with Spark capabilities of going back and forth between Python and JVM?
There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD and not RDD, and those need explicit unboxing in Scala. Since your Scala function takes an RDD, you need to write a wrapper in Scala that receives a JavaRDD and performs the unboxing first:
def runFunctionWrapper(jrdd: JavaRDD, ...) = {
runFunction(jrdd.rdd, ...)
}
Then call it like
spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)
Note that by Python convention, _jrdd is considered a private member of the Python RDD class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm member of SparkContext.
The real problem is making Scala call back into Python for the application of function. In PySpark, the Python RDD's map() method creates an instance of org.apache.spark.api.python
.PythonFunction, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner. This is very different from building a wrapper around the Python function and passing it to the map() method of the RDD instance.
I believe it is best if you simply replicate the functionality of runFunction in Python or (much better performance-wise) replicate the functionality of myPythonFun in Scala. Or, if what you do can be done interactively, follow the suggestion of #EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.

Serialization of an object used in foreachRDD() when CheckPointing

According to this question and documentations I've read, Spark Streaming's foreachRDD(someFunction) will have someFunction itself executed in the driver process ONLY, though if there were operations done on RDDs then they will be done on the executors - where the RDDs sit.
All above works for me as well, though I noticed that if I turn on checkpointing, then it seems like spark is trying to serialize everything in foreachRDD(someFunction) and send to somewhere - which is causing issue for me because one of the object used is not serializable (namely schemaRegistryClient). I tried Kryo serializer but also no luck.
The serialization issue goes away if I turn checkpointing off.
Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Thanks a lot.
Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Checkpointing shouldn't have anything to do with your problem. The underlying issue is the fact that you have a non serializable object instance which needs to be sent to your workers.
There is a general pattern to use in Spark when you have such a dependency. You create an object with a lazy transient property which will load inside the worker nodes when needed:
object RegisteryWrapper {
#transient lazy val schemaClient: SchemaRegisteryClient = new SchemaRegisteryClient()
}
And when you need to use it inside foreachRDD:
someStream.foreachRDD {
rdd => rdd.foreachPartition { iterator =>
val schemaClient = RegisteryWrapper.schemaClient
iterator.foreach(schemaClient.send(_))
}
}
A couple of things here are important:
You cannot use this client in the code that is executed on workers (that is inside RDD).
You can create Object with transient client field and have it re-created once job is restarted. An example how to accomplish this can be found here.
The same principle applies to Broadcast and Accumulator variables.
Checkpointing persists data, job metadata, and code logic. When the code is changed, your checkpoints become invalid.
Issue might be with checkpoint data, if you changed any thing in your code then you to remove your old checkpoint metadata.

Resources