Read inside mapGroupsWithState - apache-spark

I am a bit blocked on this subject. Is it possible to read from a source inside mapGroupsWithState. Seems like sparkSession is not available. Is it the case? Should I think about another solution?
Similar question here:
Batch Read inside flatMapGroupWithState
From what I know you are unable to use sparkSession inside transformations since sparkSession is for the driver. I was expecting to get a task not serializable but I get "no active session". Any help would be appreciated.

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

SparkContext can only be used on the driver

I am trying to use SparkContext.binaryFiles function to process a set of ZIP files. The setup is to map from a RDD of filenames, in which the mapping function uses the binaryFiles function.
The problem is that SparkContext is referenced in the mapping function, and I'm getting this error. How can I fix it?
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Sample code:
file_list_rdd.map(lambda x: sc.binaryFiles("/FileStore/tables/xyz/" + x[1]))
where file_list_rdd is a RDD of (id, filename) tuples.
It would appear that you need to call the function without referencing the spark context - and if that is actually applicable.
Also consider moving the function / def into the map body statement(s) itself. That is commonly done - and we are using a functional language. I have been at a loss to resolve Serialization errors unless I resort to the aforementioned and move defs to the Executor logic.
Some file processing is also done via the driver. This post could be of interest: How to paralelize spark etl more w/out losing info (in file names). Based on your code snippet I would be looking at this here.
And you should use something like this and process accordingly:
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
Now you are using it from the Driver and the sc.

Pyspark Checkpointing in Dataproc(StackOverFlowError)

I faced stackoveroverflow error when persisting dataset with pyspark. I am casting whole dataframe into doubletype and then persisting to calculate statistics, and I read that checkpointing is a solution to stackoverflow. However, I am having trouble with implementing it in dataproc.
I am working with pyspark, and when I checkpoint the dataframe and checkedpointed with df.isCheckpointed(), it returns false. However, when I debug it, df.rdd.is_checkpointed says True. Is there any issue with the package / am I doing something wrong?
I thought localCheckpoint is more appropriate for my purpose(https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint()), as my problem was simply DAG depth was too deep, but I couldn't find any use case. Also, if I just checkpointed RDD says it is checkpointed(as in first question), but if I tried localcheckpoint, it says it is not. Has anyone tried this function?
After I tried with local standalone mode, I tried it with dataproc. I tried both hdfs and google cloud storage, but either way the storage was empty, but rdd says it is checkpointed.
Can anyone help me with this? Thanks in advance!
If you're using localCheckpoint, it will write to the local disk of executors, not to HDFS/GCS: https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint--.
Also note that there's an eager (checkpoint immediately) and non-eager (checkpoint when the RDD is actually materialized) mode to checkpointing. That may affect what those methods return. The code is often the best documentation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1615.
In general, please attach sample code (repros) to questions like this -- that way we can answer your question more directly.
Create a temp dir using your SparkSession object:
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
dataframe_name = # Any Spark Dataframe of type <pyspark.sql.dataframe.DataFrame>
at this point of time, the dataframe_name would be a DAG, which you can store as a checkpoint like,
dataframe_checkpoint = dataframe_name.checkpoint()
dataframe_checkpoint is also a spark dataframe of type <pyspark.sql.dataframe.DataFrame> but instead of the DAG, it stores the result of the query
Use checkpoints if:
the computation takes a long time
the computing chain is too long
depends too many RDDs

Serialization of an object used in foreachRDD() when CheckPointing

According to this question and documentations I've read, Spark Streaming's foreachRDD(someFunction) will have someFunction itself executed in the driver process ONLY, though if there were operations done on RDDs then they will be done on the executors - where the RDDs sit.
All above works for me as well, though I noticed that if I turn on checkpointing, then it seems like spark is trying to serialize everything in foreachRDD(someFunction) and send to somewhere - which is causing issue for me because one of the object used is not serializable (namely schemaRegistryClient). I tried Kryo serializer but also no luck.
The serialization issue goes away if I turn checkpointing off.
Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Thanks a lot.
Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Checkpointing shouldn't have anything to do with your problem. The underlying issue is the fact that you have a non serializable object instance which needs to be sent to your workers.
There is a general pattern to use in Spark when you have such a dependency. You create an object with a lazy transient property which will load inside the worker nodes when needed:
object RegisteryWrapper {
#transient lazy val schemaClient: SchemaRegisteryClient = new SchemaRegisteryClient()
}
And when you need to use it inside foreachRDD:
someStream.foreachRDD {
rdd => rdd.foreachPartition { iterator =>
val schemaClient = RegisteryWrapper.schemaClient
iterator.foreach(schemaClient.send(_))
}
}
A couple of things here are important:
You cannot use this client in the code that is executed on workers (that is inside RDD).
You can create Object with transient client field and have it re-created once job is restarted. An example how to accomplish this can be found here.
The same principle applies to Broadcast and Accumulator variables.
Checkpointing persists data, job metadata, and code logic. When the code is changed, your checkpoints become invalid.
Issue might be with checkpoint data, if you changed any thing in your code then you to remove your old checkpoint metadata.

Resources