Serialization of an object used in foreachRDD() when CheckPointing - apache-spark

According to this question and documentations I've read, Spark Streaming's foreachRDD(someFunction) will have someFunction itself executed in the driver process ONLY, though if there were operations done on RDDs then they will be done on the executors - where the RDDs sit.
All above works for me as well, though I noticed that if I turn on checkpointing, then it seems like spark is trying to serialize everything in foreachRDD(someFunction) and send to somewhere - which is causing issue for me because one of the object used is not serializable (namely schemaRegistryClient). I tried Kryo serializer but also no luck.
The serialization issue goes away if I turn checkpointing off.
Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Thanks a lot.

Is there a way to let Spark not to serialize what's used in foreachRDD(someFunc) while also keep using checkpointing?
Checkpointing shouldn't have anything to do with your problem. The underlying issue is the fact that you have a non serializable object instance which needs to be sent to your workers.
There is a general pattern to use in Spark when you have such a dependency. You create an object with a lazy transient property which will load inside the worker nodes when needed:
object RegisteryWrapper {
#transient lazy val schemaClient: SchemaRegisteryClient = new SchemaRegisteryClient()
}
And when you need to use it inside foreachRDD:
someStream.foreachRDD {
rdd => rdd.foreachPartition { iterator =>
val schemaClient = RegisteryWrapper.schemaClient
iterator.foreach(schemaClient.send(_))
}
}

A couple of things here are important:
You cannot use this client in the code that is executed on workers (that is inside RDD).
You can create Object with transient client field and have it re-created once job is restarted. An example how to accomplish this can be found here.
The same principle applies to Broadcast and Accumulator variables.
Checkpointing persists data, job metadata, and code logic. When the code is changed, your checkpoints become invalid.

Issue might be with checkpoint data, if you changed any thing in your code then you to remove your old checkpoint metadata.

Related

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Pyspark Checkpointing in Dataproc(StackOverFlowError)

I faced stackoveroverflow error when persisting dataset with pyspark. I am casting whole dataframe into doubletype and then persisting to calculate statistics, and I read that checkpointing is a solution to stackoverflow. However, I am having trouble with implementing it in dataproc.
I am working with pyspark, and when I checkpoint the dataframe and checkedpointed with df.isCheckpointed(), it returns false. However, when I debug it, df.rdd.is_checkpointed says True. Is there any issue with the package / am I doing something wrong?
I thought localCheckpoint is more appropriate for my purpose(https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint()), as my problem was simply DAG depth was too deep, but I couldn't find any use case. Also, if I just checkpointed RDD says it is checkpointed(as in first question), but if I tried localcheckpoint, it says it is not. Has anyone tried this function?
After I tried with local standalone mode, I tried it with dataproc. I tried both hdfs and google cloud storage, but either way the storage was empty, but rdd says it is checkpointed.
Can anyone help me with this? Thanks in advance!
If you're using localCheckpoint, it will write to the local disk of executors, not to HDFS/GCS: https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint--.
Also note that there's an eager (checkpoint immediately) and non-eager (checkpoint when the RDD is actually materialized) mode to checkpointing. That may affect what those methods return. The code is often the best documentation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1615.
In general, please attach sample code (repros) to questions like this -- that way we can answer your question more directly.
Create a temp dir using your SparkSession object:
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
dataframe_name = # Any Spark Dataframe of type <pyspark.sql.dataframe.DataFrame>
at this point of time, the dataframe_name would be a DAG, which you can store as a checkpoint like,
dataframe_checkpoint = dataframe_name.checkpoint()
dataframe_checkpoint is also a spark dataframe of type <pyspark.sql.dataframe.DataFrame> but instead of the DAG, it stores the result of the query
Use checkpoints if:
the computation takes a long time
the computing chain is too long
depends too many RDDs

Get SparkSession in partition loop [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

Spark job throwing NPE [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

How to access cached data in Spark Streaming application?

I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.
I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.
Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).
Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?
It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).
But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.
I think for your use case mapWithState operator is exactly what you're after.
Spark does not work that way. Please think it through in a distributed way.
For the first part of keeping in RAM. You can use cache() or persist() anyone as by default they keep data in memory, of the worker.
You can verify this from Apache Spark Code.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
As far as I understand your use case, you need the UpdateStateByKey Operation to implement your second use case !
For more on Windowing see here.

Resources