Spark create new spark session/context and pick up from failure - apache-spark

The Spark platform where I work is not stable and keep failing my jobs with various reason each time. The job just not die on Hadoop manager but linger as Running, so I want to kill it.
In the same python script, I would like to kill the current spark session once there is failure, create another sparkcontext/session and pick up from the last checkpoint. I do have frequent checkpoint to avoid DAG getting too long. The part where it tends to fail is a while loop, so I can afford to pick up with the current df
Any idea how I can achieve that ?
My initial thought is
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("test_Terminal").config("spark.sql.broadcastTimeout", "36000").getOrCreate()
flag_finish = False
flag_fail=False
while (!flag_finish) :
if flag_fail : #kill current erroneous session
sc.stop()
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '60g'),
('spark.driver.memory','30g'),('spark.executor.cores', '16'),
('spark.driver.cores', '24'),('spark.cores.max', '32')])
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
df = ...#read back from checkpoint or disk
#process with current df or df picked up
while .. :#this is where server tend to fail my job due after some time
try :
##df processing and update
...
df.checkpoint()
df.count() #activate checkpoint
if complete :
flag_finished = True
exception Exception as e:
flag_fail=True
continue
Another question is how to explicitly read from checkpoint (which has been done by df.checkpoint())

Checkpointing in non-Streaming is to used sever lineage. It is not designed for sharing data between different applications or different Spark Contexts.
What you would like is not possible in fact.

Related

(py)Spark checkpointing consumes driver memory

Context
I have a pySpark-query that creates a rather large DAG. Thus, I break the lineage using checkpoint(eager=True) to shrink it which normally works.
Note: I do not use localCheckpoint() since I use dynamic ressource allocation (see the docs for reference about this).
# --> Pseudo-code! <--
spark = SparkSession()
sc= SparkContext()
# Collect distributed data sources which results in touching a lot of files
# -> Large DAG
df1 = spark.sql("Select some data")
df2 = spark.sql("Select some other data")
df3 ...
# Bring these DataFrames together to break lineage and shorten DAG
# Note: In "eager"-mode this is executed right away
intermediate_results = df1.union(df2).union(df)....
sc.setCheckpointDir("hdfs:/...")
checkpointed_df = intermediate_results.checkpoint(eager=True)
# Now continue to do stuff
df_X = spark.sql("...")
result = checkpointed_df.join(df_X ...)
Problem
I start the Spark-session in client-mode (admin-requirement) in a Docker container in a Kubernetes cluster (respectively some third party product manages this as set up by the admins).
When I execute my code and intermediate_results.checkpoint(eager=True) two things happen:
I receive a pySpark-error about loosing the connection to the JVMs and a resulting calling-error:
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
...
Py4JError: An error occurred while calling o1540.checkpoint
This is of course a very shortened StackTrace.
The software controlling the Docker states:
Engine exhausted available memory, consider a larger engine size.
This refers to an exceeded memory-limit of the container.
Question
The only reason I can explain myself that the Docker-containers memory-limit is exceeded would be that checkpoint() actually passes data through the driver at some point. Otherwise, I have no action which would collect anything to the driver on purpose. However, I didn't read anything about it in the docs.
Does checkpoint() actually consume memory in the driver when executed?
Did anybody encounter a similar error-behaviour and can pin out that this is deriving from something else?

Periodically persist the computed result using Spark Streaming?

I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.
I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.
My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.
I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .
But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.
If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.
Key points:
enable write ahead log for receiver
enable checkpoint
Detail
enable WAL: set spark.streaming.receiver.writeAheadLog.enable true
enable checkpoint
checkpoint is to write your app state to reliable storage periodically. And when your application fails, it can recover from checkpoint file.
To write checkpoint, write this:
ssc.checkpoint("checkpoint.path")
To read from checkpoint:
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate("checkpoint_path", () => createContext())
ssc.start()
ssc.awaitTermination()
}
in the createContext function, you should create ssc and do your own logic. For example:
def createContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName("app.name")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds("streaming.interval"))
ssc.checkpoint("checkpoint.path")
// your code here
ssc
}
Here is the document about necessary steps about how to deploy spark streaming applications, including recover from driver/executor failure.
https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#deploying-applications
Spark Streaming acts as a consumer application. In real time,data being pulled from Kafka topics where you can store the offset of data in some data store. This is also true if you are reading data from Twitter streams. You can follow the below posts to store the offset and if the application crashed or restarted.
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
https://www.linkedin.com/pulse/achieving-exactly-once-semantics-kafka-application-ishan-kumar

How to start and stop spark Context Manually

I am new to spark.In my current spark application script, I can send queries to spark in-memory saved table and getting the desired result using spark-submit.The problem is, each time spark context stops automatically after completing result. I want to send multiple queries sequentially.for that I need to keep alive spark context. how could I do that ? my point is
Manual start and stop sparkcontext by user
kindly suggest me.I am using pyspark 2.1.0.Thanks in advance
To answer your question, this works
import pyspark
# start
sc = pyspark.SparkContext()
#stop
sc.stop()
Try this code:
conf = SparkConf().setAppName("RatingsHistogram").setMaster("local")
sc = SparkContext.getOrCreate(conf)
This ensures to don;t have always stop your context and at the same time, if existing Spark Context are available, it will be reused.

Why does Spark throw "SparkException: DStream has not been initialized" when restoring from checkpoint?

I am restoring a stream from a HDFS checkpoint (ConstantInputDSTream for example) but I keep getting SparkException: <X> has not been initialized.
Is there something specific I need to do when restoring from checkpointing?
I can see that it wants DStream.zeroTime set but when the stream is restored zeroTime is null. It doesn't get restored possibly due to it being a private member IDK. I can see that the StreamingContext referenced by the restored stream does have a value for zeroTime.
initialize is a private method and gets called at StreamingContext.graph.start but not by StreamingContext.graph.restart, presumably because it expects zeroTime to have been persisted.
Does someone have an example of a Stream that recovers from a checkpoint and has a non null value for zeroTime?
def createStreamingContext(): StreamingContext = {
val ssc = new StreamingContext(sparkConf, Duration(1000))
ssc.checkpoint(checkpointDir)
ssc
}
val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
val socketStream = ssc.socketTextStream(...)
socketStream.checkpoint(Seconds(1))
socketStream.foreachRDD(...)
The problem was that I created the dstreams after the StreamingContext had been recreated from checkpoint, i.e. after StreamingContext.getOrCreate. Creating dstreams and all transformations should've been in createStreamingContext.
The issue was filled as [SPARK-13316] "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards.
This Exception may also occur when you are trying to use same check-pointing directory for 2 different spark streaming jobs. In that case also you will get this exception.
Try using unique checkpoint directory for each spark job.
ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.spark.SparkException: org.apache.spark.streaming.dstream.FlatMappedDStream#6c17c0f8 has not been initialized
at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:313)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
The above error was due to the fact that I also had another Spark Job writing to the same checkpointdir. Even though the other spark job was not running, the fact that it had written to the checkpointdir, the new Spark Job was not able to configure the StreamingContext.
I deleted the contents of the checkpointdir and resubmitted the Spark Job, and the issue was resolved.
Alternatively, you can just use a separate checkpointdir for each Spark Job, to keep it simple.

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

Resources