Apache Spark - ReducedWindowedDStream has not been initialized - apache-spark

I am restarting a Spark streaming job that is checkpointed in HDFS. I am purposely killing the job after 5 minutes and restarting it to test the recovery. I receive this error once ssc.start() is invoked.
INFO WriteAheadLogManager : Recovered 1 write ahead log files from hdfs://...receivedBlockMetadata
INFO WriteAheadLogManager : Reading from the logs:
Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ReducedWindowedDStream#65600fb3 has not been initialized
at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
I am starting the job using: StreamingContext.getOrCreate(checkpointDir,...
The job has three windowed operations that are sliding windows of 5 minutes, 1 hour, and 1 day, but the job was stopped after 5 minutes. In order for the recovery from the checkpoint to work, does the maximum windowed duration need to pass for all the windowed ops to initialize?

I encountered the same problem, and I deleted the checkpoint path on HDFS to avoid the exception

Related

Checkpoint takes long time in a Spark Job

I have a Spark job (batch) with a checkpoint that it takes over 3h to finish, and appears the checkpoint over 30 times in the SparkUI:
I tried to delete the checkpoint from the code, and similar thing happens, there is a 3h GAP between the job before and the next job.
Data is not too big, and the job just read from 6 tables with no more than 3GB of data, and this job is running in a Cloudera Platform (YARN).
I have already tried using more shuffle partitions and parallelism and also using less, but it doesn't work. I also tried with the number of executors, but nothing changed...
What do you think is happening?
I finally could solve it.
The problem was that the input hive table had just 5 partitions (5 parquet files), so the job was working all the time with just 5 partitions.
.repartition(100) after reading solved the problem and speed up the process from 5h to 40 min.

Spark is dropping all executors at the beginning of a job

I'm trying to configure a spark job to run with fixed resources on a Dataproc cluster, however after the job was running for 6 minutes I noticed that all but 7 executors had been dropped. 45 minutes later the job has not progressed at all, and I cannot find any errors or logs to explain.
When I check the timeline in the job details it shows all but 7 executors being removed at the 6 minute mark, with the message Container [really long number] exited from explicit termination request..
The command I am running is:
gcloud dataproc jobs submit spark --region us-central1 --cluster [mycluster] \
--class=path.to.class.app --jars="gs://path-to-jar-file" --project=my-project \
--properties=spark.executor.instances=72,spark.driver.memory=28g,spark.executor.memory=28g
My cluster is 1 + 24 n2-highmem16 instances if that helps.
EDIT: I terminated the job, reset, and tried again. The exact same thing happened at the same point in the job (Job 9 Stage 9/12)
Typically that message is expected to be associated with Spark Dynamic Allocation; if you want to always have a fixed number of executors, you can try to add the property:
...
--properties=spark.dynamicAllocation.enabled=false,spark.executor.instances=72...
However, that probably won't address the root problem in your case aside from seeing idle executors continue to stick around; if the dynamic allocation was relinquishing those executors, that would be due to those tasks having completed already but where your remaining executors for whatever reason are not yet done for a long time. This often indicates some kind of data skew where the remaining executors have a lot more work to do than the ones that already completed for whatever reason, unless the remaining executors were simply all equally loaded as part of a smaller phase of the pipeline, maybe in a "reduce" phase.
If you're seeing lagging tasks out of a large number of equivalent tasks, you might consider adding a repartition() step to your job to chop it up more fine-grained in the hopes of spreading out those skewed partitions, or otherwise changing the way your group or partition your data through other means.
Fixed. The job was running out of resources. Allocated some more executors to the job and it completed.

Why the filesystem is closed automatically?

I am running and spark job that reads from a kafka queue and writes this data into avro files with spark.
But in the cluster when it is running more or less 50 minutes it get this exception
20/01/27 16:21:23 ERROR de.rewe.eem.spark.util.FileBasedShutdown$: Not possible to await termination or timout - exit checkRepeatedlyShutdownAndStop with exception
20/01/27 16:21:23 ERROR org.apache.spark.util.Utils: Uncaught exception in thread Thread-2
java.io.IOException: Filesystem closed
at com.mapr.fs.MapRFileSystem.checkOpen(MapRFileSystem.java:1660)
at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:633)
The spark configuration and properties is this one
SPARK_OPTIONS="\--driver-memory 4G--executor-memory 4G--num-executors 4--executor-cores 4--conf spark.driver.memoryOverhead=768--conf spark.driver.maxResultSize=0--conf spark.executor.memory=4g--master yarn--deploy-mode cluster"
But the 50 minutes that it is running it is working fine.

Spark Streaming - Jobs run concurrently with default spark.streaming.concurrentJobs setting

I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.

Why spark streaming executors start at different time?

I'm using Spark streaming 1.6 which uses kafka as a source
My input arguments are as follows:
num-executors 5
num-cores 4
batch Interval 10 sec
maxRate 600
blockInterval 350 ms
Why does some of my executors start later than another ??
That's not executors' start time, but tasks' start time.
This is most likely due to locality scheduling. Spark delayed the start of a task to find the best executor to run that task on. Check the configuration "spark.locality.wait" in Spark's documentation for further details.

Resources