Let's say I have a Spark 2.x application, which has speculation enabled (spark.speculation=true), which writes data to a specific location on HDFS.
Now if the task (which writes data to HDFS) takes long, Spark would create a copy of the same task on another executor, and both the jobs would be running in parallel.
How does Spark handle this? Obviously both the tasks shouldn't be trying to write data at the same file location at the same time (which seems to be happening in this case).


As I understand what is happening in my tasks:
If one of the speculative tasks is finished, the other is killed
When spark kills this task, it deletes temporary file written by this task
So no data will be duplicated
If you choose mode overwrite, some specilative tasks may fail with this exception:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
Failed to CREATE_FILE /<hdfs_path>/.spark-staging-<...>///part-00191-.c000.snappy.parquet
for DFSClient_NONMAPREDUCE_936684547_1 on
because this file lease is currently owned by DFSClient_NONMAPREDUCE_-1803714432_1 on
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(
I will continue to study this situation, so maybe the answer will be more helpful some day


HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
This will not help, we already use filter


Spark write to HDFS is slow

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.

Spark concurrent writes on same HDFS location

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.

It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.
Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:

Spark code to protect from FileNotFoundExceptions?

Is there a way to run my spark program and be shielded from files
underneath changing?
The code starts by reading a parquet file (no errors during the read):
val mappings = + "/table/mappings/")
It then does transformations with the data e.g.,
val newTable = mappings.join(anotherTable, 'id)
These transformations take hours (which is another problem).
Sometimes the job finishes, other times, it dies with the following similar message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 1014.0 failed 4 times, most recent failure: Lost task
6.3 in stage 1014.0 (TID 106820,, executor 5): No such file or directory:
We believe another job is changing the files underneath us, but haven't been able to find the culprit.
This is a very complicated problem to solve here. If the underlying data changes while you are operating on the same dataframe the spark job will fail. The reason is when the dataframe was created the underlying RDD knew the location of the data and the DAG associated with it. Now if the underlying data suddenly changed by some job , RDD has no option but fail it.
One possibility of enable retry ,speculation etc but nevertheless the problem exists. Generally if you have a table in parquet and you want to read write at the same time, partition the table by date or time and then write will happen in the different partition while reading will happen in different partition.
Now with the problem of join taking long time. If you are reading the data from s3 then join and write back to s3 again the performance will be slower. Because now the hadoop needs to fetch the data from s3 first then perform the operation ( code not going to data ). Although the network call is fast, I ran some experiment with s3 vs EMR FS and found 50% slowdown with s3.
One alternative is to copy the data from s3 to HDFS and then run the join. That will shield you from the data overwriting and the performance will be faster.
One last thing if you are using spark 2.2 s3 write is painfully slow due to deprecation of DirectOutputCommiter. So that could be another reason for slowdown

Spark Streaming - Restarting from checkpoint replays last batch

We are trying to build a fault tolerant spark streaming job, there's one problem we are running into. Here's our scenario:
1) Start a spark streaming process that runs batches of 2 mins
2) We have checkpoint enabled. Also the streaming context is configured to either create a new context or build from checkpoint if one exists
3) After a particular batch completes, the spark streaming job is manually killed using yarn application -kill (basically mimicking a sudden failure)
4) The spark streaming job is then restarted from checkpoint
The issue that we are having is that after the spark streaming job is restarted it replays the last successful batch. It always does this, just the last successful batch is replayed, not the earlier batches
The side effect of this is that the data part of that batch is duplicated. We even tried waiting for more than a minute after the last successful batch before killing the process (just in case writing to checkpoint takes time), that didn't help
Any insights? I have not added the code here, hoping someone has faced this as well and can give some ideas or insights. Can post the relevant code as well if that helps. Shouldn't spark streaming checkpoint right after a successful batch so that it is not replayed after a restart? Does it matter where I place the ssc.checkpoint command?
You have the answer in the last line of your question. The placement of ssc.checkpoint() matters. When you restart the job using the saved checkpointing, the job comes up with whatever is being saved. So in your case when you killed the job after the batch is completed, the recent one is the last successful one. By this time, you might have understood that checkpointing is mainly to pick up from where you left off-especially for failed jobs.
There are two things those need to be taken care.
1] Ensure that the same checkpoint directory is being used in getOrCreate streaming context method when you restart the program.
2] Set “spark.streaming.stopGracefullyOnShutdown" to "true". This allows spark to complete processing current data and update the checkpoint directory accordingly. If set to false, it may lead to corrupt data in checkpoint directory.
Note: Please post code snippets if possible. And yes, the placement of ssc.checkpoint does matter.
In Such a scenario, one should ensure that checkpoint directory used in streaming context method is same after restart of Spark application. Hopefully it will help

