I have a Spark Structured Streaming task running on AWS EMR that is essentially a join of two input streams over a one minute time window. The input streams have a 1 minute watermark. I don't do any aggregation. I write results to S3 "by hand" with a forEachBatch and a foreachPartition per Batch that converts the data to string and writes to S3.
I would like to run this for a long time, i.e. "forever", but unfortunately Spark slowly fills up HDFS storage on my cluster and eventually dies because of this.
There seem to be two types of data that accumulate. Logs in /var and .delta, .snapshot files in /mnt/tmp/.../. They don't get deleted when I kill the task with CTRL+C (or in case of using yarn with a yarn application kill) either, I have to manually delete them.
I run my task with spark-submit. I tried adding
--conf spark.streaming.ui.retainedBatches=100 \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.cleaner.referenceTracking.cleanCheckpoints=true \
--conf spark.cleaner.periodicGC.interval=15min \
--conf spark.rdd.compress=true
without effect. When I add --master yarn the paths where the temporary files are stored change a bit, but the problem of them accumulating over time persists. Adding a --deploy-mode cluster seems to make the problem worse as more data seems to be written.
I used to have a Trigger.ProcessingTime("15 seconds) in my code, but removed it as I read that Spark might fail to clean up after itself if the trigger time is too short compared to the compute time. This seems to have helped a bit, HDFS fills up slower, but temporary files are still piling up.
If I don't join the two streams, but just select on both and union the results to write them to S3 the accumulation of cruft int /mnt/tmp doesn't happen. Could it be that my cluster is too small for the input data?
I would like to understand why Spark is writing these temp files, and how to limit the space they consume. I would also like to know how to limit the amount of space consumed by logs.

Spark fills HDFS with logs because of
One needs to set spark.eventLog.enabled=false so that no logs are created.

in addition to #adrianN's answer, on the EMR side, they retain application logs on HDFS - see


Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Spark executor hangs on binaryFiles read

We use Spark 2.1.0 on Yarn for batch elaboration of multiline records.
Our job is written in Pyspark and runs once every day. The input folder contains ~45000 very small files (the range is 1kB-100kB each file), for a total of ~2GB.
Every file contains a different number of multiline record. The first line of a record has a standard pattern, a timestamp followed by a greek µ and some other infos. For example:
28/09/2018 08:54:22µfirst record metadata
first record content with
number of
28/09/2018 08:57:12µsecond record metadata
second record content
with a different
number of lines
This is how we read files in our Dataframe:
df=spark.sparkContext.binaryFiles(input_path).toDF(['filename', 'content'])
raw ='filename', explode(split(df.content, r'(?=\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}µ)'))).cache()
The first line's output is a dataframe with one entry for every file, the second line's output is a dataframe with one entry for every record. Dataframe is then cached and other operations are performed.
We are actually testing the solution and this is the current deploy mode for the job (memory requirements, however, are oversized):
spark2-submit --master yarn \
--conf spark.kryoserializer.buffer.max=1g \
--deploy-mode cluster \
--driver-memory 16g \
--driver-cores 1 \
--conf spark.yarn.driver.memoryOverhead=1g \
--num-executors 20 \
--executor-memory 16g \
--executor-cores 1 \
--conf spark.yarn.executor.memoryOverhead=1g \
The job runs fine almost every day and it performs all its operations in 10-15 minutes, writing results to HDFS.
The problem is, once every 7-10 days one of the ~45000 input files has a completely different size compared to the others: 100MB to 1GB (less than 2GB, anyway). In this case, our job (in particular, one of the executors) hangs and seems to be doing nothing the entire time. There are no new log lines after the first minutes. It takes hours and we never saw the end of these job, because we have to kill them before some hours. We suspect this is because of the "big" file, in fact the job runs fine if we remove it from the input folder.
These are screenshots taken from our last run:
Pyspark documentation notes "Small files are preferred, large file is also allowable, but may cause bad performance.". We can accept a performance worsening, but we think this is not the case, because it seems to us that the job is simply doing nothing during the whole time.
Is a 200MB file really a "large file" in Spark point of view? If yes, how can we improve performances of our job, or at least understand if it is actually doing something?
Thank you
Maybe you should improve your executor-cores number. binaryFiles create the BinaryFileRDD, and BinaryFileRDD get the partitions number depends on CPU processors.
// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when
// traversing a large number of directories and files. Parallelize it.

What is the correct way to query Hive on Spark for maximum performance?

Spark newbie here.
I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file.
I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")
hiveCtx = HiveContext(sc)
data = hiveCtx.sql("select * from my_table")
Later, I submit it with something similar to this:
spark-submit --master yarn --deploy-mode cluster --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files
However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).
Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?
Thank you!
I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around
Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.
Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.
Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

Spark job randomly hangs int the middle of a stage while reading data

I have a spark job which reads data, transforms it(shuffle involved) and writes data back to disks. Different instances of the same spark job is used for processing separate data in parallel(each has its input\output dir). Some of the jobs, so far, 3 jobs out of 200 roughly, got stuck in the middle of reading stage. By stuck I mean there is no tasks finished after some point, there is no progress in stage, there is no new errors logs of executors in UI, a job can run for half an hour and then it stops and there is no progress. When I rerun the whole set of jobs, everything can be fine or some other jobs can hang again, this time some others(another in/out dir). We use spark 1.6.0(CDH_5.8). We use dynamic allocation and such a job can eat more resources after it already "stuck". Any idea what can be done in such situations?
I start jobs using this properties:
--master yarn-cluster
--driver-memory 8g
--executor-memory 4g
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.dynamicAllocation.maxExecutors=2200
--conf spark.yarn.maxAppAttempts=2
--conf spark.dynamicAllocation.enabled=true
Disabling dynamic allocation seems solved the issue, we are gonna try running our jobs another several days to conclude was it really the reason.

Spark Yarn running 1000 jobs in queue

I am trying to schedule 1000 jobs in Yarn cluster. I want to run more then 1000 jobs daily at same time and yarn to manage the resources. For 1000 files of different category from hdfs i am trying to create spark submit command from python and execute. But i am getting out of memory error due to spark submit using driver memory.
How can schedule 1000 jobs in spark yarn cluster? I even tried oozie job scheduling framework along with spark, it did not work as expected with HDP.
Actually, you might not need 1000 jobs to read from 1000 files in HDFS. You could try to load everything in a single RDD as well (the APIs do support reading multiple files and wildcards in paths). Now, after reading all the files in a single RDD, you should really focus on ensuring if you have enough memory, cores, etc. assigned to it and start looking at your business logic which avoids costly operations like shuffles, etc.
But, if you insist that you need to spawn 1000 jobs, one for each file, you should look at --executor-memory and --executor-cores (along with num-executors for parallelism). These give you leverage to optimise for memory/CPU footprint.
Also curious, you are saying that you get OOM during spark-submit (using driver memory). The driver doesn't really use any memory at all, unless you do things like collect or take with large set, which bring the data from the executors to the driver. Also you are firing the jobs in yarn-client mode? Another hunch is to check if the box where you spawn spark spark jobs has even enough memory just to spawn the jobs in the first place?
It will be easier if you could also paste some logs here.
