Spark executor hangs on binaryFiles read - apache-spark

We use Spark 2.1.0 on Yarn for batch elaboration of multiline records.
Our job is written in Pyspark and runs once every day. The input folder contains ~45000 very small files (the range is 1kB-100kB each file), for a total of ~2GB.
Every file contains a different number of multiline record. The first line of a record has a standard pattern, a timestamp followed by a greek µ and some other infos. For example:
28/09/2018 08:54:22µfirst record metadata
first record content with
undefined
number of
lines
28/09/2018 08:57:12µsecond record metadata
second record content
with a different
number of lines
This is how we read files in our Dataframe:
df=spark.sparkContext.binaryFiles(input_path).toDF(['filename', 'content'])
raw = df.select('filename', explode(split(df.content, r'(?=\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}µ)'))).cache()
The first line's output is a dataframe with one entry for every file, the second line's output is a dataframe with one entry for every record. Dataframe is then cached and other operations are performed.
We are actually testing the solution and this is the current deploy mode for the job (memory requirements, however, are oversized):
spark2-submit --master yarn \
--conf spark.kryoserializer.buffer.max=1g \
--deploy-mode cluster \
--driver-memory 16g \
--driver-cores 1 \
--conf spark.yarn.driver.memoryOverhead=1g \
--num-executors 20 \
--executor-memory 16g \
--executor-cores 1 \
--conf spark.yarn.executor.memoryOverhead=1g \
spark_etl.py
The job runs fine almost every day and it performs all its operations in 10-15 minutes, writing results to HDFS.
The problem is, once every 7-10 days one of the ~45000 input files has a completely different size compared to the others: 100MB to 1GB (less than 2GB, anyway). In this case, our job (in particular, one of the executors) hangs and seems to be doing nothing the entire time. There are no new log lines after the first minutes. It takes hours and we never saw the end of these job, because we have to kill them before some hours. We suspect this is because of the "big" file, in fact the job runs fine if we remove it from the input folder.
These are screenshots taken from our last run:
Pyspark documentation notes "Small files are preferred, large file is also allowable, but may cause bad performance.". We can accept a performance worsening, but we think this is not the case, because it seems to us that the job is simply doing nothing during the whole time.
Is a 200MB file really a "large file" in Spark point of view? If yes, how can we improve performances of our job, or at least understand if it is actually doing something?
Thank you

Maybe you should improve your executor-cores number. binaryFiles create the BinaryFileRDD, and BinaryFileRDD get the partitions number depends on CPU processors.
// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when
// traversing a large number of directories and files. Parallelize it.
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS,
Runtime.getRuntime.availableProcessors().toString)

Related

How many partitions does pyspark create while reading a csv of a relatively small size?

I start a pyspark session in shell by running
pyspark --master yarn --num-executors 8 --executor-cores 2 --executor-memory 4G
The cluster starts successfully.
I have a csv with 100000 rows and whose size is 175.46 MBs on disk.
I try to read it and check the number of partitions. The behaviour is very erratic here. It sometimes is 16 partitions and sometimes it is lesser than that. I can try to read the same csv again and again and the number of partitions decreases progressively. The number of partitions reduces erratically but, necessarily reduces. It never increases from what the previous number of partitions was. Attaching the screenshot for one such instance as well
As you can see, I am reading the exact same CSV and the number of partitions is reducing erratically. Can you please help me understand what is happening here? How is spark deciding the number of partitions and why the element of seeming randomness to it?

Spark - I cannot increase number of tasks in local mode

I tried to submit my application and change the coalese[k] in my code by different combinations:
Firstly, I read some data from my local disk:
val df = spark.read.option("encoding", "gbk").option("wholeFile",true).option("multiline",true).option("sep", "|+|").schema(schema).csv("file:///path/to/foo.txt")
Situation 1
I think local[*] means there are 56 cores in total. And I specify 4 * 4 = 16 tasks:
spark-submit:
spark-submit --master local[*] --class foo --driver-memory-8g --executor-memory 4g --executor-cores 4 --num-executors 4 foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").partitionBy("date").orc("hdfs://xxx:9000/user/hive/warehouse/ods/foo")
But when I have a look at spark history log server UI,there is only 1 task. In the data set, the 'date' column has only a single value.
So I tried another combination and removed partitionBy:
Situation 2
spark-submit:
spark-submit --master local[*] --class foo foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").orc("hdfs://xxxx:9000/user/hive/warehouse/ods/foo")
But the history server shows there is still only 1 task.
There are 56 cores and 256GB memory on my local machine.
I know in local-mode spark creates one JVM for both driver and executor, so it means we have one executor with the number of cores (let's say 56) of our computer (if we run it with Local[*]).
Here are the questions:
Could any one explain why my task number is always 1?
How can I increase the number of tasks so that I can make use of parallism?
Will my local file be read into different partitions?
Spark can read a csv file only with one executor as there is only a single file.
Compared to files which are located in a distributed files system such as HDFS where a single file can be stored in multiple partitions. That means your resulting Dataframe df has only a single partition. You can check that using df.rdd.getNumPartitions. See also my answer on How is a Spark Dataframe partitioned by default?
Note that coalesce will collapse partitions on the same worker, so calling coalesce(16) will not have any impact at all as the one partition of your Dataframe is anyway located already on a single worker.
In order to increase parallelism you may want to use repartition(16) instead.

How to stop Spark Structured Streaming from filling HDFS

I have a Spark Structured Streaming task running on AWS EMR that is essentially a join of two input streams over a one minute time window. The input streams have a 1 minute watermark. I don't do any aggregation. I write results to S3 "by hand" with a forEachBatch and a foreachPartition per Batch that converts the data to string and writes to S3.
I would like to run this for a long time, i.e. "forever", but unfortunately Spark slowly fills up HDFS storage on my cluster and eventually dies because of this.
There seem to be two types of data that accumulate. Logs in /var and .delta, .snapshot files in /mnt/tmp/.../. They don't get deleted when I kill the task with CTRL+C (or in case of using yarn with a yarn application kill) either, I have to manually delete them.
I run my task with spark-submit. I tried adding
--conf spark.streaming.ui.retainedBatches=100 \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.cleaner.referenceTracking.cleanCheckpoints=true \
--conf spark.cleaner.periodicGC.interval=15min \
--conf spark.rdd.compress=true
without effect. When I add --master yarn the paths where the temporary files are stored change a bit, but the problem of them accumulating over time persists. Adding a --deploy-mode cluster seems to make the problem worse as more data seems to be written.
I used to have a Trigger.ProcessingTime("15 seconds) in my code, but removed it as I read that Spark might fail to clean up after itself if the trigger time is too short compared to the compute time. This seems to have helped a bit, HDFS fills up slower, but temporary files are still piling up.
If I don't join the two streams, but just select on both and union the results to write them to S3 the accumulation of cruft int /mnt/tmp doesn't happen. Could it be that my cluster is too small for the input data?
I would like to understand why Spark is writing these temp files, and how to limit the space they consume. I would also like to know how to limit the amount of space consumed by logs.
Spark fills HDFS with logs because of https://issues.apache.org/jira/browse/SPARK-22783
One needs to set spark.eventLog.enabled=false so that no logs are created.
in addition to #adrianN's answer, on the EMR side, they retain application logs on HDFS - see https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Spark tasks one more than number of partitions

I am trying to do a simple count and group by in spark dataset.
However each time one of the stages get stuck like (200/201 1 running).
I have retried with several partitions ranging from 1000 to 6000. Each time I am stuck in a stage showing (1000/1001 1 running) or (6000/6001 1 running) in status bar.
Kindly help me as to where this extra 1 task is getting spawned from.
The spark-submit options are as below :
--conf spark.dynamicAllocation.enabled=false --conf spark.kryoserializer.buffer.max=2000m --conf spark.shuffle.service.enabled=true --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.default.parallelism=3000 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=6000 --conf spark.driver.memory=30g --conf spark.yarn.maxAppAttempts=1 --conf spark.driver.cores=6 --num-executors 80 --executor-cores 5 --executor-memory 40g
The number of spark shuffle partitions are huge. Spark writes the files to disk for each shuffle partition. That may take a lot of time if you have such a large number of partitions as well as shuffle partitions. You could try reducing both the default parallelism and shuffle partitions.
It's hard to know without seeing your specific spark code and the input format, but the first thing I would look into is data skew in your input data.
If one task is consistently taking longer to complete it's probably because it is significantly larger than the others. This will happen during a shuffle if one key in your data that you are grouping by shows up way more frequently than others since they will all end up in the same shuffled partition.
That being said, if you are literally just doing df.groupBy("key").count then Spark won't need to shuffle the values, just the intermediate sums for each key. That's why it would be helpful to see your specific code.
Another consideration is that your input format and data will define the number of initial partitions, not your spark parallelism settings. For example if you have 10 gzip'ed text files, you will only ever be able to have 10 input partitions. It sounds like the stage you are seeing get stuck is changing task counts with your setting changes though, so I'm assuming it's not the first stage.

Resources