Spark tasks one more than number of partitions - apache-spark

I am trying to do a simple count and group by in spark dataset.
However each time one of the stages get stuck like (200/201 1 running).
I have retried with several partitions ranging from 1000 to 6000. Each time I am stuck in a stage showing (1000/1001 1 running) or (6000/6001 1 running) in status bar.
Kindly help me as to where this extra 1 task is getting spawned from.
The spark-submit options are as below :
--conf spark.dynamicAllocation.enabled=false --conf spark.kryoserializer.buffer.max=2000m --conf spark.shuffle.service.enabled=true --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.default.parallelism=3000 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=6000 --conf spark.driver.memory=30g --conf spark.yarn.maxAppAttempts=1 --conf spark.driver.cores=6 --num-executors 80 --executor-cores 5 --executor-memory 40g

The number of spark shuffle partitions are huge. Spark writes the files to disk for each shuffle partition. That may take a lot of time if you have such a large number of partitions as well as shuffle partitions. You could try reducing both the default parallelism and shuffle partitions.

It's hard to know without seeing your specific spark code and the input format, but the first thing I would look into is data skew in your input data.
If one task is consistently taking longer to complete it's probably because it is significantly larger than the others. This will happen during a shuffle if one key in your data that you are grouping by shows up way more frequently than others since they will all end up in the same shuffled partition.
That being said, if you are literally just doing df.groupBy("key").count then Spark won't need to shuffle the values, just the intermediate sums for each key. That's why it would be helpful to see your specific code.
Another consideration is that your input format and data will define the number of initial partitions, not your spark parallelism settings. For example if you have 10 gzip'ed text files, you will only ever be able to have 10 input partitions. It sounds like the stage you are seeing get stuck is changing task counts with your setting changes though, so I'm assuming it's not the first stage.

Related

Spark Cassandra connector control number of reads per sec

I am running spark application which performs a direct join on Cassandra table
I am trying to control the number of reads per sec, so that the long-running job doesn't impact the overall database
Here are my configuration parameters
--conf spark.cassandra.concurrent.reads=2
--conf spark.cassandra.input.readsPerSec=2
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.cassandra.input.fetch.sizeInRows=1500
I know I won't read more than 1500 rows from each partition
However, in spite of all thresholds reads per sec are crossing 200-300
Is there any other flag or configuration that needs to be turned on
it seams that CassandraJoinRDD has a bug in throttling with spark.cassandra.input.readsPerSec, see https://datastax-oss.atlassian.net/browse/SPARKC-627 for details.
In the meantime use spark.cassandra.input.throughputMBPerSec to throttle your join. Note that the throttling is based on RateLimiter class, so the throttling won't kick in immediately (you need to read at least throughputMBPerSec of data to start throttling). This is something that may be improved in the SCC.

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Spark executor hangs on binaryFiles read

We use Spark 2.1.0 on Yarn for batch elaboration of multiline records.
Our job is written in Pyspark and runs once every day. The input folder contains ~45000 very small files (the range is 1kB-100kB each file), for a total of ~2GB.
Every file contains a different number of multiline record. The first line of a record has a standard pattern, a timestamp followed by a greek µ and some other infos. For example:
28/09/2018 08:54:22µfirst record metadata
first record content with
undefined
number of
lines
28/09/2018 08:57:12µsecond record metadata
second record content
with a different
number of lines
This is how we read files in our Dataframe:
df=spark.sparkContext.binaryFiles(input_path).toDF(['filename', 'content'])
raw = df.select('filename', explode(split(df.content, r'(?=\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}µ)'))).cache()
The first line's output is a dataframe with one entry for every file, the second line's output is a dataframe with one entry for every record. Dataframe is then cached and other operations are performed.
We are actually testing the solution and this is the current deploy mode for the job (memory requirements, however, are oversized):
spark2-submit --master yarn \
--conf spark.kryoserializer.buffer.max=1g \
--deploy-mode cluster \
--driver-memory 16g \
--driver-cores 1 \
--conf spark.yarn.driver.memoryOverhead=1g \
--num-executors 20 \
--executor-memory 16g \
--executor-cores 1 \
--conf spark.yarn.executor.memoryOverhead=1g \
spark_etl.py
The job runs fine almost every day and it performs all its operations in 10-15 minutes, writing results to HDFS.
The problem is, once every 7-10 days one of the ~45000 input files has a completely different size compared to the others: 100MB to 1GB (less than 2GB, anyway). In this case, our job (in particular, one of the executors) hangs and seems to be doing nothing the entire time. There are no new log lines after the first minutes. It takes hours and we never saw the end of these job, because we have to kill them before some hours. We suspect this is because of the "big" file, in fact the job runs fine if we remove it from the input folder.
These are screenshots taken from our last run:
Pyspark documentation notes "Small files are preferred, large file is also allowable, but may cause bad performance.". We can accept a performance worsening, but we think this is not the case, because it seems to us that the job is simply doing nothing during the whole time.
Is a 200MB file really a "large file" in Spark point of view? If yes, how can we improve performances of our job, or at least understand if it is actually doing something?
Thank you
Maybe you should improve your executor-cores number. binaryFiles create the BinaryFileRDD, and BinaryFileRDD get the partitions number depends on CPU processors.
// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when
// traversing a large number of directories and files. Parallelize it.
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS,
Runtime.getRuntime.availableProcessors().toString)

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

Spark job randomly hangs int the middle of a stage while reading data

I have a spark job which reads data, transforms it(shuffle involved) and writes data back to disks. Different instances of the same spark job is used for processing separate data in parallel(each has its input\output dir). Some of the jobs, so far, 3 jobs out of 200 roughly, got stuck in the middle of reading stage. By stuck I mean there is no tasks finished after some point, there is no progress in stage, there is no new errors logs of executors in UI, a job can run for half an hour and then it stops and there is no progress. When I rerun the whole set of jobs, everything can be fine or some other jobs can hang again, this time some others(another in/out dir). We use spark 1.6.0(CDH_5.8). We use dynamic allocation and such a job can eat more resources after it already "stuck". Any idea what can be done in such situations?
I start jobs using this properties:
--master yarn-cluster
--driver-memory 8g
--executor-memory 4g
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.dynamicAllocation.maxExecutors=2200
--conf spark.yarn.maxAppAttempts=2
--conf spark.dynamicAllocation.enabled=true
UPADATE
Disabling dynamic allocation seems solved the issue, we are gonna try running our jobs another several days to conclude was it really the reason.

Resources