Very long spark process on EMR - apache-spark

Context
I am processing some data (5 billion rows, ~7 columns) via pyspark on EMR.
The first steps including some joins work as expected, up to and including a cache() (memory_disk_ser). Then I filter one column for nulls and do a count() of this big dataframe.
Problem
It takes hours to then fail with a 'no connection error' (I do not remember precisely, but I am more interested in the 'why' of it being slow than the final error).
What I noticed
Out of my 256 vcores, 1 is always at 100%, the rest is idle. The one at 100% is used by a data node JVM.
Configuration
I have 4 r5a.16xlarge instances, each with 4 EBS ssds.
EMR is supposed to take care of its own config, and that is what I see in the spark UI:
spark.emr.default.executor.memory 18971M
spark.driver.memory 2048M
spark.executor.cores 4
I am setting myself:
spark.network.timeout: 800s
spark.executor.heartbeatInterval: 60s
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.shuffleTracking.enabled: True
spark.executor.instances: 0
spark.default.parallelism: 128
spark.shuffle.spill.compress: true
spark.shuffle.compress: true
spark.rdd.compress: true
spark.storage.level: MEMORY_AND_DISK_SER
spark.executor.extraJavaOptions: -X:+seG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Duser.timezone=GMT
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Duser.timezone=GMT
Question
What do I do wrong, or what do I not understand properly? Counting a cached dataframe built in 10 minutes, even when filtering out nulls, should not take hours.
Some more details
The data source is on S3, on homogeneous parquet files. But reading those always work fine, because the join succeeds.
During the count(), I see 200 taaks, 195 succeeds within a few seconds, but 5 consistently never complete, all NODE_LOCAL (but some NODE_LOCAL) tasks do complete

Its a bit hard to tell what is going wrong without looking at the resource manager.
But first, make sure that you are not 'measuring' the success on non-action APIs (because cache is not an action, neither joins and so on).
My best shot would be at the Spark configurations, I would be using cluster-mode with these configurations:
spark.default.parallelism=510
--num-executors=13 (or spark.executor.instances)
spark.executor.cores = spark.driver.cores = 5
spark.executor.memory = spark.driver.memory = 36g
spark.driver.memoryOverhead=4g
I think the problem is at this configuration spark.executor.instances which you have set to 0.
Otherwise, these settings (from AWS official guide) are almost optimal for your used AWS instance types.

Related

Getting BusyPoolException com.datastax.spark.connector.writer.QueryExecutor , what wrong me doing?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster environment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
Using following configurration
cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
Job is taking around 2 hrs , it really huge time
When I check logs I see
WARN com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException
how to fix this ?
You have incorrect value for cassandra.concurrent.writes - this means that you're sending 1500 concurrent batches at the same time. But by default, Java driver allows 1024 simultaneous requests. And usually, if you have too high number for this parameter, could lead to overload of the nodes, and as result - retries for tasks.
Also, other settings are incorrect - if you sepcify cassandra.output.batch.size.rows, then its value overrides the value of cassandra.output.batch.size.bytes. See corresponding section of the Spark Cassandra Connector reference for more details.
One of the aspects of performance tuning is to have correct number of Spark partitions, so you reach good parallelism - but this really depends on your code, how many nodes in Cassandra cluster, etc.
P.S. Also, please note that configuration parameters should be started with spark.cassandra., not with simple cassandra. - if you specified them in this form, then these parameters are ignored and defaults are used.

Spark Graphframes large dataset and memory Issues

I want to run a pagerank on relativly large graph 3.5 billion nodes 90 billion edges. And I have been experimenting with different cluster sizes to get it to run. But first the code:
from pyspark.sql import SparkSession
import graphframes
spark = SparkSession.builder.getOrCreate()
edges_DF = spark.read.parquet('s3://path/to/edges') # 1.4TB total size
verts_DF = spark.read.parquet('s3://path/to/verts') # 25GB total size
graph_GDF = graphframes.GraphFrame(verts_DF, edges_DF)
graph_GDF = graph_GDF.dropIsolatedVertices()
result_df = graph_GDF.pageRank(resetProbability=0.15, tol=0.1)
pagerank_df = result_df.vertices
pagerank_df.write.parquet('s3://path/to/output', mode='overwrite')
I experienced high garbage collection problems times right from the start. So I experimented with different settings and sizes for the cluster. I mainly followed two articles:
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
I run the cluster on amazon EMR. These are the relevant setting I currently use:
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:2.7.6,graphframes:graphframes:0.7.0-spark2.4-s_2.11",
"spark.dynamicAllocation.enabled": "false",
"spark.network.timeout":"1600s",
"spark.executor.heartbeatInterval":"120s",
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.storage.level": "MEMORY_AND_DISK_SER",
"spark.rdd.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true",
"spark.memory.fraction": "0.80",
"spark.memory.storageFraction": "0.30",
"spark.serializer":"org.apache.spark.serializer.KryoSerializer",
"spark.sql.shuffle.partitions":"1216"
"yarn.nodemanager.vmem-check-enabled": "false",
"yarn.nodemanager.pmem-check-enabled": "false"
"maximizeResourceAllocation": "true"
"fs.s3.maxConnections": "5000",
"fs.s3.consistent": "true",
"fs.s3.consistent.throwExceptionOnInconsistency":"false",
"fs.s3.consistent.retryPolicyType":"fixed",
"fs.s3.consistent.retryPeriodSeconds":"10"
I experimented with cluster sizes my first experiment that seemed to work was
a cluster with the following parameters: --deploy-mode cluster --num-executors 75 --executor-cores 5 --executor-memory 36g --driver-memory 36g --driver-cores 5
With this configuration GC time was way down everything was working but since it was a test the cluster it had very "little" memory with 2.7 TB in total, also after a while I got ExecutorLostFailure (executor 54 exited caused by one of the running tasks) Reason: Container from a bad node Exit status: 137. Which I thought happened because I left the node to little RAM. So I rerun the whole thing but this time with --executor-cores 5 --executor-memory 35g and right away my GC problems where back and my cluster acted really weird. So I thought I understood the problem that the reason for the high GC times was not insufficient memory per executor.
Next cluster I spun up was with the following parameters: --deploy-mode cluster --num-executors 179 --executor-cores 5 --executor-memory 45g --driver-memory 45g --driver-cores 5
So a larger cluster and even more memory per executor as before. everything was running smoothly and I noticed via ganglia that the first step took about 5.5 TB of ram.
I though I understood the issues that using less cores available to my cluster and enlarging the memory of each executor makes the program faster I guessed that it hast to do with the verts_DF being about 25gb in size and this way it would fit into the memory of each executor and leave room the calculations (25GB * 179 nearly is 5.5TB).
So the next cluster I spun up had the same number of nodes but I resized the exectuors to: --num-executors 119 --executor-cores 5 --executor-memory 75g
Instantly all the problems where back! High GC times the cluster was hanging via ganglia I could see the RAM filling up to 8 of 9 available TB. I was baffled.
I went back and spun up the --num-executors 179 --executor-cores 5 --executor-memory 45g cluster again, which luckily is easy to do with EMR because I could just clone it. But now also this configuration did not work. High GC times Cluster hitting 8TB of used memory right away.
What is going on here? It feels like I play roulette sometimes the same config works and other times it does not?
If someone still stumbles upon this after some time passed it realized that the problem lies with how graphx or graphframes load the graph. Both try to generate all triplets of the graph they are loading, which with very large graphs resoluts in OOM errors, because a graph with 3.5 billion nodes and 70 billion edges has damn many of them.
I wrote a solution by implementing pagerank in pyspark. It is for sure not as fast as a scala implementation but it scales and does not run into the described triplet problem.
I published it on github
https://github.com/thagorx/spark_pagerank
If you are running a stand-alone version, with pyspark and graphframes, you can launch the pyspark REPL by executing the following command:
pyspark --driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
Be sure to change the SPARK_VERSION environment variable appropriately regarding the latest released version of Spark

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

Spark shows different number of cores than what is passed to it using spark-submit

TL;DR
Spark UI shows different number of cores and memory than what I'm asking it when using spark-submit
more details:
I'm running Spark 1.6 in standalone mode.
When I run spark-submit I pass it 1 executor instance with 1 core for the executor and also 1 core for the driver.
What I would expect to happen is that my application will be ran with 2 cores total.
When I check the environment tab on the UI I see that it received the correct parameters I gave it, however it still seems like its using a different number of cores. You can see it here:
This is my spark-defaults.conf that I'm using:
spark.executor.memory 5g
spark.executor.cores 1
spark.executor.instances 1
spark.driver.cores 1
Checking the environment tab on the Spark UI shows that these are indeed the received parameters but the UI still shows something else
Does anyone have any idea on what might cause Spark to use different number of cores than what I want I pass it? I obviously tried googling it but didn't find anything useful on that topic
Thanks in advance
TL;DR
Use spark.cores.max instead to define the total number of cores available, and thus limit the number of executors.
In standalone mode, a greedy strategy is used and as many executors will be created as there are cores and memory available on your worker.
In your case, you specified 1 core and 5GB of memory per executor.
The following will be calculated by Spark :
As there are 8 cores available, it will try to create 8 executors.
However, as there is only 30GB of memory available, it can only create 6 executors : each executor will have 5GB of memory, which adds up to 30GB.
Therefore, 6 executors will be created, and a total of 6 cores will be used with 30GB of memory.
Spark basically fulfilled what you asked for. In order to achieve what you want, you can make use of the spark.cores.max option documented here and specify the exact number of cores you need.
A few side-notes :
spark.executor.instances is a YARN-only configuration
spark.driver.memory defaults to 1 core already
I am also working on easing the notion of the number of executors in standalone mode, this might get integrated into a next release of Spark and hopefully help figuring out exactly the number of executors you are going to have, without having to calculate it on the go.

spark scalability: what am I doing wrong?

I am processing data with spark and it works with a day worth of data (40G) but fails with OOM on a week worth of data:
import pyspark
import datetime
import operator
sc = pyspark.SparkContext()
sqc = pyspark.sql.SQLContext(sc)
sc.union([sqc.parquetFile(hour.strftime('.....'))
.map(lambda row:(row.id, row.foo))
for hour in myrange(beg,end,datetime.timedelta(0,3600))]) \
.reduceByKey(operator.add).saveAsTextFile("myoutput")
The number of different IDs is less than 10k.
Each ID is a smallish int.
The job fails because too many executors fail with OOM.
When the job succeeds (on small inputs), "myoutput" is about 100k.
what am I doing wrong?
I tried replacing saveAsTextFile with collect (because I actually want to do some slicing and dicing in python before saving), there was no difference in behavior, same failure. is this to be expected?
I used to have reduce(lambda x,y: x.union(y), [sqc.parquetFile(...)...]) instead of sc.union - which is better? Does it make any difference?
The cluster has 25 nodes with 825GB RAM and 224 cores among them.
Invocation is spark-submit --master yarn --num-executors 50 --executor-memory 5G.
A single RDD has ~140 columns and covers one hour of data, so a week is a union of 168(=7*24) RDDs.
Spark very often suffers from Out-Of-Memory errors when scaling. In these cases, fine tuning should be done by the programmer. Or recheck your code, to make sure that you don't do anything that is way too much, such as collecting all the bigdata in the driver, which is very likely to exceed the memoryOverhead limit, no matter how big you set it.
To understand what is happening you should realize when yarn decides to kill a container for exceeding memory limits. That will happen when the container goes beyond the memoryOverhead limit.
In the Scheduler you can check the Event Timeline to see what happened with the containers. If Yarn has killed a container, it will be appear red and when you hover/click over it, you will see a message like:
Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So in that case, what you want to focus on is these configuration properties (values are examples on my cluster):
# More executor memory overhead
spark.yarn.executor.memoryOverhead 4096
# More driver memory overhead
spark.yarn.driver.memoryOverhead 8192
# Max on my nodes
#spark.executor.cores 8
#spark.executor.memory 12G
# For the executors
spark.executor.cores 6
spark.executor.memory 8G
# For the driver
spark.driver.cores 6
spark.driver.memory 8G
The first thing to do is to increase the memoryOverhead.
In the driver or in the executors?
When you are overviewing your cluster from the UI, you can click on the Attempt ID and check the Diagnostics Info which should mention the ID of the container that was killed. If it is the same as with your AM Container, then it's the driver, else the executor(s).
That didn't resolve the issue, now what?
You have to fine tune the number of cores and the heap memory you are providing. You see pyspark will do most of the work in off-heap memory, so you want not to give too much space for the heap, since that would be wasted. You don't want to give too less, because the Garbage Collector will have issues then. Recall that these are JVMs.
As described here, a worker can host multiple executors, thus the number of cores used affects how much memory every executor has, so decreasing the #cores might help.
I have it written in memoryOverhead issue in Spark and Spark – Container exited with a non-zero exit code 143 in more detail, mostly that I won't forget! Another option, that I haven't tried would be spark.default.parallelism or/and spark.storage.memoryFraction, which based on my experience, didn't help.
You can pass configurations flags as sds mentioned, or like this:
spark-submit --properties-file my_properties
where "my_properties" is something like the attributes I list above.
For non numerical values, you could do this:
spark-submit --conf spark.executor.memory='4G'
It turned out that the problem was not with spark, but with yarn.
The solution is to run spark with
spark-submit --conf spark.yarn.executor.memoryOverhead=1000
(or modify yarn config).

Resources