The spark cluster (spark 2.2) is used by around 30 people via spark-shell and tableau (10.4). Once a day the thriftserver gets killed or freezes because the jvm has to many garbage to collect. These are the error messages that I can find in the thriftserver log file:
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR TaskSchedulerImpl: Lost executor 2 on XXX.XXX.XXX.XXX: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Exception in thread "HiveServer2-Handler-Pool: Thread-152" java.lang.OutOfMemoryError: Java heap space
General information:
The Thriftserver is started with the following options (copied from the web-ui of the master -> sun.java.command):
org.apache.spark.deploy.SparkSubmit --master spark://bd-master:7077 --conf spark.driver.memory=6G --conf spark.driver.extraClassPath=--hiveconf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --executor-memory 12G --total-executor-cores 12 --supervise --driver-cores 2 spark-internal hive.server2.thrift.bind.host bd-master --hiveconf hive.server2.thrift.port 10001
The spark standalone cluster has 48 cores and 240 GB memory at 6 machines. Every machine has 8 Cores and 64 GB memory. Two of them are virtual machines.
The users are querying a hive table which is a 1.6 GB csv file replicated on all machines.
Is there something I have done wrong why tableau is able to kill the thriftserver? Is there any other information I could provide that helps you to help me?
We are able to bypass this issue by setting:
spark.sql.thriftServer.incrementalCollect=true
With this parameter set to true, the thriftserver will send a result to the requester for every partition. This reduces the peak of memory the thriftserver needs when the thriftserver is going to send the result.
Related
I am learning spark and trying to execute simple wordcount application. I am using
spark version 2.4.7-bin-hadoop.2.7
scala 2.12
java 8
spark cluster having 1 master and 2 worker node is running as stand alone cluster
spark config is
spark.master spark://localhost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 500M
master start script is ${SPARK_HOME}/sbin/start-master.sh
slave start script is ${SPARK_HOME}/sbin/start-slave.sh spark://localhost:7077 -c 1 -m 50M
I want to start the driver in cluster mode
${SPARK_HOME}/bin/spark-submit --master spark://localhost:7077 --deploy-mode cluster --driver-memory 500M --driver-cores 8 --executor-memory 50M --executor-cores 4 <absolut path to the jar file having code>
Note: The completed driver/apps are the ones I had to kill
I have used the above params after reading spark doc and checking the blogs.
But after I submit the job driver does not run. It always shows worker as none. I have read multiple blogs and checked the documentation to find out how to submit the job in cluster mode. I tweaked different params for spark-submit but it does not execute. Interesting thing to note is that when i submit in client mode it works.
Can you help me in fixing this issue?
Take a look at CPU and memory configurations of your workers and the driver.
Your application requires 500 Mb of RAM and one CPU core to run the driver and 50 Mb and one core to run computational jobs. So you need 550 Mb of RAM and two cores. These resources are provided by a worker when you run your driver in cluster mode. But each worker is allowed to use only one CPU core and 50 Mb of RAM. So the resources that the worker has are not enough to execute your driver.
You have to allocate your Spark cluster as much resources as you need for your work:
Worker Cores >= Driver Cores + Executor Cores
Worker Memory >= Driver Memory + Executor Memory
Perhaps you have to increase amount of memory for both the driver and the executor. Try to run Worker with 1 Gb memory and your driver with 512 Mb --driver-memory and --executor-memory.
I'm provisioning a Google Cloud Dataproc cluster in the following way:
gcloud dataproc clusters create spark --async --image-version 1.2 \
--master-machine-type n1-standard-1 --master-boot-disk-size 10 \
--worker-machine-type n1-highmem-8 --num-workers 4 --worker-boot-disk-size 10 \
--num-worker-local-ssds 1
Launching a Spark application in yarn-cluster mode with
spark.driver.cores=1
spark.driver.memory=1g
spark.executor.instances=4
spark.executor.cores=8
spark.executor.memory=36g
will only ever launch 3 executor instances instead of the requested 4, effectively "wasting" a full worker node which seems to be running the driver only. Also, reducing spark.executor.cores=7 to "reserve" a core on a worker node for the driver does not seem to help.
What configuration is required to be able to run the driver in yarn-cluster mode alongside executor processes, making optimal use of the available resources?
An n1-highmem-8 using Dataproc 1.2 is configured to have 40960m allocatable per YARN NodeManager. Instructing spark to use 36g of heap memory per executor will also include 3.6g of memoryOverhead (0.1 * heap memory). YARN will allocate this as the full 40960m.
The driver will use 1g of heap and 384m for memoryOverhead (the minimum value). YARN will allocate this as 2g. As the driver will always launch before executors, its memory is allocated first. When an allocation request comes in for 40960 for an executor, there is no node with that much memory available and so no container is allocated on the same node as the driver.
Using spark.executor.memory=34g will allow the driver and executor to run on the same node.
We are running spark in cluster mode. As per spark documentation we can provide spark.executor.memory = 3g to change the executor memory size. or we can provide spark-shell --executor-memory 3g. But both the ways when i go and check in spark UI it is showing each executor having 530 MB of memory. Any ideas how to change the memory more than 530MB.
I launch a Python Spark program like this:
/usr/lib/spark/bin/spark-submit \
--master yarn \
--executor-memory 2g \
--driver-memory 2g \
--num-executors 2 --executor-cores 4 \
my_spark_program.py
I get the error:
Required executor memory (2048+4096 MB) is above the max threshold
(5760 MB) of this cluster! Please check the values of
'yarn.scheduler.maximum-allocation-mb' and/or
'yarn.nodemanager.resource.memory-mb'.
This is a brand new EMR 5 cluster with one master m3.2xlarge systems and two core m3.xlarge systems. Everything should be set to defaults. I am currently the only user running only one job on this cluster.
If I lower executor-memory from 2g to 1500m, it works. This seems awfully low. An EC2 m3.xlarge server has 15GB of RAM. These are Spark worker/executor machines, they have no other purpose, so I would like to use as much of that as possible for Spark.
Can someone explain how I go from having an EC2 worker instance with 15GB to being able to assign a Spark worker only 1.5GB?
On [http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html] I see that the EC2 m3.xlarge default for yarn.nodemanager.resource.memory-mb default to 11520MB and 5760MB with HBase installed. I'm not using HBase, but I believe it is installed on my cluster. Would removing HBase free up lots of memory? Is that yarn.nodemanager.resource.memory-mbsetting the most relevant setting for available memory?
When I tell spark-submit --executor-memory is that per core or for the whole worker?
When I get the error Required executor memory (2048+4096 MB), the first value (2048) is what I pass to --executor-memory and I can change it and see the error message change accordingly. What is the second 4096MB value? How can I change that? Should I change that?
I tried to post this issue to AWS developer forum (https://forums.aws.amazon.com/forum.jspa?forumID=52) and I get the error "Your message quota has been reached. Please try again later." when I haven't even posted anything? Why would I not have permissions to post a question there?
Yes, if hbase is installed, it will use quite a bit of memory be default. You should not put it on your cluster unless you need it.
Your error would make sense if there was only 1 core node. 6G (4G for the 2 executors, 2G for the driver) would be more memory than your resource manager would have to allocate. With a 2 node core, you should actually be able to allocate 3 2G executors. 1 on the node with the driver, 2 on the other.
In general, this sheet could help make sure you get the most out of your cluster.
I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement
15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeout 1000000 ms
I don't see any errors but I see above warning and because of it executor gets removed by YARN and I see Rpc client disassociated error and IOException connection refused and FetchFailedException
After executor gets removed I see it is again getting added and starts working and some other executors fails again. My question is is it normal for executor getting lost? What happens to that task lost executors were working on? My Spark job keeps on running since it is long around 4-5 hours I have very good cluster with 1.2 TB memory and good no of CPU cores.
To solve above time out issue I tried to increase time spark.akka.timeout to 1000 seconds but no luck. I am using the following command to run my Spark job. I am new to Spark. I am using Spark 1.4.1.
./spark-submit --class com.xyz.abc.MySparkJob --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 4g --master yarn-client --executor-memory 25G --executor-cores 8 --num-executors 5 --jars /path/to/spark-job.jar
What might happen is that the slaves cannot launch executor anymore, due to memory issue. Look for the following messages in the master logs:
15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5 because it is EXITED
15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9 on worker worker-20150713153302-192.168.122.229-59013
15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited with code 1),Some(1)) from Actor[akka.tcp://sparkWorker#192.168.122.229:59013/user/Worker#-83763597]
You might find some detailed java errors in the worker's log directory, and maybe this type of file: work/app-id/executor-id/hs_err_pid11865.log.
See http://pastebin.com/B4FbXvHR
This issue might be resolved by your application management of RDD's, not by increasing the size of the jvm's heap.