How to launch parallel spark jobs? - apache-spark

I think i don't understand enough how to launch jobs.
I have one job which takes 60 seconds to finish. I run it with following command:
spark-submit --executor-cores 1 \
--executor-memory 1g \
--driver-memory 1g \
--master yarn \
--deploy-mode cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=4 \
--conf spark.dynamicAllocation.initialExecutors=4 \
--conf spark.executor.instances=4 \
If i increase number of partitions from code and number of executors the app will finish faster, which it's ok. But if i increase only executor-cores the finish time is the same, and i don't understand why. I expect the time to be lower than initial time.
My second problem is if i launch twice above code i expect that both jobs to finish in 60 seconds, but this don't happen. Both jobs finish after 120 seconds and i don't understand why.
I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 threads). From what i saw in default EMR configurations, yarn is set on FIFO(default) mode with CapacityScheduler.
What do you think about this problems?

Spark creates partitions based on a logic inside the data source. In your case it probably creates a number of partitions which is smaller than the number of executors * executor cores so just increasing the cores will not make it run faster as those would be idle. When you also increase the number of partitions it can work faster.
When you run spark-submit twice, there is a good chance that the dynamic allocation reaches the maximum allocation of executors before the second one starts (it takes ~4 seconds by default in your case). Depending on how yarn was defined, this might fill up all of the available executors (either because the number of threads defined is too small or because memory is filled up). In any case if this indeed happens, the second spark-submit would not start processing until some executor is freed meaning it takes the sum of times.
BTW remember that in cluster mode, the driver takes up an executor too...

Related

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

Spark-submit executor memory issue

I have a 10 node cluster, 8 DNs(256 GB, 48 cores) and 2 NNs. I have a spark sql job being submitted to the yarn cluster. Below are the parameters which I have used for spark-submit.
--num-executors 8 \
--executor-cores 50 \
--driver-memory 20G \
--executor-memory 60G \
As can be seen above executor-memory is 60GB, but when I check Spark UI is shows 31GB.
1) Can anyone explain me why it is showing 31GB instead of 60GB.
2) Also help in setting optimal values for parameters mentioned above.
I think,
Memory allocated gets divided into two parts:
1. Storage (caching dataframes/tables)
2. Processing (the one you can see)
31gb is the memory available for processing.
Play around with spark.memory.fraction property to increase/decrease the memory available for processing.
I would suggest to reduce the executor cores to about 8-10
My configuration :
spark-shell --executor-memory 40g --executor-cores 8 --num-executors 100 --conf spark.memory.fraction=0.2

Spark job randomly hangs int the middle of a stage while reading data

I have a spark job which reads data, transforms it(shuffle involved) and writes data back to disks. Different instances of the same spark job is used for processing separate data in parallel(each has its input\output dir). Some of the jobs, so far, 3 jobs out of 200 roughly, got stuck in the middle of reading stage. By stuck I mean there is no tasks finished after some point, there is no progress in stage, there is no new errors logs of executors in UI, a job can run for half an hour and then it stops and there is no progress. When I rerun the whole set of jobs, everything can be fine or some other jobs can hang again, this time some others(another in/out dir). We use spark 1.6.0(CDH_5.8). We use dynamic allocation and such a job can eat more resources after it already "stuck". Any idea what can be done in such situations?
I start jobs using this properties:
--master yarn-cluster
--driver-memory 8g
--executor-memory 4g
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.dynamicAllocation.maxExecutors=2200
--conf spark.yarn.maxAppAttempts=2
--conf spark.dynamicAllocation.enabled=true
UPADATE
Disabling dynamic allocation seems solved the issue, we are gonna try running our jobs another several days to conclude was it really the reason.

Meet OOM when I want to fetch more than 1,000,000 rows in apache-spark

Problem:
I want to query my table which stored in Hive through the SparkSQL JDBC interface.
And want to fetch more than 1,000,000 rows. But met OOM.
sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";
My Env:
5 Nodes = One master + 4 workers, 1000M Network Switch , Redhat 6.5
Each node: 8G RAM, 500G Harddisk
Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13
Data:
A table with user and there charge for electricity data.
About 1,600,000 Rows. About 28MB.
Each row occupy about 18 Bytes.
2 columns: user_id String, total_num Double
Repro Steps:
1. Start Spark
2. Start SparkSQL thriftserver, command:
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh \
--master spark://cx-spark-001:7077 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=2g \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.shuffle.manager=sort \
--conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" \
--conf spark.file.transferTo=false \
--conf spark.akka.timeout=2000 \
--conf spark.storage.memoryFraction=0.4 \
--conf spark.cores.max=8 \
--conf spark.kryoserializer.buffer.mb=256 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.akka.frameSize=512 \
--driver-class-path /usr/local/hive/lib/classes12.jar
Run the test code, see it in attached file: testHiveJDBC.java
Get the OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. see the attached logs.
Preliminary diagnose:
6. When fetching less than 1,000,000 rows , it always success.
7. When fetching more than 1,300,000 rows , it always fail with OOM: GC overhead limit exceeded.
8. When fetching about 1,040,000-1,200,000 rows, if query right after the thrift server start up, most times success. if I successfully query once then retry the same query, it will fail.
9. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s.
10. I tried to start thrift with different configure, give the worker 4G MEM or 2G MEM , got the same behavior. That means , no matter the total MEM of worker, i can get less than 1,000,000 rows, and can not get more than 1,300,000 rows.
Preliminary conclusions:
11. The total data is less than 30MB, It is so small, And there is no complex computation operation.
So the failure is not caused by excessive memory requirements.
So I guess there are some defect in spark sql code.
12. Allocate 2G or 4G MEM to each worker, got same behavior.
This point strengthen my doubts: there are some defect in code. But I can't find the specific location.
Because spark workers send all task results to driver program (ThriftServer) and the driver program will collect all task results into org.apache.spark.sql.Row[TASK_COUNT][ROW_COUNT] array.
This is the root cause to make ThriftServer OOM.
What you additionally could try is to set spark.sql.thriftServer.incrementalCollects to true. The effects are described in https://issues.apache.org/jira/browse/SPARK-25224 pretty nicely!

Resources