Spark Graphframes large dataset and memory Issues - apache-spark

I want to run a pagerank on relativly large graph 3.5 billion nodes 90 billion edges. And I have been experimenting with different cluster sizes to get it to run. But first the code:
from pyspark.sql import SparkSession
import graphframes
spark = SparkSession.builder.getOrCreate()
edges_DF = spark.read.parquet('s3://path/to/edges') # 1.4TB total size
verts_DF = spark.read.parquet('s3://path/to/verts') # 25GB total size
graph_GDF = graphframes.GraphFrame(verts_DF, edges_DF)
graph_GDF = graph_GDF.dropIsolatedVertices()
result_df = graph_GDF.pageRank(resetProbability=0.15, tol=0.1)
pagerank_df = result_df.vertices
pagerank_df.write.parquet('s3://path/to/output', mode='overwrite')
I experienced high garbage collection problems times right from the start. So I experimented with different settings and sizes for the cluster. I mainly followed two articles:
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
I run the cluster on amazon EMR. These are the relevant setting I currently use:
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:2.7.6,graphframes:graphframes:0.7.0-spark2.4-s_2.11",
"spark.dynamicAllocation.enabled": "false",
"spark.network.timeout":"1600s",
"spark.executor.heartbeatInterval":"120s",
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.storage.level": "MEMORY_AND_DISK_SER",
"spark.rdd.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true",
"spark.memory.fraction": "0.80",
"spark.memory.storageFraction": "0.30",
"spark.serializer":"org.apache.spark.serializer.KryoSerializer",
"spark.sql.shuffle.partitions":"1216"
"yarn.nodemanager.vmem-check-enabled": "false",
"yarn.nodemanager.pmem-check-enabled": "false"
"maximizeResourceAllocation": "true"
"fs.s3.maxConnections": "5000",
"fs.s3.consistent": "true",
"fs.s3.consistent.throwExceptionOnInconsistency":"false",
"fs.s3.consistent.retryPolicyType":"fixed",
"fs.s3.consistent.retryPeriodSeconds":"10"
I experimented with cluster sizes my first experiment that seemed to work was
a cluster with the following parameters: --deploy-mode cluster --num-executors 75 --executor-cores 5 --executor-memory 36g --driver-memory 36g --driver-cores 5
With this configuration GC time was way down everything was working but since it was a test the cluster it had very "little" memory with 2.7 TB in total, also after a while I got ExecutorLostFailure (executor 54 exited caused by one of the running tasks) Reason: Container from a bad node Exit status: 137. Which I thought happened because I left the node to little RAM. So I rerun the whole thing but this time with --executor-cores 5 --executor-memory 35g and right away my GC problems where back and my cluster acted really weird. So I thought I understood the problem that the reason for the high GC times was not insufficient memory per executor.
Next cluster I spun up was with the following parameters: --deploy-mode cluster --num-executors 179 --executor-cores 5 --executor-memory 45g --driver-memory 45g --driver-cores 5
So a larger cluster and even more memory per executor as before. everything was running smoothly and I noticed via ganglia that the first step took about 5.5 TB of ram.
I though I understood the issues that using less cores available to my cluster and enlarging the memory of each executor makes the program faster I guessed that it hast to do with the verts_DF being about 25gb in size and this way it would fit into the memory of each executor and leave room the calculations (25GB * 179 nearly is 5.5TB).
So the next cluster I spun up had the same number of nodes but I resized the exectuors to: --num-executors 119 --executor-cores 5 --executor-memory 75g
Instantly all the problems where back! High GC times the cluster was hanging via ganglia I could see the RAM filling up to 8 of 9 available TB. I was baffled.
I went back and spun up the --num-executors 179 --executor-cores 5 --executor-memory 45g cluster again, which luckily is easy to do with EMR because I could just clone it. But now also this configuration did not work. High GC times Cluster hitting 8TB of used memory right away.
What is going on here? It feels like I play roulette sometimes the same config works and other times it does not?

If someone still stumbles upon this after some time passed it realized that the problem lies with how graphx or graphframes load the graph. Both try to generate all triplets of the graph they are loading, which with very large graphs resoluts in OOM errors, because a graph with 3.5 billion nodes and 70 billion edges has damn many of them.
I wrote a solution by implementing pagerank in pyspark. It is for sure not as fast as a scala implementation but it scales and does not run into the described triplet problem.
I published it on github
https://github.com/thagorx/spark_pagerank

If you are running a stand-alone version, with pyspark and graphframes, you can launch the pyspark REPL by executing the following command:
pyspark --driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
Be sure to change the SPARK_VERSION environment variable appropriately regarding the latest released version of Spark

Related

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

Spark tuning job

I have a problem with tuning Spark jobs executing on Yarn cluster. I'm having a feeling that I'm not getting most of my cluster and additionally, my jobs fail (executors get removed all the time).
I have the following setup:
4 machines
each machine has 10GB of RAM
each machine has 8 cores
8GBs of RAM are allocated for yarn jobs
14 (of 16) virtual cores are allocated for yarn jobs
I have run my spark job (actually connected to a jupyter notebook) using different setups, e.g.
pyspark --master yarn --num-executors 7 --executor-cores 4 --executor-memory 3G
pyspark --master yarn --num-executors 7 --executor-cores 7 --executor-memory 2G
pyspark --master yarn --num-executors 11 --executor-cores 4 --executor-memory 1G
I've tried different combinations and none of them seems to be working as my executors get destroyed. Additionally, I've read somewhere that it is a good way to increase spark.yarn.executor.memoryOverhead to 600MB as a way not to loose executors (and I did that), but seems that doesn't help. How should I setup my job?
Additionally, it confuses me that when I look at the ResourceManager UI it says for my job vcores used 8 vcores total 56. It seems that I'm using a single core per executor, but I don't understand why?
One more thing, when I setup my job, how many partitions should I specify when I'm reading data from HDFS to get maximal performance?
Donald Knuth said premature optimisation is the root of all evil. I am sure faster running program which fails is on no use. Start by giving all the memory to one executor. Say 7GB/8GB and just 1 core. This is a complete wastage of cores, but if it works, it proves your application can possibly run on this hardware. If even this doesn't work, you should try getting bigger machines. Assuming it works, try increasing the number of cores, until it still works.
The gist of the argument is: your application requires certain memory per task. But the number of tasks running per executor is dependent on number of cores. First find the worst case memory per cores for you application and then you can set executor memory and cores to some multiple of this number.

How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

I am working in Spark Project since last 3-4 months and recently.
I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).
The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).
My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.
For tuneup I am using
--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
at run time.
If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.
[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]
then I am facing the above issue.
Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.
Code format:
val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")
some spark calculation using dataframe, hive query & udf.....
Finally:
result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")
Hello friends I have found the answer of my own question few days back so here
I am writing that.
Whenever we execute any spark program we do not specify the queue parameter and some time the default queue has some limitations which does not allow you to execute as many executors or tasks that you want so it might cause a slow processing and later on a cause of job failure for memory issue as you are running less executors/tasks. So don't forget to mention a queue name at in your execution command:
spark-submit --class com.xx.yy.FactTable_Merging.ScalaHiveHql
--num-executors 25
--executor-cores 5
--executor-memory 20g
--driver-memory 10g
--driver-cores 5
--master yarn-cluster
--name "FactTable HIST & INCR Re Write After Null Merging Seperately"
--queue "your_queue_name"
/tmp/ScalaHiveProgram.jar
/user/poc_user/FactTable_INCR_MERGED_10_PARTITION
/user/poc_user/FactTable_HIST_MERGED_50_PARTITION

spark submit executor memory/failed batch

I have 2 questions on spark streaming :
I have a spark streaming application running and collection data in 20 seconds batch intervals, out of 4000 batches there are 18 batches which failed because of exception :
Could not compute split, block input-0-1464774108087 not found
I assumed the data size is bigger than spark available memory at that point, also the app StorageLevel is MEMORY_ONLY.
Please advice how to fix this.
Also in the command I use below, I use executor memory 20G(total RAM on the data nodes is 140G), does that mean all that memory is reserved in full for this app, and what happens if I have multiple spark streaming applications ?
would I not run out of memory after a few applications ? do I need that much memory at all ?
/usr/iop/4.1.0.0/spark/bin/spark-submit --master yarn --deploy-mode
client --jars /home/blah.jar --num-executors 8 --executor-cores
5 --executor-memory 20G --driver-memory 12G --driver-cores 8
--class com.ccc.nifi.MyProcessor Nifi-Spark-Streaming-20160524.jar
It seems might be your executor memory will be getting full,try these few optimization techniques like :
Instead of using StorageLevel is MEMORY_AND_DISK.
Use Kyro serialization which is fast and better than normal java serialization.f yougo for caching with memory and serialization.
Check if there are gc,you can find in the tasks being executed.

Using all resources in Apache Spark with Yarn

I am using Apache Spark with Yarn client.
I have 4 worker PCs with 8 vcpus each and 30 GB of ram in my spark cluster.
Im set my executor memory to 2G and number of instances to 33.
My job is taking 10 hours to run and all machines are about 80% idle.
I dont understand the correlation between executor memory and executor instances. Should I have an instance per Vcpu? Should I set the executor memory to be memory of machine/#executors per machine?
I believe that you have to use the following command:
spark-submit --num-executors 4 --executor-memory 7G --driver-memory 2G --executor-cores 8 --class \"YourClassName\" --master yarn-client
Number of executors should be 4, since you have 4 workers. The executor memory should be close to the maximum memory that each yarn node has allocated, roughly ~5-6GB (I assume you have 30GB total RAM).
You should take a look on the spark-submit parameters and fully understand them.
We were using cassandra as our data source for spark. The problem was there were not enough partitions. We needed to split up the data more. Our mapping for # of cassandra partitions to spark partitions was not small enough and we would only generate 10 or 20 tasks instead of 100s of tasks.

Resources