Saving large file exceeds frameLimit - apache-spark

I try to save a large text file of a approx. 5GB
sc.parallelize(cfile.toString()
.split("\n"), 1)
.saveAsTextFile(new Path(path+".cs", "data").toUri.toString)
but I keep getting
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
...
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
I'm stuck here for ages now. Can anybody help me here and explain how I can save cfile as a textfile?
Standalone/Local/Yarn cluster?
Yarn cluster
Memory/Cores settings?
1,8 TB
285 Cores
Number of Partitions?
I am currently setting the number of partitions to 1:
The concerned lines of code for setting the number of partitions:
val model = word2vec
.setMinCount(minCount.asInstanceOf[Int])
.setVectorSize(arguments.getVectorSize)
.setWindowSize(arguments.getContextWindowSize)
.setNumPartitions(numW2vPartitions)
.setLearningRate(learningRate)
.setNumIterations(arguments.getNumIterations)
.fit(wordSequence)
spark-submit arguments:
spark-submit --master yarn
--deploy-mode cluster
--driver-memory 20G
--num-executors 5
--executor-cores 8
--driver-java-options "-Dspark.akka.frameSize=2000"
--executor-memory 20G --class

Standalone/Local/Yarn cluster?
Memory/Cores settings?
Number of Partitions?
Your error probably symptom that one of the workers are gone(OOM killer might have killed it or it got some OOM error)
I'm not sure why are you doing this: cfile.toString().split("\n") - from this I understand that you hold all 5GB content in memory and try to parallelize it ? Clearly it's not optimal.
Another problem that could be relevant - if you driver can hold all 5GB in memory somehow, but still all network layers between driver-workers won't like this amount of data - so my advice to split it into partitions.
instead you can read file with sc.textFile(..) and then save it into your new path. You also can control number of partitions(pieces) of your text file with sc.textFile(..).repartition(100).

Related

spark - application returns different results based on different executor memory?

I am noticing some peculiar behaviour, i have spark job which reads the data and does some grouping ordering and join and creates an output file.
The issue is when I run the same job on yarn with memory more than what the environment has eg the cluster has 50 GB and i submit spark-submit with close to 60 GB executor and 4gb driver memory.
My results gets decreased seems like one of the data partitions or tasks are lost while processing.
driver-memory 4g --executor-memory 4g --num-executors 12
I also notice the warning message on driver -
WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
but when i run with limited executors and memory example 15GB, it works and i get exact rows/data. no warning message.
driver-memory 2g --executor-memory 2g --num-executors 4
any suggestions are we missing some settings on cluster or anything?
Please note my job completes successfully in both the cases.
I am using spark version 2.2.
This is meaningless (except maybe for debugging) - the plan is larger when there are more executors involved and the warning is that it is too big to be converted into a string. if you need it you can set spark.debug.maxToStringFields to a larger number (as suggested in the warning message)

Spark java.lang.OutOfMemoryError : Java Heap space [duplicate]

This question already has answers here:
Spark java.lang.OutOfMemoryError: Java heap space
(14 answers)
Closed 1 year ago.
I am geting the above error when i run a model training pipeline with spark
`val inputData = spark.read
.option("header", true)
.option("mode","DROPMALFORMED")
.csv(input)
.repartition(500)
.toDF("b", "c")
.withColumn("b", lower(col("b")))
.withColumn("c", lower(col("c")))
.toDF("b", "c")
.na.drop()`
inputData has about 25 million rows and is about 2gb in size. the model building phase happens like so
val tokenizer = new Tokenizer()
.setInputCol("c")
.setOutputCol("tokens")
val cvSpec = new CountVectorizer()
.setInputCol("tokens")
.setOutputCol("features")
.setMinDF(minDF)
.setVocabSize(vocabSize)
val nb = new NaiveBayes()
.setLabelCol("bi")
.setFeaturesCol("features")
.setPredictionCol("prediction")
.setSmoothing(smoothing)
new Pipeline().setStages(Array(tokenizer, cvSpec, nb)).fit(inputData)
I am running the above spark jobs locally in a machine with 16gb RAM using the following command
spark-submit --class holmes.model.building.ModelBuilder ./holmes-model-building/target/scala-2.11/holmes-model-building_2.11-1.0.0-SNAPSHOT-7d6978.jar --master local[*] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=2g --conf spark.rpc.message.maxSize=1024 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=50g --driver-memory=12g
The oom error is triggered by (at the bottow of the stack trace)
by org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)
Logs :
Caused by: java.lang.OutOfMemoryError: Java heap space at java.lang.reflect.Array.newInstance(Array.java:75) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)
Any suggestions will be great :)
Things I would try:
1) Removing spark.memory.offHeap.enabled=true and increasing driver memory to something like 90% of the available memory on the box. You probably are aware of this since you didn't set executor memory, but in local mode the driver and the executor all run in the same process which is controlled by driver-memory. I haven't tried it, but the offHeap feature sounds like it has limited value. Reference
2) An actual cluster instead of local mode. More nodes will obviously give you more RAM.
3a) If you want to stick with local mode, try using less cores. You can do this by specifying the number of cores to use in the master setting like --master local[4] instead of local[*] which uses all of them. Running with less threads simultaneously processing data will lead to less data in RAM at any given time.
3b) If you move to a cluster, you may also want to tweak the number of executors cores for the same reason as mentioned above. You can do this with the --executor-cores flag.
4) Try with more partitions. In your example code you repartitioned to 500 partitions, maybe try 1000, or 2000? More partitions means each partition is smaller and less memory pressure.
Usually, this error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further. Also, this error may be thrown when there is insufficient native memory to support the loading of a Java class. In a rare instance, a java.lang.OutOfMemoryError may be thrown when an excessive amount of time is being spent doing garbage collection and little memory is being freed.
How to fix error :
How to set Apache Spark Executor memory
Spark java.lang.OutOfMemoryError: Java heap space

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

I am working in Spark Project since last 3-4 months and recently.
I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).
The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).
My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.
For tuneup I am using
--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
at run time.
If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.
[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]
then I am facing the above issue.
Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.
Code format:
val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")
some spark calculation using dataframe, hive query & udf.....
Finally:
result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")
Hello friends I have found the answer of my own question few days back so here
I am writing that.
Whenever we execute any spark program we do not specify the queue parameter and some time the default queue has some limitations which does not allow you to execute as many executors or tasks that you want so it might cause a slow processing and later on a cause of job failure for memory issue as you are running less executors/tasks. So don't forget to mention a queue name at in your execution command:
spark-submit --class com.xx.yy.FactTable_Merging.ScalaHiveHql
--num-executors 25
--executor-cores 5
--executor-memory 20g
--driver-memory 10g
--driver-cores 5
--master yarn-cluster
--name "FactTable HIST & INCR Re Write After Null Merging Seperately"
--queue "your_queue_name"
/tmp/ScalaHiveProgram.jar
/user/poc_user/FactTable_INCR_MERGED_10_PARTITION
/user/poc_user/FactTable_HIST_MERGED_50_PARTITION

spark submit executor memory/failed batch

I have 2 questions on spark streaming :
I have a spark streaming application running and collection data in 20 seconds batch intervals, out of 4000 batches there are 18 batches which failed because of exception :
Could not compute split, block input-0-1464774108087 not found
I assumed the data size is bigger than spark available memory at that point, also the app StorageLevel is MEMORY_ONLY.
Please advice how to fix this.
Also in the command I use below, I use executor memory 20G(total RAM on the data nodes is 140G), does that mean all that memory is reserved in full for this app, and what happens if I have multiple spark streaming applications ?
would I not run out of memory after a few applications ? do I need that much memory at all ?
/usr/iop/4.1.0.0/spark/bin/spark-submit --master yarn --deploy-mode
client --jars /home/blah.jar --num-executors 8 --executor-cores
5 --executor-memory 20G --driver-memory 12G --driver-cores 8
--class com.ccc.nifi.MyProcessor Nifi-Spark-Streaming-20160524.jar
It seems might be your executor memory will be getting full,try these few optimization techniques like :
Instead of using StorageLevel is MEMORY_AND_DISK.
Use Kyro serialization which is fast and better than normal java serialization.f yougo for caching with memory and serialization.
Check if there are gc,you can find in the tasks being executed.

Resources