Running Spark Mllib Decision Tree gives block size error - apache-spark

I'm running a decision tree on a dataframe of about 2000 points and 500 features. Maxbins is 182. No matter how i increase the shuffling block size from 200 up to 4000 i keep getting a failure at stage 3 of the decision tree training saying "max integer reached" referring to Spark block size shuffling size. Note my dataframes are not rdds but spark sql dataframes.
Here is the error:
...
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522)
at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312)
at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
...
Here is the code producing it:
val assembled = assembler.transform(features)
val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth)
val pipeline = new Pipeline().setStages(Array(labelIndexer, dt))
val model = pipeline.fit(assembled)
Thank you for any pointers on what might be causing this and how to fix it.
Thank you.

Try to increase number of partitions - try repartition() method.
The reason for this error is that spark uses memory mapped files to handle partition data blocks and it's currently not possible to memory mapped something that's more, than 2GB (Integer.MAX_VALUE) - not the Spark issue, though.
The workaround would be increase number of partitions. That would reduce block size of particular partition and might help with the issue.
And there is also some activity to workaround that in Spark itself - to process partitions blocks in chunks.

Related

Spark Broadcast results in increased size of dataframe

I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.
Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB
Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.
Any reasoning or help is appreciated.
The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.
Do not broadcast big dataframes, only small ones, to use in join operations.
As per link:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Spark Out of Memory Error For MapOutputTracker serializeMapStatuses

I have a spark job which have hundred thousands (300,000 task and more)of tasks at stage 0, and then during the shuffling, the following exception throws on Driver side:
util.Utils: Suppressing exception in finally: null
java.lang.OutOfMemoryError at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at
java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) at
java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) at
java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) at
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:618) at
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617) at
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:560) at
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:349) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I checked ByteArrayOutputStream code, and it throws out of memory error when the array size is larger than INTEGER.MAX which is about 2G. That means the map status serialization result should less than 2G.
I also checked the MapOutputTracker code, this map status size is related to task size and following stage task size.
I was wondering if anyone encounter this issue, how you resolve this. my understanding is we can only reduce the size of task, but my task can only stucks because less partition will delay the computation.
This is likely caused by a single block that exceeds 2GB of memory during a shuffle. This usually means your operation requires larger parallelism which will reduce the size of any individual block - hopefully below the 2GB limit (which is extremely high.)
No Spark shuffle block can be greater than 2 GB
Spark uses ByteBuffer as abstraction for storing blocks.
val buf =ByteBuffer.allocate(length.toInt)
ByteBuffer is limited by Integer.MAX_IZE(2GB)
Increasing your Parallelism
1) Repartition your data before invoking the operation that causes this error as follows:
DataFrame.repartition(400)
RDD.repartition(400)
2) Pass the number of partitions into the operation as the last argument (where supported):
import org.apache.spark.rdd.PairRDDFunctions
RDD.groupByKey(numPartitions: Int) RDD.join(other: RDD, numPartitions:Int)
3) Set the default parallelism (partitions) through the SparkConf as follows (NOT YET SUPPORTED in Databricks Cloud):
// create the SparkConf used to create the SparkContext val conf = new SparkConf()
// set the parallelism/partitions conf.set("spark.default.parallelism", 400)
// create the SparkContext with the conf val sc = new SparkContext(conf)
// check the parallelism/partitions sc.defaultParallelism
4) Set the SQL partitions through SQL as follows (default is 200):
SET spark.sql.shuffle.partitions=400;
Why 2GB?
This limit exists because of the limit on Java Integers: 2^31 == 2,147,483,647 ~= 2GB.
Spark's shuffle mechanism currently uses Java ByteArrays to transport the data across the network.
This will be enhanced in the future by either expanding Spark's shuffle to use either a larger ByteArrays using Longs, chaining together ByteArrays, or both.
https://forums.databricks.com/questions/1140/im-seeing-an-outofmemoryerror-requested-array-size.html

spark shuffle read time

k.imgur.com/r8NIv.png
I am having hard time processing this information from Spark UI. The executor which has lowest spark shuffle read size/Records takes maximum time to read the shuffle blocks as shown in the pictures. I am not understanding if this is a code issue or if this is a data node issues.
Maybe it not only caused by the shuffle read size,there are many factors affecting the shuffle time like the number of partitions.You can try to modify the configuration parmeters about shuffle.
shuffle-behavior

spark groupby driver throws OutOfMemory

I have a RDD[((Long,Long),Float)] about 150G (shown in web ui storage).
When I groupby this RDD, driver program throws following error
15/07/16 04:37:08 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-39] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:844)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:747)
The executors didn't even start the stage.
This RDD has 120000 partitions. Could this be the cause of the error?
The size of a at least one of the partitions is more that the memory you have allocated to the executor (you can do that with the --executor-memory flag on the command line running the spark job
After grouping by (Long, Long), at least one of your groups are big to fit in memory. Spark expects each record after grouping ((Long,long), Iterator[Float]) to fit in memory. and this is not the case for your data. see this https://spark.apache.org/docs/1.2.0/tuning.html look for Memory Usage of Reduce Tasks
I suggest to have a work around by increasing your data parallelism. Add a mapping step before the group by and break down your data.
ds.Map(x=>(x._1._1,x._1._2,x._1._1%2),float))
Then group by the new key (you might do something more sophisticated than this x._1._1%2).

Resources