Java Out-of-Memory Error in pyspark - apache-spark

My problem is fairly simple: the JVM is running out of memory when I run RandomForest.trainRegressor in pyspark. I'm using about 3GB of training data, 77 features, and with numTrees set to 15. If I set numTrees to 15 or more, the training fails with an out-of-memory error (like below):
error info:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},numTrees=15, featureSubsetStrategy="sqrt",impurity='variance', maxDepth=20, maxBins=14) #variance
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.trainRandomForestModel.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:380)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:744)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
I'm using spark 1.5 and my parameters are:
spark-submit --master yarn-client --conf spark.cassandra.connection.host=x.x.x.x \
--jars /home/retail/packages/spark_cassandra_tool-1.0.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hive/lib/HiveAuthHook.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/lib/spark-examples.jar \
--num-executors 30 --executor-cores 4 \
--executor-memory 20000M --driver-memory 200g \
--conf spark.yarn.executor.memoryOverhead=5000 \
--conf spark.kryoserializer.buffer.max=2000m \
--conf spark.kryoserializer.buffer=40m \
--conf spark.driver.extraJavaOptions=\"-Xms2048m -Xmx2048m -XX:+DisableExplicitGC -Dcom.sun.management.jmxremote -XX:PermSize=512m -XX:MaxPermSize=2048m -XX:MaxDirectMemorySize=5g\" \
--conf spark.driver.maxResultSize=10g \
--conf spark.port.maxRetries=100
In my opinion, the trees should be trained in series, so why does the number of them create a higher memory load?
If I want to train 300 trees successfully, how should I set the spark parameters to use the RandomForest.trainRegressor function properly?

Related

Spark-submit with Stocator failing with Class com.ibm.stocator.fs.ObjectStoreFileSystem not found error

I am trying to run spark-submit wordcount Python on a Kubernetes cluster by pulling a text file stored in COS.
For the config, I followed the Stocator README.md
./bin/spark-submit \
--master k8s://https://c111.us-south.containers.cloud.ibm.com:32206 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi --packages com.ibm.stocator:stocator:1.1.3 \
--conf spark.executor.instances=5 --conf spark.hadoop.fs.cos.myobjectstorage.access.key= --conf spark.hadoop.fs.cos.myobjectstorage.secret.key= --conf spark.hadoop.fs.stocator.scheme.list=cos --conf spark.hadoop.fs.cos.impl=com.ibm.stocator.fs.ObjectStoreFileSystem --conf spark.hadoop.fs.stocator.cos.impl=com.ibm.stocator.fs.cos.COSAPIClient --conf spark.hadoop.fs.stocator.cos.scheme=cos --conf spark.jars.ivy=/tmp/.ivy\
--conf spark.kubernetes.container.image=us.icr.io/mods15/spark-py:v1 --conf spark.hadoop.fs.cos.myobjectstorage.endpoint=http://s3.us.cloud-object-storage.appdomain.cloud --conf spark.hadoop.fs.cos.myobjectstorage.v2.signer.type=false --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
I can see the driver and executors pods spinning up and after a couple of minutes the driver errors-out with the log below.
Driver stacktrace:
21/01/12 11:52:55 INFO DAGScheduler: Job 0 failed: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 7.839348 s
Traceback (most recent call last):
File "/opt/spark/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 889, in collect
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.30.43.123, executor 4): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:84)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.$anonfun$readToUnsafeMem$1(TextFileFormat.scala:119)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
Caused by: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
... 26 more
Any idea on how I can make this work? I want to pass the text file stored in COS to the wordcount Python example that comes with Spark download (examples folder)
I am using Spark-3.0.1-hadoop2.7 and for the container images, I followed the documentation here
The part that is failing here is
local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
For some reason, the wordcount.py is not able to pick the book.txt file in the COS.
Moving the cos file call inside the python file as mentioned in the link here solved the issue
from pyspark import SparkContext
sc = SparkContext("local", "count app")
sonnets = sc.textFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files")
counts = sonnets.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
counts.saveAsTextFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files/wordcount-result")

Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"

I am using the following code to read some json data from S3:
df = spark_sql_context.read.json("s3a://test_bucket/test.json")
df.show()
The above code throws the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.json.
: java.lang.NumberFormatException: For input string: "100M"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:391)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I have read several other SO posts on this topic (like this one or this) and have done all they have mentioned but nothing seems to fix my issue.
I am using spark-2.4.4-bin-without-hadoop and hadoop-3.1.2. As for the jar files, I've got:
aws-java-sdk-bundle-1.11.199.jar
hadoop-aws-3.0.0.jar
hadoop-common-3.0.0.jar
Also, using the following spark-submit command to run the code:
/opt/spark-2.4.4-bin-without-hadoop/bin/spark-submit
--conf spark.app.name=read_json --master yarn --deploy-mode client --num-executors 2
--executor-cores 2 --executor-memory 2G --driver-cores 2 --driver-memory 1G
--jars /home/my_project/jars/aws-java-sdk-bundle-1.11.199.jar,
/home/my_project/jars/hadoop-aws-3.0.0.jar,/home/my_project/jars/hadoop-common-3.0.0.jar
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.rpc.askTimeout=600s" /home/my_project/read_json.py
Anything I might be missing here?
From the stack trace the error is thrown when it's trying to read one of the configuration options, so the issue is with one of the default configuration options that now require numeric format.
In my case the error was resolved after I added the following configuration parameter to the spark-submit command:
--conf fs.s3a.multipart.size=104857600
See Tuning S3A Uploads.
I am posting what I ended up doing to fix the issue for anyone who might see the same exception:
I added hadoop-aws to HADOOP_OPTIONAL_TOOLS in hadoop-env.sh. I also removed all configurations in spark for s3a except the access/secret and everything worked. My code before the changes:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")\
.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")\
.set("com.amazonaws.services.s3.enableV4", "true")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.multipart.size", "104857600"
)
And after:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
That probably means that it was a class path issue. The hadoop-aws wasn't getting added to the class path and so under the covers it was defaulting to some other implementation of S3AFileSystem.java. Hadoop and spark are a huge pain in this area because there are so many different places and ways to load things and java is particular about the order as well because if it doesn't happen in the right order, it will just go with whatever was loaded last. Hope this helps others facing the same issue.

Understanding why smaller executors fail and larger succeeds in spark

I have a job that parses approximate a terabyte of json formatted data split in 20 mb files (this is because each minute gets a 1gb dataset essentially).
The job parses, filters, and transforms this data and writes it back out to another path. However, whether it runs depends on the spark configuration.
The cluster consists of 46 nodes with 96 cores and 768 gb memory per node. The driver has the same specs.
I submit the job in standalone mode and:
Using 22g and 3 cores per executor, the job fails due to gc and OOM
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError19/04/13 01:35:32 WARN TransportChannelHandler: Exception in connection from /10.0.118.151:34014
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.security.sasl.digest.DigestMD5Base$DigestIntegrity.getHMAC(DigestMD5Base.java:1060)
at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1470)
at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:150)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:126)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:101)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o54.json.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
Using 120g and 15 cores per executor, the job succeeds.
Why would the job fail on the smaller memory/core setup?
Notes:
There is an explode operation that possibly may be related as well. Edit: Unrelated. Tested the code, did a simple spark.read.json().count().show() and it gc'd and OOM'd.
My current pet theory at the moment is the the large number of small files results in high shuffle overhead. Is this what's going on and is there a way around this (outside of re-aggregating the files separately)?
Code as requested:
Launcher
./bin/spark-submit --master spark://0.0.0.0:7077 \
--conf "spark.executor.memory=90g" \
--conf "spark.executor.cores=12" \
--conf 'spark.default.parallelism=7200' \
--conf 'spark.sql.shuffle.partitions=380' \
--conf "spark.network.timeout=900s" \
--conf "spark.driver.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
launcher.py
Code
spark = SparkSession.builder \
.appName('Rewrites by Frequency') \
.getOrCreate()
spark.read.json("s3://path/to/file").count()

Spark submission on DCOS cluster fails with java.net.UnknownHostException: hdfs

I am running a spark-submit in cluster/rest mode on a DCOS cluster:
$ ./spark-submit --deploy-mode cluster --master mesos://localhost:7077 --conf spark.master.rest.enabled=true --conf spark.mesos.uris=http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml,http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/core-site.xml --conf spark.mesos.executor.docker.image=someregistry:5000/someimage:2.0.0-rc3 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs/history --conf spark.executor.extraClassPath=/opr/spark/dist/elasticsearch-spark-20_2.11-6.4.2.jar --conf spark.mesos.driverEnv.SPARK_HDFS_CONFIG_URL=http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml --conf spark.executor.memory=42G --conf spark.driver.memory=8G --conf spark.executor.cores=8 --driver-class-path /opt/spark/dist/elasticsearch-spark-20_2.11-6.4.2.jar http://hostname/somescript.py
The task fails as follows:
java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:668)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:604)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:68)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: hdfs
... 26 more
I have created a tunnel from my localhost:7707 to master 7707 because the internal uri:7707 was directly reachable;
edit: I think it could be related to my executors not being able to read core-site.xml where the following declaration resides
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdfs</value>
</property>
Do I need to point my local spark installation at these files somehow?
The problem is in the --conf spark.eventLog.dir=hdfs://hdfs/history
You need to change it to --conf spark.eventLog.dir=hdfs://HDFS_NAME_NODE_HOSTNAME:8020/hdfs/history
Note: Replace HDFS_NAME_NODE_HOSTNAME with the actual Namenode hostname.

How to get the working directory in executor

I am using the following command to submit Spark job, I hope to send jar and config files to each executor and load it there
spark-submit --verbose \
--files=/tmp/metrics.properties \
--jars /tmp/datainsights-metrics-source-assembly-1.0.jar \
--total-executor-cores 4\
--conf "spark.metrics.conf=metrics.properties" \
--conf "spark.executor.extraClassPath=datainsights-metrics-source-assembly-1.0.jar" \
--class org.microsoft.ofe.datainsights.StartServiceSignalPipeline \
./target/datainsights-1.0-jar-with-dependencies.jar
--files and --jars is used to send files to executors, I found that the files are sent to the working directory of executor like 'worker/app-xxxxx-xxxx/0/
But when job is running, the executor always throws exception saying that it could not find the file 'metrics.properties'or the class which is contained in 'datainsights-metrics-source-assembly-1.0.jar'. It seems that the job is looking for files under another dir rather than working directory.
Do you know how to load the file which is sent to executors?
Here is the trace (The class 'org.apache.spark.metrics.PerfCounterSource' is contained in the jar 'datainsights-metrics-source-assembly-1.0.jar'):
ERROR 2016-01-14 16:10:32 Logging.scala:96 - org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.PerfCounterSource cannot be instantiated
java.lang.ClassNotFoundException: org.apache.spark.metrics.PerfCounterSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_80]
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_80]
at java.security.AccessController.doPrivileged(Native Method) [na:1.7.0_80]
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_80]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_80]
at java.lang.Class.forName0(Native Method) ~[na:1.7.0_80]
at java.lang.Class.forName(Class.java:195) ~[na:1.7.0_80]
It looks like you have a typo in your --jars argument, so it could be that it's not actually loading the file and and continuing silently.

Resources