Pyspark - java.lang.OutOfMemoryError: Java heap space while writing to csv file - python-3.x

when trying to write to a csv file using below code
DF.coalesce(1).write.option("header","false").option("sep",",").option("escape",'"').option("ignoreTrailingWhiteSpace","false").option("ignoreLeadingWhiteSpace","false").mode("overwrite").csv(filename)
I am getting the below error
ileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
Could someone advise a workaround ?

Try increasing the executor.memory in your spark-submit application
Something like this
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

For me adding the below spark config fixed the issue
spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "15g").appName('sl-app').getOrCreate()

Related

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera
**time spark-submit \
--deploy-mode client \
--conf spark.app.name='XXXxxxxxx'
--conf spark.master=local[*] \
--conf spark.driver.memory=20g \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=20g \
--conf spark.executor.cores=7 \
--conf spark.dynamicAllocation.enabled=True \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=5 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.5 \
--conf spark.local.dir=/data/xxx/ssss/spark_local/ \
--py-files test.py \
--files test_ed.ini \
test_py.py**
I am getting this in my output log file:
**Caused by: java.io.FileNotFoundException: /xxx/xxxxx/xxx/xxx/spark_local (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more**
I have tried the shuffle partition after this job gets successful, but my downstream job also fails after that.

Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"

I am using the following code to read some json data from S3:
df = spark_sql_context.read.json("s3a://test_bucket/test.json")
df.show()
The above code throws the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.json.
: java.lang.NumberFormatException: For input string: "100M"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:391)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I have read several other SO posts on this topic (like this one or this) and have done all they have mentioned but nothing seems to fix my issue.
I am using spark-2.4.4-bin-without-hadoop and hadoop-3.1.2. As for the jar files, I've got:
aws-java-sdk-bundle-1.11.199.jar
hadoop-aws-3.0.0.jar
hadoop-common-3.0.0.jar
Also, using the following spark-submit command to run the code:
/opt/spark-2.4.4-bin-without-hadoop/bin/spark-submit
--conf spark.app.name=read_json --master yarn --deploy-mode client --num-executors 2
--executor-cores 2 --executor-memory 2G --driver-cores 2 --driver-memory 1G
--jars /home/my_project/jars/aws-java-sdk-bundle-1.11.199.jar,
/home/my_project/jars/hadoop-aws-3.0.0.jar,/home/my_project/jars/hadoop-common-3.0.0.jar
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.rpc.askTimeout=600s" /home/my_project/read_json.py
Anything I might be missing here?
From the stack trace the error is thrown when it's trying to read one of the configuration options, so the issue is with one of the default configuration options that now require numeric format.
In my case the error was resolved after I added the following configuration parameter to the spark-submit command:
--conf fs.s3a.multipart.size=104857600
See Tuning S3A Uploads.
I am posting what I ended up doing to fix the issue for anyone who might see the same exception:
I added hadoop-aws to HADOOP_OPTIONAL_TOOLS in hadoop-env.sh. I also removed all configurations in spark for s3a except the access/secret and everything worked. My code before the changes:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")\
.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")\
.set("com.amazonaws.services.s3.enableV4", "true")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.multipart.size", "104857600"
)
And after:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
That probably means that it was a class path issue. The hadoop-aws wasn't getting added to the class path and so under the covers it was defaulting to some other implementation of S3AFileSystem.java. Hadoop and spark are a huge pain in this area because there are so many different places and ways to load things and java is particular about the order as well because if it doesn't happen in the right order, it will just go with whatever was loaded last. Hope this helps others facing the same issue.

Understanding why smaller executors fail and larger succeeds in spark

I have a job that parses approximate a terabyte of json formatted data split in 20 mb files (this is because each minute gets a 1gb dataset essentially).
The job parses, filters, and transforms this data and writes it back out to another path. However, whether it runs depends on the spark configuration.
The cluster consists of 46 nodes with 96 cores and 768 gb memory per node. The driver has the same specs.
I submit the job in standalone mode and:
Using 22g and 3 cores per executor, the job fails due to gc and OOM
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError19/04/13 01:35:32 WARN TransportChannelHandler: Exception in connection from /10.0.118.151:34014
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.security.sasl.digest.DigestMD5Base$DigestIntegrity.getHMAC(DigestMD5Base.java:1060)
at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1470)
at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:150)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:126)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:101)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o54.json.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
Using 120g and 15 cores per executor, the job succeeds.
Why would the job fail on the smaller memory/core setup?
Notes:
There is an explode operation that possibly may be related as well. Edit: Unrelated. Tested the code, did a simple spark.read.json().count().show() and it gc'd and OOM'd.
My current pet theory at the moment is the the large number of small files results in high shuffle overhead. Is this what's going on and is there a way around this (outside of re-aggregating the files separately)?
Code as requested:
Launcher
./bin/spark-submit --master spark://0.0.0.0:7077 \
--conf "spark.executor.memory=90g" \
--conf "spark.executor.cores=12" \
--conf 'spark.default.parallelism=7200' \
--conf 'spark.sql.shuffle.partitions=380' \
--conf "spark.network.timeout=900s" \
--conf "spark.driver.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
launcher.py
Code
spark = SparkSession.builder \
.appName('Rewrites by Frequency') \
.getOrCreate()
spark.read.json("s3://path/to/file").count()

Spark YARN: Cannot allocate a page with more than 17179869176 bytes

I am joining 11Mn records. I am running with 5 workers in EMR Cluster Spark 2.2.1
I am getting the following error while running the job:
executor 3): java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:277)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not able to understand the possible reason for this. Please help me with what parameter should I set.
Currently I am running with the following arguments: --num-executors 5 --conf spark.eventLog.enabled=true --executor-memory 70g --driver-memory 30g --executor-cores 16 --conf spark.shuffle.memoryFraction=0.5

How to get the working directory in executor

I am using the following command to submit Spark job, I hope to send jar and config files to each executor and load it there
spark-submit --verbose \
--files=/tmp/metrics.properties \
--jars /tmp/datainsights-metrics-source-assembly-1.0.jar \
--total-executor-cores 4\
--conf "spark.metrics.conf=metrics.properties" \
--conf "spark.executor.extraClassPath=datainsights-metrics-source-assembly-1.0.jar" \
--class org.microsoft.ofe.datainsights.StartServiceSignalPipeline \
./target/datainsights-1.0-jar-with-dependencies.jar
--files and --jars is used to send files to executors, I found that the files are sent to the working directory of executor like 'worker/app-xxxxx-xxxx/0/
But when job is running, the executor always throws exception saying that it could not find the file 'metrics.properties'or the class which is contained in 'datainsights-metrics-source-assembly-1.0.jar'. It seems that the job is looking for files under another dir rather than working directory.
Do you know how to load the file which is sent to executors?
Here is the trace (The class 'org.apache.spark.metrics.PerfCounterSource' is contained in the jar 'datainsights-metrics-source-assembly-1.0.jar'):
ERROR 2016-01-14 16:10:32 Logging.scala:96 - org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.PerfCounterSource cannot be instantiated
java.lang.ClassNotFoundException: org.apache.spark.metrics.PerfCounterSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_80]
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_80]
at java.security.AccessController.doPrivileged(Native Method) [na:1.7.0_80]
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_80]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_80]
at java.lang.Class.forName0(Native Method) ~[na:1.7.0_80]
at java.lang.Class.forName(Class.java:195) ~[na:1.7.0_80]
It looks like you have a typo in your --jars argument, so it could be that it's not actually loading the file and and continuing silently.

Resources