PySpark driver memory exceptions while reading too many small files - apache-spark

PySpark version: 2.3.0, HDP 2.6.5
My source is populating a Hive table(HDFS backed) with 826 paritions and 1557242 small files(40KB). I know this is highly inefficient way to store data but I dont have control over my source
The problem now is when I need to do some historical load and I need to scan all the files at once the driver is having memory exceptions. Tried setting driver-memory to 8g,16g and similar configuration for driver.memory.overhead too. But the problem still persists.
What makes me wonder is this is failing in listing files I presume this is just metadata. Is there an explanation why file metadata would need so much memory?
py4j.protocol.Py4JJavaError: An error occurred while calling o351.saveAsTable.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1956)
at java.net.URI$Parser.substring(URI.java:2869)
at java.net.URI$Parser.parse(URI.java:3049)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3$$anonfun$7.apply(InMemoryFileIndex.scala:251)

It would be helpfull if you can share the parameters you were passing in spark submit.
i too facing similar issue.. adjusting the parameters it worked.
try diffrent configs ( as number s i cant suggest it depends on server configs)
mine :
spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 5g \
--executor-memory 6g \
--executor-cores 3

Related

Driver cores must be a positive number

I have upgraded Spark from version 3.1.1 to 3.2.1.
And now all existing Spark Jobs break with following ERROR.
Exception in thread "main" org.apache.spark.SparkException: Driver cores must be a positive number
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:634)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:257)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:234)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:119)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:1026)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1026)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
We are using Spark in cluster mode with apache mesos and co-located with cassandra.
I tried few options:
e.g. appl/spark/bin/spark-submit --name "Testjob" --deploy-mode cluster --master mesos://<master node>:7077 --executor-cores 4 --driver-memory 1G --driver-cores 1 -class ....
Do you have any hints or solutions for solving this problem.
Many thanks...
cheers
Unfortunately I think it is impossible to run Spark 3.2.x with Mesos in Cluster mode because of this feature and the way MesosClusterDispatcher works.
Basically what's happening the Dispatcher is submitting the Spark Application with the --driver-cores argument as a floating point number, and then Spark (SparkSubmitArguments.scala) reads it as String and parses it just like this:
driverCores.toInt
and of course this fails.
I proposed a quick fix for this but meanwhile I just built the code with the change I made in the PR. I also reported this as a bug.

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

A SPARK CLUSTER ISSUE

I know that when the spark cluster in the production environment is running a job, it is in the stand-alone mode.
While I was running a job, a few points of worker's memory overflow caused the worker node process to die.
I would like to ask how to analyze the error shown in the image below:
Spark Worker Fatal Error
EDIT: This is a relatively common problem please also view this if the below doesn't help you Spark java.lang.OutOfMemoryError: Java heap space.
Without seeing your code here is the process you should follow:
(1) If the issue is caused primarily from the Java allocation running out of space within the container allocation I would advise messing with your memory overhead settings (below). The current value are a little high and will cause the excess spin-up of vcores. Add the two below settings to your spark-submit and re-run.
--conf "spark.yarn.executor.memoryOverhead=4000m"
--conf "spark.yarn.driver.memoryOverhead=2000m"
(2) Adjust Executor and Driver Memory Levels. Start low and climb. Add these values to the spark-submit statement.
--driver-memory 10g
--executor-memory 5g
(3) Adjust Number of Executor Values in the spark submit.
--num-executors ##
(4) Look at the Yarn stages of the job and figure where inefficiencies in the code is present and where persistence's can be added and replaced. I would advise to heavily look into spark-tuning.

DAG scheduler Event Loop outOfMemoryerror: Java Heap Space with Spark Streaming Application

My Spark Streaming application continuous to fail with OutOfmemoryException: Java Heap Space.
I am using the below configuration for my Spark Submit Job.
spark-submit \
--class ... \
--master ...\
--deploy-mode cluster \
--executor-memory 1G \
--total-executor-cores 3 \
--driver-memory 2G
and spark.yarn.driver.memoryOverhead is set to 1G .
After analysing the heap dump , i noticed excessive usage of "DAG scheduler Event Loop" and if i further look into it , i see CHAR[] and byte[] class be used.
The Old Generation GC occupies all 2GB of memory and continues to grow.
Please suggest of this is a bug or you need more information on the same to further analyse the same.
Your help is much appreciated.
There seems to be a bug in the usage of ForkJoinPool in Spark 2.0.0 which is creating way too many threads
This issue is resolved here https://issues.apache.org/jira/browse/SPARK-17396

spark submit executor memory/failed batch

I have 2 questions on spark streaming :
I have a spark streaming application running and collection data in 20 seconds batch intervals, out of 4000 batches there are 18 batches which failed because of exception :
Could not compute split, block input-0-1464774108087 not found
I assumed the data size is bigger than spark available memory at that point, also the app StorageLevel is MEMORY_ONLY.
Please advice how to fix this.
Also in the command I use below, I use executor memory 20G(total RAM on the data nodes is 140G), does that mean all that memory is reserved in full for this app, and what happens if I have multiple spark streaming applications ?
would I not run out of memory after a few applications ? do I need that much memory at all ?
/usr/iop/4.1.0.0/spark/bin/spark-submit --master yarn --deploy-mode
client --jars /home/blah.jar --num-executors 8 --executor-cores
5 --executor-memory 20G --driver-memory 12G --driver-cores 8
--class com.ccc.nifi.MyProcessor Nifi-Spark-Streaming-20160524.jar
It seems might be your executor memory will be getting full,try these few optimization techniques like :
Instead of using StorageLevel is MEMORY_AND_DISK.
Use Kyro serialization which is fast and better than normal java serialization.f yougo for caching with memory and serialization.
Check if there are gc,you can find in the tasks being executed.

Resources