Structured Spark Streaming Throws OOM exception - apache-spark

My structured Spark Streaming Job fails with the following exception after running for more than 24 hrs.
Exception in thread "spark-listener-group-eventLog" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.math.BigInteger.<init>(BigInteger.java:1114)
at java.math.BigInteger.valueOf(BigInteger.java:1098)
at scala.math.BigInt$.apply(BigInt.scala:49)
at scala.math.BigInt$.long2bigInt(BigInt.scala:101)
at org.json4s.Implicits$class.long2jvalue(JsonDSL.scala:45)
at org.json4s.JsonDSL$.long2jvalue(JsonDSL.scala:61)
Quick background:
My structured spark streaming job is to ingest events received as new files (parquet) into Solr collection. So, the sources are 8 different hive tables (8 different hdfs locations) receiving events and the sink is one solr collection.
Configuration:
Number Executors: 30
Executor Memory: 20 G
Driver memory: 20 G
cores - 5
Generated a hprof dump file and loaded into MAT to understand cause. The dump file looks like. This is a test environment and data stream TPS (transaction per minute) is very low and sometimes no transactions at all.
Any clue on what is causing this. Unfortunately, I'm unable to share the code snippet. Sorry about that.

Related

How Spark and S3 interact

I'm wondering how data is loaded into spark in below scenario:
There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).
I have following questions:
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
Answer:
Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.
So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
Answer:
As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect operation everything gets loaded into the driver memory but for some other operation like count if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.
Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.
Each worker will issue 1+ GET request on the ranges of the parquet file it has been given; more as it seeks around the files. The whole 10GB file is never loaded anywhere.
each worker will be doing its own read of its own split; this counts against the overall IO capacity of the store/shard.

Running 16 processes on single jvm machine

I'm using a 64 GB RAM & 24 core machine and allocated 32 GB to JVM. I wanted to run following processes :-
7 Kafka Brokers
3 instance of zookeeper
Elastic Search
Cassandra
Spark
MongoDB
Mysql
Kafka Manager
Node.js
& running 4-5 Spark Application on 5-6 executor with 1GB each simulatenously. The working of Spark jobs are as follows :-
1) 1 Spark job takes data from kafka and inserts into Cassandra
2) 1 Spark job takes data from another kafka topic and inserts into different Cassandra Table.
3) 2 Spark Job takes data from Cassandra, did some processing/analysis and writes data into their respective different cassandra table.
So, Sometimes my insertion application gets hang. It is taking around 500 records/second from Kafka. After running for sometime, It starts creating batches in queue and there is no error still processing time in Spark dashboard is increasing gradually.
I have used TOP to check CPU usage and found there is one process "0QrmJB" which is taking 1500+ CPU% usage and java is taking 200% usage.
What might be the issues ? I'm not able to analyse.Is it ok to run these many processes on single JVM machine? Thanks,

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

Spark: driver logs showing "thread spilling sort data to disk"

Could somebody help me understand what could be the possible reasons for the below lines coming in spark job logs.
2018-06-11T05:35:46,181 - INFO [Executor task launch worker for task
328:Logging$class#54] - TID 328 waiting for at least 1/2N of on-heap
execution pool to be free 2018-06-11T05:35:46,182 - INFO [Executor
task launch worker for task 329:UnsafeExternalSorter#202] - Thread 151
spilling sort data of 50.0 MB to disk (20 times so far)
2018-06-11T05:35:46,188 - INFO [Executor task launch worker for task
322:UnsafeExternalSorter#202] - Thread 176 spilling sort data of 33.0
MB to disk (27 times so far)
Spark program working:
query the database, cache the whole table(2GB is cached)
select records sequentially for a country out of 3(Denmark, India, NewZealand)
break the dataframe in 500 pieces and pass it to a map function which creates the json of a set of records in a piece and send it to search server
map is being applied on parallel collection(Vector) to execute the parallel processing and we could send in parallel to search server for indexing.
I am newbie in Spark, so please help me to understand which part of configuration should I look to stop this spilling. Spark version is 2.1.1
Based on the log, you sort the data.
During the sort there is not enough memory to store auxiliary data structures for shuffle in memory.
Therefore Spark spills data to disk.
This log means there isn't enough memory for task computing, and exchange data to disk, it's expensive operation.
When you find this log in one or few executor tasks, it indicates there exists data skew, you may need to find skew key data and preprocess it.

Bad read performance on Spark over HBase Hadoop

When reading 161 000 elements from HBase (462 MB based on HDFS file size) Spark spends at least 6 seconds to read them.
HBase is configured to use a block cache. During the test (there is no other process running at that moment), the block cache has a size of 470.1 MB (752.0 MB free).
All the elements are in the block cache.
The executor is running in an Yarn container (yarn mode) of 1408 MB memory.
Everything is running on a single node (including the master) over an Amazon m4 large node.
There is no other row in the table and a range scanning is performed.
RDD initialized like this
Executor Logs (it took 8 seconds in debug logging level)
The job is executed via Spark JobServer
Even a simple count on the RDD (no other operation) takes 5 seconds
I don't know what I can do based on the figures below. Where does the executor spend its time? How can I identify the bottleneck?
Thank you very much,
Sébastien.

Resources