Apache Spark Temp files Size on DIsk - apache-spark

I have a setup where the incoming data from Kafka cluster is processed by Apache Spark streaming job.
Version Info :-
Kafka = 0.8.x
Spark Version = 2.3.1
Recently when the capacity of Kafka cluster was increased(by adding new nodes) we suddenly saw an exponential increase in disk usage of spark clusters(most of the space was occupied by spark temp files)
I am not sure if these are related and wanted some pointers to address/debug the same.
As a precaution we have increased the disk space of spark clusters to avoid "No space left on device" error.

Related

Spark Structure Streaming on RockDB is not clearing the disk

I have been running spark statefull structured streaming in EMR production, before statestore was running on HDFS backend and accumulating Few GB;s like (2.5 gb) in hdfs directories, Later when I moved to rocksdb backend with 3.2.1, It accumulates close to 45GB in last 10days.
Also I using default rocksDB spark configurations. Im sure, It was clearing the state after the processing time out, Even now I see only 25MB and 6.5 million records in spark UI in overall.
How can we pruge rocks db statestore beyond some storage limit ?

Low Performance of Spark Report when Spark and Cassandra on separate Docker Containers

I am using Spark 3.0.1, spark cassandra connector and cassandra on Kubernetes.
I was using spark cassandra connector repartitionByCassandraReplica API to get the data locality feature of Spark Partition with Cassandra and then joinWithCassandraTable.
But this repartitioning is failed as cassandra data is not local for Spark container. Because of which joinWithCassandraTable performance is getting very low.
Is there any other way to get the good performance for joinWithCassandraTable.
As you already know, calling repartitionByCassandraReplica is pointless when Spark and Cassandra are not co-located on the same machines.
To maximise the throughput of your Cassandra cluster are:
allocate at least 4 cores to each C* pod (8 cores is recommended)
allocate at least 16GB of RAM to each C* pod (24-30GB is recommended)
allocate at least 8GB of memory to the heap (16GB is recommended)
allocate volumes which have at least 5K IOPS
Once you've configured these recommendations, increase the capacity of your cluster by adding more C* pods. Cheers!

Spark driver pod getting killed with 'OOMKilled' status

We are running a Spark Streaming application on a Kubernetes cluster using spark 2.4.5.
The application is receiving massive amounts of data through a Kafka topic (one message each 3ms). 4 executors and 4 kafka partitions are being used.
While running, the memory of the driver pod keeps increasing until it is getting killed by K8s with an 'OOMKilled' status. The memory of executors is not facing any issues.
When checking the driver pod resources using this command :
kubectl top pod podName
We can see that the memory increases until it reaches 1.4GB, and the pod is getting killed.
However, when checking the storage memory of the driver on Spark UI, we can see that the storage memory is not fully used (50.3 KB / 434 MB). Is there any difference between the storage memory of the driver, and the memory of the pod containing the driver ?
Has anyone had experience with a similar issue before?
Any help would be appreciated.
Here are few more details about the app :
Kubernetes version : 1.18
Spark version : 2.4.5
Batch interval of spark streaming context : 5 sec
Rate of input data : 1 kafka message each 3 ms
Scala language
In brief, the Spark memory consists of three parts:
Reversed memory (300MB)
User memory ((all - 300MB)*0.4), used for data processing logic.
Spark memory ((all-300MB)*0.6(spark.memory.fraction)), used for cache and shuffle in Spark.
Besides this, there is also max(executor memory * 0.1, 384MB)(0.1 is spark.kubernetes.memoryOverheadFactor) extra memory used by non-JVM memory in K8s.
Adding executor memory limit by memory overhead in K8S should fix the OOM.
You can also decrease spark.memory.fraction to allocate more RAM to user memory.

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

Amazon EMR - Is Spark dataframe.cache() stored in core nodes only, or also on task nodes?

I'm running on AWS EMR Spark (v2) and calculated a large dataframe. When the dataframe.cache() (which is lazy) kicks-in would it be performed on the task nodes, or would the data frame be moved around to the core nodes for caching?

Resources