spark data locality on large cluster - apache-spark

As spark executors are allocated when init SparkContext, when I load data after that(eg. use sc.textFile()), how can spark ensure data locality? I mean, in a large cluster with like 5000 servers, executor's location is random on the subset of all workers, and spark even didn't know what&where is my data when allocating executors. At this time, the data locality can only depend on luck? or is there any other method in spark to reallocate executors or sth.?

After a few days of thinking, I realized that the strength of spark is the ability to deal with iterative computing, and it should only read from hard disk for the first time. After that, everything can be reached in executors' memory. So executors' location at first do not affect much.

Related

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

Spark directs shuffle output to disk even when there is plenty of RAM

We have a Spark cluster without local disks and spilling is setup up to go to NFS. We realize that this is not how Spark was designed to be used, but such are our current realities.
In this world, spills slow down Spark jobs a great deal and we would like to minimize them. For most jobs we have, Spark executors have enough RAM to hold all intermediate computation results, but we see that Spark always writing shuffle results to disk, i.e. to NFS in out case. We have played with all Spark config options that looked vaguely related to try making Spark write shuffle outputs to RAM to no avail.
I have seen in a few places, like
Does Spark write intermediate shuffle outputs to disk, that Spark prefers to write shuffle output to disk. My questions are:
Is there a way to make Spark use RAM for shuffle outputs when there is RAM available?
If not, what would be a way to make it use fewer larger writes? We see it doing a lot of small 1-5KB writes and waiting for NFS latency after every write. The following config options didn't help:
spark.buffer.size
spark.shuffle.spill.batchSize
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.file.buffer
spark.shuffle.unsafe.file.output.buffer
spark.shuffle.sort.initialBufferSize
spark.io.compression.*.blockSize

How to best partition data in Spark for optimal processing

I am using a 40 node EMR cluster with 16cores in each node with 1TB memory, the data that I am processing is close to 70GB-80GB.
I am re-partitioning the input Dataframe so that each executor can process an equal chunk of data, however the re-partitioning is not happening properly and 90% of the heavy lifting is done by 1-2 executors and rest of the executors are enjoying with only MB's of data
Even if I don't explicitly use repartitions and allow spark to do it, the skewness in partitions still exists
What change should I bring in my spark code so that each executors gets almost equal amount of data for processing and the skewness can be reduced.

Distributing partitions across cluster

In apache spark one is allowed to load datasets from many different sources. According to my understanding computing nodes of spark cluster can be different than these used by hadoop to store data (am I right?). What is more, we can even load local file into spark job. Here goes main question: Even if we use the same computers for hdfs and spark purposes, is it always true that spark, during creation of RDD, will shuffle all data? Or spark will just try to load data in the way to take advantage of already existing data locality?
You can use HDFS as the common underlying storage for both MapReduce (Hadoop) and Spark engines, and use a cluster manager like YARN to perform resource management. Spark will try to take advantage of data locality, and execute tasks as close as possible to the data.
This is how it works: If data is available on a node to process, but the CPU is not free, Spark will wait for a certain amount of time (determined by the configuration parameter: spark.locality.wait seconds, default is 3 seconds) for the CPU to become available.
If CPU is still not free after the configured time has passed, Spark will switch the task to a lower locality level. It will then again wait for spark.locality.wait seconds and if a timeout occurs again, it will switch to a yet lower locality level.
The locality levels are defined as below, in order from closest to data, to farthest from data (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.TaskLocality$):
PROCESS_LOCAL (data is in the same JVM as the running code)
NODE_LOCAL (data is on the same node)
NO_PREF (data is accessed equally quickly from anywhere and has no locality preference)
RACK_LOCAL (data is on the same rack of servers)
ANY (data is elsewhere on the network and not in the same rack)
Waiting time for locality levels can also be individually configured. For longer jobs, the wait time can be increased to a larger value than default of 3 seconds, since the CPU might be tied up longer.

Resources