Spark Jobs filling up the disk at SPARK_LOCAL_DIRS location - apache-spark

spark jobs are filling up the disk within a short amount of time (< 10 mins). I have a disk space of 10GB and it is getting full at SPARK_LOCAL_DIRS location. In my case SPARK_LOCAL_DIRS is set to /usr/local/spark/temp.
There are lot of files like this input-0-1489072623600 and each file is somewhere between 3MB-8MB.
any ideas?

SPARK_LOCAL_DIRS is used for rdd cache(disk) and shuffle data. You should check the storage details whether how much data is cached(disk) and if any shuffle operations during your job.

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Specify which file system spark uses for spilling RDDs

How do we specify the local (unix) file system where Spark spills RDDs when they won't fit in memory? We cannot find this in the documentation. Analysis confirms that it is being saved in the Unix file system, not in HDFS.
We are running on Amazon with Elastic Map Reduce. Spark is spilling to /mnt. On our system, /mnt is an EBS volume while /mnt1 is a SSD. We want to spill to /mnt. If that fills up, we want to spill to /mnt2. We want /mnt to be the spillage of last resort. It's unclear how to configure this way, and how to monitor spilling.
We have reviewed the existing SO questions:
Understanding Spark shuffle spill appears out of date.
Why SPARK cached RDD spill to disk? and Use SSD for SPARK RDD discuss spill behavior, but not where the files are spilled.
Spark shuffle spill metrics is an unanswered question showing the Spill UI, but does not provide the details we are requesting.
Checkout https://spark.apache.org/docs/2.2.1/configuration.html#application-properties and search for
spark.local.dir
This defaults to /tmp, try and set it to the location of your EBS
NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Also look at the following stack overflow post for more insightful info

Spark wholeTextFiles(): java.lang.OutOfMemoryError: Java heap space

I'm processing a file of 400MB with spark.wholeTextFiles() and I keep getting out of memory error. I first used this API with a folder of files which has 40MB in total and I would like to know if my code works with large file, that's where comes the big file.
This is the configuration and I think I offered enough RAM for heap but still no luck and I'm just reading the folder and then write down with
files.saveAsTextFile("data/output/no")
and the command is
spark-submit --driver-memory 4G --driver-java-options -Xms4096m
--executor-memory 4G target/scala-2.11/mz_2.11-1.0.jar
I compared spark sql, sc.hadoopFile and sc.wholeTextFiles and wholeTextFiles is the fastest and I think that's because wholeTextFiles tries to load the whole folder into the memory of one node, the master I guess and everything happens at RAM, so it is fast.
HadoopFile() load by partition, which will be as many as files number, even if the files are small and this read action is expensive.
spark sql will load folder to partitions, the size of partition could be defined with
spark.conf.set("spark.sql.files.maxPartitionBytes", 32000000)
but if the files are small, it takes time to charge the files to each partition.
Q1. why do I keep getting out of memory error?
Q2. when spark load folder/big file by partition and return RDD, how
many partition has been read into the RAM? maybe non, and spark wait
for an action to load as many partitions as the number of
executor(or cores?) each time to treat? in that case, maybe we should
load big partition like 64MB or 128MB instead of small partition like
32kb?
Can you please the entire the code ?
The wholeTextFile() is used when the filePath and fileContent would be required.
Something like key -> filePath (C:\\fileName) and value -> actual fileContent.
The number of partitions when wholeTextFile() is used depends on how many executor cores you have.
Here the number of partitions will be 1 or more .
Unless an action is called spark job isn't triggered.
It's a bottom-top approach / lazy evaluation .

spark data locality on large cluster

As spark executors are allocated when init SparkContext, when I load data after that(eg. use sc.textFile()), how can spark ensure data locality? I mean, in a large cluster with like 5000 servers, executor's location is random on the subset of all workers, and spark even didn't know what&where is my data when allocating executors. At this time, the data locality can only depend on luck? or is there any other method in spark to reallocate executors or sth.?
After a few days of thinking, I realized that the strength of spark is the ability to deal with iterative computing, and it should only read from hard disk for the first time. After that, everything can be reached in executors' memory. So executors' location at first do not affect much.

Resources