OOM loading data from parquet - apache-spark

we have an Apache Spark 1.4.0 cluster and we would like to load data from a set of 350 parquet file from HDFS. Currently, when we try to run our program, we get a "OutOfMemory Error" driver side.
Profiling both an Executor and the driver we noticed that during the operation the executor memory remains constant when the driver memory constantly increase.
For each parquet file we load data as follows:
sqlContext.read().format(PARQUET_OUT_TYPE).load("path").toJavaRDD(mappingFunction)
and, after that, we join the RDDs by "union" and then we coalesce them
partitions.reduce((r1,r2) -> r1.union(r2).coalesce(PARTITION_COUNT))
What looks really strange to me is that the executor memory remain constant during the loading phase (when I expect to see it increase 'cause of the data read by the node) and the driver memory constantly increase (when I expect to see it remain constant because that should not be loaded in the driver memory).
Is there something wrong with the way we load data? Could you please explain me how to read data from parquet in parallel?
Thanks

the OOM was caused by the parquet metadata not by the data.
Thanks

Related

Memory leak in Spark Structured Streaming driver

I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S.
Theses job are reading files from S3 using SSS provided File Source input and also use S3 for checkpointing (with the directory output commiter).
What I noticed is that, after few days of running, the driver is having memory issue and crash.
As the driver is not doing many things (just calling Spark SQL functions and write the output to S3), I am wondering how to detect the source of these memory issues (memory leaking from hadoop/S3A library ?) and how I can fix them.
As shown on the screenshot, the driver take some time befoe using all the memory, and once it reached it, it seems to be able to call GC enough time. But after 1 week of running, it crash, as if the GC doesn't run enough/doesn't find something to clean.

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Why getting out of memory error in Spark jobs?

In spark jobs, there are multiple reasons to get out of memory error, mostly when the shuffle size is large. However the expected behaviour from spark jobs is to spill on disk whenever the data (either cached data or shuffle data) does not fit into the executor memory, so theoretically we should never see an out of memory issue. But apparently this is not the behaviour in practice. What is the reason?

Spark partitionBy on write.save brings all data to driver?

So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1
Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Resources