Apache Spark loads the entire partition into memory? - apache-spark

Apache Spark loads the entire partition into memory or does it load gradually? Is there any reference (preferably official) about that?
If I have a large partition will be necessary to have the partition size in memory available?
Will loading data from the in-memory partition depend on the type of transformation?

That depends of your file type, if it is CSV/textFile spark usually will load gradually even if you have multiple partitions and it depends of the size of the files. CSV does that because you cannot split by which data you need to read. CSV/textFile to get one row of data you need to scan the whole file.
If we are talking about parquet or orc files the format is naturally splittable. The data will never load the full files if you put some conditions during the read as where and select to choose the columns. That is why the recommended file size is around 1GB to optimise the spark time processing.
So if you are using parquet, each partition of spark should be able to be stored in memory while the process is going. Spark will try to store most partitions it can in the memory of the cluster during the transformations you are doing, if that cannot be fitted that will spill to the disk, reducing the execution time but ensure your execution to finish.

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Apache Spark ---- how spark reads large partitions from source when there is no enough memory

Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon

Breaking lineage of an RDD without relying on HDFS

I'm running a spark application on Amazon spot instances. In the end, I'm exporting my results to parquet files on S3. The tasks are memory intensive, so I have to run the initial calculations using a large number of partitions (hundreds of thousands). In the end, I would like to coalesce the partitions to a few large partitions and save them to big parquet files. And this is where I get into trouble:
- If I'm using .coalesce(), which is a narrow transformation, the entire lineage that precedes the coalesce will be executed on a small number of partitions, which will cause OOMs.
- If I'm using .repartition(), I rely on HDFS for the shuffle files.
This is a problem when using spot instances, which may be decommissioned, leaving corrupt/missing HDFS blocks.
- checkpointing also relies on HDFS so I can't use that.
- converting to a Dataframe and back didn't actually break the lineage (rdd.toDF.rdd, am I missing something?).
To conclude, I'm looking for a way to coalesce to a smaller amount of partitions only to persist the data on S3 - I would like for the calculation to happen using the original partitions.

Concatenate ORC partition files on disk?

I am using Spark 2.3 to convert some CSV data to ORC for use with Amazon Athena; it is working fine! Athena works best with files that are not too small so, after manipulating the data a bit, I am using Spark to coalesce the partitions into a single partition before writing to disk, like so:
df.coalesce(1).write.orc("out.orc", compression='zlib', mode='append')
This results in a single ORC file that is an optimal file size for use with Athena. However, the coalesce step takes a very long time. It adds about 33% to the total amount of time to convert the data!
This is obviously due to the fact that Spark cannot parallelize the coalesce step when saving to a single file. When I create the same number of partitions as there are CPUs available, the ORC write out to disk is much faster!
My question is, can I parallelize the ORC write to disk and then concatenate the files somehow? This would allow me to parallelize the write and merge the files without having to compress everything on a single CPU?

Spark RDDs are stored in blocks or stored in memory ? Few queries around Spark

Have few questions around Spark RDD. Can someone enlighten me please.
I could see that RDDs are distributed across nodes, does that mean the
distributed RDD are cached in memory of each node or will that RDD data
reside on the hdfs disk. Or Only when any application runs the RDD data get
cached in memory ?
My understanding is, when I create a RDD based on a file which is present
on hdfs blocks , the RDD will first time read the data (I/O operation ) from
the blocks and then cache it persistently. Atleast one time it has to the
read the data from disk, Is that true ???
Is there any way if i can cache the external data directly into RDD instead
of storing the data first in hdfs and then load into RDD from hdfs blocks ?
The intention here is storing data first into hdfs and then loading it into
in memory will present latency ??
Rdd's are data structures similar to arrays and lists. When you create an RDD (example: loading a file ) if it is in the local mode it is stored in the laptop. If you are using hdfs it is stored in hdfs. Remember ON DISK.
If you want to store it in the cache (in RAM), you can use the cache() function.
Hope you got the answer for the second question too from the first one .
Yes you can directly load the data from your laptop without loading it into hdfs.
val newfile = sc.textFile("file:///home/user/sample.txt")
Specify the file path.
By default spark takes hdfs as storage u can change it by using the above line.
Dont forget to put the three ///:
file:///

Resources