I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?
According to the official Azure Guide using native Spark caching, even with disk persistence, won't take advantage of local SSD. I suspect that in order to benefit from it we need to use OFF_HEAP option when persisting RDDs. But then how to configure it so it uses local SDD (mounted as SDB1 under /mnt) and Alluxio for in-memory stuff? I know switches
--conf spark.memory.offHeap.enabled="true" \
--conf spark.memory.offHeap.size=10G \
I'm asking about datasets generated through a set of operations rather than generated from input datasets (which would be easy - they only "HDFS://" prefix is needed).
To persist Data from Spark to a shared external storage that can manage SSD resource, you can use Alluxio. Spark can save and load RDDs or Dataframes to Alluxio easily:
// Save RDD to Alluxio as Text File
scala> rdd.saveAsTextFile("alluxio://master:19998/myRDD")
// Load the RDD back from Alluxio as Text File
scala> sc.textFile("alluxio://master:19998/myRDD")
// Save Dataframe to Alluxio as Parquet files
scala> df.write.parquet("alluxio://master:19998/path")
// Load Dataframe back from Alluxio as Parquet files
scala> df = sqlContext.read.parquet("alluxio://master:19998/path")
Maybe they meant storing data explicitly to Alluxio or Hdfs directly?
e.g. instead of:
df.cache()
use write and read:
df.write.parquet("alluxio://master:19998/out.parquet")
df.read.parquet("alluxio://master:19998/out.parquet")
p.s. sorry for a stupid answer, wanted to write it in a comment but hadn't enough reputation.
When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -
Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?
Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?
Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?
Data on S3 is external to HDFS obviously.
You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.
If you use:
val df = spark.read.parquet("/path/to/parquet/file.../...")
then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.
But, this:
val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")
will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.
The Driver only gets the metadata if specifying S3 directly.
In your terms:
"... each executor downloads the data directly (on the worker node)? " YES
Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.
The question is regarding spark 1.6
When a dataframe is written to HDFS in SaveMode.APPEND mode, I want to know which files were created new.
A way to do this is to keep track of files in HDFS before and after job, is there a better way?
Also Map-Reduce prints job statistics at the end, do we have something similar for every spark action.
I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.