Spark pulling data into RDD or dataframe or dataset - apache-spark

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver.
I have 3 questions -
Let's day you have a 20 TB flat file file stored in HDFS and from a driver program you pull it into a data frame or an RDD, using one of the respective libraries' out of the box functions (sc.textfile(path) or sc.textfile(path).toDF, etc). Will it cause the driver program to have OOM if the driver is run with only 32 gb memory? Or at least have swaps on the driver Jim? Or will spark and hadoop be smart enough to distribute the data from HDFS into a spark executor to make a dataframe/RDD without going through the driver?
The exact same question as 1 except from an external RDBMS?
The exact same question as 1 except from a specific nodes file system (just Unix file system, a 20 TB file but not HDFS)?

Regarding 1
Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:
All the transformation operations like (map, filter, etc.) are lazy.
This means that no reading will be performed unless you require a
concrete result of your operations (like reduce, fold or save the
result to some file).
When processing a file on HDFS Spark operates
with file partitions. Partition is a minimal logical batch of data
the can be processed. Normally one partition equals to one HDFS
block and the total number of partitions can never be less then
number of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD and
Dataset are performed inside of executors and never on driver. Driver
creates a DAG and logical plan of execution and assigns tasks to
executors for further processing.
Each executor runs the previously
assigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.
So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.
There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.
Regarding 2
In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.
Regarding 3
The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

How to process data in parallel but write results in a single file in Spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk
The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
coalesce(1)
you can try
repartition(1)

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

how does hashpartitioner work in spark?

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Resources