Faster reading from blob storage via spark

Faster reading from blob storage via spark - azure

I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.

Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.

Related

Apache Spark loads the entire partition into memory?

Apache Spark loads the entire partition into memory or does it load gradually? Is there any reference (preferably official) about that?
If I have a large partition will be necessary to have the partition size in memory available?
Will loading data from the in-memory partition depend on the type of transformation?

That depends of your file type, if it is CSV/textFile spark usually will load gradually even if you have multiple partitions and it depends of the size of the files. CSV does that because you cannot split by which data you need to read. CSV/textFile to get one row of data you need to scan the whole file.
If we are talking about parquet or orc files the format is naturally splittable. The data will never load the full files if you put some conditions during the read as where and select to choose the columns. That is why the recommended file size is around 1GB to optimise the spark time processing.
So if you are using parquet, each partition of spark should be able to be stored in memory while the process is going. Spark will try to store most partitions it can in the memory of the cluster during the transformations you are doing, if that cannot be fitted that will spill to the disk, reducing the execution time but ensure your execution to finish.

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?

Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.

Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.

Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Can Spark/EMR read data from s3 multi-threaded

Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?

There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.

Understanting file distribution and partitioning in HDFS when using Hive

On the one hand, in HDFS docs they say:
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block
size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different
DataNode.
Meaning every file will be splitted between nodes.
On the other hand, when I use Hive or Spark SQL, I manage the partitions in such a way that there is a folder for each partition, and all the files inside belong to this partition. For example:
/Sales
/country=Spain
/city=Barcelona
/2019-08-28.parquet
/2019-08-27.parquet
/city=Madrid
/2019-08-28.parquet
/2019-08-27.parquet
Let's say that each file's size is 1GB and the HDFS block size is 128 MB.
So I am confused. I don't understand if city=Barcelonav/2019-08-28.parquet is saved on only one node as a whole (even together with city=Barcelona/2019-08-27.parquet), or each file is distributed between 8 nodes.
If each file is distributed, then what is the benefit of the partitions?
If the data is distributed according to the partitions I define, how does HDFS know to do that? Does it look for folders with a name in the form of key=value and make sure they will be saved intact?

You are confused between "how HDFS stores the files that we dump into it" and "how Hive/Spark creates different directories in case of partitioning".
Let me try to provide you a perspective.
HDFS works as you have mentioned.
HDFS breaks up the files into n number of blocks depending upon the block size and the size of the file to be stored. The metadata (directories, permissions, etc..) is an abstraction in a sense that the file (2019-08-27.parquet) that you see as one is indeed distributed among nodes. Namenode maintains the metadata.
However, when we partition it creates different directories on HDFS. This ultimately helps when you want to query the data with conditions on the partitioned column. Only relevant directories are searched for the requested data. If you go ahead and query on your partitioned data and write an explain to have a look at the logical plan, you can notice the Partition Filters while FileScan phase.
The partitioned data is still stored on HDFS in the same way that you mentioned.
Hope this helps!

Process multiple small files of total size 100GB in HDFS

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?

I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?

You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string