Most efficient way to load many files in spark in parallel?

Most efficient way to load many files in spark in parallel? - apache-spark

[Disclaimer: While this question is somewhat specific, I think it circles a very generic issue with Hadoop/Spark.]
I need to process a large dataset (~14TB) in Spark. Not doing aggregations, mostly filtering. Given ~30k files (250 part files, per month for 10 years, each part being ~ 200MB), I would like to load them into a RDD/DataFrame and filter out items based on some arbitrary filters.
To make the listing of the files efficient (I'm on google dataproc/cloud storage, so the driver doing a wildcard glob was very serial and very slow), I precalculate an RDD of the file names, then load them into an RDD (I'm using avro, but file type shouldn't be relevant), e.g.
#returns an array of files to load
files = sc.textFile('/list/of/files/').collect()
#load the files into a dataframe
documents = sqlContext.read.format('com.databricks.spark.avro').load(files)
When I do this, even on a 50-worker cluster, it seems that only one executor is doing the work of reading the files. I've experimented with broadcasting the files list and read a dozen different approaches but I can't seem to crack the issue.
So, is there an efficient way to create a very large dataframe from multiple files? How do I best take advantage of all the potential computing power when creating this RDD?
This approach works very well on smaller sets but, at this size, I see a large number of symptoms like long-running processes with no feedback. Is there some treasure trove of knowledge -- besides #zero323 :-) -- on optimizing spark at this scale?

Listing 30k files shouldn't be an issue for GCS - even if single GCS list request that lists up to 500 files at a time will take 1 second each, all 30k files will be listed in a minute or so. There could be some corner cases with some glob patterns that make it slow, but there were recent optimizations in GCS connector globbing implementation that could help.
That's why it should be good enough for you to just rely on default Spark API with globbing:
val df = sqlContext.read.avro("gs://<BUCKET>/path/to/files/")

Related

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.
E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")
but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.

The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:
https://issues.apache.org/jira/browse/SPARK-20061
https://issues.apache.org/jira/browse/HADOOP-14217
and threads:
Struggling with colon ':' in file names
I think the only way is rename these files...

If you want performance.....
I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)
S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).
To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.
You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

Sorted parquet files for query optimization

Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For this reason, the discussion of this question is not about the cause of sorting. Rather, the purpose of this question is to talk about how to sort, which is mentioned in all Internet links with the least explanation (about 30%) and the challenges of data sorting are not mentioned at all. The purpose of this question is to get help from all friends who are experts and experienced in this field and to determine the best method (based on cost and benefit) for sorting.
Brief explanation about Apache parquet library
Before starting discussing Spark, I will explain about the tool used to produce parquet files. The parquet-mr library (I use Java for example, but it can probably be extended to other languages) writes to a disk and memory at the same time when we create a parquet file. This library also has a feature called getDataSize() that returns the exact final size of the file after it is completely closed on the disk, so we can use it to achieve the following two conditions when we write parquet files:
Do not make parquet files with small size (which is not good for query engines)
All parquet files can be produced with a certain minimum size or fixed size (for example, 1 GB each file)
Since this library writes to disk and memory at the same time, it does not allow data to be sorted unless all the data is first sorted in memory and then given to the library. (But this is not possible with large volumes of data.) We also implicitly assume that data is being generated as a stream that we intend to store. (In the case of a fixed data, the problem stated in this question will be meaningless because it can be said that the whole data is arranged once and for all and the problem is over. But we assume that there is a flow of data, in which case it is important to have an optimal way to sort the data)
One advantage mentioned above for the Apache parquet library is that we can fix the exact size of the output parquet file. This is an advantage in my opinion. Because, for example, if I know that the size of Hadoop blocks is equal to 128 MB and the size of parquet row-group is 128 MB, I can fix the parquet file size to 1 GB. Then I know that all parquet files will have 8 blocks and HDFS storage will be used best and all parquet files will be the same. (Because in HDFS, when the block size is 128 MB, the smaller file will take up the same amount of space) This may not be an advantage for everyone, and we'd be happy for experienced people to critique it if needed.
Parquet File Sorting Challenges
One point before we start is that we are looking for permanent data sorting because we are going to use it in the next thousands of queries. Almost so far, the above descriptions have identified some of challenges for sorting, but I will describe all of the challenges below:
Parquet tools do not allow you to write sorted data. So one way is to keep all the data in memory and after sorting, give it to the parquet library to be written in the parquet file. This method has two drawbacks: 1) It is not possible to keep all data in memory. 2) Because all the data is in memory, the size of the parquet file is not known and may be less than or more than 1 GB or any amount after writing, and the advantage of being fixed parquet size is lost.
Suppose we want to do this sorting in a parallel process instead of doing it in real time and stream. In this way, if we want to use parquet library, we will still have the problem that we have to bring the whole data to the memory for sorting, which is not possible. So let's say we use a tool like Spark for sorting. A specific cost we give in this section is that cluster resources are used for sorting, and in practice each data is written twice. (Once the parquet writing time and once the sorting) The next point is that even if we skip these two cases, after sorting the data, depending on the other columns in the parquet file, the amount of parquet compression for that particular column and for the whole data may change and increase or decrease. For this reason, after the parquet file is written, small files may be created or the fixed size (for example, 1 GB) may change. Unfortunately, Spark does not provide a way to control the file size (it may not be possible in practice), and therefore if we want to restore the fixed file size, we may need to use methods such as the mentioned link, which will not be free (causes to write the file several times apart from the cluster resources that are consumed and the exact file size will not be fixed):How do you control the size of the output file
Maybe there is no other way and the only ways are the mentioned one at the above. In which case, I would be happy for this note to be expressed by experts so that others know that there is no other way right now.
Challenges In Summary
For this reason, we generally observed 2 types of problems in these solutions:
How to do sorting at a reasonable cost and time (in stream flow)
How to keep the size of parquet files fixed
For this reason, although it is said everywhere that sorting is very good (and the results of surveys, both on the Internet and by myself, show that it is really useful), there is no mention at all of its methods and challenges. I ask experienced and expert friends in this field to help me in this direction (hoping that it will help others as well) and if ways or points are missed in this explanation, please state it.
Sorry if there is a typo in some parts due to my weakness in English language. Thanks.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?

In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.

TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

How to process the different graph files to be processed independently in between the cluster nodes in Apache Spark?

Lets say I have a large number of graph files and each graph has around 500K edges. I have been processing these graph files on Apache Spark and I was wondering how to parallelize the entire graph processing job efficiently. Since for now, every graph file is independent with any other, I am looking for parallelism with the files. So, if I have 100 graph files and I have 20 nodes clusters, can I process each file on each node, so each node will process 5 files. Now, what is happening is like the single graph is being processed in number of stages which is causing a lot of shuffling.
graphFile = "/mnt/bucket/edges" #This directory has 100 graph files each file with around 500K edges
nodeFile = "/mnt/bucket/nodes" #This directory has node files
graphData = sc.textFile(graphFile).map(lambda line: line.split(" ")).flatMap(lambda edge: [(int(edge[0]),int(edge[1]))])
graphDataFrame = sqlContext.createDataFrame(graphData, ['src', 'dst']).withColumn("relationship", lit('edges')) # Dataframe created so as to work with Graphframes
nodeData = sc.textFile(nodeFile).map(lambda line: line.split("\s")).flatMap(lambda edge: [(int(edge[0]),)])
nodeDataFrame = sqlContext.createDataFrame(nodeData, ['id'])
graphGraphFrame = GraphFrame(nodeDataFrame, graphDataFrame)
connectedComponent = graphGraphFrame.connectedComponents()
The thing is its taking a lot of time to process even couple of files. And I have to process like 20K files. Each file has 800K edges. May be if data partition strategy can be figured out that ensures every dependent edges will be processed on single node, shuffling will be less.
Or what is the best way of solving this efficiently ?

TL;DR Apache Spark is not the right tool for the job.
The main scope of Spark is data parallelism but what you're looking for is task parallelism. Theoretically core Spark engine is generic enough to be used to achieve limited task parallelism as well, but in practice there are better tools out there for job like this and it definitely not the goal of the libraries like GraphX and GraphFrames.
Since data distribution is the core assumption behind these libraries their algorithms are implemented using techniques like message passing or joins what is reflected in multistage job structure and shuffles. If data fits in the main memory (you can easily process graphs with millions of edges on single node using optimized graph processing libraries) these techniques are completely useless in practice.
Given the piece of code you've shown, in-core graph processing library like igraph or NetworkX (better documented and much more comprehensive but unfortunately memory hungry and slightly slowish) combined with GNU Parallel should be more than enough and much more efficient in practice. For more complex jobs you may consider using full featured workflow management tool like Airflow or Luigi.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string