Efficiently write one output file when partitioning by column - apache-spark

I have a large data set, df, made up of events. I want to write it out, partitioned by year/month/dat/hour, and have each resulting partition contain only file.
Here's a code snippet:
df.partitionBy("event_year", "event_month", "event_day", "event_hour").
mode(SaveMode.Overwrite).
parquet(s"${output_data_root}/tmp/")
What's unclear is what to do with df prior to this operation to get one file out, as it's unclear how partition(COL) and coalesce interact. IE, what happens when I do:
df.repartition(col("year"), col("month"), col("day"), col("event_hour")).coalesce(1)
(or vice versa)
It wouldn't work to just coalesce(1) (the data set is far too large), but from what I can tell, repartition(COL) will not necessarily result in one partition per column set.

It's still unclear to me exactly what is going on under the hood, but it turns out you just:
df.repartition(1, col("year"), col("month"), col("day"), col("event_hour"))
Anecdotally, this is WAY faster than repartition(...).coalesce, and particularly when using S3, it definitely is important to keep your file count minimal.

Related

Sorted parquet files for query optimization

Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For this reason, the discussion of this question is not about the cause of sorting. Rather, the purpose of this question is to talk about how to sort, which is mentioned in all Internet links with the least explanation (about 30%) and the challenges of data sorting are not mentioned at all. The purpose of this question is to get help from all friends who are experts and experienced in this field and to determine the best method (based on cost and benefit) for sorting.
Brief explanation about Apache parquet library
Before starting discussing Spark, I will explain about the tool used to produce parquet files. The parquet-mr library (I use Java for example, but it can probably be extended to other languages) writes to a disk and memory at the same time when we create a parquet file. This library also has a feature called getDataSize() that returns the exact final size of the file after it is completely closed on the disk, so we can use it to achieve the following two conditions when we write parquet files:
Do not make parquet files with small size (which is not good for query engines)
All parquet files can be produced with a certain minimum size or fixed size (for example, 1 GB each file)
Since this library writes to disk and memory at the same time, it does not allow data to be sorted unless all the data is first sorted in memory and then given to the library. (But this is not possible with large volumes of data.) We also implicitly assume that data is being generated as a stream that we intend to store. (In the case of a fixed data, the problem stated in this question will be meaningless because it can be said that the whole data is arranged once and for all and the problem is over. But we assume that there is a flow of data, in which case it is important to have an optimal way to sort the data)
One advantage mentioned above for the Apache parquet library is that we can fix the exact size of the output parquet file. This is an advantage in my opinion. Because, for example, if I know that the size of Hadoop blocks is equal to 128 MB and the size of parquet row-group is 128 MB, I can fix the parquet file size to 1 GB. Then I know that all parquet files will have 8 blocks and HDFS storage will be used best and all parquet files will be the same. (Because in HDFS, when the block size is 128 MB, the smaller file will take up the same amount of space) This may not be an advantage for everyone, and we'd be happy for experienced people to critique it if needed.
Parquet File Sorting Challenges
One point before we start is that we are looking for permanent data sorting because we are going to use it in the next thousands of queries. Almost so far, the above descriptions have identified some of challenges for sorting, but I will describe all of the challenges below:
Parquet tools do not allow you to write sorted data. So one way is to keep all the data in memory and after sorting, give it to the parquet library to be written in the parquet file. This method has two drawbacks: 1) It is not possible to keep all data in memory. 2) Because all the data is in memory, the size of the parquet file is not known and may be less than or more than 1 GB or any amount after writing, and the advantage of being fixed parquet size is lost.
Suppose we want to do this sorting in a parallel process instead of doing it in real time and stream. In this way, if we want to use parquet library, we will still have the problem that we have to bring the whole data to the memory for sorting, which is not possible. So let's say we use a tool like Spark for sorting. A specific cost we give in this section is that cluster resources are used for sorting, and in practice each data is written twice. (Once the parquet writing time and once the sorting) The next point is that even if we skip these two cases, after sorting the data, depending on the other columns in the parquet file, the amount of parquet compression for that particular column and for the whole data may change and increase or decrease. For this reason, after the parquet file is written, small files may be created or the fixed size (for example, 1 GB) may change. Unfortunately, Spark does not provide a way to control the file size (it may not be possible in practice), and therefore if we want to restore the fixed file size, we may need to use methods such as the mentioned link, which will not be free (causes to write the file several times apart from the cluster resources that are consumed and the exact file size will not be fixed):How do you control the size of the output file
Maybe there is no other way and the only ways are the mentioned one at the above. In which case, I would be happy for this note to be expressed by experts so that others know that there is no other way right now.
Challenges In Summary
For this reason, we generally observed 2 types of problems in these solutions:
How to do sorting at a reasonable cost and time (in stream flow)
How to keep the size of parquet files fixed
For this reason, although it is said everywhere that sorting is very good (and the results of surveys, both on the Internet and by myself, show that it is really useful), there is no mention at all of its methods and challenges. I ask experienced and expert friends in this field to help me in this direction (hoping that it will help others as well) and if ways or points are missed in this explanation, please state it.
Sorry if there is a typo in some parts due to my weakness in English language. Thanks.

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?
You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

Spark SQL join with empty dataset results in larger output file size

I ran into an issue in which apparently performing a full outer join with an empty table in Spark SQL results in a much larger file size than simply selecting columns from the other dataset without doing a join.
Basically, I had two datasets, one of which was very large, and the other was empty. I went through and selected nearly all columns from the large dataset, and full-outer-joined it to the empty dataset. Then, I wrote out the resulting dataset to snappy-compressed parquet (I also tried with snappy-compressed orc). Alternatively, I simply selected the same columns from the large dataset and then saved the resulting dataset as snappy-compressed parquet (or orc) as above. The file sizes were drastically different, in that the file from the empty-dataset-join was nearly five times bigger than the simple select file.
I've tried this on a number of different data sets, and get the same results. In looking at the data:
Number of output rows are the same (verified with spark-shell by reading in the output datasets and doing a count)
Schemas are the same (verified with spark-shell, parquet-tools, and orc-tools)
Spot-checking the data looks the same, and I don't see any crazy data in either type of output
I explicitly saved all files with the same, snappy, compression, and output files are given '.snappy.parquet' extensions by Spark
I understand that doing a join with an empty table is effectively pointless (I was doing so as part of some generic code that always performed a full outer join and sometimes encountered empty datasets). And, I've updated my code so it no longer does this, so the problem is fixed.
Still, I would like to understand why / how this could be happening. So my question is -- why would doing a Spark SQL join with an empty dataset result in a larger file size? And/or any ideas about how to figure out what is making the resulting parquet files so large would also be helpful.
After coming across several other situations where seemingly small differences in the way data was processed resulted in big difference in file size (for seemingly the exact same data), I finally figured this out.
The key here is understanding how parquet (or orc) encodes data using potentially sophisticated formats such as dictionary encoding, run-length encoding, etc. These formats take advantage of data redundancy to make file size smaller, i.e., if your data contains lots of similar values, the file size will be smaller than for many distinct values.
In the case of joining with an empty dataset, an important point is that when spark does a join, it partitions on the join column. So doing a join, even with an empty dataset, may change the way data is partitioned.
In my case, joining with an empty dataset changed the partitioning from one where many similar records were grouped together in each partition, to one where many dissimilar records were grouped in each partition. Then, when these partitions were written out, many similar or dissimilar records were put together in each file. When partitions had similar records, parquet's encoding was very efficient and file size was small; when partitions were diverse, parquet's encoding couldn't be as efficient, and file size was larger -- even though the data overall was exactly the same.
As mentioned above, we ran into several other instances where the same problem showed up -- we'd change one aspect of processing, and would get the same output data, but the file size might be four times larger. One thing that helped in figuring this out was using parquet-tools to look at low-level file statistics.

Spark transformations and ordering

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.

Resources