Spark/pySpark: Best way to read small binary data files - apache-spark

I need to read data from binary files. The files are small, of the order of 1 MB, so it's probably not efficient to use binaryFiles() and process them file by file (too much overhead).
I can join them in one big file and then use binaryRecords(), but the record size is just 512 bytes, so I'd like to concatenate several records together, in order to produce chunks of the size of tens of megabytes. The binary file format allows this.
How can I achieve this?
More in general: Is this the right approach to the problem?
Thanks!

As of Spark 2.1, binaryFiles() will coalesce multiple small input files into a partition (default is 128 MB per partition), so using binaryFiles() to read small files should be much more efficient now.
See also https://stackoverflow.com/a/51460293/215945 for more details about binaryFiles() and how to adjust the default 128 MB size (if desired).

I'm not sure, but this way might help:
N is the number of your small files.
rdd = sc.parallelize(1 to N, N).mapPartitions(binaryFiles()).collect()

Related

What is open cost bytes in spark?

What does this property spark.sql.files.openCostInBytes do ?
This is official document definition:
The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
But still didn't get it. Can anyone explain with small example that why and where its useful?

Sorted parquet files for query optimization

Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For this reason, the discussion of this question is not about the cause of sorting. Rather, the purpose of this question is to talk about how to sort, which is mentioned in all Internet links with the least explanation (about 30%) and the challenges of data sorting are not mentioned at all. The purpose of this question is to get help from all friends who are experts and experienced in this field and to determine the best method (based on cost and benefit) for sorting.
Brief explanation about Apache parquet library
Before starting discussing Spark, I will explain about the tool used to produce parquet files. The parquet-mr library (I use Java for example, but it can probably be extended to other languages) writes to a disk and memory at the same time when we create a parquet file. This library also has a feature called getDataSize() that returns the exact final size of the file after it is completely closed on the disk, so we can use it to achieve the following two conditions when we write parquet files:
Do not make parquet files with small size (which is not good for query engines)
All parquet files can be produced with a certain minimum size or fixed size (for example, 1 GB each file)
Since this library writes to disk and memory at the same time, it does not allow data to be sorted unless all the data is first sorted in memory and then given to the library. (But this is not possible with large volumes of data.) We also implicitly assume that data is being generated as a stream that we intend to store. (In the case of a fixed data, the problem stated in this question will be meaningless because it can be said that the whole data is arranged once and for all and the problem is over. But we assume that there is a flow of data, in which case it is important to have an optimal way to sort the data)
One advantage mentioned above for the Apache parquet library is that we can fix the exact size of the output parquet file. This is an advantage in my opinion. Because, for example, if I know that the size of Hadoop blocks is equal to 128 MB and the size of parquet row-group is 128 MB, I can fix the parquet file size to 1 GB. Then I know that all parquet files will have 8 blocks and HDFS storage will be used best and all parquet files will be the same. (Because in HDFS, when the block size is 128 MB, the smaller file will take up the same amount of space) This may not be an advantage for everyone, and we'd be happy for experienced people to critique it if needed.
Parquet File Sorting Challenges
One point before we start is that we are looking for permanent data sorting because we are going to use it in the next thousands of queries. Almost so far, the above descriptions have identified some of challenges for sorting, but I will describe all of the challenges below:
Parquet tools do not allow you to write sorted data. So one way is to keep all the data in memory and after sorting, give it to the parquet library to be written in the parquet file. This method has two drawbacks: 1) It is not possible to keep all data in memory. 2) Because all the data is in memory, the size of the parquet file is not known and may be less than or more than 1 GB or any amount after writing, and the advantage of being fixed parquet size is lost.
Suppose we want to do this sorting in a parallel process instead of doing it in real time and stream. In this way, if we want to use parquet library, we will still have the problem that we have to bring the whole data to the memory for sorting, which is not possible. So let's say we use a tool like Spark for sorting. A specific cost we give in this section is that cluster resources are used for sorting, and in practice each data is written twice. (Once the parquet writing time and once the sorting) The next point is that even if we skip these two cases, after sorting the data, depending on the other columns in the parquet file, the amount of parquet compression for that particular column and for the whole data may change and increase or decrease. For this reason, after the parquet file is written, small files may be created or the fixed size (for example, 1 GB) may change. Unfortunately, Spark does not provide a way to control the file size (it may not be possible in practice), and therefore if we want to restore the fixed file size, we may need to use methods such as the mentioned link, which will not be free (causes to write the file several times apart from the cluster resources that are consumed and the exact file size will not be fixed):How do you control the size of the output file
Maybe there is no other way and the only ways are the mentioned one at the above. In which case, I would be happy for this note to be expressed by experts so that others know that there is no other way right now.
Challenges In Summary
For this reason, we generally observed 2 types of problems in these solutions:
How to do sorting at a reasonable cost and time (in stream flow)
How to keep the size of parquet files fixed
For this reason, although it is said everywhere that sorting is very good (and the results of surveys, both on the Internet and by myself, show that it is really useful), there is no mention at all of its methods and challenges. I ask experienced and expert friends in this field to help me in this direction (hoping that it will help others as well) and if ways or points are missed in this explanation, please state it.
Sorry if there is a typo in some parts due to my weakness in English language. Thanks.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Faster way to split way big file in to smaller files?

I have a small file which is about 6.5 GB and I tried to split it into files of size 5MB each using split -d -line--bytes=5MB. It took me over 6 minutes to split this file.
I have files over 1TB.
Is there a faster way to do this?
Faster than a tool specifically designed to do this kind of job? Doesn't sound likely in the general case. However, there are a few things you may be able to do:
Save the output files to a different physical storage unit. This avoids reading and writing data to the same disk at the same time, allowing more uninterrupted processing.
If the record size is static you can use --bytes to avoid the processing overhead of dealing with full lines.

Resources