Will small files be spread across partitions in JavaPairRDD?

Will small files be spread across partitions in JavaPairRDD? - apache-spark

This may be a silly question, but I'm not able to understand how the files are split across partitions.
My requirement is to read 10000 Binary files(Bloom filter persisted file) from Hdfs location and process the Binary files separately by converting the data to ByteArrayInputStream . The point to be noted is these files are Bloom filter persisted files and should be read sequentially from the start of the file till to the end and should be converted to Byte Array, thus this Byte array will be used to restructure the Bloomfilter object.
JavaPairRDD<String, PortableDataStream> rdd = sparkContext.binaryFiles(commaSeparatedfilePaths);
rdd.map(new Function<Tuple2<String, PortableDataStream>, BloomCheckResponse>()
Here in the code, I get v1._1 as Filepath and v1._2 the PortableDataStream which will be converted to ByteArrayInputStream.
Each binary file is of 34 MB.
Now the question is will there come a situation where part of the file will be in one partition and the other part in a different one? Or all the time I process, will I get all the content of file mapped to its file in single partition and its not split across?
Executor memory = 4GB and the cores = 2 and the executors are 180.
Basically the expectation is that the file should be read the way it is from start to end without split.

Each (file, stream) is guaranteed to provide full content of the file in the stream. There is no case where data will be divided between multiple pairs, not to mention multiple partitions.
You're safe to use it for your intended scenario.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.

NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Is SparkContext.newAPIHadoopFile API reading in and processing single file in parallel?

I need to use Spark to read a huge uncompressed text file (>20GB) into RDD. Each record in the file spans multiple lines (<20 lines per record) so I can't use sc.textFile. I'm considering using SparkContext.newAPIHadoopFile with a custom delimiter. However since the file is fairly big, I'm curious if the reading and parsing will happen distributedly across multiple Spark executors, or only one node?
File content looks as follow:
record A
content for record A
content for record A
content for record A
record B
content for record B
content for record B
content for record B
...

It depends on your input format and mostly on compression codec. E.g. gzip is not splittable but Snappy is.
If it is splitable Hadoop API will take care of it according to its' split size config:
minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
maxSize = getMaxSplitSize(job);
for each file
blockSize = file.getBlockSize();
splitSize = computeSplitSize(blockSize, minSize, maxSize);
Then each split will become a partition and will be distributed across the cluster.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance

what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)

Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Spark process file in chunks

I would like to process chunks of data (from a csv file) and then do some analysis within each partition/chunk.
How do I do this and then process these multiple chunks in parallel fashion? I'd like to run map and reduce on each chunk

I don't think you can read only part of a file. Also I'm not quite sure if I understand your intent correctly or if you understood the concept of Spark correctly.
If you read a file and apply map function on the Dataset/RDD, Spark will automatically process the function in parallel on your data.
That is, each worker in your cluster will be assigned to a partition of your data, i.e. will process "n%" of the data. Which data items will be in the same partition is decided by the partitioner. By default, Spark uses a Hash Partitioner.
(Alternatively to map, you can apply mapParititions)
Here are some thoughts that came to my mind:
partition your data using the partitionBy method and create your own partitioner. This partitioner can for example put the first n rows into partition 1, the next n rows into partition 2, etc.
If your data is small enough to fit on the driver, you can read the whole file, collect it into an array, and skip the desired number of rows (in the first run, no row is skipped), take the next n rows, and then create an RDD again of these rows.
You can preprocess the data, create the partitons somehow, i.e. containing the n% and then store it again. This will create different files on your disk/HDFS: part-00000, part-00001, etc. Then in your actual program you can read just the desired part file, one after the other...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string