I need to use Spark to read a huge uncompressed text file (>20GB) into RDD. Each record in the file spans multiple lines (<20 lines per record) so I can't use sc.textFile. I'm considering using SparkContext.newAPIHadoopFile with a custom delimiter. However since the file is fairly big, I'm curious if the reading and parsing will happen distributedly across multiple Spark executors, or only one node?
File content looks as follow:
record A
content for record A
content for record A
content for record A
record B
content for record B
content for record B
content for record B
...
It depends on your input format and mostly on compression codec. E.g. gzip is not splittable but Snappy is.
If it is splitable Hadoop API will take care of it according to its' split size config:
minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
maxSize = getMaxSplitSize(job);
for each file
blockSize = file.getBlockSize();
splitSize = computeSplitSize(blockSize, minSize, maxSize);
Then each split will become a partition and will be distributed across the cluster.
Related
For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka
This may be a silly question, but I'm not able to understand how the files are split across partitions.
My requirement is to read 10000 Binary files(Bloom filter persisted file) from Hdfs location and process the Binary files separately by converting the data to ByteArrayInputStream . The point to be noted is these files are Bloom filter persisted files and should be read sequentially from the start of the file till to the end and should be converted to Byte Array, thus this Byte array will be used to restructure the Bloomfilter object.
JavaPairRDD<String, PortableDataStream> rdd = sparkContext.binaryFiles(commaSeparatedfilePaths);
rdd.map(new Function<Tuple2<String, PortableDataStream>, BloomCheckResponse>()
Here in the code, I get v1._1 as Filepath and v1._2 the PortableDataStream which will be converted to ByteArrayInputStream.
Each binary file is of 34 MB.
Now the question is will there come a situation where part of the file will be in one partition and the other part in a different one? Or all the time I process, will I get all the content of file mapped to its file in single partition and its not split across?
Executor memory = 4GB and the cores = 2 and the executors are 180.
Basically the expectation is that the file should be read the way it is from start to end without split.
Each (file, stream) is guaranteed to provide full content of the file in the stream. There is no case where data will be divided between multiple pairs, not to mention multiple partitions.
You're safe to use it for your intended scenario.
On Apache's official website, this is the official explanation of this parameter:
When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.
In fact, my question is, what is the summary file?
Apache Parquet uses metadata to store all information required to load the data from a file, like column metadata, dictionaries row groups and so on.
The format is designed to keep this metadata embeded in the file itself, or stored a separate file. This is what summary file is.
Parquet summary file contains a collection of footers from actual Parquet data files in a directory. It can be used to skip RowGroups when reading w/o fetching the footer from each individual Parquet file which may be expensive if you have a lot of files and/or on Blob stores.
https://github.com/apache/parquet-mr/blob/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L407
Parquet storage format is the columnar oriented file format, that means data for a particular column for all the rows will be stored adjacent to each other, which results in two main benefits - better compression ratio and increased query performance.
new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.
In the EXTRACT documentation there's the (awesome) auto-magic support for gzipped files (which we are using).
But should I assume it won't use more than one AU? As if I understand correctly the files need to be "splitable" to spread across AUs?
Or will it split across AU's once extracted-on-the-fly and / or do gziped files have an index to indicate where they can be split somehow?
Or perhaps I'm muddling the vertex concept with AUs?
This is a good question :).
In general, if the file format is splitable (e.g., basically row-oriented with rows being less than the rowsize limit, which currently is 4MB), then large files will be split into 1GB per vertex.
However, GZip itself is not a splitable format. Thus we cannot split a GZip file during decompression and we end up not splitting the processing of the decompressed file either (the current framework does not provide this). As a consequence, we limit the size of a GZip file to 4GB. If you want scale out with GZip files, we recommend to split the data into several GZip files and then use file sets to scale out processing.