How to properly load millions of files into an RDD

How to properly load millions of files into an RDD - apache-spark

I have a very large set of json files (>1 million files) that I would like to work on with Spark.
But, I've never tried loading this much data into an RDD before, so I actually don't know if it can be done, or rather if it even should be done.
What is the correct pattern for dealing with this amount of data within RDD(s) in Spark?

Easiest way would be to create directory, copy all the files to the directory and pass directory as path while reading the data.
If you try to use patterns in the directory path, Spark might run into out of memory issues.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.

NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.
E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")
but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.

The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:
https://issues.apache.org/jira/browse/SPARK-20061
https://issues.apache.org/jira/browse/HADOOP-14217
and threads:
Struggling with colon ':' in file names
I think the only way is rename these files...

If you want performance.....
I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)
S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).
To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.
You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere?

With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?

The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.

It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.

Spark: How to read & write temporary files?

I need to write a Spark app that uses temporary files.
I need to download many many large files, read them with some legacy code, do some processing, delete the file, and write the results to a database.
The files are on S3 and take a long time to download. However, I can do many at once, so I want to download a large number in parallel. The legacy code reads from the file system.
I think I can not avoid creating temporary files. What are the rules about Spark code reading and writing local files?
This must be a common issue, but I haven't found any threads or docs that talk about it. Can someone give me a pointer?
Many thanks
P

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string