Create an RDD of FileMetadata - apache-spark

We have files with naming convention as below. Each file size if few KBs and we have millions of them in NFS.
"XXXXXXXXXX..YYMMDD.HHMMSS.NNNN.tarbz2
We want to load only last 5 files per month per "XXXXXXXXXX".
We can do Filesystem calls to get the filenames and give a filtered set of files to sc.binaryFiles. But this seems hack and may not work once we move to HDFS!!!
Is there a better way of achieving this usecase in spark?

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.
E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")
but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.
The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:
https://issues.apache.org/jira/browse/SPARK-20061
https://issues.apache.org/jira/browse/HADOOP-14217
and threads:
Struggling with colon ':' in file names
I think the only way is rename these files...
If you want performance.....
I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)
S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).
To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.
You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

How to get the number of partitions written by a DataFrameWriter

Let's assume we have the following code in Spark:
dataset.write.partitionBy("c1", "c2", "c3").parquet("myDir")
I have seen a couple of threads on SO explaining how to get the number of files or records written after the parquet method completes. However, what I would like to access is the name of the partitioning directories created, i.e. the number of directories myDir/c1=XX/c2=YY/c3=ZZ where XX, YY and ZZ are domain-related values.
One reason I need these directory names is to perform data integrity checks after an ETL process, and need to know which directories have been created during the ETL (say 3-4 directories for my use case) among thousands of them.
Does anyone know if there is a way to retrieve this information (at the Spark API level)?

Most efficient way to load many files in spark in parallel?

[Disclaimer: While this question is somewhat specific, I think it circles a very generic issue with Hadoop/Spark.]
I need to process a large dataset (~14TB) in Spark. Not doing aggregations, mostly filtering. Given ~30k files (250 part files, per month for 10 years, each part being ~ 200MB), I would like to load them into a RDD/DataFrame and filter out items based on some arbitrary filters.
To make the listing of the files efficient (I'm on google dataproc/cloud storage, so the driver doing a wildcard glob was very serial and very slow), I precalculate an RDD of the file names, then load them into an RDD (I'm using avro, but file type shouldn't be relevant), e.g.
#returns an array of files to load
files = sc.textFile('/list/of/files/').collect()
#load the files into a dataframe
documents = sqlContext.read.format('com.databricks.spark.avro').load(files)
When I do this, even on a 50-worker cluster, it seems that only one executor is doing the work of reading the files. I've experimented with broadcasting the files list and read a dozen different approaches but I can't seem to crack the issue.
So, is there an efficient way to create a very large dataframe from multiple files? How do I best take advantage of all the potential computing power when creating this RDD?
This approach works very well on smaller sets but, at this size, I see a large number of symptoms like long-running processes with no feedback. Is there some treasure trove of knowledge -- besides #zero323 :-) -- on optimizing spark at this scale?
Listing 30k files shouldn't be an issue for GCS - even if single GCS list request that lists up to 500 files at a time will take 1 second each, all 30k files will be listed in a minute or so. There could be some corner cases with some glob patterns that make it slow, but there were recent optimizations in GCS connector globbing implementation that could help.
That's why it should be good enough for you to just rely on default Spark API with globbing:
val df = sqlContext.read.avro("gs://<BUCKET>/path/to/files/")

Resources