is there a limit for pyspark read csv files? - azure

i am relatively new to spark/pyspark so any help is well appreciated.
currently we have files being delivered to Azure data lake hourly into a file directory, example:
hour1.csv
hour2.csv
hour3.csv
i am using databricks to read the files in the file directory using the code below:
sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)
each of the CSV files is about 5kb and all have the same schema.
what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?
from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.
any explanations is very much appreciated.
thank you

The limit it your driver's memory.
When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way).
After having the list of files, it creates tasks for the executors to run.
With that in mind, if the list is too large to fit in the driver's memory, you will have issues.
You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

What is open cost bytes in spark?

What does this property spark.sql.files.openCostInBytes do ?
This is official document definition:
The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
But still didn't get it. Can anyone explain with small example that why and where its useful?

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.
E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")
but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.
The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:
https://issues.apache.org/jira/browse/SPARK-20061
https://issues.apache.org/jira/browse/HADOOP-14217
and threads:
Struggling with colon ':' in file names
I think the only way is rename these files...
If you want performance.....
I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)
S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).
To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.
You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Resources