Output Many CSV files, and Combing into one without performance impact with Transform data using mapping data flows via Azure Data Factory - azure

I followed the example below, and all is going well.
https://learn.microsoft.com/en-gb/azure/data-factory/tutorial-data-flow
Below is about the output files and rows:
If you followed this tutorial correctly, you should have written 83
rows and 2 columns into your sink folder.
Below is the result from my example, which is correct having the same number of rows and columns.
Below is the output. Please note that the total number of files is 77, not 83, not 1.
Question:: Is it correct to have so many csv files (77 items)?
Question:: How to combine all files into one file without slowing down the process?
I can create one file by following the link below, which warns of slowing down the process.
How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?

The number of files generated from the process is dependent upon a number of factors. If you've set the default partitioning in the optimize tab on your sink, that will tell ADF to use Spark's current partitioning mode, which will be based on the number of cores available on the worker nodes. So the number of files will vary based upon how your data is distributed across the workers. You can manually set the number of partitions in the sink's optimize tab. Or, if you wish to name a single output file, you can do that, but it will result in Spark coalescing to a single partition, which is why you see that warning. You may find it takes a little longer to write that file because Spark has to coalesce existing partitions. But that is the nature of a big data distributed processing cluster.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Databricks Spark CREATE TABLE takes forever for 1 million small XML files

I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.
The Experiment
The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID> element is kept identical.
<OBSERVATION>
<HEADER>...</HEADER>
<RESULT>
<ID>...</ID>
<VALUES>...</VALUES>
</RESULT>
</OBSERVATION>
For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER and RESULT, which is what I am getting.
CREATE TABLE Observations
USING XML
OPTIONS (
path "/mnt/blobstorage/records/*.xml",
rowTag "RESULT",
rootTag "OBSERVATION",
excludeAttribute True
)
The Problem
The CREATE TABLE statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87 in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).
I have noticed that the cell with the CREATE TABLE command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml" for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).
I also know that spark-xml reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:
predefine a schema (from only the first XML file)
ingest as plaintext without parsing (using the TEXT provider).
The same problem persists in both cases.
The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.
My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.
The Question
What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE, how do I do it?
My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size.
As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):
1. Copy the files from Blob to local file-system and generate a script for file merge:
# copy files from mounted storage to driver local storage
dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)
unzipdir= 'temp/records/'
gzipdir= 'temp/gzip/'
# generate shell-script and write it into the local filesystem
script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)
2. Run the shell script
%sh
sudo sh ./scripts/makeone.sh
3. Copy the files back to the mounted storage
dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True)
Another important point is that the spark-xml library does a two step approach:
It parses the data to infer the schema. If the parameter samplingRatio is not changed, it does this for the whole dataset. Often it is enough only to do this for a smaller sample, or you can predefine the schema (use the parameter schema for this), then you don' t need this step.
Reading the data.
Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.

Resources