How to handle CSV files in the Bronze layer without the extra layer - databricks

If my raw data is in CSV format and I would like to store it in the Bronze layer as Delta tables then I would end up with four layers like Raw+Bronze+Silver+Gold. Which approach should I consider?

A bit of an open question, however with respect to retaining the "raw" data in CSV I would normally recommend this as storage of these data is usually cheap relative to the utility of being able to re-process if there are problems or for purpose of data audit/traceability.
I would normally take the approach of compressing the raw files after processing and perhaps tar-balling the files. In addition moving these files to colder/cheaper storage.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

Azure Synapse loading: Split large compress files to smaller compressed files

I'm receiving this recommendation from Azure Synapse.
Recommendation details
We have detected that you can increase load throughput by splitting your compressed files that are staged in your storage account. A good rule of thumb is to split compressed files into 60 or more to maximize the parallelism of your load. Learn more
Looking at Azure's docs, this is the recommendation.
Preparing data in Azure Storage
To minimize latency, colocate your storage layer and your SQL pool.
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns.
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8 and UTF-16 performance is minimal.
Split large compressed files into smaller compressed files.
What I'm trying to understand is how can I split a large compress files into smaller compress files? Is there an option for that? Thanks!
You may checkout this article How to maximize COPY load throughput with file splits.
It’s recommended to load multiple files at once for parallel processing and maximizing bulk loading performance with SQL pools using the COPY statement.
File-splitting guidance is outlined in the following documentation and this blog covers how to easily split CSV files residing in your data lake using Azure Data Factory Mapping data flows within your data pipeline.

Optimal Data Lake File Partition Sizes

The Small File Problem gets referenced a lot when discussing performance issues with Delta Lake queries. Many sources recommend file sizes of 1GB for optimal query performance.
I know Snowflake is different than Delta Lake, but I think it's interesting that Snowflake's strategy contradicts the conventional wisdom. They rely on micro-partitions, which aim to be between 50MB and 500MB before compression.
Snowflake and Delta Lake have similar features:
File Pruning - Snowflake vs Delta Lake
Metadata about contents of file - Snowflake vs Delta Lake
Can anyone explain why Snowflake thrives on smaller files while conventional wisdom suggests that Delta Lake struggles?
Disclaimer: I'm not very familiar with Snowflake, so I can only say based on the documentation & experience with Delta Lake.
Small files problem usually arise when you're storing streaming data, or something like, and store that in the formats like Parquet that rely only on the listing of the files provided by storage provider. With a lot of small files, the listing of files is very expensive, and often is the place where most of time is spent.
Delta Lake solves this problem by tracking the file names in the manifest files, and then reaching objects by file name, instead of listing all files and extracting file names from there. On Databricks, Delta has more optimizations for data skipping, etc., that could be achieved by using metadata stored in the manifest files. As I see from documentation, Snowflake has something similar under the hood.
Regarding the file size - on Delta, default size is ~1Gb, but in practice it could be much lower, depending on type of data that is stored, and if we need to update data with new data or not - when updating/deleting data, you'll need to rewrite the whole files, and if you have big files, then you're rewriting more.

Is compression a linear operation?

I'm mincing terms here but I think its the most concise was to ask it.
When I say linear, I mean linear as in like a linear operation in mathematics i.e. f(t+p) = f(t)+f(p).
At work, we have datasets we store in .h5 files. We do this for space reasons. When we perform data analysis and post processing, we have to open the .h5 file and grab data. Due to the way our system has been set up, we unfortunately end up opening up that .h5 file multiple times, grabbing just a small subset of the tables each time. This is hard on our file system and causes obvious run time performance of the scripts, and the network in general.
I wonder if it would be better to not store all the tables in a single .h5 file. Perhaps chunk it up into smaller ones, or even compress each table individually, that way when a script only needs to access a small set of tables it doesn't take so long.
I am curious if this would come with a trade off, i.e. would we lose space by compressing individually? Which is why I titled this post the way I did. If the act of compressing a file is linear, then compressing the tables individually should sum up to the same as compressing a single file holding all the tables.
I of course doubt that compression is linear, but just how much of a difference in storage utilization could occur?

Resources