What Happens When a Delta Table is Created in Delta Lake? - databricks

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.

There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

Related

Increment data load from Azure Synapse to ADLS using delta lake

We have some views created in Azure Synapse Db. We need to query this data incrementally based on a water mark column and it has to be loaded into the Azure data lake container into the Raw layer and then to the curated layer. In Raw Layer the file should contain the entire Data(Full Load data).So basically we need to append this data and export as a full load . Should we use Databricks Delta lake tables to handle this requirement. How we can upsert data to the Delta lake table. Also we need to delete the record if it has been deleted from source.What should be partition column to be used for this
Please look at the syntax for delta tables - UPSERT. Before the delta file format, one would have to read the old file, read the new file and do a set operation on the dataframes to get the results.
The nice thing about delta is the ACID properties. I like using data frames since the syntax might be smaller. Here is an article for you to read.
https://www.databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

Reading version specific files delta-lake

I want to read the delta data after a certain timestamp/version. The logic here suggests to read the entire data and read the specific version, and then find the delta. As my data is huge, I would prefer not to read the entire data and if somehow be able to read only the data after certain timestamp/version.
Any suggestions?
If you need data that have timestamp after some specific date, then you still need to shift through all data. But Spark & Delta Lake may help here if you organize your data correctly:
You can have time-based partitions, for example, store data by day/week/month, so when Spark will read data it may read only specific partitions (perform so-called predicate pushdown), for example, df = spark.read.format("delta").load(...).filter("day > '2021-12-29'") - this will work not only for Delta, but for other formats as well. Delta Lake may additionally help here because is supports so-called generated columns where you don't need to create a partition column explicitly, but allow Spark to generate it for you based on other columns
On top of partitioning, formats like Parquet (and Delta that is based on Parquet) allow to skip reading all data because they maintain the min/max statistics inside the files. But you will still need to read these files
On Databricks, Delta Lake has more capabilities for selective read of the data - for example, that min/max statistics that Parquet has inside the file, could be saved into the transaction log, so Delta won't need to open file to check if timestamp in the given range - this technique is called data skipping. Additional performance could come from the ZOrdering of the data that will collocate data closer to each other - that's especially useful when you need to filter by multiple columns
Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0

Update existing records of parquet file in Azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.
Is there any way I can update those parquet file without performing truncate and reload operation.
Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.
You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.
I have found a workaround to this problem.
Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.
Always go for soft Delete while working in No-Sql. Hard delete if very costly.
Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

Convert schema evolving (SCD) JSON/XML into ORC/Parquet format

We are getting varieties of JSONs/XMLs as input where schema is always evolving. I want to process them using ORC or Parquet format in Hadoop/Hive environment for performance gain.
I know below common style of achieving same objective :
Use JSONSerde or XMLSerde library, first create hive table using these serde. Later select * fields query will be fired on each xml/json hive table to save as orc or save as parquet into another table. Once done successful I can drop these Serde Table and XML/JSON data.
What would be another good ways of doing same ?
As suggested by you, this is the most common way to do an offline conversion of JSON/XML data to parquet format.
But another way could be to parse the JSON/XML and create Parquet Groups for each of the JSON records. Essentially:
Open the JSON file
Read each individual record
Open another file
Create a Parquet Group from the record read in #2
Write the parquet group to the file created in #3
Do this for all records in the file
Close both files.
We came up with such a converter for one of our used case.

Resources