Update existing records of parquet file in Azure - azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.
Is there any way I can update those parquet file without performing truncate and reload operation.

Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.
You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.

I have found a workaround to this problem.
Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.

Always go for soft Delete while working in No-Sql. Hard delete if very costly.
Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Related

Increment data load from Azure Synapse to ADLS using delta lake

We have some views created in Azure Synapse Db. We need to query this data incrementally based on a water mark column and it has to be loaded into the Azure data lake container into the Raw layer and then to the curated layer. In Raw Layer the file should contain the entire Data(Full Load data).So basically we need to append this data and export as a full load . Should we use Databricks Delta lake tables to handle this requirement. How we can upsert data to the Delta lake table. Also we need to delete the record if it has been deleted from source.What should be partition column to be used for this
Please look at the syntax for delta tables - UPSERT. Before the delta file format, one would have to read the old file, read the new file and do a set operation on the dataframes to get the results.
The nice thing about delta is the ACID properties. I like using data frames since the syntax might be smaller. Here is an article for you to read.
https://www.databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

Override underlying parquet data seamlessly for impala table

I have an Impala table backed by parquet files which is used by another team.
Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)
Our Spark code look like this
dataset.write.format("parquet").mode("overwrite").save(path)
During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.
Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.
Maybe I can write the new data files into different location then make Impala table point to that location?
The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.
When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.
Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.
As a solution, you can perform one of the below,
Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.
OR
As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.
Hope this helps!

Convert schema evolving (SCD) JSON/XML into ORC/Parquet format

We are getting varieties of JSONs/XMLs as input where schema is always evolving. I want to process them using ORC or Parquet format in Hadoop/Hive environment for performance gain.
I know below common style of achieving same objective :
Use JSONSerde or XMLSerde library, first create hive table using these serde. Later select * fields query will be fired on each xml/json hive table to save as orc or save as parquet into another table. Once done successful I can drop these Serde Table and XML/JSON data.
What would be another good ways of doing same ?
As suggested by you, this is the most common way to do an offline conversion of JSON/XML data to parquet format.
But another way could be to parse the JSON/XML and create Parquet Groups for each of the JSON records. Essentially:
Open the JSON file
Read each individual record
Open another file
Create a Parquet Group from the record read in #2
Write the parquet group to the file created in #3
Do this for all records in the file
Close both files.
We came up with such a converter for one of our used case.

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.
Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?
give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?
Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.
My Level 0 design of this
Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
You enable querying of the data from multiple clients/applications/users
cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files
I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Resources