The documentation does not make it clear and we can't adequately test this:
Does CREATE OR REPLACE TABLE 'x' DEEP CLONE 'y' synchronize two pre-existing Delta tables or does it delete the target and recreate it from the source?
It will do the copying of only data that were added since previous clone, it won’t delete the target before copying. That’s one of the reasons why it’s very good for things like backing up the data.
Create or replace are related to the metastore operations as explained in the docs.
Related
I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?
We are investigating how to implement the GDPR 'right to be forgotten' in Delta Lake. Basically, the key functionality is to delete a record (from a person who has requested to have their data removed) from delta lake, including previous versions.
I thought (hoped) that VACUUM would do the trick, but as I understand it, VACUUM deletes whole tables. Hence, I lose the history of all other records, which I would like to keep.
Here is a notebook demonstrating what I want to do.
Versions in Delta tables are immutable - each modification operation doesn't change the existing files, but take the original data from it, do modification & create a new version. Because of that, you need to do modification of the data & clean the old versions using the VACUUM. Databricks has very good guide on handling of GDPR & CCPA data using the Delta Lake, that describes how to approach to that problem.
Theoretically, you can write a script that will go through the whole history, read each version, do modification of the data, and write as a new version, and at the end do the vacuum, but that could be quite resource intensive.
Also, if you'll need to perform that operation periodically, you may think about other approaches, like, encrypting each user's data with individual keys, separating the PII data into a separate table that you can modify, and other things.
I want to copy some tables from a production system to a test system on a regular basis. Both systems run a PostgreSQL server. I want to copy only specific tables from production to test.
I´ve already set up a foreach which iterates over the table names I want to copy. The problem is, that the table structures may change during development process and the copy job might fail.
So is there a way to use some kind auf "automatic mapping"? Cause the tables in both systems always have exactly the same structure. Or is there some kind of "Copy table" procedure?
You could remove mapping and structure in your pipeline . Then it will using the default mapping behavior. Given your tables always have the same schema, both mapping by name and mapping by order should work.
I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop
Looking in my keyspace directory I see several versions of most of my tables. I am assuming this is because I dropped them at some point and recreated them as I was refining the schema.
table1-b3441432142142sdf02328914104803190
table1-ba234143018dssd810412asdfsf2498041
These created tables names are very cumbersome to work with. Try changing to one of the directories without copy pasting the directory name from the terminal window... Painful. So easy to mistype something.
That side note aside, how do I tell which directory is the most current version of the table? Can I automatically delete the old versions? I am not clear if these are considered snapshots or not since each directory also can contain snapshots. I read in another post you can stop autosnapshot, but I'm not sure I want that. I'd rather just automatically delete any tables not being currently used (i.e.: that are not the latest version).
I stumbled across this trying to do a backup. I realized I am forced go to every table directory and copy out the snapshot files (there are like 50 directories..not including all the old table versions) which seems like a terrible design (maybe I'm missing something??).
I assumed I could do a snapshot of the whole keyspace and get one file back or at least output all the files to a single directory that represents the snapshot of the entire keyspace. At the very least it would be nice knowing what the current versions are so I can grab the correct files and offload them to storage somewhere.
DataStax Enterprise has a backup feature but it only supports AWS and I am using Azure.
So to clarify:
How do I automatically delete old table versions and know which is
the current version?
How can I backup the most recent versions of the tables and output the files to a single directory that I can offload somewhere? I only have two nodes, so simply relying on the repair is not a good option for me if a node goes down.
You can see the active version of a table by looking in the system keyspace and checking the cf_id field. For example, to see the version for a table in the 'test' keyspace with table name 'temp', you could do this:
cqlsh> SELECT cf_id FROM system.schema_columnfamilies WHERE keyspace_name='test' AND columnfamily_name='temp' allow filtering;
cf_id
--------------------------------------
d8ea9830-20e9-11e5-afc0-c381f961c62a
As far as I know, it is safe to delete (rm -r) outdated table version directories that are no longer active. I imagine they don't delete them automatically so that you can recover the data if you dropped them by mistake. I don't know of a way to have them removed automatically even if auto snapshot is disabled.
I don't think there is a command to write all the snapshot files to a single directory. According to the documentation on snapshot, "After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place." So it's left up to the application developer how they want to handle archiving the snapshot files.