I can´t Delete or Truncate Table Storage - azure

Create a table from a storage account and I can insert records with no problem.
I decided to use this type of table to be able to share the URI (Https) and for another system to consume it, and since it is a NoSQL table, it gives me the possibility of adapting what I need to store in it.
But I have the inconvenience that this table, I must truncate it every time information is processed and the option from Data Factory where it indicates the insertion mode, (Replace or combine), does not work, it always performs an append.
I tried to do it from DataBricks, but I don't know how to reference that table, since it is outside of a Blob Storage and it cannot be mounted as such. any ideas?
Azure Table Storage
Azure data Factory
"Tipo de Inserción" = Insert Type
"Reemplazar" = Replace
Table Data
Configuration Data Factory
How to configure so that I can delete the data?
Thanks a lot.
Greetings.

Related

Delta Logic implementation using SnapLogic

Is there any snap available in SnapLogic to do following
Connect with snowflake and get data by SELECT * FROM VIEW
Connect with Azure Blob Storage and get the data from csv file : FILENAME_YYYYMMDD.csv
Take only those data which are available in 1 but NOT available in 2 and write this delta back to Azure Blob Storage : FILENAME_YYYYMMDD.csv
Is In-Memory Look-Up useful for this?
No, In-Memory Lookup snap is used for cases where you need to look up the value corresponding to the value in a certain field of the incoming records. For example, say you want to look up a country name against the country ISO code. This snap generally fetches the lookup table once and stores it in memory. Then it uses this stored lookup table to provide data corresponding to the incoming records.
In your case, you have to use the Join snap and configure it to an inner join.

What's the most efficient way to delete rows in target table that are missing in source table? (Azure Databricks)

I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?

How do I store run-time data in Azure Data Factory between pipeline executions?

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?
I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!
There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Azure tables, blobs

Just starting to study the Azure framework.
Just say you create a row in the Table storage.
You then create a Bob in the Blob storage,
Is there some way you can correlate what you just added to the Table with the Blob you just created?
Or anytime you have related entries you must use SQL Azure?
That would be disappointing as say you wanted to store some Video in a blob, but had some row in an SQL Azure table that you wanted to link to the Blob.
Since you cannot have this link you must store your video in SQL Azure somehow?
You can store unique filename of file stored in blob into azure table row.
Storing full video/binary content in Azure table or SQL azure table is not recommended as retrival will become bit slower and SQL azure is bit expensive compared to blob.
As long as you choose your blob filename carefully, you should just be able to reference that filename in your Azure table. However, be careful how you name it. It is very tempting to jump in with a timestamp based name and reference that in the table. However, using Azure you are obviously using a scalable solution that can end up highly parallel.
Use a GUID for the filename to guarantee uniqueness. Or if you want to be able to more easily browse the storage based on when the items were added, perhaps choose a date format YYYYMMDDHHMMSS with a GUID suffixed to it.
It is perfectly possible and done many times, I personally implemented such an architecture for one of the projects I worked on.
I used Blob Name as Partition Key and Blob Container Name as Row Key for the entity I write to Table.
You can also do it other way around and choose Blob Container Name as PK and Blob Name as RK. Then you may want to consider the partitions in your table not getting too big and cause some perf issues down the road if your blob containers get too crowded.
There is one thing you need to explicitly handle though, Blob Name s and Blob Container Name s have different naming constraints then Table Partition Keys and Table Row keys, so you need to Sanitize the Blob Name and Blob Container Name and before using them as PK and RK and writing the entity to table or reading the entity from the table.
Here is how you can Sanitize Blob Name s and Blob Container Name s to be used as Table Keys:
public static readonly Regex DisallowedCharsInTableKeys = new Regex(#"[\#%/?\u0000-\u001F\u007F-\u009F]");
string sanitizedKey = DisallowedCharsInTableKeys.Replace(BlobName, disallowedCharReplacement);
At this stage you may also want to prefix the sanitized key (Partition Key or Row Key) with the hash of the original key (ie. Blob Name) to avoid false collisions of different invalid keys having the same sanitized value.
Azure supports associating metadata with the blobs. Metadata is essentially just name-value pairs. The following video might be useful: http://www.bestechvideos.com/2009/04/15/mix09-windows-azure-storage
So instead of associating the table with the blob, you might want to do the other way round by using metadata.

Resources