Archive tables in azure - azure

I have a table storage in Azure where-in one of the tables is growing rapidly and I need to archive the tables for any data older than 90 days. Tried reading online and the only solution I can get online is to use Eventually consistent transactions pattern : https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/. Although the document takes an example of an employee table and can help me in achieving my objective, the intention of posting this question is to identify if there is a better solution.
Please note I am very new to Azure so might be missing a very easy step to achieve this.
Regards Tarun

The guide you are referencing is a good source. Note that if you are using Tables for logging data (has a common requirement for archiving older data) then you might want to look at blob storage instead - see the log data pattern in the guide you reference. Btw, AzCopy can also be used to export the data either to a blob or to local file system. See here for more information: https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/#copy-entities-in-an-azure-table-with-azcopy-preview-version-only.

Related

Azure Data Factory copy activity very slow

Summarize the problem
I've seeing particularly slow performance out of Azure Data Factory. Searching for similar questions on StackOverFlow turns up nothing except for the advice to contact support.
I'm rolling the dice here to see if anyone has seen something similar and knows how to fix it.
In short, every operation I try in ADF results in excruciatingly slow performance.
This includes:
Extracting a zip in blob storage to blob storage
Copying a number of small compressed files into Azure Data Explorer
Copying a number of small uncompressed json files into Azure Data Explorer
Extracting ZIP
Copying to ADX
In both cases the performance is in the kilobytes per second range.
In both cases the copy/import will eventually work but it can take hours.
Describe what you've tried
I've tried:
using different regions
creating and using my own Integration Runtime
playing with different parameters that could potentially affect performance such as parallel connections etc.
Contacting Microsoft support (who sent me here)
Show some code
Not really any code to share. To reproduce just try extracting a zip to and from blob storage. I get ~400KB/s.
In summary, any advice would be gratefully received. If I can't get this bit working I have to implement a the ingestion factory manually, which on reflection sounds like fun than I've been having with ADF.
Thease 'deep' folders will affect copy speed. We should minimize the depth and increase the amount of copy activity. You can reference this document to troubleshoot copy activity performance. Or you can send a feedback to Microsoft Azure.

How to bulk delete (say millions) of documents spread across millions of logical partitions in Cosmos db sql api?

MS Azure documentation does not talk anything about it. Formal bulk executor documentations talks only about insert and update options, not delete. There is a suggested java script server side program to create a stored procedure which sounds very good, but that requires us to input the partition key value. It wont make sense if our documents are spread across millions of logical partitions.
This is a very simple business need. While migrating huge volume of data in a sql api cosmos collection, if we insert some wrong data, there seems to be no option to delete other then restore to previous state. I have explored for few hrs now, but couldnt find a solution. Even raised a case with MS support, they directed to some .net code which I see need to see as that does not look straightforward. What if someone dont know .net.
Cant we easily bulk delete docs spread across several logical partitions in MS Cosmos SQL API ? Feels disgusting ..
I hope you can provide some accurate details. How to achieve this with some simple straight forward sample code and steps as well. Hope MS and Cosmos db experts to share views as well.
Even raised a case with MS support, they directed to some .net code
which I see need to see as that does not look straightforward.
Obviously,you have already made some efforts to find any solutions except below 2 scenarios:
1.Bulk delete Stored procedure:https://github.com/Azure/azure-cosmosdb-js-server/blob/master/samples/stored-procedures/bulkDelete.js
2.Bulk delete executor:
.NET: https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/blob/master/BulkDeleteSample/BulkDeleteSample/Program.cs
Java: https://github.com/Azure/azure-cosmosdb-bulkexecutor-java-getting-started/blob/master/samples/bulkexecutor-sample/src/main/java/com/microsoft/azure/cosmosdb/bulkexecutor/bulkdelete/BulkDeleter.java
So far, only above official solutions are supported. Another workaround is TTL for cosmos db.I believe you have your own logic to judge which part of data is correct and which part of data is wrong,should be deleted. You could set TTL on those data so that they could be killed as soon as expired data arrivals.
Has anyone tried this .. looks like a good solution in java
https://github.com/Azure/azure-cosmosdb-bulkexecutor-java-getting-started#bulk-delete-api
If you write a batch job to do that delete documents over night by using some date configuration we could achieve it. Here is the article published on how to do it.
https://medium.com/#vaibhav.medavarapu/bulk-delete-documents-from-azure-cosmos-db-using-asp-net-core-8bc95dd20411

Copy millions of files form root AZStorage Blob to subfolders

I’ve got multiple Azure storage blob containers each with over 1M JSON files include the root. Impossible to work with (no shocker) so trying to use Data Factory to move them to multiple folders using a timestamp in the files to create a YYYY-MM-DD/HH folder setup as a partition system. But every approach I’ve tried fails with timeouts / too many item limits. Need to open each file, get the timestamp, and use it to move the file to a dynamic path using the timestamp data. Ideas?
UPDATE: I was able to get around this, but I wouldn't call it a "answer" so I'll just update the question. To create smaller collections, I parameterized the pipeline to accept a file name wildcard. I then created another pipeline that uses an array of 0-9,a-z to use that as an parameter on the dataset. Brute force workaround... assume there's got to be a better solution, but this works for now.
Read doc: Move data to and from Azure Blob storage
The following articles describe how to move data to and from Azure Blob storage using different technologies.
Azure Storage-Explorer
AzCopy
Python-SDK (Others: .NET, Java, Node.js, Python, Go, PHP, Ruby.)
SSIS
In your case, I would suggest you to use SDK, which supports .NET, Java, Node.js, Python, Go, PHP, Ruby.
Believe me , if you want to migrate your datas from AzureBlob , DataFactory is not a good way, it makes the problem more complicated.
( This is my suggestion after I migrated over 100 million JSON-files (over 2TB) from AzureBlob)
If you have time... I would do the following:
Create an Azure Function to read the file and get your timestamp and do your move operation. scope the function just to use a single file. Then use events (EventGrid) in the storage account to trigger the function on create of a blob. Then you know for any new files it will move the file to the right spot. (Remember you need to reach a million executions in the consumption model for functions to start billing, so this is a low cost option).
For the current files, create another function (or if you want some more control, use a logic app, but your cost will be a bit more) and set your parralelism on the function or logic app to a low amount (to keep an eye on your executions). that run a simple for each with limits that run your first function. This will slowly move your files out of that container eventually getting you into a reasonable item count to work with on with stuff like ADF. This might just solve your problem for the long run as any new files will be categorized accordingly, and your backlog is slowly being moved as required. If you need to update a DB with a pointer to where your file lives you could put that piece of code also in your function or logic app. Just my two cents :)
It is not clear if you are using the hierarchical folder structure provided by Azure Data Lake Storage Gen2, the generation 1 simulates a folders structure but it is not optimum.
There are several advantages on the ADLSV2 that should help in your case mainly related to move operations.
To migrate from ADLS Gen 1 to ADLS Gen 2 have a look here.
Additionally, you may explore optimizations on your specific case with the following paper here.

Best practice - Storage options for external reference data that is queried in different ways

We have a cloud platform with various Health Care applications. Each application needs what we call reference data. Reference data is always external data coming from a provider on a daily or some regular schedule. An example of reference data is FDB MedKnowledge which includes a comprehensive compendium of consumer medication monographs, along with drug images and imprints.
Various applications will query the reference data to present it to their target customers (who can be physicians, nurses, technicians, procurement department etc...). A common global API will be developed to return the requested data.
Historical information is required ( for ex: FDB in 2017 had NDC1 which then got deleted from the FDB feed in 2019. So a physician who prescribed NDC1 should be able to query the information of that drug going through history).
Daily we receive the feed from the external provider and use it as input source to merge ( update, insert, delete) our reference data copy such that its live table reflects the latest external feed.
In Azure, we have the following storage options:
Blob storage
Cosmos Db
Azure sql database with system versioning
Azure Datawarehouse
Azure Data lake
What is the best practice to store external reference data? We are leaning toward azure sql database with system versioning. Have any of you worked with external reference data? If yes, what is your storage decision and has it worked well for you? I would like to hear your comments and opinions. Thank you!
You need to base your choice on the type of data you are trying to store, and how you need to reference it. It sounds like you might actually need a few different technologies here.
For example, Azure SQL is great for storing relational data. So if your data is tabular in form and needs to have relationships between it, then this is a good choice. However, if you're going to be storing millions and millions of rows then performance might suffer in a relational database. In that sort of scenario, or one where you have lots of transactional data you might want to look at Cosmos DB.
You mentioned images at one point, putting these in a database is not a good idea, in this sort of scenario you are going to want to look at using blob storage.
"Reference Data" really doesn't mean anything, look at the individual types of data you need to store, and how this data is used, and make decisions based on this. For lots of different types of data, there is unlikely to be a one size fits all solution.

How can I back up my Windows Azure table storage?

I would like to be able to back up my table storage and also I have a need to move the data (export and import) from my production to development environment on my desktop.
Does anyone know of any tools or method that I can use to this.
You can use Cerebrata's Azure Management Cmdlets product. It allows you to download and restore your Azure table storage (and many more things). You can download it from here.
http://clumsyleaf.com/products/tablexplorer
TableXplorer will let you export all table data to an XML or CSV file.
As mentioned by others, there are tools out there that let you download your data in various formats, but it's worth noting that none of these are a true backup like you might be used to getting with SQL server.
As far as I'm aware they all just run a regular table storage query to scan through all of the records in the table and save out the results. If you have a reasonable amount of data (and if you're using table storage then you probably do) it's quite possible that this backup could take an hour or more.
For the sake of simplicity let's say you have two large related tables, A and B. If the backup starts by backing up table A, then moves on to table B, by the time it finishes backing up table B, it might contain records that rely on data in table A that's just not there.
If you just want to refresh the data in your development environment, this could be perfectly acceptable, but you do need to be aware of it.
You can try this solution it lets you to backup/restore your tables and blobs to the same or different storage account.

Resources