Is anyone aware of a way to anonomise data within a storage account container on Azure?
I was assuming no but wanted to see if anyone had any ideas on this?
I'm aware of data masking on SQL databases on Azure but within containers we have files with data that are used within a data lake.
I know my details a little sparse... But I'm still trying to understand the underlying architecture I'm working with atm.
Any questions/ideas/thoughts, throw them at me!
Azure storage account(data lake) don't support data masking. so mask at the source of the data is not possible.
You can create app to mask the data and put it in the same VNET with your data lake(In this way, the others can not access the primary data). And then send the data to target storage.
But maybe what you need is customer managed keys to do data encryption?
Related
In my azure subscription I have a storage account with a lot of tables that contains important data.
As far as I know azure offers a backup point-in-time for the storages and blobs, and geo redundancy in event of a failover. But I couldn't find anything regarding the backup of table storages.
The only way to do so is by using azCopy which is fine and a logic, but I couldn't make it work as I had some issues with permissions even if I set the Azure Blob Data Contributor to my container.
So as an option, I was thinking if there is a way how to implement this using python code to loop throu all the tables in a specific container and make a copy into another container.
Can anyone enlighten me on this matter please?
Did you set the Azure Storage firewall: allow access from all networks?:
Python code is a way but we can't help you design the code. And there isn't an example for you. It doesn't meet Stack Overflow's guideline.
If you still couldn't figure it out with AzCopy, I would suggest you think about use Data Factory to schedule backup the data from table storage to another container.
Create a pipeline with copy active to copy the data from Table
Storage. Ref this tutorial:Copy data to and from Azure Table
storage by using Azure Data Factory.
Create a schedule trigger for the pipeline to make the jobs
automatic.
If the Table storage has many tables, the easiest way is using Copy Data Tool.
Update:
Copy data tool source settings:
Sink settings: auto create the table in sink table storage
HTH.
I plan on using Azure Blob storage to store images. I will have around 5000 categories for images that I plan on using folders to keep separated. For each of the image files, the file names won't differ a lot across the board and there is the potential to need to change metadata frequently.
My original plan was to use a SQL database to index all of these files and store my metadata there, but I'm second guessing that plan.
Is it feasible to index files in Azure Blob storage using a database, or should I just stick with using blob metadata?
Edit: I guess this question should really be "are there any downsides to indexing Azure Blob storage using a relational database?". I'm much more comfortable working with a DB than I am Azure storage, so my preference is to use a DB.
I'm second guessing whether or not to use a DB after looking at Azure storage more and discovering meta-tags and indexing. Hope this helps.
You can use Azure Search for this task as well, store images in Azure Storage (BLOB) and use Azure Search for crawling. indexing and searching. Using metadata you can enhance your search as well. This way you might not even need to use Folders to separate different categories.
Blob Index is a very feasible option and it can save the in the pricing, time, and overhead in terms of not using SQL.
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/
If you are looking for more information on this preview feature, I would love hear more and work closer on this issue. Could you please reach me on BlobIndexPreview#microsoft.com.
I am performing ETL in Azure Data Factory and I just wanted to confirm my understanding of it before going further. Please find the image attached below.
I am collecting data from multiple source and storing in Azure Blob Storage then perform Transformation and Loading. What I am confused about is that whether Azure Blob Storage is a landing or staging area here in my case. Some people use these terms interchangeably and couldn't understand the fine line between these two terms.
Also, can anyone explain me which part is Extract, Transform and Load is. In my understating, collecting the data from multiple source and store into Azure Blob Storage is Extracting, Azure Data Factory is Transformation and copying the transformed data into Azure Database is Loading. Am i correct or is there something I am misunderstanding here?
What I am confused about is that whether Azure Blob Storage is a
landing or staging area here in my case.
In your case, Azure Blob Storage is both landing area and staging area. Landing area means a area collecting data from different places. Staing area means it only save data for a little time, staging data should be deleted during ETL process.
Also, can anyone explain me which part is Extract, Transform and Load
is.
Copy Activity is a typical technology based on ETL. If only talking about the Copy Activity of Azure Data Factory, after you specify the copy source, the ADF will perform copy activities based on this, this is 'extract'. The part of the ADF that transfers data to the specified Sink according to your settings, this is 'Load', and the details of the copy behavior is 'Transform'. If you look at your entire process, you collect data to blob storage is also 'Extract'.
I'd like to have our daily csv log files transferred from GCS to Azure Datalake Store, but I can't really figure out what would be the easiest way for it.
Is there a built-in solution for that?
Can I do that with Data Factory?
I'd rather avoid running a VM scheduled to do this with the apis. The idea comes from the GCS->(DataFlow->)BigQuery solution.
Thanks for any ideas!
Yes, you can move data from Google Cloud Storage to Azure Data lake Store using Azure Data Factory by developing custom copy activity. However, in this activity, you will be using APIs for transferring that data. See details on this article.
A simple question: Can this be achieved directly? I mean without the Azure blob storage in between (as showed in all the examples)? Can someone provide some code example please.
yes, you can do this directly. In fact, you can do direct copies from any of our supported sources/sinks, you don't have to pass through blob. To go from on-prem SQL Server-->SQL azure, you will need to setup a Data Management Gateway connector on your on-prem server. Then, you use a linked service of type AzureStorage and an output dataset of type AzureSQLTable as the output dataset, instead of AzureBlob as is shown in the example. The exact steps to setup the DMG and the JSON code for the linked services, datasets, and pipelines can be found in our documentation. We are also improving our UI in the near future to make these kinds of copy setups an easy code-free experience.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-sqlserver-connector/