I am working on a backup strategy for Data Lake Store (DLS). My plan is to create two DLS accounts and copy data between them. I have evaluated several approaches to achieve this but none of them satisfies the requirement to preserve the POSIX ACLs (permissions in DLS parlance). PowerShell cmdlets require data to be downloaded from the primary DLS onto a VM and re-uploaded onto the secondary DLS. The AdlCopy tool works only on Windows 10, does not preserve permissions and neither supports copying data across regions (not that this is a hard requirement). Data Factory seemed like the most sensible approach until I realized it also doesn't preserve permissions.
Which leads me to my last option - Distcp. According to the Distcp guide (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html), the tool supports preserving of permissions. However, the downside of using Distcp is that the tool must be run from HDInsight. Although it supports both intra and inter-cluster copying, I would rather not have a running HDInsight cluster just for backup operations.
Am I missing something? Does anyone have any better suggestions?
Your assessment is comprehensive. Those are indeed the options that are available should you want to copy over permissions. So you will have to choose one of them, sorry. If you truly want a serverless option that would copy over the permissions, Azure Data Factory would have to be it. Could you please create a feedback item here - https://feedback.azure.com/forums/270578-data-factory?
Thanks,
Sachin Sheth
Program Manager, Azure Data Lake.
Related
In my azure subscription I have a storage account with a lot of tables that contains important data.
As far as I know azure offers a backup point-in-time for the storages and blobs, and geo redundancy in event of a failover. But I couldn't find anything regarding the backup of table storages.
The only way to do so is by using azCopy which is fine and a logic, but I couldn't make it work as I had some issues with permissions even if I set the Azure Blob Data Contributor to my container.
So as an option, I was thinking if there is a way how to implement this using python code to loop throu all the tables in a specific container and make a copy into another container.
Can anyone enlighten me on this matter please?
Did you set the Azure Storage firewall: allow access from all networks?:
Python code is a way but we can't help you design the code. And there isn't an example for you. It doesn't meet Stack Overflow's guideline.
If you still couldn't figure it out with AzCopy, I would suggest you think about use Data Factory to schedule backup the data from table storage to another container.
Create a pipeline with copy active to copy the data from Table
Storage. Ref this tutorial:Copy data to and from Azure Table
storage by using Azure Data Factory.
Create a schedule trigger for the pipeline to make the jobs
automatic.
If the Table storage has many tables, the easiest way is using Copy Data Tool.
Update:
Copy data tool source settings:
Sink settings: auto create the table in sink table storage
HTH.
I'm very new to the Azure Data Lake Storage and currently training on Data Factory. I have a developer background so right away I'm not a fan of the 'tools' approach for development. I really don't like how there's all these settings to set and objects you have to create everywhere. I much prefer a code approach which allows us to detach the logic from the service (don't like the publishing thing to save), see everything by scrolling or navigate to different objects in a project, see differences easier in source control and etc. So I found this Micrososft's Filesystem SDK that seems to be an alternative to Data Factory:
https://azure.microsoft.com/en-us/blog/filesystem-sdks-for-azure-data-lake-storage-gen2-now-generally-available/
What has been your experience using this approach? Is this a good alternative? Is there a way to run SDK code in data factory? that way we can leverage scheduling and triggers? I guess i'm looking for Pros/cons.
thank you
Well, the docs refer to several SDKs, one of them being the .Net SDK and the title is
Use .NET (or Python or Java etc.) to manage directories, files, and ACLs in Azure Data Lake Storage Gen2
So, the SDK lets you manage the filesystem only. No support for triggers, pipelines, dataflows and the lot. You will have to stick to the Azure Data Factory for that.
Regarding this:
I'm not a fan of the 'tools' approach for development
I hate to tell you but the world is moving that way whether you like it or not. Take Logic Apps for example. Azure Data Factory isn't aimed at the hardcore developer but fulfils a need for people working with large sets of data like Data Engineers. I am already glad it integrates with git very well. Yes, there is some overhead in defining sinks and sources but they are reusable across pipelines.
If you really want to use code try Azure Databricks. Take a look at this Q&A as well.
TL;DR:
The FileSystem SDK is not an alternative.
The code-centric alternative to Azure Data Factory for building and managing your Azure Data Lake is Spark. Typically either Azure Databricks or Azure Synapse Spark.
Currently I am tasked with researching a solution to easily copying data from one environment to another (QA to DEV for example) as well as having the flexibility of going to different times to compare our data. It is an easy task to do locally with SSMS and I am looking for the best ways to do it using Azure and it's tools.
These are the options that I found so far:
Backup Service and Backup Vault (The MS solution that I am not asking for. They don't generate .bak files)
Azure Function to execute generate and transfer SQL (flexible but the code needs to be maintained + manage authentication)
Powershell process with Azure Automate (Flexible too but needs to be maintained)
Datafactory/SSIS (Still learning and researching)
Anyone got any tools/methods that are worth looking into before I dive deeper with a solution?
For Azure SQL database, SQL Data Sync is one of feature for the data sync between Azure SQL and SQL server(on-premise). Some limits are that Azure SQL database must be hub and each must have a primary key. That may not suit you.
Per my experience, Data Factory is the best one for you. You can copy the data between different environment, in Sink settings, we can using upsert(insert or update) operation to sync the data.
If you only want to schedule the backup automatically for the SQL, the third-part tool also could feed your request: SQL Backup and FTP.
Since you have searched a lot and found almost all the options in Azure, all the ways can achieve that. You need to know your real request, data sync or auto backup create the .bacpac file to storage. That's not a good question to help you find the best way. The way you like, the way is the best.
I went with writing an Azure Automate powershell script. including cmdlts like New-AzureRmSqlDatabaseExport and passing in the parameters was ticky but it finally did the job.
Copying data between various instances of ADLS using DISTCP
Hi All
Hope you are doing well.
We have a use case around using ADLS as different tiers of the ingestion process, just required you valuable opinions regarding the feasibility of the same.
INFRASTRUCTURE: There will two instances of ADLS named LAND & RAW. LAND instance will be getting the file directly from the source while RAW instance will be getting the file once validations are passed in LAND instance. We also have a Cloudera cluster hosted on Azure platform which will have connectivity established to both the ADLS instances.
PROCESS: We will have a set of data & control files landing in one of the ADLS instances (say landing). We need to run a spark code on Cloudera cluster to perform count validation between Data & control file present in Land ADLS instance. Once the validation is successful, we want distcp command to copy data from Land ADLS instance to Raw ADLS instance. We are assuming that Distcp utility will be already installed on the Cloudera cluster.
Can you guys suggest if above approach looks fine?
Primarily our question is whether DISTCP utility will support data movement between two different ADLS instances?
We also considered other options like ADLCopy but Distcp appeared better.
NOTE: We haven't considered use Azure Data Factory since it may has certain security challenges though we know Data Factory is best suited for above use case.
If your use case requires you to copy data between multiple storage accounts, distcp is the right way to execute this.
Note that even if you were to encapsulate this solution in data factory, the pipeline with copy activity will invoke distcp.
We're trying to get information about Azure and/or AWS in terms of their ability to create snapshots of data drives that are writable and can be attached to VMs.
Currently we use a model with our test environments on-prem, where we have a clone of a set of production databases/logs on drives that are quite large (+2TB) on our EMC SAN. Instead of making full copies of the clone for each test environment DB server, we use EMC VNX redirect-on-write snapshots. This allows us to quickly provision the DB server VM in the test environment without having to make a full copy of the DB/logs, and saves on SAN space, as only the delta from new writes to the snapshot are stored as new data. This works really well as we only need one full copy of the source DBs/logs.
Does anyone know if Azure or AWS has the ability to do something similar or a reasonable alternative? Making full copies of the databases/logs for each test environment is not really an option for us. We started looking at the Azure SQL Database copy feature but we were not sure if this creates a full copies or writable snapshots.
Thanks in advance.
Does anyone know if Azure or AWS has the ability to do something similar or a reasonable alternative?
Azure VM Disk uses Azure Page Blob to store data. Until now, the snapshot of a Azure blob can be read, copied, or deleted, but not modified.
I am sorry to tell that Azure doesn't provide the similar thing to fit your requirement. In Azure, we do need to use AzCopy to copy the whole blob to make the new blob writable.