Migrate delta lake from Azure to GCP - azure

we are looking to migrate our delta lake from Azure to GCP.
So far we are thinking about moving delta files from one ADLS bucket to GCS bucket. We believe there might be more to that.
We are looking for a methodology, best practices and hints on to do that migrationo. Can anybody help on that please.

You might like to check sources and sinks of the Cloud Storage Transfer Service. One of the sources - is an Azure Blob Storage, including Azure Data Lake Storage Gen2 - I don't know if that can help in your case. And some documentation about access configuration.
All other details - depends on you case, and it is very difficult to provide a general answer.

Related

Are you able to apply Azure Information Protection on Azure Data Lake or Blob Storage?

I'm wanting to know if you can apply Azure Information Protection on files that are in Azure Data Lake or Blob storage.
I cant seem to find any documentation that confirms this.
Cheers
Tim

Any best alternative for Azcopy for data movement?

I have 100s of TB data to move from S3 to blob storage. Is there any best alternative of Azcopy? Because Azcopy use high bandwidth and have high CPU usage. I don't want to use It. In Azcopy v10 still these issues are coming after applying the required parameters. Can someone help me in this regard, I did R&D on It but not found any alternate.
I agree with #S RATH.
For big data moving, Data Factory is the best alternative of Azcopy. It has the better Copy performance :
Data Factory support Amazon S3 and Blob Storage as the connector.
With Copy active, You could create the Amazon S3 as the source dataset and Blob Storage as Sink dataset.
Ref these tutorials:
Copy data from Amazon Simple Storage Service by using Azure Data
Factory: This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure Data Factory, read the introductory article.
Copy and transform data in Azure Blob storage by using Azure Data
Factory: This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Azure Blob storage. It also describes how to use the Data Flow activity to transform data in Azure Blob storage. To learn about Azure Data Factory, read the introductory article.
Data Factory also provide many ways to improve the data copy performance, ref: Copy activity performance and scalability guide
I thinks it will help you save much time, as we know, time is money.

Use Data Lake or Blob on HDInsights cluster on Azure

When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.
What are the real differences between these two options and how do they affect the performance?
I found this page https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage
But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".
Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?
As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.
Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.
Hope this helps.
In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.
Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.
The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.
In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-spark

Does the Azure Data Lake Store offer any encryption?

I am under the impression that the Azure Data Lake Store does not currently offer any encryption at rest (the way Azure Blob Storage does). I managed to found some vague mention of this on the official website, suggesting this is coming soon.
Is this your understanding as well? Does this cover the databases stored under the Azure Data Lake Analytics as well?
Actually encryption is available in preview on ADL Storage right now. If you contact us we can give you access to the preview.
Azure Data Lake Store Encryption is currently Generally Available, not in preview any more. You can specify encryption at rest when creating ADLS in portal. Options are these:

How to transfer csv files from Google Cloud Storage to Azure Datalake Store

I'd like to have our daily csv log files transferred from GCS to Azure Datalake Store, but I can't really figure out what would be the easiest way for it.
Is there a built-in solution for that?
Can I do that with Data Factory?
I'd rather avoid running a VM scheduled to do this with the apis. The idea comes from the GCS->(DataFlow->)BigQuery solution.
Thanks for any ideas!
Yes, you can move data from Google Cloud Storage to Azure Data lake Store using Azure Data Factory by developing custom copy activity. However, in this activity, you will be using APIs for transferring that data. See details on this article.

Resources