Any best alternative for Azcopy for data movement? - azure

I have 100s of TB data to move from S3 to blob storage. Is there any best alternative of Azcopy? Because Azcopy use high bandwidth and have high CPU usage. I don't want to use It. In Azcopy v10 still these issues are coming after applying the required parameters. Can someone help me in this regard, I did R&D on It but not found any alternate.

I agree with #S RATH.
For big data moving, Data Factory is the best alternative of Azcopy. It has the better Copy performance :
Data Factory support Amazon S3 and Blob Storage as the connector.
With Copy active, You could create the Amazon S3 as the source dataset and Blob Storage as Sink dataset.
Ref these tutorials:
Copy data from Amazon Simple Storage Service by using Azure Data
Factory: This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure Data Factory, read the introductory article.
Copy and transform data in Azure Blob storage by using Azure Data
Factory: This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Azure Blob storage. It also describes how to use the Data Flow activity to transform data in Azure Blob storage. To learn about Azure Data Factory, read the introductory article.
Data Factory also provide many ways to improve the data copy performance, ref: Copy activity performance and scalability guide
I thinks it will help you save much time, as we know, time is money.

Related

Migrate delta lake from Azure to GCP

we are looking to migrate our delta lake from Azure to GCP.
So far we are thinking about moving delta files from one ADLS bucket to GCS bucket. We believe there might be more to that.
We are looking for a methodology, best practices and hints on to do that migrationo. Can anybody help on that please.
You might like to check sources and sinks of the Cloud Storage Transfer Service. One of the sources - is an Azure Blob Storage, including Azure Data Lake Storage Gen2 - I don't know if that can help in your case. And some documentation about access configuration.
All other details - depends on you case, and it is very difficult to provide a general answer.

When to use Data Factory (copy) over direct pull in SQL synapse

I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.
If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.

Azure Lake to Lake transfer of files

My company has two Azure environments. The first one was a temporary environment and is being re-purposed / decommissioned / I'm not sure. All I know is I need to get files from one Data Lake on one environment, to a DataLake on another. I've looked at adlcopy and azcopy and neither seem like they will do what I need done. Has anyone encountered this before and if so, what did you use to solve it?
Maybe you can think about Azure Data Factory, it can helps you transfer files or data from one Azure Data Lake to Another Data Lake.
You can reference Copy data to or from Azure Data Lake Storage Gen2 using Azure Data Factory.
This article outlines how to use Copy Activity in Azure Data Factory to copy data to and from Data Lake Storage Gen2. It builds on the Copy Activity overview article that presents a general overview of Copy Activity.
For example, you can learn from this tutorial: Quickstart: Use the Copy Data tool to copy data.
In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a pipeline that copies data from a folder in Azure Blob storage to another folder.
Hope this helps.

Logic Apps - Get Blob Content Using Path

I have a event driven logic app (blob event) which reads a block blob using the path and uploads the content to Azure Data Lake. I noticed the logic app is failing with 413 (RequestEntityTooLarge) reading a large file (~6 GB). I understand that logic apps has the limitation of 1024 MB - https://learn.microsoft.com/en-us/connectors/azureblob/ but is there any work around to handle this type of situation? The alternative solution I am working on is moving this step to Azure Function and get the content from the blob. Thanks for your suggestions!
If you want to use an Azure function, I would suggest you to have a look at this at this article:
Copy data from Azure Storage Blobs to Data Lake Store
There is a standalone version of the AdlCopy tool that you can deploy to your Azure function.
So your logic app will call this function that will run a command to copy the file from blob storage to your data lake factory. I would suggest you to use a powershell function.
Another option would be to use Azure Data Factory to copy file to Data Lake:
Copy data to or from Azure Data Lake Store by using Azure Data Factory
You can create a job that copy file from blob storage:
Copy data to or from Azure Blob storage by using Azure Data Factory
There is a connector to trigger a data factory run from logic app so you may not need azure function but it seems that there is still some limitations:
Trigger Azure Data Factory Pipeline from Logic App w/ Parameter
You should consider using Azure Files connector:https://learn.microsoft.com/en-us/connectors/azurefile/
It is currently in preview, the advantage it has over Blob is that it doesn't have a size limit. The above link includes more information about it.
For the benefit of others who might be looking for a solution of this sort.
I ended up creating an Azure Function in C# as the my design dynamically parses the Blob Name and creates the ADL structure based on the blob name. I have used chunked memory streaming for reading the blob and writing it to ADL with multi threading for adderssing the Azure Functions time out of 10 minutes.

How to transfer csv files from Google Cloud Storage to Azure Datalake Store

I'd like to have our daily csv log files transferred from GCS to Azure Datalake Store, but I can't really figure out what would be the easiest way for it.
Is there a built-in solution for that?
Can I do that with Data Factory?
I'd rather avoid running a VM scheduled to do this with the apis. The idea comes from the GCS->(DataFlow->)BigQuery solution.
Thanks for any ideas!
Yes, you can move data from Google Cloud Storage to Azure Data lake Store using Azure Data Factory by developing custom copy activity. However, in this activity, you will be using APIs for transferring that data. See details on this article.

Resources