I have a large volume of data (~20 TB) in Azure blob storage that I want to access from Spark cluster setup in Amazon EMR. What is the best way to do this? Is transferring this data to S3 the only option? If yes, what is the cheapest way to transfer this data to S3?
Thanks!
Based on your description, you have about 20TB data want to transfer to Amazon S3. I am not familiar with Amazon. But in Azure we will be charged for the data transfer. Here is the pricing site. For example you need $0.08 per GB. 20*1024*0.08= $1638.4. It is very expensive. I would suggest you to consider other approaches. If you do not care about money at all, please try to search tool in google or write your own code to transfer these data.
Related
we are looking to migrate our delta lake from Azure to GCP.
So far we are thinking about moving delta files from one ADLS bucket to GCS bucket. We believe there might be more to that.
We are looking for a methodology, best practices and hints on to do that migrationo. Can anybody help on that please.
You might like to check sources and sinks of the Cloud Storage Transfer Service. One of the sources - is an Azure Blob Storage, including Azure Data Lake Storage Gen2 - I don't know if that can help in your case. And some documentation about access configuration.
All other details - depends on you case, and it is very difficult to provide a general answer.
I would like to create an Azure Storage Account, and use blob storage in the region US West.
However my business needs is to upload/download files from all over the world and not just US West.
When I download/upload files from India or places that are far from US West, there is a severe degradation in performance.
For downloads I could use Geo Redundant read replica. This partially solves the problem. However the this is increasing the cost significantly. Also the time take for replication is several minutes and this is not fitting for me.
In AWS S3 storage, there is a feature called transfer acceleration. Transfer acceleration speeds up the uploads/downloads by optimizing the routing of packets. Is there any similar feature in Azure?
You may use Azcopy(AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account. This article helps you download AzCopy, connect to your storage account, and then transfer files.) Fast Data Transfer or Azure Data Factory(A fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale.)
High-Throughput with Azure Blob Storage
You should look at the Azure Storage library https://azure.microsoft.com/en-us/blog/introducing-azure-storage-data-movement-library-preview-2/
You should also take into account the performance guidelines from the Azure Storage Team https://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/
This article provides an overview of some of the common Azure data transfer solutions. The article also links out to recommended options depending on the network bandwidth in your environment and the size of the data you intend to transfer. Choose an Azure solution for data transfer
I have 100s of TB data to move from S3 to blob storage. Is there any best alternative of Azcopy? Because Azcopy use high bandwidth and have high CPU usage. I don't want to use It. In Azcopy v10 still these issues are coming after applying the required parameters. Can someone help me in this regard, I did R&D on It but not found any alternate.
I agree with #S RATH.
For big data moving, Data Factory is the best alternative of Azcopy. It has the better Copy performance :
Data Factory support Amazon S3 and Blob Storage as the connector.
With Copy active, You could create the Amazon S3 as the source dataset and Blob Storage as Sink dataset.
Ref these tutorials:
Copy data from Amazon Simple Storage Service by using Azure Data
Factory: This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure Data Factory, read the introductory article.
Copy and transform data in Azure Blob storage by using Azure Data
Factory: This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Azure Blob storage. It also describes how to use the Data Flow activity to transform data in Azure Blob storage. To learn about Azure Data Factory, read the introductory article.
Data Factory also provide many ways to improve the data copy performance, ref: Copy activity performance and scalability guide
I thinks it will help you save much time, as we know, time is money.
We have a large amount, 1PB, of (live) data that we have to transfer periodically between S3 and Azure Blob Storage. What tools do you use for that? And what strategy do you use to minimize cost of transfer and downtime?
We have evaluated a number of solutions, including AzCopy, but none of them satisfy all of our requirements. We are a small startup so we want to avoid homegrown solutions.
Thank you
Azure Data Factory is probably your best bet.
Access the ever-expanding portfolio of more than 80 prebuilt connectors—including Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery¬—at no additional cost. Data Factory provides efficient and resilient data transfer by using the full capacity of underlying network bandwidth, delivering up to 1.5 GB/s throughput.
I'm planning to transfer data from an old ftp to a www.backblaze.com/b2 bucket.
I'm considering to use rclone for this.
Running rclone in a aws machine. In rclone I will configure two remotes, the ftp and the b2 bucket. Then will execute something like:
./rclone sync ftp:/myfolder b2:/myfolder
Full data size goes from 100GB to 500GB.
The aws machine is on south america and I guess the ftp also, not sure about b2 bucket.
Question: Do this consumes my amazon aws network transfer? And will it cost a lot?
Data transfer into AWS is free. You will be charged only when you transfer data out from AWS to internet. It depends on the region and the size of the data you are transferring out. Based on 100GB to 500GB transfer, data transfer alone will cost you $9 to $45. See Data Transfer Cost for more information.