Azure blob copy in cloud - azure

In aws, the "upload-part-copy" has option of byte ranges. If I wanted to copy portions of two objects to a new object within the cloud, I can copy using the "upload-part-copy" command.
I could not find any such method or mechanism to copy portions of blobs to a new blob in Azure. I tried AzCopy. But it does not have any option to select some portion of blob.
Can anyone please help me if there is any method like that.

Can anyone please help me if there is any method like that.
As of today, this feature is not there in Azure Blob Storage. A copy operation copies the entire source blob to destination blob.
A workaround would be to download the byte ranges (blocks) from the source blobs on your local machine and then create a new blob by uploading these blocks.
If you were using Blob Service REST API, here would be the operations you would need to perform:
Read Source Blob 1 by specifying the range in Range or x-ms-range request header you would like to read. Store the data fetched somewhere in your application.
Repeat the same for Source Blob 2.
Now create a new blob by uploading the data fetched for 1st source blob using Put Block.
Repeat the same for 2nd source blob.
Create the destination blob by committing block list.

Related

Extracting Files from Onprem server to Azure Blob storage while filtering files with no data

I am trying to transfer on-premise files to azure blob storage. However, out of the 5 files that I have, 1 has "no data" so I can't map the schema. Is there a way I can filter out this file while importing it to azure? Or would I have to import them into azure blob storage as is then filter them to another blob storage? If so, how would I do this?
DataPath
CompleteFiles Nodata
If your on prem source is your local file system, then first copy the files with folder structure to a temporary blob container using azcopy SAS key. Please refer this thread to know about it.
Then use ADF pipeline to filter out the empty files and store it final blob container.
These are my files in blob container and sample2.csv is an empty file.
First use Get Meta data activity to get the files list in that container.
It will list all the files and give that array to the ForEach as #activity('Get Metadata1').output.childItems
Inside ForEach use lookup to get the row count of every file and if the count !=0 then use copy activity to copy the files.
Use dataset parameter to give the file name.
Inside if give the below condition.
#not(equals(activity('Lookup1').output.count,0))
Inside True activities use copy activity.
copy sink to another blob container:
Execute this pipeline and you can see the empty file is filtered out.
If your on-prem source is SQL, use lookup to get the list of tables and then use ForEach. Inside ForEach do the same procedure for individual tables.
If your on-prem source other than the above mentioned also, first try to copy all files to blob storage then follow the same procedure.

Load data from public Azure blob in Matillion

I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse.
My Matillion is on Azure, as is my Snowflake database.
The training is AWS-based, but gives information about the adjustments needed for Azure or GS.
One of the steps shows how to Load data from blob storage. It is S3 based.
For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage.
It also explains that for Snowflake on Azure yet another component needs to be used.
I have created a Stage in Snowflake:
CREATE STAGE "onlinemtlntrainingazure_flights"
URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'
The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).
I have created the target table in my database. It is accessible and visible in Matillion IDE.
The adjusted component I am to use is 'Azure Blob Storage Load'.
According to the documentation, I will need:
For Snowflake on Azure:
Create a Stage in Snowflake:
You should create a Stage in Snowflake which will be pointing to the
public data we provide. Please, find below the .json file containing
the job that will help you to do this. Don't forget to change the SQL
Script for pointing to your own schema
After Creating the Stage in Snowflake:
You should use the 'Create Table' and the 'Azure Blob Storage Load'
components individually as the 'Azure Blob Load Generator' won't let
you to select the Stage previously created. We have attached below the
Create Table metadata to save you some time.
'Azure Blob Storage Load' Settings:
Stage: onlinemtlntrainingazure_flights
Pattern: training_azure_flights_2016.gz
Target Table: training_flights
Record Delimiter: 0x0a
Skip Header: 1
The source data on Azure is located here:
Azure Blob Container (with flights data)
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz
Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.
the stage does not appear in the list, and manually inputting the stage name yields an error (unrecognised option). prefixing the stage name with my schema (and even database) does not help.
azure storage location property does not accept the https://... URI to the data files. When I replace the 'https' by 'azure', or remove the part after the last '/' it complains with 'Unable to find an account with name: [onlinemtlntrainingazure]'
using [Custom] for stage property removes the error message, but when running the component, it comes back again with the 'Unable to find account'.
Any thoughts?
Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.
The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.
In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component.
You already have the https:// source URLs for the three files, so:
Set the source type to HTTPS (no username or password is needed)
Add the source URL
Set the target type to Azure Blob Storage
In the example I used two variables, with defaults set to my storage account and container name
Repeat for all three files
After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.

How to copy blob from one container to another using Logic App

In Azure, I have Blob storage with two containers. Input and Output. I have file say Test1.csv, which after processing I want to copy to output container. I need to do this as a step in Azure Logic app. However I am not able to understand how to configure the Copy Blob action in Azure Logic app. I am not able to configure the source path URL correctly.
It gives error message file not found.
Thanks
If you want to use Copy blob action to copy blob, the simplest way to get the blob url is use the Create SAS URI by path action. Then pass the url to the Copy blob action and the destination.
Except this way, you could use create blob to copy blob. Firstly use Get blob content using path to get the blob content, then use Create blob to upload the blob to the destination container.

Azure: Unable to copy Archive blobs from one storage account to another?

Whenever I try to copy Archive blobs to a different storage account and changing its tier in destination. I am getting the following error:
Copy source blob has been modified. ErrorCode: CannotVerifyCopySource
I have tried copying Hot/Cool blobs to Hot/Cool/Archive. I am facing the issue only while copying Archive to Hot/Cool/Archive. Also, there is no issue while copying within same storage account.
I am using Azure python SDK:
blob_url = source_block_blob_service.make_blob_url(copy_from_container, blob_name, sas_token = sas)
dest_blob_service.copy_blob(copy_to_container, blob_name, blob_url, requires_sync = True, standard_blob_tier = 'Hot')
The reason you're getting this error is because copying an archived blob is only supported in the same storage account and you're trying it across different storage account.
From the REST API documentation page:
Copying Archived Blob (version 2018-11-09 and newer)
An archived blob can be copied to a new blob within the same storage
account. This will still leave the initially archived blob as is. When
copying an archived blob as source the request must contain the header
x-ms-access-tier indicating the tier of the destination blob. The data
will be eventually copied to the destination blob.
While a blob is in the archive access tier, it's considered offline and can't be read or modified.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-rehydration
To read the blob, you either need to rehydrate it first. Or, as described in the link above, you can also use the CopyBlob operation. I am not sure if the python SDK copy_blob() operation uses that API behind the scenes - maybe not if it did not work that way for you.

Could not verify the copy source within the specified time. RequestId: (blank)

I am trying to copy some blob files from one storage account to another one. I am using AzCopy in order to fulfill this goal.
The process works for copying files between containers within the same storage account, but not between different storage accounts.
The command I am issuing is:
AzCopy /Source:https://<storage_account1>.blob.core.windows.net/<container_name1>/<path_to_desired_blobs> /Dest:https://<storage_account2>.blob.core.windows.net/<container_name2>/<path_to_store>/ /SourceKey:<source_key> /DestKey:<dest_key> /Pattern:<some_pattern> /S
The error I am getting is the following:
The remote server returned an error: (400) Bad Request.
Could not verify the copy source within the specified time.
RequestId:
Time:2016-04-01T19:33:01.0527460Z
The only difference between the two storage accounts is that one is Standard, whereas the other one is Premium.
Any help will be appreciated!
From your description, you're trying to copy Block Blob from source account to Page Blob in destination account, which is not supported in Azure Storage Service and AzCopy.
To work around it, you can firstly use AzCopy to download the Block Blobs from source account to local file system, and then upload them from local file system to destination account with option /BlobType:Page (this option is only valid when uploading from local to blob).
Premium Storage only supports page blobs. Please confirm that you are copying page blobs from standard to premium storage account. Also, specify the BlobType parameter to "page" in order to copy the data as page blobs into destination premium storage account.
From the description, I am assuming your source blob is a block blob. Azure's "Async Copy Blob" process (which is used by AzCopy as the default method) preserves the blob type. That is, you cannot convert a blob type from Block to Page through async copy blob.
Instead, can you try AzCopy again with "/SyncCopy" option along with "/BlobType:page" parameter? That might help change the destination blob type to Page.
(If that doesn't work, only other solution would be to first download the blob, and then upload it with "/BlobType:page")

Resources