So I want to create a Copy Data Tool in Azure Data Factory, where I connect to a Azure Data Explorer (Kusto) and copy large amounts data to blob storage. There are five steps you have to go through to set up the copy data tool and I am stuck on number three "Destination"
Destination Screen
When I test the Connection it works and I created an empty blob storage in my storage account specifically to store my data. When I press the next button I get the error message
TypeError: Cannot set properties of undefined (setting 'sink')
For the life of me I don't know why... I tried a new connection with a different name and I tried giving different optional params, but nothing seems to work.
Related
I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse.
My Matillion is on Azure, as is my Snowflake database.
The training is AWS-based, but gives information about the adjustments needed for Azure or GS.
One of the steps shows how to Load data from blob storage. It is S3 based.
For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage.
It also explains that for Snowflake on Azure yet another component needs to be used.
I have created a Stage in Snowflake:
CREATE STAGE "onlinemtlntrainingazure_flights"
URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'
The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).
I have created the target table in my database. It is accessible and visible in Matillion IDE.
The adjusted component I am to use is 'Azure Blob Storage Load'.
According to the documentation, I will need:
For Snowflake on Azure:
Create a Stage in Snowflake:
You should create a Stage in Snowflake which will be pointing to the
public data we provide. Please, find below the .json file containing
the job that will help you to do this. Don't forget to change the SQL
Script for pointing to your own schema
After Creating the Stage in Snowflake:
You should use the 'Create Table' and the 'Azure Blob Storage Load'
components individually as the 'Azure Blob Load Generator' won't let
you to select the Stage previously created. We have attached below the
Create Table metadata to save you some time.
'Azure Blob Storage Load' Settings:
Stage: onlinemtlntrainingazure_flights
Pattern: training_azure_flights_2016.gz
Target Table: training_flights
Record Delimiter: 0x0a
Skip Header: 1
The source data on Azure is located here:
Azure Blob Container (with flights data)
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz
Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.
the stage does not appear in the list, and manually inputting the stage name yields an error (unrecognised option). prefixing the stage name with my schema (and even database) does not help.
azure storage location property does not accept the https://... URI to the data files. When I replace the 'https' by 'azure', or remove the part after the last '/' it complains with 'Unable to find an account with name: [onlinemtlntrainingazure]'
using [Custom] for stage property removes the error message, but when running the component, it comes back again with the 'Unable to find account'.
Any thoughts?
Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.
The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.
In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component.
You already have the https:// source URLs for the three files, so:
Set the source type to HTTPS (no username or password is needed)
Add the source URL
Set the target type to Azure Blob Storage
In the example I used two variables, with defaults set to my storage account and container name
Repeat for all three files
After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.
Need help with this error.
ADF copy activity, Moving data from snowflake to azure blob storage delimited text.
I am able to preview the snowflake source data. I am also able to browse the containers via sink browse. This doesn't look like an issue with permissions.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Snowflake Export Copy Command validation failed:
'The Snowflake copy command payload is invalid.
Cannot specify property: column mapping,
Source=Microsoft.DataTransfer.ClientLibrary,'
Thanks for your help
Clear the mapping from copy activity, it worked.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Snowflake Export Copy Command validation failed: 'The Snowflake copy command payload is invalid. Cannot specify property: column mapping,Source=Microsoft.DataTransfer.ClientLibrary,'"
To resolve the above error, please check the following:
Please check the copy command you are using, it means the copy command you are using is not valid.
To know more in detail about the error, try like below:
COPY INTO MYTABLE VALIDATION_MODE = 'RETURN_ERRORS' FILES=('result.csv');
Check if you have any issue with the data files before loading the data again.
Try checking the storage account you are currently using and note that Snowflake doesn't support Data Lake Storage Gen1.
Use the COPY INTO command to copy the data from the Snowflake database table into Azure blob storage container.
Note:
Use the blob.core.windows.net endpoint for all supported types of Azure blob storage accounts, including Data Lake Storage Gen2.
Make sure to have either ACCOUNTADMIN role or a role with the global CREATE INTEGRATION privilege to run the below sample command:
copy into 'azure://myaccount.blob.core.windows.net/mycontainer/unload/' from mytable storage_integration = myint;
By using the above command, you no need to include credentials to access the storage.
For more in detail, please refer below links:
Unloading into Microsoft Azure — Snowflake Documentation
COPY INTO < table> — Snowflake Documentation
I sent my azure databricks logs to storage account & Microsoft by default contains those entry in append_blob. I tried to read the Json data with access key I got a error ( shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB).
Error -
Is there any way to read directly that data path(insights-logs-jobs.mdd.blob.core.windows.net/resourceId=/SUBSCRIPTIONS/xxxxxx-xxxx-xxxxx/RESOURCEGROUPS/ssd--RG/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/addd-PROCESS-xx-ADB/y%3D2021/m%3D12/d%3D07/h%3D00/m%3D00/PT1H.json")
Second way I tired to copy data to other container where data comes in block_blob & tried to read using databricks it works. But need to automate the copy data from multiple container to another. As seen in diagram.
We're migrating from blob storage to ADLS Gen 2 and we want to test the access to Data Lake from DataBricks. I created a service principal which has Blob Storage Reader and Blob Storage Contributor access to Data Lake.
My notebook sets the below spark config:
spark.conf.set("fs.azure.account.auth.type","OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type","org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id","<clientId")
spark.conf.set("fs.azure.account.oauth2.client.secret","<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint","https://login.microsoftonline.com/<endpoint>/oauth2/token")
//I replaced the values in my notebook with correct values from my service principal
When I run the below code, the content of the directory are shown correctly:
dbutils.fs.ls("abfss://ado-raw#<storage account name>.dfs.core.windows.net")
I can read a small text file from my data lake which is only 3 bytes
but when I'm trying to show its content, the cell gets stuck at running command and nothing happens.
What do you think the issue is? and how do I resolve it?
Thanks in advance
The issue was the private and public subnets had been deleted by mistake and then recreated using a different IP range. They need to be on the same range as the management subnet, otherwise the private endpoint set up for the storage account won’t work.
In aws, the "upload-part-copy" has option of byte ranges. If I wanted to copy portions of two objects to a new object within the cloud, I can copy using the "upload-part-copy" command.
I could not find any such method or mechanism to copy portions of blobs to a new blob in Azure. I tried AzCopy. But it does not have any option to select some portion of blob.
Can anyone please help me if there is any method like that.
Can anyone please help me if there is any method like that.
As of today, this feature is not there in Azure Blob Storage. A copy operation copies the entire source blob to destination blob.
A workaround would be to download the byte ranges (blocks) from the source blobs on your local machine and then create a new blob by uploading these blocks.
If you were using Blob Service REST API, here would be the operations you would need to perform:
Read Source Blob 1 by specifying the range in Range or x-ms-range request header you would like to read. Store the data fetched somewhere in your application.
Repeat the same for Source Blob 2.
Now create a new blob by uploading the data fetched for 1st source blob using Put Block.
Repeat the same for 2nd source blob.
Create the destination blob by committing block list.