Load data from public Azure blob in Matillion - azure

I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse.
My Matillion is on Azure, as is my Snowflake database.
The training is AWS-based, but gives information about the adjustments needed for Azure or GS.
One of the steps shows how to Load data from blob storage. It is S3 based.
For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage.
It also explains that for Snowflake on Azure yet another component needs to be used.
I have created a Stage in Snowflake:
CREATE STAGE "onlinemtlntrainingazure_flights"
URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'
The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).
I have created the target table in my database. It is accessible and visible in Matillion IDE.
The adjusted component I am to use is 'Azure Blob Storage Load'.
According to the documentation, I will need:
For Snowflake on Azure:
Create a Stage in Snowflake:
You should create a Stage in Snowflake which will be pointing to the
public data we provide. Please, find below the .json file containing
the job that will help you to do this. Don't forget to change the SQL
Script for pointing to your own schema
After Creating the Stage in Snowflake:
You should use the 'Create Table' and the 'Azure Blob Storage Load'
components individually as the 'Azure Blob Load Generator' won't let
you to select the Stage previously created. We have attached below the
Create Table metadata to save you some time.
'Azure Blob Storage Load' Settings:
Stage: onlinemtlntrainingazure_flights
Pattern: training_azure_flights_2016.gz
Target Table: training_flights
Record Delimiter: 0x0a
Skip Header: 1
The source data on Azure is located here:
Azure Blob Container (with flights data)
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz
Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.
the stage does not appear in the list, and manually inputting the stage name yields an error (unrecognised option). prefixing the stage name with my schema (and even database) does not help.
azure storage location property does not accept the https://... URI to the data files. When I replace the 'https' by 'azure', or remove the part after the last '/' it complains with 'Unable to find an account with name: [onlinemtlntrainingazure]'
using [Custom] for stage property removes the error message, but when running the component, it comes back again with the 'Unable to find account'.
Any thoughts?
Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.

The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.
In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component.
You already have the https:// source URLs for the three files, so:
Set the source type to HTTPS (no username or password is needed)
Add the source URL
Set the target type to Azure Blob Storage
In the example I used two variables, with defaults set to my storage account and container name
Repeat for all three files
After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.

Related

Azure Data Factory Copy Snowflake to Azure blog storage ErrorCode 2200User configuration issue ErrorCode=SnowflakeExportCopyCommandValidationFailed

Need help with this error.
ADF copy activity, Moving data from snowflake to azure blob storage delimited text.
I am able to preview the snowflake source data. I am also able to browse the containers via sink browse. This doesn't look like an issue with permissions.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Snowflake Export Copy Command validation failed:
'The Snowflake copy command payload is invalid.
Cannot specify property: column mapping,
Source=Microsoft.DataTransfer.ClientLibrary,'
Thanks for your help
Clear the mapping from copy activity, it worked.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Snowflake Export Copy Command validation failed: 'The Snowflake copy command payload is invalid. Cannot specify property: column mapping,Source=Microsoft.DataTransfer.ClientLibrary,'"
To resolve the above error, please check the following:
Please check the copy command you are using, it means the copy command you are using is not valid.
To know more in detail about the error, try like below:
COPY INTO MYTABLE VALIDATION_MODE = 'RETURN_ERRORS' FILES=('result.csv');
Check if you have any issue with the data files before loading the data again.
Try checking the storage account you are currently using and note that Snowflake doesn't support Data Lake Storage Gen1.
Use the COPY INTO command to copy the data from the Snowflake database table into Azure blob storage container.
Note:
Use the blob.core.windows.net endpoint for all supported types of Azure blob storage accounts, including Data Lake Storage Gen2.
Make sure to have either ACCOUNTADMIN role or a role with the global CREATE INTEGRATION privilege to run the below sample command:
copy into 'azure://myaccount.blob.core.windows.net/mycontainer/unload/' from mytable storage_integration = myint;
By using the above command, you no need to include credentials to access the storage.
For more in detail, please refer below links:
Unloading into Microsoft Azure — Snowflake Documentation
COPY INTO < table> — Snowflake Documentation

"Azure Blob Source 400 Bad Request" when using Azure Blob Source in SSIS to pull a file from Azure Storage container

My package is very simple. It is loading data from a csv file that I have stored in an Azure storage container, and inserting that data into an Azure SQL database. The issue is stemming from the connection to my Azure storage container. here is an image of the output:
Making this even more odd, while the data flow task is failing:
The individual components within the data flow task all indicate success:
Setting up the package, it seems that the connection to the container is fine (after all, it was able to extract all the column names from the desired file and map them to their destination). Here is an image showing the connection is fine:
So the issue is only realized upon execution.
I will also note that I found this post that was experiencing the exact same issue that I am now. As the top response there instructed, I added the new registry keys, but no cigar.
Any thoughts would be helpful.
First, make sure your blob can be access by public:
And if you don't have requirement to set networking, please make sure:
Then set the container access level:
And make sure the container is correct.

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

I uploaded my CSV file into my Azure Data Lake Storage Gen2 using Azure Synapse portal. Then I tried select Top 100 rows and got an error after running auto-generated SQL.
Auto-generated SQL:
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Error:
File 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv'
cannot be opened because it does not exist or it is used by another process.
This error in Synapse Studio has link (which leads to self-help document) underneath it which explains the error itself.
Do you have rights needed on the storage account?
You must have Storage Blob Data Contributor or Storage Blob Data Reader in order for this query to work.
Summary from the docs:
You need to have a Storage Blob Data Owner/Contributor/Reader role to
use your identity to access the data. Even if you are an Owner of a
Storage Account, you still need to add yourself into one of the
Storage Blob Data roles.
Check out the full documentation for Control Storage account access for serverless SQL pool
If your storage account is protected with firewall rules then take a look at this stack overflow answer.
Reference full docs article.
I just took your code & updated the path to what I have and it worked just worked fine
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://XXX.dfs.core.windows.net/himanshu/NYCTaxi/PassengerCountStats.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Please check if the path to which you have uploaded the file and the one used in the script is the same .
You can do this to check that
Navigate to WS -> Data -> ADLS gen2 -> Go to the file -> right click go to the property and copy the Uri from there paste in the script .

Copying files in fileshare with Azure Data Factory configuration problem

I am trying to learn using the Azure Data Factory to copy data (a collection of csv files in a folder structure) from an Azure File Share to a Cosmos DB instance.
In Azure Data factory I'm creating a "copy data" activity and try to set my file share as source using the following host:
mystorageaccount.file.core.windows.net\\mystoragefilesharename
When trying to test the connection, I get the following error:
[{"code":9059,"message":"File path 'E:\\approot\\mscissstorage.file.core.windows.net\\mystoragefilesharename' is not supported. Check the configuration to make sure the path is valid."}]
Should I move the data to another storage type like a blob or I am not entering the correct host url?
You'll need to specify the host in json file like this "\\myserver\share" if you create pipeline with JSON directly or you use set the host url like this "\myserver\share" if you're using UI to setup pipeline.
Here is more info:
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system#sample-linked-service-and-dataset-definitions
I believe when you create file linked service, you might choose public IR. If you choose public IR, local path (e.g c:\xxx, D:\xxx) is not allowed, because the machine that run your job is managed by us, which not contains any customer data. Please use self-hosted IR to copy your local files.
Based on the link posted by Nicolas Zhang: https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system#sample-linked-service-and-dataset-definitions and the examples provided therein, I was able to solve it an successfully create the copy action. I had two errors (I'm configuring via the data factory UI and not directly the JSON):
In the host path, the correct one should be: \\mystorageaccount.file.core.windows.net\mystoragefilesharename\myfolderpath
The username and password must be the one corresponding to the storage account and not to the actual user's account which I was erroneously using.

access a file from a directory in azure blob storage through Azure Logic App

I am using LogicApp to import a set of files which are inside the directory(/devcontainer/sample1/abc.csv).
The problem here is that,I could not even located to the azure file from my LogicApp, I am getting the following error as:
verify that the path exists and does not contain the blob name.List Folder is not allowed on blobs.
Screenshots for reference
The problem here is that,I could not even located to the azure file from my LogicApp,
The file explorer will show all the contains and blobs when you choose blob path. And it will cache the data for a period of time to ensure the smoothness of the operation. If a blob is added to the container recently, it will not be seen and chosen from the file explorer. The workaround is by clicking the change connection link and using a new connection to retrieve the data.
Does your blob connection pointing to the correct storage account? one thing you can try to do is instead of providing the path try to browse the path so that you can what are the containers and the blobs that are present in the storage account that you are trying to access.

Resources