azure synapse analytics batch job

azure synapse analytics batch job - azure

I'm new to azure synapse analytics.
I'm wondering if it is possible to do such a thing :
In a pipeline :
as an input I have a parquet file
I'd like to do row by row of the parquet file (foreach ?):
make a call to a soap xml api
according to result call :
= if ok store the result (xml file) in a blob container.
= if not ok store the result in an azure storage table.
I've already done such a thing with an azure webjob would it be makable (rather simply) with azure synapse analytics ?
Regards.

Related

How to Read Append Blobs as DataFrames in Azure DataBricks

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity as:
And then provided the folder path to store the generated logs (in txt format), which have the following structure:
Timestamp
Level
OperationName
OperationItem
Message
2022-03-01 15:14:06.9880973
Info
FileWrite
"deviceA/component1/2022.zip/measurements_01.csv"
"Complete writing file. File is successfully copied."
I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.
From the above scenario, how can I read these logs successfully into my R session in DataBricks ?

According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from dataricks.
Using blob storage for logging:
Using ADLS gen2 for logging:

how to create an external table in sql server from parquet file in Gen2 blob storage

I have created a pipeline that extract data from a datasource and store it as parquet file in a blob storage Gen2. Next after I checked the file is created by use azure storage explorer, I tried to create an external table from this parquet file in sql server management studio and it gives me the following error message
Msg 15151, Level 16, State 1, Line 1
Cannot find the external data source 'storageGen2FA - File systems', because it does not exist or you do not have permission.
although I checked my permissions and I have the right one to do this operation????
SQL is Azure SQL and Data source exist
thanks in advance

Azure DataFactory Copy Data - how to know it has done copying data?

We have a bunch of files in azure blob storage as a tsv format and we want to move them to destination which is ADLS Gen 2 and parquet format. We want this activity on daily basis. So the ADF pipeline will write bunch of parquet files in folders which will have date in them. for example
../../YYYYMMDD/*.parquet
On the other side we have API which will access this. How does the API know that the data migration is completed for a particular day or not?
Basically is there an in built ADF feature to write done file or _SUCCESS file which API can rely on?
Thanks

Why not simply call the API to let it know from ADF using Web activity?
You can use Web Activity to even pass the name of processed file as URL or body parameters to that the API knows what to process.

Provide two ways here for you.Let me say,from the perspective of ADF copy activity execution results, it can be divided into active way and passive way.
1.Active way,you could use waitOnCompletion feature in execute pipeline activity.
After that,execute a web activity to trigger your custom api.Please see this case:Azure Data Factory: How to trigger a pipeline after another pipeline completed successfully
2.Passive way,you could use monitor feature of ADF pipeline. Please see the example of .net sdk:
Console.WriteLine("Checking copy activity run details...");
RunFilterParameters filterParams = new RunFilterParameters(
DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse = client.ActivityRuns.QueryByPipelineRun(
resourceGroup, dataFactoryName, runResponse.RunId, filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();
Check the status is successed, then do your custom business.

How to access files stored in AzureDataLake and use this file as input to AzureBatchStep in azure.pipleline.step?

I registered an Azure data lake datastore as in the documentation in order to access the files stored in it.
I used
DataReference(datastore, data_reference_name=None, path_on_datastore=None, mode='mount', path_on_compute=None, overwrite=False)
and used it as input to azure pipeline step in AzureBatchStep method.
But I got an issue: that datastore name could not be fetched in input.
Is Azure Data Lake not accessible in Azure ML or am I getting it wrong?

Azure Data Lake is not supported as an input in AzureBatchStep. You should probably use a DataTransferStep to copy data from ADL to Blob and then use the output of the DataTransferStep as an input to AzureBatchStep.

Get list of all files in a azure data lake directory to a look up activity in ADFV2

I have a number of files in azure data lake storage, i am creating a pipeline in ADFV2 to get the list of all the files in a folder in ADLS. How to do this?

You should use Get metadata activity.
Check this

You could follow below steps to list files in ADLS.
1: Use ADLS SDK to get the list file names in a specific directory and output the results. Such as Java SDK here. Of course, you could use .net or Python.
// list directory contents
List<DirectoryEntry> list = client.enumerateDirectory("/a/b", 2000);
System.out.println("Directory listing for directory /a/b:");
for (DirectoryEntry entry : list) {
printDirectoryInfo(entry);
}
System.out.println("Directory contents listed.");
2. Compile the file so that it could be executed.Store it into azure blob storage.
3.Use custom activity in azure data factory to configure the blob storage path and execute the program. More details,please follow this document.
You could use custom activity in Azure data factory.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-java-sdk#list-directory-contents

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string