Azure Data Factory API call throws payload limit error - azure

We are performing Copy Activity with API Rest Url as data source and ADLS Gen2 as sink. The pipeline works in most cases and sporadically throws below error. We have nested pipeline to loop through multiple REST API request parameters and make call within forEach activity.
Error displayed in ADF monitor-
Error Code - 2200
Failure Type - User Configuration issue
Details - The payload including configurations on activity/dataset/linked service is too large. Please check if you have settings with very large value and try to reduce its size.

Error message: The payload including configurations on
activity/dataSet/linked service is too large. Please check if you have
settings with very large value and try to reduce its size.
Cause: The payload for each activity run includes the activity configuration, the associated dataset(s), and linked service(s) configurations if any, and a small portion of system properties generated per activity type. The limit of such payload size is 896 KB as mentioned in the Azure limits documentation for Data Factory and Azure Synapse Analytics.
Recommendation: You hit this limit likely because you pass in one or more large parameter values from either upstream activity output or external, especially if you pass actual data across activities in control flow. Check if you can reduce the size of large parameter values or tune your pipeline logic to avoid passing such values across activities and handle it inside the activity instead.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-factory-troubleshoot-guide#payload-is-too-large

Related

Azure Stream Analytics job path pattern for ApplicationInsights data in storage container

My goal is to import telemetry data from an ApplicationInsights resource into a SQL Azure database.
To do so, I enabled allLogs and AllMetrics in the Diagnostic settings of the ApplicationInsights instance, and set the Destination details to "Archive to a storage account".
This works fine, and I can see data being saved to containers beneath the specified storage account, as expected. For example, page views are successfully written to the insights-logs-apppageviews container.
My understanding is that I can use StreamAnalytics job(s) from here to import these JSON files into SQL Azure by specifying a container as input, and a SQL Azure table as output.
The problems I encounter from here are twofold:
I don't know what to use for the "Path pattern" on the input resource: there are available tokens to use for {date} and {time}, but the actual observed container path is slightly different, and I'm not sure how to account for this discrepancy. For example, the {date} token expects the YYYY/MM/DD format, but that part of the observed path is of the format /y={YYYY}/m={MM}/d={DD}. If I try to use the {date} token, nothing is found. As far as I can tell, there doesn't appear to be any way to customize this.
For proof-of-concept purposes, I resorted to using a hard-coded container path with some data in it. With this, I was able to set the output to a SQL Azure table, check that there were no schema errors between input and output, and after starting the job, I do see the first batch of data loaded into the table. However, if I perform additional actions to generate more telemetry, I see the JSON files updating in the storage containers, yet no additional data is written to the table. No errors appear in the Activity Log of the job to explain why the updated telemetry data is not being picked up.
What settings do I need to use in order for the job to run continuously/dynamically and update the database as expected?

Passing Databricks ClusterID at runtime from Azure Data Bricks Pipeline

I am looking to make Azure linked service configurable and hence passing the Databricks WorkspaceURL and the ClusterID at runtime. I will be having multiple Spark cluster and based on the size of the cluster I would be invoking the type/size of the cluster.
I am not finding an option of getting the DataBricks ClusterID and passit from the ADF pipeline
You can use the REST API Clusters API 2.0 to get cluster list.
https://adb-7012303279496007.7.azuredatabricks.net/api/2.0/clusters/list
I have reproduced the above and got the below result.
First generate the access token in databricks workspace and use that in web activity as authorization to get the list of clusters.
Output from web activity:
The above also contains cluster size in mb. Store the above in an array variable.
For getting the desired cluster id based on cluster size you can use your filter condition as per your requirement.
Here, for sample I have used cluster size in mb as filter condition.
Notebook linked service:
parameter for cluster_id.
Pass the desired cluster_id from filtered array like below.
#activity('Filter1').output.Value[0].cluster_id
You can give the Notebook path using the dynamic content.
My Execution:

Azure App Service Timeout for Resource creation

I have a problem with my Web app on Azure App Service and its timeout. It provides an API that creates a CosmosDB instance in an Azure Resource group. Since its creation takes a lot of time (~ 5 minutes), the App Service timeout (230 seconds) forces the App to return an HTTP Response 500, while the CosmosDB creation is successful. Within the method, the Resource is created and then some operations are performed on it.
ICosmosDBAccount cosmosDbAccount = azure.CosmosDBAccounts
.Define(cosmosDbName)
.WithRegion(Region.EuropeNorth)
.WithNewResourceGroup(resourceGroupName)
.WithDataModelSql()
.WithSessionConsistency()
.WithDefaultWriteReplication()
.Create();
DoStuff(cosmosDbAccount);
Since I've read that the timeout cannot be increased, is there a simple way to await the Resource creation and get a successful response?
From your code point of view, you are using .net sdk to implement it.
From the official SDK source code, we should be very clear that the source code uses asynchronous methods to create resources, and finally determines whether the creation is complete by detecting the state value of ProvisioningState.
In webapp, the api we designed should be that it is reasonable to return data immediately after sending the request. If in our api, we need to wait for the asynchronous or synchronous callback result in the SDK, and then return the data, it must take a long time, so the design is unreasonable.
So my suggestion is to use rest api to achieve your needs.
Create (use Database Accounts - Create Or Update)
The database account create or update operation will complete asynchronously.
Check the response result.
Check the provisioningState in properties (use Database Accounts - Get)
If the provisioningState is Succeeded, then we can know the resoure has been created successfully.
If you want to achieve the same effect as in portal, it is recommended to add a timer to get the state value of provisioningState.

Azure DataFactory Copy Data - how to know it has done copying data?

We have a bunch of files in azure blob storage as a tsv format and we want to move them to destination which is ADLS Gen 2 and parquet format. We want this activity on daily basis. So the ADF pipeline will write bunch of parquet files in folders which will have date in them. for example
../../YYYYMMDD/*.parquet
On the other side we have API which will access this. How does the API know that the data migration is completed for a particular day or not?
Basically is there an in built ADF feature to write done file or _SUCCESS file which API can rely on?
Thanks
Why not simply call the API to let it know from ADF using Web activity?
You can use Web Activity to even pass the name of processed file as URL or body parameters to that the API knows what to process.
Provide two ways here for you.Let me say,from the perspective of ADF copy activity execution results, it can be divided into active way and passive way.
1.Active way,you could use waitOnCompletion feature in execute pipeline activity.
After that,execute a web activity to trigger your custom api.Please see this case:Azure Data Factory: How to trigger a pipeline after another pipeline completed successfully
2.Passive way,you could use monitor feature of ADF pipeline. Please see the example of .net sdk:
Console.WriteLine("Checking copy activity run details...");
RunFilterParameters filterParams = new RunFilterParameters(
DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse = client.ActivityRuns.QueryByPipelineRun(
resourceGroup, dataFactoryName, runResponse.RunId, filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();
Check the status is successed, then do your custom business.

How to use dynamic content in relative URL in Azure Data Factory

I have a REST data source where I need pass in multiple parameters to build out a dataset in Azure Data Factory V2.
I have about 500 parameters that I need to pass in so don’t want to pass these individually. I can manually put these in a list (I don’t have to link to another data source to source these). The parameters would be something like [a123, d345, e678]
I'm working in the UI. I cannot figure out how to pass these into the relative URL (where it says Parameter) to then form the dataset. I could do this in Power BI using functions and parameters but can't figure it out in Azure Data Factory as I'm completely new to it. I'm using the Copy Data functionality in ADF to do this.
The sink would be a json file in an Azure blob that I can then access via Power BI. I'm fine with this part.
Relative URL with Parameter requirement
How to add dynamic content
I'm afraid that your requirement can't be implemented.As you know,ADF REST dataset is used to retrieving data from a REST endpoint by using the GET or POST methods with Http request. No way to configure a list of parameters in the relativeUrl property which ADF would loop it automaticlly for you.
Two ways to retrieve your goal:
1.Loop your parameter array ,pass single item into relativeUrl to execute copy activity individually.Using this way,you could use foreach activity in the ADF.
2.Write a overall api to accept list paramter from the requestBody,execute your business in the api inside with loop.

Resources