I have an Azure Data Factory for Data extraction from OnPremise CRM. I am running into an issue with one of the Data entities where the Pipeline runs for close to 8 hours and throws this below exception. I know it's not an issue with authentication as I am able to get the other entities without any issues. I tried to change the parallelCopies to 18 and DIUs but when I trigger the Pipeline it sticks to Parallel Copies of '1', DIUs of 4 and eventually fails. Appreciate any inputs.
Operation on target XXXX failed: Failure happened on 'Source' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path XXXXXXX,Source=Microsoft.DataTransfer.Common,''Type=System.NotSupportedException,Message=The authentication endpoint Kerberos was not found on the configured Secure Token Service!,Source=Microsoft.Xrm.Sdk,'
I ran into something similar when using CRM as a sink; any upsert activities would fail very near exactly 60 minutes. The error I observed in the Azure Data Factory activity was:
'Type=System.NotSupportedException,Message=The authentication endpoint Kerberos was not found on the configured Secure Token Service!,Source=Microsoft.Xrm.Sdk,'
This post helped me find what to change in ADFS. I ran Get-ADFSRelyingPartyTrust and reviewed the TokenLifetime property, which happened to be 0. Apparently tokens last 60 minutes when the configuration is 0.
The following PowerShell increased the timeout, and I confirmed upsert activities no longer fail when exceeding 60 minutes.
Set-ADFSRelyingPartyTrust –TargetName "<RelyingPartyTrust>" –TokenLifetime <timeout in minutes>
It turned out to be a time out setting on the ADFS, once the time out is increased the job ran successfully.
Related
I need to run an aggregate query to calculate the count of records e.g. SELECT r.product_id, r.rating, COUNT(1) FROM product_ratings r GROUP BY r.product_id, r.rating. The query works perfectly fine on the Azure Data Explorer, albeit a little slow. An optimised version of the query takes about 30 seconds when executed on the Data Explorer. However, when I run the same query in my Java app, it appears to be timing out in 5 seconds with the following exception:
com.azure.cosmos.implementation.GoneException: {"innerErrorMessage":"The requested resource is no longer available at the server."}
I believe this is due to a default request timeout of 5 seconds defined in ConnectionPolicy (both Direct and Gateway modes). I can't find a way to override this default. Is there a way to increase the request timeout? Is there another possible reason for this error?
Tried this both on the Java SDK v4 and Spring Data Connector v3 with the same end result i.e. GoneException.
You could consider the following recommendations,
The following should help to address the issue:
Try to increase http connection pool size (default is 1000, you can increase to 2000)
If you are using GateWay mode, try the DirectMode, more traffic will go over tcp and less traffic over http
You can refer the Github code on setting the timeout.
So I am getting an error in Azure Data Factory that I haven't been able to find any information about. I am running a data flow and eventually (after an hour or so) get this error
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to
reason: The service has encountered an error processing your request.
Please try again. Error code 1204.","Details":"The service has
encountered an error processing your request. Please try again. Error
code 1204."}
Troubleshooting I have already done :
I have successfully ran the data flow using the sample option. Did this with 1 million rows.
I am processing 3 years of data and I have successfully processed all the data by filter the data by year and running the data flow once for each year.
So I think I have shown the data isn't the problem, because I have processed all of it by breaking it down into 3 runs.
I haven't found a pattern in the time the pipeline runs for before the error occurs that would indicate I am hitting any timeout value.
The source and sink for this data flow are both an Azure SQL Server database.
Does anyone have any thoughts? Any suggestions for getting a more verbose error out of data factory (I already have the pipeline set with verbose logging).
We are glad to hear that you has found the cause:
"I opened a Microsoft support ticket and they are saying it is a
database transient caused failure."
I think the error will be resolved automatically. I post this as answer and this can be beneficial to other community members. Thank you.
Update:
The most important thing is that you have resolved it by increase the vCorces in the end.
"The only thing they gave me was their BS article on handling
transient errors. Maybe I’m just old but a database that cannot
maintain connections to it is not very useful. What I’ve done to
workaround this is increase my vCores. This sql database was a
serverless one. While performance didn’t look bad my guess is the
database must be doing some sort of resize in the background to
handle the hour long data builds I need it to do. I had already tried
setting the min/max vCores to be the same. The connection errors
disappeared when I increased the vCores count to 6."
We have a dotnet core 3.0 solution running in a docker image running on Azure. For now, we haven't set it up in a k8s cluster. Our app service plan is PremiumV2, which basically means that we're running on dedicated hardware and not sharing our resources with anyone else.
We have a simple api-call to get the executing user based on the JWT. This validates the JWT, gets the users mail from the claims and queries cosmos to get more information about the user. When the request is sent from Postman the first time, it takes roughly 320 ms, however the subsequential requests takes around 50 ms. If we're waiting, lets say 10 minutes more or less, the requests is back at around 300 ms and again the subsequential requests takes around 50 ms. This indicates that the behavior is re-producable. Its worth mentioning that its not only this call that we see this behavior, but every "first" requests to our api takes more time than the other following.
Looking into application insights, apparently cosmos is not the bottleneck here. We've also configured the app service to be "always on"
Any ideas on how we can trace down this issue? Has anyone else experienced the same behavior? Is there any settings or configuration we should look at in Azure?
I´m trying to grab Firebase analytics data from Google BigQuery with Azure Data Factory.
The Connection to BigQuery works but I have quite often timeout issues when running a (simple) query. 3 out of 5 times I run into a timeout. If no timeout occurs I recive the data as expected.
Can someone of you confirm this issue? Or has an idea what´s the reason for the.
Thanks & best,
Michael
Timeout issues could happen in the Azure Data Factory sometimes. It is affected by source dataset, sink dataset, network, query performance and other factors, etc. After all, your connectors are not azure services.
You could try to set timeout param follow this json chart. Or you could set retry times to deal with timeout issues.
If your sample data is so simple that can't be timeout,maybe you could commit feedback here to ask adf team about your concern.
I have two Azure WebJobs. The first takes an incoming message that tells it to grab a PDF and break it into individual page images and then queue another message for the second WebJob to process the individual pages. It worked fine on our QC instance, but when we tried to move to production I started getting strange errors on the second job, but not consistently. The first job runs and breaks the file into page images. That is working fine. I have confirmed that every page image gets created and every page message gets queued. However, for the second job, only some of the messages are getting processed correctly. The remaining show this error in the WebJob diagnostics:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.ProcessBatchPage ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 52 - Unable to locate a Local Database Runtime installation. Verify that SQL Server Express is properly installed and that the Local Database Runtime feature is enabled.) ---> System.ComponentModel.Win32Exception: The system cannot find the file specified
But what's weird is that this error mentions the Local Database Runtime and SQL Server Express and I am not references either anywhere in my code. The system points at an Azure SQL DB. The job is ADO.Net and I have hardcoded the connection string to try to eliminate any issues with configuration based connection strings. But what's weird is that it only happens to a certain portion of the messages. The others process perfectly.
Lastly, I ran the job in debug locally (still pointing at the real queue and DB on Azure) and got the same problem. But the job outputs a console line with the job ID as the first line of the code. For those jobs that process successfully, I see this writeline. For those that fail, I never see anything. It's almost like the job is not really starting up correctly. (the failed jobs also have a really short run time 50-100ms)
I had the same issue with some jobs and I've came accross these articles to find a solution:
Transient Fault Handling (Building Real-World Cloud Apps with Azure)
Connection Resiliency / Retry Logic (EF6 onwards)
From theses articles :
Causes of transient failures :
In the cloud environment you’ll find that failed and dropped database connections happen periodically. That’s partly because you’re going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you’re dependent on a multi-tenant service you’ll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Use smart retry/back-off logic to mitigate the effect of transient failures:
The Microsoft Patterns & Practices group has a Transient Fault Handling Application Block that does everything for you if you’re using ADO.NET for SQL Database access (not through Entity Framework). You just set a policy for retries – how many times to retry a query or command and how long to wait between tries – and wrap your SQL code in a using block :
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
When you use the Entity Framework you typically aren’t working directly with SQL connections, so you can’t use this Patterns and Practices package, but Entity Framework 6 builds this kind of retry logic right into the framework. In a similar way you specify the retry strategy, and then EF uses that strategy whenever it accesses the database.
To use this feature in the Fix It app, all we have to do is add a class that derives from DbConfiguration and turn on the retry logic.
// EF follows a Code based Configuration model and will look for a class that
// derives from DbConfiguration for executing any Connection Resiliency strategies
public class EFConfiguration : DbConfiguration
{
public EFConfiguration()
{
AddExecutionStrategy(() => new SqlAzureExecutionStrategy());
}
}