Azure website timing out after long process - azure

Team,
I have a Azure website published on Azure. The application reads around 30000 employees from an API and after the read is successful, it updates the secondary redis cache with all the 30,000 employees.
The timeout occurs in the second step whereby when it updates the secondary redis cache with all the employees. From my local it works fine. But as soon as i deploy this to Azure, it gives me a
500 - The request timed out.
The web server failed to respond within the specified time
From the blogs i came to know that the default time out is set as 4 mins for azure website.
I have tried all the fixes provided on the blogs like setting the command SCM_COMMAND_IDLE_TIMEOUT in the application settings to 3600.
I even tried putting the Azure redis cache session state provider settings as this in the web.config with inflated timeout figures.
<add type="Microsoft.Web.Redis.RedisSessionStateProvider" name="MySessionStateStore" host="[name].redis.cache.windows.net" port="6380" accessKey="QtFFY5pm9bhaMNd26eyfdyiB+StmFn8=" ssl="true" abortConnect="False" throwOnError="true" retryTimeoutInMilliseconds="500000" databaseId="0" applicationName="samname" connectionTimeoutInMilliseconds="500000" operationTimeoutInMilliseconds="100000" />
The offending code responsible for the timeout is this:
`
public void Update(ReadOnlyCollection<ColleagueReferenceDataEntity> entities)
{
//Trace.WriteLine("Updating the secondary cache with colleague data");
var secondaryCache = this.Provider.GetSecondaryCache();
foreach (var entity in entities)
{
try
{
secondaryCache.Put(entity.Id, entity);
}
catch (Exception ex)
{
// if a record fails - log and continue.
this.Logger.Error(ex, string.Format("Error updating a colleague in secondary cache: Id {0}, exception {1}", entity.Id));
}
}
}
`
Is there any thing i can make changes to this code ?
Please can anyone help me...i have run out of ideas !

You're doing it wrong! Redis is not a problem. The main request thread itself is getting terminated before the process is completed. You shouldn't let a request wait for that long. There's a hard-coded restriction on in-flight requests of 230-seconds max which can't be changed.
Read here: Why does my request time out after 230 seconds?
Assumption #1: You're loading the data on very first request from client-side!
Solution: If the 30000 employees record is for the whole application, and not per specific user - you can trigger the data load on app start-up, not on user request.
Assumption #2: You have individual users and for each of them you have to store 30000 employees data, on the first request from client-side.
Solution: Add a background job (maybe WebJob/Azure Function) to process the task. Upon request from client - return a 202 (Accepted with the job-status location in the header. The client can then poll for the status of the task at a certain frequency update the user accordingly!
Edit 1:
For Assumption #1 - You can try batching the objects while pushing the objects to Redis. Currently, you're updating one object at one time, which will be 30000 requests this way. It is definitely will exhaust the 230 seconds limit. As a quick solution, batch multiple objects in one request to Redis. I hope it should do the trick!
UPDATE:
As you're using StackExchange.Redis - use the following pattern to batch the objects mentioned here already.
Batch set data from Dictionary into Redis
The number of objects per requests varies depending on the payload size and bandwidth available. As your site is hosted on Azure, I do not thing bandwidth will be much of a concern
Hope that helps!

Related

Random 21/42 seconds timeout in outgoing traffic on Azure Web Sites

I have an ASP.NET MVC 5 application running in the azure german cloud as Azure Web App (single instance - Standard S3 size).
I'm calling a non azure hosted REST/SOAP service on a particular host and the web requests either succeed promptly or timeout after 21 / 42 seconds.
I've load tested the requests and the percentile of requests timing out is between 20 and 80.
One particular remarkable property of the timeout is, that they occur after exactly 21 or 42 seconds (this is serious, no reference to hitchhiker's guide to the galaxy intended).
Calling a different service from the web app works just fine, temporarily at least.
We've already checked the firewall of the non azure service and if the timeout occurs, not a single packet reached the host.
This issue occurred once in the past one year ago and support was unable to tell what the cause was until the issue suddenly went away roughly two weeks after first occuring, so the ticket got closed as fixed itself but now its back.
The code is using https://github.com/canton7/RestEase (uses HttpClient underneath) and looks like
[Header("Content-Type", "application/json")]
public interface IApi
{
[Post("/Login")]
Task<LoginToken> Login([Body]LoginRequest request);
}
private static Dictionary<string, IApi> ApiClientsByHost = new Dictionary<string, IApi>();
private IApi GetApiForHost(string host)
{
if (!ApiClientsByHost.TryGetValue(host, out var client))
{
lock (ApiClientsByHost)
{
if (!ApiClientsByHost.TryGetValue(host, out client))
{
ApiClientsByHost[host] = client = RestClient.For<IApi>(host);
}
}
}
return client;
}
var client = GetApiForHost("https://production/");
var loginToken = await client.Login(new LoginRequest { Username = username, Password = password });
By different service, i mean using "https://testserver/" instead of "https://production/" (testserver is located in a different data center with different IP and all).
The API authentication is passing a token via query but it timeouts already before being able to get a token.
The code is caching the IApi to avoid the TCP starvation problems of disposing HttpClients (but i've never run into port exhaustion).
Restarting the app does not resolve the issue and the issue only occurs to production currently (but a year ago, when this issue occurred on production, we've switched to testserver which worked initially but after some time, ran into the same problem)
EDIT: Found some explanation in the last answer as to where those magical 21 seconds are comming from.
EDIT: One way i've found to workaround is, is to setup a azure vm with a proxy on it and configure defaultProxy to pass through that vm.
That's TCP retransmission timing out. It's odd that you are getting different values though.

Azure WebJobs getting initialized randomly

We have webjobs consisting of several methods in a single Functions.cs file. They have servicebus triggers on topic/queues. Hence, keep listening to topic/queue for brokeredMessage. As soon as the message arrives, we have a processing logic that does lot of stuff. But, we find sometimes, all the webjobs get reinitialized suddenly. I found few articles on the website which says webjobs do get initialized and it is usual.
But, not sure if that is the only way and can we prevent it from getting reinitialized as we call brokeredMessage.Complete as soon we get brokeredMessage since we do not want it to be keep processing again and again?
Also, we have few webjobs in one app service and few webjobs in other app service. And, we find all of the webjobs from both the app service get re initialized at the same time. Not sure, why?
You should design your process to be able to deal with occasional disconnects and failures, since this is a "feature" or applications living in the cloud.
Use a transaction to manage the critical area of your code.
Pseudo/commented code below, and a link to the Microsoft documentation is here.
var msg = receiver.Receive();
using (scope = new TransactionScope())
{
// Do whatever work is required
// Starting with computation and business logic.
// Finishing with any persistence or new message generation,
// giving your application the best change of success.
// Keep in mind that all BrokeredMessage operations are enrolled in
// the transaction. They will all succeed or fail.
// If you have multiple data stores to update, you can use brokered messages
// to send new individual messages to do the operation on each store,
// giving eventual consistency.
msg.Complete(); // mark the message as done
scope.Complete(); // declare the transaction done
}

Umbraco 7.6.0 - Site becomes unresponsive for several minutes every day

We've been having a problem for several months where the site becomes completely unresponsive for 5-15 minutes every day. We have added a ton of request logging, enabled DEBUG logging, and have finally found a pattern: Approximately 2 minutes prior to the outages (in every single log file I've looked at, going back to the beginning), the following lines appear:
2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Timer: release. 2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Run now (sync).
From what I gather this is the process that rebuilds the umbraco.config, correct?
We have ~40,000 nodes, so I can't imagine this would be the quickest process to complete, however the strange thing is that the CPU and Memory on the Azure Web App do not spike during these outages. This would seem to point to the fact that the disk I/O is the bottleneck.
This raises a few questions:
Is there a way to schedule this task in a way that it only runs
during off-peak hours?
Are there performance improvements in the newer versions (we're on 7.6.0) that might improve this functionality?
Are there any other suggestions to help correct this behavior?
Hosting environment:
Azure App Service B2 (Basic)
SQL Azure Standard (20 DTUs) - DTU usage peaks at 20%, so I don't think there's anything there. Just noting for completeness
Azure Storage for media storage
Azure CDN for media requests
Thank you so much in advance.
Update 10/4/2017
If it helps, It appears that these particular log entries correspond with the first publish of the day.
I don't feel like 40,000 nodes is too much for Umbraco, but if you want to schedule republishes, you can do this:
You can programmatically call a cache refresh using:
ApplicationContext.Current.Services.ContentService.RePublishAll();
(Umbraco source)
You could create an API controller which you could call periodically by a URL. The controller would probably look something like:
public class CacheController : UmbracoApiController
{
[HttpGet]
public HttpResponseMessage Republish(string pass)
{
if (pass != "passcode")
{
return Request.CreateResponse(HttpStatusCode.Unauthorized, new
{
success = false,
message = "Access denied."
});
}
var result = Services.ContentService.RePublishAll();
if (result)
{
return Request.CreateResponse(HttpStatusCode.OK, new
{
success = true,
message = "Republished"
});
}
return Request.CreateResponse(HttpStatusCode.InternalServerError, new
{
success = false,
message = "An error occurred"
});
}
}
You could then periodically ping this URL:
/umbraco/api/cache/republish?code=passcode
I have a blog post on how you can read on how to schedule events like these to occur. I recommend just using the Windows Task Scheduler to ping the URL: https://harveywilliams.net/blog/better-task-scheduling-in-umbraco#windows-task-scheduler

Azure Function on Always-On App Service Plan Times Out with No functionTimeout Set

Like the title describes - I have an Azure Function on the App Service Plan, configured for Always On and no functionTimeout set in my host.json, and it appears to timeout / not finish anytime after 30 minutes to 1 hour.(...but I feel this may be a false positive...)
The HTTP Triggered function can sometimes take over 1-2 hours to complete. I understand that this probably isn't the best design and according to the Azure Function Best Practices I should break this out into smaller / more manageable pieces - I get that. However, I expect the Function on the App Service plan to work as advertised - no hard limit on execution time. Perhaps this is the same question as Unexpected azure-function timeouts on app-service-plan, but that has no answer and I am using an HTTP Trigger instead.
Currently, the HTTP Triggered method does not return until the work is complete. (Is this a problem - the HTTP trigger needs to return quicker?)
According to the Kudu Function Invocation Logs, this case reports "Never Finished", and when I click on the Toggle Output button to view the logs, they never come in.
When I viewed this function's run in the Logs section of that trigger, it seems like the function just stopped, and the log stream just reports no new trace:
2017-07-26T16:36:43.116 [INFO] [Class1] Update operation started processing 790 sales records ...
2017-07-26T16:36:43.116 [DBUG] [Class2] Matching and updating ids from the map...
2017-07-26T16:38:07 No new trace in the past 1 min(s).
2017-07-26T16:39:07 No new trace in the past 2 min(s).
2017-07-26T16:40:07 No new trace in the past 3 min(s).
2017-07-26T16:41:07 No new trace in the past 4 min(s).
So not sure why this function just seemed to stop - or perhaps it stopped collecting log statements (there are many), and for some reason, the function never completed.
Any ideas?
Approx time: 2017-07-26T16:00:00 UTC
InvocationID: d856c107-f1ee-455a-892b-ed970dcad128 (I think?)
If it is indeed being timed out, is there any way for us to know, (Exception? App Insights? etc.)
Based on my test, I found azure function will not stop your function if you don't set the timeout.
Here is my test, I create a ManualTrigger function which will log the message every 10 minutes.
The codes like below:
public static void Run(string input, TraceWriter log)
{
for (int i = 0; i < 100; i++)
{
log.Info( "Worked " + i*10 + " minutes ");
Thread.Sleep(600000);
}
}
The log details:
In the log, you could find my function executed 70 minutes.It still works well.
The no trace means there are no new requests send to the azure function.
Currently, the HTTP Triggered method does not return until the work is complete. (Is this a problem - the HTTP trigger needs to return quicker?)
As Jesse Carter says, you couldn't execute long time function when you used HTTP Triggered method.
Since your client-side(send request) will have a timeout value. It will wait for the function's response.
Normally, if we want to execute long time function, I suggest you could use http trigger to get the request. In the http trigger function you could add a queue message to the azure storage queue.
Then you could write a queue trigger function which will execute the long time work.
If your HTTP method takes more than a minute, you should be offloading it to a Queue. Period. (I know the other answers have said this, but it's worth repeating).
Http connections are a limited resource.
While Azure Functions as an execution engine can handle long running
operations (as demonstrated by queue / service bus support), the
http pipeline may cut off / timeout long running requests.
Queue triggers can easily run for 30+ minutes. If your job is longer than that, you really should split it into multiple queue messages.
Also check out Durable Function support: https://github.com/Azure/azure-functions-durable-extension/
Regardless of the function app timeout setting, 230 seconds is the maximum amount of time that an HTTP triggered function can take to respond to a request. This is because of the default idle timeout of Azure Load Balancer. For longer processing times, consider using the Durable Functions async pattern or defer the actual work and return an immediate response.
Function app timeout duration: Check Notes

Limiting the number of concurrent jobs on Azure Functions queue

I have a Function app in Azure that is triggered when an item is put on a queue. It looks something like this (greatly simplified):
public static async Task Run(string myQueueItem, TraceWriter log)
{
using (var client = new HttpClient())
{
client.BaseAddress = new Uri(Config.APIUri);
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
StringContent httpContent = new StringContent(myQueueItem, Encoding.UTF8, "application/json");
HttpResponseMessage response = await client.PostAsync("/api/devices/data", httpContent);
response.EnsureSuccessStatusCode();
string json = await response.Content.ReadAsStringAsync();
ApiResponse apiResponse = JsonConvert.DeserializeObject<ApiResponse>(json);
log.Info($"Activity data successfully sent to platform in {apiResponse.elapsed}ms. Tracking number: {apiResponse.tracking}");
}
}
This all works great and runs pretty well. Every time an item is put on the queue, we send the data to some API on our side and log the response. Cool.
The problem happens when there's a big spike in "the thing that generates queue messages" and a lot of items are put on the queue at once. This tends to happen around 1,000 - 1,500 items in a minute. The error log will have something like this:
2017-02-14T01:45:31.692 mscorlib: Exception while executing function:
Functions.SendToLimeade. f-SendToLimeade__-1078179529: An error
occurred while sending the request. System: Unable to connect to the
remote server. System: Only one usage of each socket address
(protocol/network address/port) is normally permitted
123.123.123.123:443.
At first, I thought this was an issue with the Azure Function app running out of local sockets, as illustrated here. However, then I noticed the IP address. The IP address 123.123.123.123 (of course changed for this example) is our IP address, the one that the HttpClient is posting to. So, now I'm wondering if it is our servers running out of sockets to handle these requests.
Either way, we have a scaling issue going on here. I'm trying to figure out the best way to solve it.
Some ideas:
If it's a local socket limitation, the article above has an example of increasing the local port range using Req.ServicePoint.BindIPEndPointDelegate. This seems promising, but what do you do when you truly need to scale? I don't want this problem coming back in 2 years.
If it's a remote limitation, it looks like I can control how many messages the Functions runtime will process at once. There's an interesting article here that says you can set serviceBus.maxConcurrentCalls to 1 and only a single message will be processed at once. Maybe I could set this to a relatively low number. Now, at some point our queue will be filling up faster than we can process them, but at that point the answer is adding more servers on our end.
Multiple Azure Functions apps? What happens if I have more than one Azure Functions app and they all trigger on the same queue? Is Azure smart enough to divvy up the work among the Function apps and I could have an army of machines processing my queue, which could be scaled up or down as needed?
I've also come across keep-alives. It seems to me if I could somehow keep my socket open as queue messages were flooding in, it could perhaps help greatly. Is this possible, and any tips on how I'd go about doing this?
Any insight on a recommended (scalable!) design for this sort of system would be greatly appreciated!
I think the code error is because of: using (var client = new HttpClient())
Quoted from Improper instantiation antipattern:
this technique is not scalable. A new HttpClient object is created for
each user request. Under heavy load, the web server may exhaust the
number of available sockets.
I think I've figured out a solution for this. I've been running these changes for the past 3 hours 6 hours, and I've had zero socket errors. Before I would get these errors in large batches every 30 minutes or so.
First, I added a new class to manage the HttpClient.
public static class Connection
{
public static HttpClient Client { get; private set; }
static Connection()
{
Client = new HttpClient();
Client.BaseAddress = new Uri(Config.APIUri);
Client.DefaultRequestHeaders.Add("Connection", "Keep-Alive");
Client.DefaultRequestHeaders.Add("Keep-Alive", "timeout=600");
Client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
}
}
Now, we have a static instance of HttpClient that we use for every call to the function. From my research, keeping HttpClient instances around for as long as possible is highly recommended, everything is thread safe, and HttpClient will queue up requests and optimize requests to the same host. Notice I also set the Keep-Alive headers (I think this is the default, but I figured I'll be implicit).
In my function, I just grab the static HttpClient instance like:
var client = Connection.Client;
StringContent httpContent = new StringContent(myQueueItem, Encoding.UTF8, "application/json");
HttpResponseMessage response = await client.PostAsync("/api/devices/data", httpContent);
response.EnsureSuccessStatusCode();
I haven't really done any in-depth analysis of what's happening at the socket level (I'll have to ask our IT guys if they're able to see this traffic on the load balancer), but I'm hoping it just keeps a single socket open to our server and makes a bunch of HTTP calls as the queue items are processed. Anyway, whatever it's doing seems to be working. Maybe someone has some thoughts on how to improve.
If you use consumption plan instead of Functions on a dedicated web app, #3 more or less occurs out of the box. Functions will detect that you have a large queue of messages and will add instances until queue length stabilizes.
maxConcurrentCalls only applies per instance, allowing you to limit per-instance concurrency. Basically, your processing rate is maxConcurrentCalls * instanceCount.
The only way to control global throughput would be to use Functions on dedicated web apps of the size you choose. Each app will poll the queue and grab work as necessary.
The best scaling solution would improve the load balancing on 123.123.123.123 so that it can handle any number of requests from Functions scaling up/down to meet queue pressure.
Keep alive afaik is useful for persistent connections, but function executions aren't viewed as a persistent connection. In the future we are trying to add 'bring your own binding' to Functions, which would allow you to implement connection pooling if you liked.
I know the question was answered long ago, but in the mean time Microsoft have documented the anti-pattern that you were using.
Improper Instantiation antipattern

Resources