Memory leak/consumption in Hubs due to large messages? - iis

I'm currently trying SignalR and RabbitMQ in order to round-robin / load balance json webservice queries and I'm having troubles with the memory consumption by one of the application when it processes large (~ 300 - 2500 kb) messages.
I have a IIS server hosting a web application (named "Backend") that needs to query an another web application (name "Pricing") also hosted by a IIS server.
In order to keep a connection alive with my RabbitMQ server, I developped console application that are connected to Backend and Princing using SignalR.
So when Backend needs to query Princing, it asks its console to publish the message in the queue and the console attached to Pricing takes the message and give it to Pricing (with Invoke<> method). When Pricing finished its job, it asks its console to publish the reply message and the console attached to Backend takes it and give to Backend.
To sum up :
[Backend] -> [Console] -> [RabbitMQ] <- [Console] <- [Pricing]
And I have 2 Pricing taking messages from their console from the RabbitMQ queue.
This setup is to replace a traditionnal webservice query between the 2 IIS and benefit from the advantages of RabbitMQ (load balancer and asynchronous call in a micro/web services architecture)
I added
GlobalHost.Configuration.MaxIncomingWebSocketMessageSize = null;
in Startup.cs in both IIS in order to accept large messages.
When I take a look at Pricing's memory consumption in Windows Task Manager, it quickly grows from 500Mb to 1500Mb (in 5 minutes, dealing with neverending queries from Backend to test the setup).
I tried something else by writing the queries content in files in a shared folder and just publishing the name of the file in RabbitMQ's messages and the memory consumption of Pricing (with of course a code modification to load the file) doesn't move and stays around 500Mb.
So it seems that it has something to do with the message length that my console passes to the IIS.
I tried to disconnect the console from the IIS Hubs because I thought that it will maybe free some memory but nope.
Does anyone experienced this issue of memory consumption by large messages into Hubs ? How can I check if there's indeed a memory leak in my application ?
What about using SignalR and RabbitMQ in web/micro services environment ? Any feedback ?
Many thanks,
Jean-Francois
.NETFramework : 4.5
Microsoft.AspNet.SignalR : 2.4.1

So it seems that the version of SignalR I use (.NetFramework) allows to tune the number of messages per hub per connection kept in memory.
I fixed it to an arbitrary 50 in Startup.cs
GlobalHost.Configuration.DefaultMessageBufferSize = 50;
Its default value is 1000, meaning (if I understood it clearly) that IIS keep a circular buffer of 1000 messages in memory. Some of the messages were weighting 2.5Mo meaning that the memory used could go up to 2500Mo per connection.
As my IIS only has one connection (its console) and doesn't need to keep track of messages (as it works as webservice), it seems that 1000 messages is way too much.
With the limit of 50 messages, the memory used by the application in Windows Task Manager stays put (around 500Mo).
Is there any flaw in the way I'm using it ?
Thanks !

Related

Azure slow communication between APIs

In some 1-5% of our requests, we are seeing slow communication between APIs (REST API requests). Both APIs are developed by us and hosted on Azure, each app service on its own app service plan in the same region, P1v2 tier.
What we are seeing on application insights is that POST or GET requests on origin API can take a few seconds to execute, while real execution time on destination API is only a few milliseconds.
Examples (first line POST request on origin, second execution time on destination API): slow req 1, slow req 2
Our best guess is that the time difference is lost in communication between components. We don't have an explanation for it since the payload is really small and in most cases, communication takes less than 5 milliseconds.
We dismiss the possible explanation it could be due to component cold start since it happens during constant load and no horizontal scaling was performed.
Do you have any idea what might cause it or how to do additional analysis in order to discover it?
If you're running multiple sites on the App Service Plan, then enable the "Always On" setting for your web app > All Settings > Application Settings > Click on Always On
See here for details: https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
When Always On is off, the site is shut down after 20 minutes of inactivity to free up resources for any additional websites that might be using the same App Service Plan.
The amount of information it needs to collect, process and then present itself requires some time, and involve internal calls as well, that is why considering the server load and usage, it takes around 6 to 7 seconds sometimes even more.
To Troubleshoot that latency, try this steps, provided by Microsoft.

How to find/cure source of function app throughput issues

I have an Azure function app triggered by an HttpRequest. The function app reads the request, tosses one copy of it into a storage table for safekeeping and sends another copy to a queue for further processing by another element of the system. I have a client running an ApacheBench test that reports approximately 148 requests per second processed. That rate of processing will not be enough for our expected load.
My understanding of function apps is that it should spawn as many instances as is needed to handle the load sent to it. But this function app might not be scaling out quickly enough as it’s only handling that 148 requests per second. I need it to handle at least 200 requests per second.
I’m not 100% sure the problem is on my end, though. In analyzing the performance of my function app I found a LOT of 429 errors. What I found online, particularly https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits, suggests that these errors could be due to too many requests being sent from a single IP. Would several ApacheBench 10K and 20K request load tests within a given day cause the 429 error?
However, if that’s not it, if the problem is with my function app, how can I force my function app to spawn more instances more quickly? I assume this is the way to get more throughput per second. But I’m still very new at working with function apps so if there is a different way, I would more than welcome your input.
Maybe the Premium app service plan that’s in public preview would handle more throughput? I’ve thought about switching over to that and running a quick test but am unsure if I’d be able to switch back?
Maybe EventHub is something I need to investigate? Is that something that might increase my apparent throughput by catching more requests and holding on to them until the function app could accept and process them?
Thanks in advance for any assistance you can give.
You dont provide much context of you app but this is few steps how you can improve
If you want more control you need to use App Service plan with always on to avoid cold start, also you will need to configure auto scaling since you are responsible in this plan and auto scale is not enabled by default in app service plan.
Your azure function must be fully async as you have external dependencies so you dont want to block thread while you are calling them.
Look on the limits. Using host.json you can tweek it.
429 error means that function is busy to process your request, so probably when you writing to table you are not using async and blocking thread
Function apps work very well and scale as it says. It could be because request coming from Single IP and Azure could be considering it DDOS. You can do the following
AzureDevOps Load Test
You can load test using one of the azure service . I am very sure they have better criteria of handling IPs. Azure DeveOps Load Test
Provision VM in Azure
The way i normally do is provision the VM (windows 10 pro) in azure and use JMeter to Load test. I have use this method to test and it works fine. You can provision couple of them and subdivide the load.
Use professional Load testing services
If possible you may use services like Loader.io . They use sophisticated algos to run the load test and provision bunch of VMs to run the same test.
Use Application Insights
If not already you must be using application insights to have a better look from server perspective. Go to live stream and see how many instance it would provision to handle the load test . You can easily look into events and error logs that may be arising and investigate. You can deep dive into each associated dependency and investigate the problem.

Stress testing Azure App Service periodically stops processing requests

I am currently stress testing a .Net Core application, targeting netcoreapp2.2, that is hosted on Azure as a App Service connected to a P1V2 (210 ACU, 3.5GB memory) service plan with 2 instances.
The endpoint that I'm stress testing is very simple, it validates a Oauth2.0 token, gets the user and some info about the user from a P2 (250 DTU) Azure hosted database, total 4 db queries per request, and returns the string "Pong".
When running 15 concurrent users (or more) in 200 loops I see the stop(s) in processing seen in the image (between the high peaks). The service plan never hits more than around 20-35% CPU and the database never uses more than 2% load. Increasing the users decreases the average throughput.
When looking at the slow requests it is like it just randomly stops, never at the same place. When I look at the DB requests I never see a request that takes longer than a couple of 100 milliseconds while some requests can take upwards to 5-6s to process.
It feels like I reach some limit which results in something stopping for a period of time, but I can't figure out where the problem lies.
When running the same stress locally I don't see these stops.
I'm using jmeter cli to run the stress tests against both environments.
Any help is greatly appreciated, thanks!
This could be because of Azure DDOS protection behaviour.
If your application is being attacked by a DDOS attack, Microsoft will
stop all connections to your end point and in effect taking down your
service.
To avoid this you need to setup Web application firewall (WAF) to exclude healthy requests.

Designing a message processing system

I have been asked to create a message processing system as following. As I am not sure if this is the right place to post this, feel free to move it to any other appropriate SC group.
Problem
Server have about 100 to 500 clients connected at every moment. When a client connects to server, server loads part of their data and cache it in memory for faster access. Server will receive between 200~1000 messages per second for all clients. These messages are relatively small (about 500 bytes). Any changes to data in cache should be saved to disk as soon as possible. When client disconnects all their data is saved to disk and removed from cache. each message contains some instruction and a text message which will be saved as file. Instructions should be executed as fast as possible (near instant) and all clients using that file should get the update. Only writing the modified message to disk can be delayed.
Here is my solution in a diagram
My solution consists of a web server (http or socket) a message queue and two or more instances of file server and instruction server.
Web server grabs client messages and if there is message available for client in message queue, pushes it back to client.
Instruction processor grabs instructions from queue and creates necessary message to be processed by file server (Get/set file) and waits for the file to be available in queue and more process to create another message for client.
File server only provides the files, either from cache or physical file depending on type of file.
Concerns:
There are peak times that total connected clients might go over 10000 at once and total messages received from clients increase to 10~15K.
I should be able to clear the queue and go back to normal state as soon as possible (with processing requests obviously).
I should be able to add extra instruction processors and file servers on the fly without having to shut down the other instances.
In case file server crashes it shouldn’t lose files so it has to write files to disk as soon as there are any changes and process time is available.
File system should be in b+ tree format so some applications (local reporting apps) could easily access files without having to go through queue server
My Solution
I am thinking of using node.js for socket/web server. And may be a NoSQL database for file server and a queue server such as rabbitMQ or Node_Redis and Redis.
Questions:
Is there a better way of structuring this system?
What are my other options for components of this system?
is it possible to run all the instances in same server machine or even in same application (in different thread)?
You have a couple of holes here, mostly around the web server "pushing" the message back to the client. That doesn't really work in a web-based world. You can try and use websockets, but generally, this ends up being polling based.
I don't know what the "instructions" are to be executed, but saving 1000 500byte messages is trivial. Many NoSQL solutions boast million+ write per second capacity. Especially if you let committing to disk to lag.
Don't bother with the queue for the return of the file. A good NoSQL solution will scale better. Build out a Cassandra cluster, load test it until it can handle your peak load.
This simplifies your architecture into a 1 or more web servers, clients polling that server for file updates, a queue for submitting "messages" to the "instruction server" (also known as an application server in web-developer terms), and a no-sql database for the instruction server to write files to.
This makes scaling easy, you can always add more web servers, and with a decent cluster size for your no-sql server, you should get to scale horizontally there as well. Your only real bottleneck is your instruction server queue, which you could always throw more instruction servers at.

High amount of http read timeouts on azure

When we migrated our apps to azure from rackspace, we saw almost 50% of http requests getting read timeouts.
We tried placing the client both inside and outside azure with the same results. The client in this case is also a server btw, so no geographic/browser issues either.
We even tried increasing the size of the box to ensure azure wasn't throttling. But even using D boxes for a single request, the result was the same.
Once we moved out apps out of azure they started functioning properly again.
Each query was done directly on an instance using a public ip, so no load balancer issues either.
Almost 50% of queries ran into this issue. The timeout was set to 15 minutes.
Region was US East 2
Having 50% of HTTP requests timing out is not normal behavior. This is why you need to analyze what is causing those timeouts by validating the requests are hitting your VM. For this, I would recommend you running a packet capture on your server and analyze response times, as well as look for high number of retransmissions; it is even better if you can take a simultaneous network trace on your clients machines so you can do TCP sequence number analysis and compare packets sent vs received. 
If you are seeing high latencies in the packet capture or high number of retransmissions, it requires detailed analysis. I strongly suggest you to open a support incident so Microsoft support can help you investigate your issue further.

Resources