I have a Web server (Web01) setup in VM. Currently, I facing performance issue on the Web Server, The bottleneck is too many request, the web server is not enough process power to execute. So I have 2 options to resolve this problem.
Increase CPU and Memory
Setup Web02 in VM (Same VM Host of Web01) and build NLB.
I don't know above 2 options which is the best. Actually, I struggle option 2 that if I setup 2 web server's in same VM host, is the performance is bester than option 1?
I can share some thoughts with you on the pros and cons of NLB, but I can't directly help you make a choice.
Network load balancing has several potential advantages. By distributing network traffic among multiple servers or virtual machines, processing is faster than if all traffic flows through a single server. If demand decreases, servers can be taken offline, and the feature will balance traffic among the remaining hosts. NLB provides fault tolerance at the network layer, ensuring that connections are not directed to servers that are down. Network Load Balancing also enables organizations to rapidly scale server applications by adding hosts and then distributing the application's traffic among the new hosts.
But it still has some drawbacks. It cannot detect service interruption, only by IP address. If a particular server service fails, WNLB cannot detect the failure and will still route requests to that server. The current CPU load and RAM utilization of each server cannot be considered when distributing the client load.
Related
We are having difficulties choosing a load balancing solution (Load Balancer, Application Gateway, Traffic Manager, Front Door) for IIS websites on Azure VMs. The simple use case when there are 2 identical sites is covered well – just use Azure Load Balancer or Application Gateway. However, in cases when we would like to update websites and test those updates, we encounter limitation of load balancing solutions.
For example, if we would like to update IIS websites on VM1 and test those updates, the strategy would be:
Point a load balancer to VM2.
Update IIS website on VM1
Test the changes
If all tests are passed then point the load balancer to VM1 only, while we update VM2.
Point the load balancer to both VMs
We would like to know what is the best solution for directing traffic to only one VM. So far, we only see one option – removing a VM from backend address pool then returning it back and repeating the process for other VMs. Surely, there must be a better way to direct 100% of traffic to only one (or to specific VMs), right?
Update:
We ended up blocking the connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
Without any Azure specifics, the usual pattern is to point a load balancer to a /status endpoint of your process, and to design the endpoint behavior according to your needs, eg:
When a service is first deployed its status is 'pending"
When you deem it healthy, eg all tests pass, do a POST /status to update it
The service then returns status 'ok'
Meanwhile the load balancer polls the /status endpoint every minute and knows to mark down / exclude forwarding for any servers not in the 'ok' state.
Some load balancers / gateways may work best with HTTP status codes whereas others may be able to read response text from the status endpoint. Pretty much all of them will support this general behavior though - you should not need an expensive solution.
We ended up blocking connection between VMs and Load Balancer by creating Network Security Group rule with Deny action on Service Tag Load Balancer. Once we want that particular VM to be accessible again we switch the NSG rule from Deny to Allow.
The downside of this approach is that it takes 1-3 minutes for the changes to take an effect. Continuous Delivery with Azure Load Balancer
If anybody can think of a faster (or instantaneous) solution for this, please let me know.
I had exactly the same requirement in an Azure environment which I built a few years ago. Azure Front Door didn't exist, and I had looked into using the Azure API to automate the process of adding and removing backend servers the way you described. It worked sometimes, but I found the Azure API was unreliable (lots of 503s reconfiguring the load balancer) and very slow to divert traffic to/from servers as I added or removed them from my cluster.
The solution that follows probably won't be well received if you are looking for an answer which purely relies upon Azure resources, but this is what I devised:
I configured an Azure load balancer with the simplest possible HTTP and HTTPS round-robin load balancing of requests on my external IP to two small Azure VMs running Debian with HAProxy. I then configured each HAProxy VM with backends for the actual IIS servers. I configured the two HAProxy VMs in an availability set such that Microsoft should not ever reboot them simultaneously for maintenance.
HAProxy is an excellent and very robust load balancer, and it supports nearly every imaginable load balancing scenario, and crucially for your question, it also supports listening on a socket to control the status of the backends. I configured the following in the global section of my haproxy.cfg:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats socket ipv4#192.168.95.100:9001 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
In my example, 192.168.95.100 is the first HAProxy VM, and 192.168.95.101 is the second. On the second server, these lines would be identical except for its internal IP.
Let's say you have an HAProxy frontend and backend for your HTTPS traffic to two web servers, ws1pro and ws2pro with the IPs 192.168.95.10 and 192.168.95.11 respectively. For simplicity sake, I'll assume we don't need to worry about HTTP session state differences across the two servers (e.g. Out-of-Process session state) so we just divert HTTPS connections to one node or the other:
listen stats
bind *:8080
mode http
stats enable
stats refresh 10s
stats show-desc Load Balancer
stats show-legends
stats uri /
frontend www_https
bind *:443
mode tcp
option tcplog
default_backend backend_https
backend backend_https
mode tcp
balance roundrobin
server ws1pro 192.168.95.10:443 check inter 5s
server ws2pro 192.168.95.11:443 check inter 5s
With the configuration above, since both HAProxy VMs are listening for admin commands on port 9001, and the Azure load balancer is sending the client's requests to either VM, we need to tell both servers to disable the same backend.
I used Socat to send the cluster control commands. You could do this from a Linux VM, but there is also a Windows version of Socat, and I used the Windows version in a set of really simple batch files. The cluster control commands would actually be the same in BASH.
stop_ws1pro.bat:
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo disable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
start_ws1pro.bat:
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.100:9001
echo enable server backend_https/ws1pro | socat - TCP4:192.168.95.101:9001
These admin commands execute almost instantly. Since the HAProxy configuration above enables the stats page, you should be able to watch the status change happen on the stats page as soon as it refreshes. The stats page will show the connections or sessions draining from the server you disabled over to the remaining enabled servers when you disable a backend, and then show them returning to the server once it is enabled again.
I have been working on the Azure monitoring side for a while. I need your inputs for one of my requirement.
We have lot of IaaS VM’s both SQL and Non-SQL provisioned in our subscriptions. We are paying non-trivial amount for these VM’s. I am trying to come up w/ solution to identify low usage machines and during what times( night, early morning etc) the usage is very low. With this, I can take an action by either shut down VM’s during low usage period or reduce VM size.
For this, I am trying couple of options like Azure Advisor, Azure metrics for CPU usage, Network I/O, Disk Read/Write parameters. But considering only these might not help. Because, your network I/O might be having load balancer requests which cannot be considered.
So I need to come up w/ actual IIS requests went in during the given period.
Can you recommend on how to identify low usage VM’s? It would be a great help.
Can you recommend on how to identify low usage VM’s?
Generally, we will based on CPU usage to identify low usage VM.
Network traffic in or out of one application might be having load balancer requests, but network traffic in or out of this VM will not have load balancer, we also can use this to identify low usage VM.
So I need to come up w/ actual IIS requests went in during the given
period.
We can use OMS to monitor IIS request of each VMs in Azure, please follow this article to configure OMS.
like this:
Also we can config zabbix on one Azure VM and use that to monitor all VMs.
We've got an API micro-services infrastructure hosted on Azure VMs. Each VM will host several APIs which are separate sites running on Kestrel. All external traffic comes in through an RP (running on IIS).
We have some API's that are designed to accept external requests and some that are internal APIs only.
The internal APIs are hosted on scalesets with each scaleset VM being a replica that hosts all of the internal APIs. There is an internal load balancer(ILB)/vip in front of the scaleset. The root issue is that we have internal APIs that call other internal APIs that are hosted on the same scaleset. Ideally these calls would go to the VIP (using internal DNS) and the VIP would route to one of the machines in the scaleset. But it looks like Azure doesn't allow this...per the documentation:
You cannot access the ILB VIP from the same Virtual Machines that are being load-balanced
So how do people set this up with micro-services? I can see three ways, none of which are ideal:
Separate out the APIs to different scalesets. Not ideal as the
services are very lightweight and I don't want to triple my Azure VM
expenses.
Convert the internal LB to an external LB (add a public
IP address). Then put that LB in it's own network security
group/subnet to only allow calls from our Azure IP range. I would
expect more latency here and exposing the endpoints externally in
any way creates more attack surface area as well as more
configuration complexity.
Set up the VM to loopback if it needs a call to the ILB...meaning any requests originating from a VM will be
handled by the same VM. This defeats the purpose of micro-services
behind a VIP. An internal micro-service may be down on the same
machine for some reason and available on another...thats' the reason
we set up health probes on the ILB for each service separately. If
it just goes back to the same machine, you lose resiliency.
Any pointers on how others have approached this would be appreciated.
Thanks!
I think your problem is related to service discovery.
Load balancers are not designed for that obviously. You should consider dedicated softwares such as Eureka (which can work outside of AWS).
Service discovery makes your microservices call directly each others after being discovered.
Also take a look at client-side load balancing tools such as Ribbon.
#Cdelmas answer is awesome on Service Discovery. Please allow me to add my thoughts:
For services such as yours, you can also look into Netflix's ZUUL proxy for Server and Client side load balancing. You could even Use Histrix on top of Eureka for latency and Fault tolerance. Netflix is way ahead of the game on this.
You may also look into Consul.io product for your cause if you want to use GO language. It has a scriptable configuration for better managing your services, allows advanced security configurations and usage of non-rest endpoints. Eureka also does these but requires you add a configuration Server (Netflix Archaius, Apache Zookeeper, Spring Cloud Config), coded security and support accesses using ZUUL/Sidecar.
Can you please give me a better understanding of how we can scale the stateless services without partitioning?
Say we have 5 nodes in a cluster and we have 5 instances of the service. On simple testing a node is behaving as sticky where all the requests I am sending are being served by only one node. In the scenario when we have high volume of requests that come in, can other instances be automatically used to serve the traffic. How do we handle such scale out situations in service fabric?
Thanks!
Usually there's no need to use partitioning for stateless SF services, so avoid that if you can:
more on SF partitioning, including why its not normally used for stateless services
If you're using the ServiceProxy API, it will maintain sticky connections to a given physical node in the cluster. If you're (say) exposing HTTP endpoints, you'll have one for each physical instance in the cluster (meaning you'll end up talking to one at a time, unless you manually cycle thru them). You can avoid this by:
Creating a new proxy instance for each call, which tends to be expensive if you do it alot (or manually cycle thru the list of instance endpoint URLs, which can be tedious and/or expensive)
Put a load balancer in front of your cluster and configure all traffic from your clients to SF nodes to be forwarded thru that. The load balancer can be configured for Round-Robin, etc. style semantics:
Azure Load Balancer
Azure Traffic Manager
Good luck!
You can query the request using the reverse proxy installed on each node. Using the https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reverseproxy
The reverse proxy then resolve the endpoint for you. If you have multiple instances of the a stateless service then it will forward your request to a random one.
If during heavy load you can increase the instance count of your service and the proxy then include the new instances automatically.
I will assume you are calling your services from outside your cluster. If yes, your problem is not specific for Service Fabric, it is Azure VMSS + LB.
Service Fabric runs on top of Virtual Machines Scale Set, these VMs are created behind a Load Balancer, when the client connects to your service, they are creating a connection through the load balancer to your service, whenever a connection is open, the load balancer assign one target VM for handling your request, and any request made from your client, while using the same connection(keep alive), will be handled by the same node, this is why your load goes to a single node.
LB won't round robin the requests because they are using the same connection, it is a limitation(feature) of the LB, to work around this problem, you should open multiple connections or use multiple clients(instances).
This is for default distribution mode(Hash-based). You have to check also the routing rules in the LB to check if the distribution mode is Hash-based(5 tuple= ip+port) or if it is IP affinity mode(ip only), otherwise multiple connections from same IP will still be linked to same node.
Source: Azure Load Balaner Distribution Mode
I am starting to develop a webservice that will be hosted in the cloud but needs higher availability than typical cloud SLAs provide.
Typical SLAs, e.g. Windows Azure, promise an availability of 99.9%, i.e. up to 43min downtime per month. I am looking for an order of magnitude better availability (<5min down time per month). While I can configure several load balanced database back-ends to resolve that part of the issue I see a bottleneck at the webserver. If the webserver fails, the whole service is unavailable to the customer. What are the options of reducing that risk without introducing another possible single point of failure? I see the following solutions and drawbacks to each:
SRV-record:
I duplicate the whole infrastructure (and take care that the databases are in sync) and add additional SRV records for the domain so that the user tying to access www.example.com will automatically get forwarded to example.cloud1.com or if that one is offline to example.cloud2.com. Googling around it seems that SRV records are not supported by any major browser, is that true?
second A-record:
Add an additional A-record as alternatives. Drawbacks:
a) at my hosting provider I do not see any possibility to add a second A-record but just one... is that normal?
b)if one server of two servers are down I am not sure if the user gets automatically re-directed to the other one or 50% of all users get a 404 or some other error
Any clues for a best-practice would be appreciated
Cheers,
Sebastian
The availability of the instance i.e. SLA when specified by the Cloud Provider means the "Instance's Health is server running in the context of Hypervisor or Fabric Controller". With that said, you need to take an effort and ensure the instance is not failing because of your app / OS / or pretty much anything running inside the instance. There are few things which devops tend to miss and that kind of hit back hard like for instance - forgetting to configure the OS Updates and Patches.
The fundamental axiom with the availability is the redundancy. More redundant your application / infrastructure is more availabile is your app.
I recommend your to look into the Azure Traffic Manager and then re-work on your architecture. You need not worry about the SRV record or A-Record. Just a CNAME for the traffic manager would do the trick.
The idea of traffic manager is simple, you can tell the traffic
manager to stand after the domain name ( domain name resolution of the
app ) then the traffic manager decides where to send the request on
considerations of factors like Round-Robin, Disaster Management etc.
With the combination of the Traffic Manager and multi-region infrastructure setup; you will march towards the high availability goal.
Links
Azure Traffic Manager Overview
Cloud Power: How to scale Azure Websites globally with Traffic Manager
Maybe You should configure a corosync cluster with DRBD ?
DRBD will ensure You that the data on both nodes are replicated (for example website files and db files).
Apache as web server will be available under a virtual IP to which domain is pointed. In case of one server is down corosync will move all services to second server within few seconds.