Azure Cloud Service Worker Role not running after reboot or publish

Azure Cloud Service Worker Role not running after reboot or publish - azure

I have azure cloud service worker role running, only 1 role instance
The worker role acts as a TCP server listening on port a port which is configured in the service definition file.
So after the role instance is running, my tcp client program is able to connect to the work role.
But, every time when I reboot the role instance, or publish a new version within the visual studio, i wait the reboot or publish finish, the azure portal says it's status is running, the tcp client program is still not able to connect the server, BUT, without doing nothing, about 10 mins later, it fixed itself, the tcp client is able to connect again.
Where does this 10 min delay come from?
I thought as soon as the role instance's status becomes Running, it should work again.
First, I thought it is because of the Load balancer. But, I remote in the that role instance, and use command line netstat -A , the port is not even listening. So, seems my code for the worker role is not running?
When 10 min later, it is good for connect, I went to remote desktop, and use netstat -A again, now that port is listening.
So, after the reboot/publish, I have to wait for 10 mins to have my worker role code running?
Or I am missing something here?

Hard to say, but the following references should help:
http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx. This gives you the architecture of the processes running inside your service. When you RDP and netstat shows the port is not listening, what do you see as far as processes? Is WaWorkerHost.exe running?
http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx. This walks through all of the diagnostic data typically used to troubleshoot an issue in an Azure PaaS VM. If you check those logs and event logs do you see anything that stands out between the time when you can't connect and the time that you can?
You can check the Windows Azure event log to see when your OnStart() and Run() methods are started and stopped. If you see that Run() has started but netstat still shows the port as not listening then you know the problem is in your code and you may need to step through with a debugger (you can setup remote debugger so you can use Visual Studio on your desktop to debug the Azure VM - http://blogs.msdn.com/b/cie/archive/2014/01/24/windows-azure-remote-debugging.aspx).

Related

Azure AspNetCore WebApp under high load returns "The specified CGI application encountered an error and the server terminated the process"

I'm hosting my AspNetCore app in Azure (Windows hosting plan P3v2 plan). It works perfectly fine under normal load (5-10 requests/sec) but under high load (100-200 requests/sec) starts to hang and requests return the following response:
The specified CGI application encountered an error and the server terminated the process.
And from the event logs I can get even more details:
An attempt was made to access a socket in a way forbidden by its access permissions aaa.bbb.ccc.ddd
I have to scale instance count to 30 instances, and while each instance getting just 3-5 requests per sec, it works just fine. I beleive that 30 hosts is too much to process this high load, beleive that the resource is underutilized and trying to find the real bottleneck. If I set instance count to 10 - everything crashes and every request starts to return the error above. Resources utilization metrics for the high load case with 30 instances enabled:
The service plan CPU usage is low, about 10-15% for each host
The service plan memory usage is around 30-40%
Dependency responses quickly, 50-200 ms
Azure SQL DTU usage is about 5%
I discovered this useful article on current tier limits and after an Azure TCP connections diagnostics I figured out a few possible issues:
High outbound TCP connection
High TCP Socket handle count - High TCP Socket handle count was detected on the instance .... During this period, the process dotnet.exe of site ... with ProcessId 8144 had the maximum open handle count of 17004.
So I dig more and found the following information:
Per my service plan tier, my tcp connections limit should be 8064 which is far from the displayed above. Next I've checked the socket state:
Even though I see that number of active TCP connections is below the limit, I'm wondering if open socket handles count could be an issue here. What can cause this socket handle leak (if any)? How can I troubleshoot and debug it?

I see that you have tried to isolate the possible cause for the error, just highlighting some of the reasons to revalidate/remediate:
1- On Azure App Service - Connection attempts to local addresses (e.g. localhost, 127.0.0.1) and the machine's own IP will fail, except if another process in the same sandbox has created a listening socket on the destination port. Rejected connection attempts, normally returns the above socket forbidden error (above).
For peered VNet/On_premise, kindly ensure that the IP address used is in the ranges listed for routing to VNet/Incorrect routing.
2.On Azure App service - If the outbound TCP connections on the VM instance are exhausted. limits are enforced for the maximum number of outbound connections that can be made for each VM instance.
Other causes as highlighted in this blog
Using client libraries which are not implemented to re-use TCP connections.
Application code or the client library is leaking TCP socket handles.
Burst load of requests opening too many TCP socket connections at once.
In case of higher level protocol like HTTP this is encountered if the Keep-Alive option is not leveraged.
I'm unusure if you have already tried the App Service Diagonstic to fetch more details, kindly give that a shot:
Navigate to the Diagnose and solve problems blade in the Azure portal.
In the Azure portal, open the app in App Services.
Select Diagnose and solve problems > "TCP Connections"
Consider optimizing the application to implement Connection Pooling for your .Net/Observe the behavior locally. If feasible restart the WebApp and then check to see if that helps.
If the issue still persists, kindly file a support ticket for a detailed/deeper investigation of the backend logs.

Random timeouts at Node.js + gRPC application on Kubernetes

We have a weird networking issue.
We have a Hyperledger Fabric client application written in Node.js running in Kubernetes which communicates with an external Hyperledger Fabric Network.
We randomly get timeout errors on this communication. When the pod is restarted, all goes good for a while then timeout errors start, sometimes randomly fixed on its own and then goes bad again.
This is Azure EKS, we setup a quick Kubernetes cluster in AWS with Rancher and deployed the app there and same timeout error happened there too.
We ran scripts in the same container all night long which hits the external Hyperledger endpoint both with cURL and a small Node.js script every minute and we didnt get even a single error.
We ran the application in another VM as plain Docker containers and there was no issue there.
We inspected the network traffic inside container, when this issue happens, we can see with netstat a connection is established but tcpdump shows no traffic, no packages are even tried to be sent.
Checking Hyperledger Fabric SDK code, it uses gRPC protocol buffers behind the scenes.
So any clues maybe?

This turned out to be not Kubernetes but dropped connection issue.
gRPC keeps connection open and after some period of inactivity intermediary components drop the connection. In Azure AKS case this is the load balancer, as every outbound connection goes through a load balancer. There is a non configurable idle timeout period of 4 minutes after which load balancer drops the connection.
The fix is configuring gRPC for sending keep alive messages.
Scripts in the container worked without a problem, as they open a new connection every time they run.
Application running as plain Docker containers didnt have this issue since we were hitting endpoints every minute hence never reaching idle timeout threshold. When we hit endpoints every 10 minutes, timeout issue also started there too.

IIS Manager Error - Unable to bind to the underlying transport for [::]:80.The process cannot access the file because

I know that this question has been asked in multiple forums and have several versions of the answers.. Unfortunately, none of those answers helped me out to resolve my issue.
I stood up an AWS EC2 instance of Windows Server 2016 and installed IIS, MSMQ, Windows Process Activation Service and few other things.. When I cracked open my IIS Manager, I noticed that the "Default Web Site" is stopped and when I tried to start it I get an error "The process cannot access the file because it is being used by another process (Exception from HRESULT:0x80070020)". Tried to dig a little more and found these two exceptions in my Event Viewer:
Unable to bind to the underlying transport for [::]:80. The IP Listen-Only list may contain a reference to an interface which may not exist on this machine. The data field contains the error number.
The World Wide Web Publishing Service (WWW Service) did not register the URL prefix http://:80/SmsHandler for site 1. The site has been disabled. The data field contains the error number.*
Researching more online I found more than 2 dozen articles on this issue and more than 95% of them saying that the potential application that might be conflicting with IIS and using port 80 and 443 could be Skype.. But I DON'T HAVE SKYPE installed on my server..
I ran the "netstat -aon" command and found this:
C:\Windows\system32>netstat -aon | findstr :80
TCP 169.254.170.2:80 0.0.0.0:0 LISTENING 1164
Going by what's mentioned in other articles online.. I tried to trace down the PID - 1164 in my Task Manager and found that its the "Service Host - Local System" process having 15 System services running into it.. There's no way I can kill that process to make my IIS work..
I then tried to change the Bindings in my IIS to listen on a different port than 80 and was able to get it up and running.. But I don't want IIS to run on any other port than 80 since I don't want the user to specify the port in the URL every time when they hit the website..
I'm now running short of ideas here.. Any suggestions would be greatly appreciated.
Thanks!

I ran into a similar issue, but not with port 80. In my case it was because the ip address [::] wasn't allowed to listen on any port. Adding it to the ListenOnly list in the registry fixed the issue.
From an admin command prompt:
netsh http add iplisten ipaddress=::
From this thread.

Found the culprit.. It apparently wasn't skype for me (as it is in most of the cases), it was this service called IP Helper which was running on port 80 and was conflicting with IIS. The way I found that out was, I checked all the services running under the PID for Service Host - Local System (which in my case as 1164) and started stopping them one at a time and saw if IIS starts working.. Just wanted to close this thread.. Hope this helps if someone else get stuck with the same issue.

I had VMware Workstation installed, the solution is: "VMware -> Edit -> Preferences -> Shared VMs -> "Disable Sharing".

AWS EC2 Error: The site can't be reached - ec2.us-west-1.compute.amazonaws.com took too late to respond. Deploy NodeJS

I currently have an EC2 instance up and running with Amazon Linux running and transferred my project (which contains both React/NodeJS/Express) onto the EC2 instance via SFTP using FileZilla.
For the EC2's Security Groups, I opened a port for 3000 (protocol: tcp, source: 0.0.0.0/0), which is how my Express is defined as well.
So I sshed into EC2 instance and ran the project's Express, and sees it listening to port 3000 within the terminal. But once I hit the Public DNS with ec2...us-west-1.compute.amazonaws.com:3000, it says The site can't be reached - ec2...us-west-1.compute.amazonaws.com took too late to respond.
What could be the issue and how can I go about from here to connect to it?
Thank you in advance and will upvote/accept answer.

Just check if your Node.js server is running on the EC2 instance.
Debugging:
Check first if It working locally properly.
Check for the node.js server in EC2.
sudo netstat -tulpn | grep :3000
try to run server with --verbose flag i.e npm run server --verbose
it will show logs of the server while starting.
Check for the security group Setting for the EC2 instance.
try to connect with the ip:port i.e 35.2..:3000
If still it not working and response taking long time.
that means some other service is running on the same port.
try this in ec2:
sudo killall -9 node
npm run server
And connect with using IP(54.4.5.*:3000) or public DNS (http://ec2...us-west-1.compute.amazonaws.com:3000).
Hope It will help :)

You may be encountering an issue with outbound traffic. You may be inside a company's network, either physically connected or VPN'd in. In some instances, your VPN isnt set up to handle split traffic, so you must abide by your company's outbound restrictions.
In a situation like this, you would want to use a proxy to access your site. when locking down your security group, make sure you use your proxy's public IP (not your company's).

Usually, when we have connectivity issues, it is something basic or a firewall. I assume you have checked whether a firewall is running on either end, eg. iptables -L -n. Also, any protocol analyzer like wireshark or tcpdump would tell you where packets to port 3000 are visible.

Launch application before server completes startup

I have 2 apps on server: "Websphere Commerce" and "myapp". While Myapp inits, it needs to receive some data from WC using SOAP, however, until both apps are started, the common http port 9060 isn't listening.
There's a flag:
Enterprise Applications > * > Startup behavior
Startup order
Launch application before server completes startup
It's cleared for both apps. I thought, WAS would first report:
TCP Channel TCP_2 is listening on host * (IPv6) port 9060.
Server server1 open for e-business
then start the apps, but it first starts them, then opens the port.
Then what does this flag do?

Check this page Startup behavior settings
Launch application before server completes startup
Specifies whether the application must initialize fully before the
server starts. The default setting of false indicates that server
startup will not complete until the application starts.
A setting of true informs the product that the application might start
on a background thread and thus server startup might continue without
waiting for the application to start. Thus, the application might
not be ready for use when the application server starts.
So it is other way around, server first ensures that applications are started and then opens port to allow traffic to them.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string