I have an Azure VM that I use to connect to a database as a workhorse. If I am remotely connected, e.g., through Bastion (or SSH/RDP), my service will connect and run no problem. If I close my connection, my service will connect and run no problem for some limited time limit. After several hours, the connection to the database will fail.
If I remote back into the VM, I notice its exactly where I left it, all windows still open, etc. If I run a connection to my database, it succeeds. Again, if I leave it alone for a while, the connection to my database fails.
I have tried running powershell commands akin to the belowto keep the machine "alive". What can I do to keep it talking to the internet? The problem seems to be that it is not connecting to the internet after some lengthy timeframe of not being logged into.
# try to keep awake
$wsh = New-Object -ComObject WScript.Shell
while (1) {
$wsh.SendKeys('+{F15}')
Start-Sleep -seconds 59
}
To keep the VM Alive we can change the settings from Power & sleep > Never
For additional settings you can select as below and keep VM awake without falling into sleep.
For more information please refer this GitHub link .
Related
We are facing a strange issue with our IIS Deployments.
ApplicationPools sometimes fail to start properly but do not throw errors when doing so.
The only containing site within the Application Pool is not responsive (not even returning 500 or the like, just times out after some time).
The ApplicationPool and Sites are up and running (not stopped) as far as IIS is concerned.
Restarting the Site or the ApplicationPool does not fix the issue.
However, removing the site and ApplicationPool and recreating it with identical Properties does fix it.
Once any ApplicationPool has reached this state, the only way to solve this (as far as we know) is recreating the entire ApplicationPool.
We would gladly do so in an automated way, but there is no error to catch and handle respectively.
Some background data:
We are using IIS Version 10
The ApplicationPool appears to start correctly. EventLog states that Application '<OUR_APP>' started successfully.
We suspect that the problem might be multiple ApplicationPool-starts happening simultaniously (as they are automatically triggered by our CI/CD Pipeline).
Now, I am by no means an IIS Expert, so my questions are:
Would it be possible, that many app pool starts (circa 20-60) happening at roughly the same time cause such behaviour?
What could I do to investigate this further?
Would it be possible, that many app pool starts (circa 20-60)
happening at roughly the same time cause such behaviour?
Difficult to say. An app pool is just an empty container, mostly what takes the time and places limits on this number is what your application code and dependencies are doing at startup and runtime with a little dotnet precompilation overhead.
What could I do to investigate this further?
Check the HTTPERR logs in the Windows folder - might provide a clue if your not seeing the request logged elsewhere.
monitor the w3wp.exe processes themselves - those are your apppools(AKA "app domains"). Its possible for them to get stuck and not "properly" crash which sounds like your case.
Assuming all your apps normally work and you just want a way to recover random failures, try this...
When you have a broken app pool, run the following on your server from PowerShell or ISE (as an Administrator) to view the running IIS worker processes:
Get-WmiObject Win32_Process -Filter "name = 'w3wp.exe'" | Select-Object ProcessId,CommandLine
Above outputs the worker processes ID's, and the arguments used to start them. Among the arguments you can see the sites name - use the correct ProcessId with the command Stop-Process -Force -Id X (replacing X with the ProcessId number) to forcibly kill the process. Does the app successfully start once you try and access it after killing the process?
If you know the name of the app pool to kill you can use this code to terminate the process:
$AppPoolName = 'NAMEOFMYAPPPOOL';
Stop-Process -Force -id (Get-WmiObject Win32_Process -Filter "name = 'w3wp.exe' AND CommandLine like '%-in%$($AppPoolName)%'").ProcessId
(substitute NAMEOFMYAPPPOOL for the name of the app pool, and run as Administrator)
If killing the stalled process is sufficient to let it restart successfully it would be fairly easy to script a simple health check. I would read the bindings of each site, make an HTTP request to each binding and confirm the app pool really is running/responsive and returns a 200 OK response. If the request fails after some reasonable timeout, try terminating the process and re-requesting the HTTP request to restart the app pool. Add some retry logic and maybe add a delay between attempts so it doesnt get stuck in a loop.
Just a thought - try giving each app pool its own temp folder - configured in web.config per site:
<system.web>
<compilation tempDirectory="D:\tempfiles\apppoolname" />
Cross talk in here during startup is a possible source of weirdness.
The problem seemed to be caused by our Deployment Scripts not waiting for Application-Pools to actually be in Stopped state, before continuing to remove old application files and replacing them with the new ones and immediately Starting the ApplicationPools again.
We noticed issues related to this earlier this year when files could not be deleted because the were still being used, even after stopping the ApplicationPool (which we "solved" by implementing a retry mechanism)...
Solution
Calling the following code after stopping the ApplicatonPool seems to solve the issue....
$stopWaitCount = 0;
while ((Get-WebAppPoolState -Name $appPool).Value -ne "Stopped" -and $stopWaitCount -lt 12)
{
$stopWaitCount++
Write-Log "Waiting for Application-Pool '$appPool' to stop..."
Start-Sleep -Seconds $stopWaitCount
}
We implemented this 2 days ago and the problem didn't occur in 100+ deployments since.
I have a set of lambda functions that processes messages on an SQS stack. They take data sets, process them and store the results in an RDS MySQL database, which it connects to via VPC. Both the Lambda functions and the RDS database are in the same availability zone.
This has been working for the last couple of months without any issues, but early this morning (2019-01-12) at 01:00 I started seeing lambda timeouts and messages being moved into the dead letter queue.
I've done some troubleshooting and confirmed the reason for the timeouts is the inability for Lambda to establish a connection to the database server.
The RDS server is public, but locked down to allow access only through VPC and 2 public IPs.
I've taken the following steps so far to try and resolve the issue:
Given the lambda service role admin rights to rule out IAM issues
Unassigned VPC from the lambda functions and opened up RDC inbound access from 0.0.0.0/0 to rule out VPC issues.
Restarted the RDS hosts, the good ol' off'n'on again.
Used serverless to invoke the lambda functions locally with test data (worked). My local machine connects to the public RDS IP, not through VPC.
Changed the runtime environment from 3.6 to 3.7
It doesn't appear to be a code issue, as it's been working flawlessly for the past couple of months and I can invoke locally without issue and my Elastic Beanstalk instance, which sits on the same VPC subnet continues to connect through VPC without issue.
Here's the code I'm using to connect:
connectionString = 'mysql+pymysql://{0}:{1}#{2}/{3}'.format(os.environ['DB_USER'], os.environ['DB_PASSWORD'], os.environ['DB_HOST'], os.environ['DB_SCHEMA'])
engine = create_engine(connectionString, poolclass=NullPool)
with engine.connect() as con: <--- breaking here
meta = MetaData(engine, reflect=True) <-- never gets to here
I double checked the connection string & user accounts, both are correct/working locally.
If someone could point me in the right direction, I'd be grateful!
My first guess is that you've hit a connection limit on the RDS database. Because Lambdas can be executed concurrently (this could easily be the case if there were suddenly a lot of messages in your SQS queue), and each execution opens a new connection to your DB, the connection pool can get saturated.
If this is the case, you can set a concurrent execution limit on your Lambda function to prevent this.
A side note - it is not recommended to use a database with a persistent connection in a serverless architecture exactly for this reason. AFAIK, AWS is working on a better solution to use RDS from Lambda, but it's not available yet.
So...
I was changing security groups and it was having no effect on the RDS host, at one point I removed all access and I could still connect, which is crazy. At this point I started to think the outage on Friday night put the underlying RDS host into a weird state. I put the Security Groups back to the way they should be, stopped & started (restart had no effect) the RDS host and everything started to work again.
Very frustrating, but happy it's finally resolved.
We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?
We are using following code to connect to our caches (in-memory and Redis):
settings
.WithSystemRuntimeCacheHandle()
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime)
.And
.WithRedisConfiguration(CacheManagerRedisConfigurationKey, connectionString)
.WithMaxRetries(3)
.WithRetryTimeout(100)
.WithJsonSerializer()
.WithRedisBackplane(CacheManagerRedisConfigurationKey)
.WithRedisCacheHandle(CacheManagerRedisConfigurationKey, true)
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime);
It works fine, but sometimes machine is restarted (automatically by Azure where we host it) and after the restart connection to Redis fails with following exception:
Connection to '{connection string}' failed.
at CacheManager.Core.BaseCacheManager`1..ctor(String name, ICacheManagerConfiguration configuration)
at CacheManager.Core.BaseCacheManager`1..ctor(ICacheManagerConfiguration configuration)
at CacheManager.Core.CacheFactory.Build[TCacheValue](String cacheName, Action`1 settings)
at CacheManager.Core.CacheFactory.Build(Action`1 settings)
According to Redis FAQ (https://learn.microsoft.com/en-us/azure/redis-cache/cache-faq) part: "Why was my client disconnected from the cache?" it might happen after redeploy.
The question is
is there any mechanism to restore the connection after redeploy
is anything wrong in way we initialize the connection
We are sure the connection string is OK
Most clients (including StackExchange.Redis) usually connect / re-connect automatically after a connection break. However, your connect timeout setting needs to be large enough for the re-connect to happen successfully. Remember, you only connect once, so it's alright to give the system enough time to be able to reconnect. Higher connect timeout is especially useful when you have a burst of connections or re-connections after a blip causing CPU to spike and some connections might not happen in time.
In this case, I see RetryTimeout as 100. If this is the Connection timeout, check if this is in milliseconds. 100 milliseconds is too low. You might want to make this more like 10 seconds (remember it's a one time thing, so you want to give it time to be able to connect).
We have a Oracle 11g DB on Microsoft Azure VM.
The oracle connection at client side closes after a time period even if active sqls are running. I'm checking active sql reports one minute and in the next BOOM closed connection.
We have not defined any profile and timing out a connection but still if I keep a connection in my SQL Developer and some time later when i run a query, The connection is closed.
In case of running batch programs which consume more time on the server itself, It seems to get hanged and no session will exist of this batch program
I'm guessing after a particular time from getting a session, the DB is closing the connection thus making the batch hang.
Is it related to Azure or any other thing. It is not even producing any error codes