I had a virtual machine fail, is there anyway to figure out why the machine became unresponsive? I have no way of connecting to the VM, as RDP does not appear to be responding. What are the options in this case, and or preventative measures that can be put in place to prevent this from happening in the future?
Thanks,
Steve Armitage
If you're in a situation like this there's not much you can do but contact Microsoft Support. I've always found them helpful, but you do need to be aware that root cause analysis of a problem can sometimes take a while. As far as preventing this problem from happening again, that depends on the problem. I've encountered two such cases, once it was because of a memory leak in our code so it was something I could fix, the other it was something only MS could fix so it's hard to say.
Related
I have an api application that is running in a docker container, and since moving to AWS, the api stops daily with the error: Erlang closed the connection. I've monitored the server during that time and no IOPS seem to be causing this issue. Beyond that though, when the api fails, it won't restart on it's own on one of our clusters. I'm not sure where to find the logs to get more context and could use any input that may be helpful. Also, more context here, is that this api worked fairly well before in our data-center/physical server space, but now in AWS, it fails daily. Any thoughts or suggestions as to why this may be failing to restart?
I've looked at the syslogs and the application server logs and don't see any kind of failures. Maybe I'm not looking in the proper place. At this point, someone from one of our teams has to manually restart the api with an init.d command. I don't want to create a cron job to "fix" this because that's a band-aid and not a true fix.
There really isn't enough information on the structure of your app or its connections, so no one can give a "true fix".
The problem may be as small as configuring your nodes differently, changing some of the server's local configurations, or you may need some "keep alive" handler towards AWS.
Maybe try adding a manual periodic dump to see if its an accumulating problem, but I believe if Erlang closed the connection there's a problem between your API and AWS, and your API can't understand where it is located.
I am currently running Gitlab CE. I have an issue where it is constantly gaining space,
There is 1 current user (myself). But sitting idle it gains 20gb of usage in under an hour for no apparent reason (not pushing or pulling or even using it, the service is simply live and idle) until eventually it fills my drive (411gb of free space before the installation of Gitlabs. takes less than 24hrs to fill it.).
I cannot locate the source of the issue, google seems to like referring me to size limitations, and that is fine if I needed to increase that which I don't, i have tried to disable some metrics and the safety features such as "Health checks" in an attempt to stop it from doing this but with no success
I have to keep reinstalling it to negate the idle data usage. There is a reason for me setting it up, but I cannot deploy this the way it is. Have any of you experienced this issue? Is there a way around this?
The system current running it: Fedora 36 running the installation on a 500GB SSD, 8 core Ryzen 7 Processor.
any advice to solve this problem would be great. Please note I am not an expert.
Answer to this question:
rsync was scheduled automatically and was in a loop.
Removed rsync, reinstalled it, rescheduled rsync to go on my schedule, removed the older 100 or so back ups and my space has been returned.
for those that are running rsync, just check that it is not running too closely and is detecting that its own backups are there. as the back ups i found were corrupted.
Can you help me?
I tried jmeter testing with nodejs server but after some 5000 requests node server doesn't respond, So I have to restart the server to make it work again. is there any way to make it work again without restarting the server?
What you are asking is a way to treat the symptoms without treating the cause. I know of no way to "make it work again" but if we can find the cause of your problem we can fix it and remove the symptoms. It is difficult to comment on what exactly is happening without more information/code, but two things come to mind.
You may be performing some very heavy computations and accidentally blocking your event loop. This article discusses it in more detail.
You may have a memory leak which is crashing node. This is easy to check by watching the memory usage of node on windows task manager or a mac/linux equivalent. If the memory keeps increasing and never falls, Node may reach its max memory limit and crash. The only way I know of to fix this is to run the node garbage collector manually. This article talks about it. This is of course a temporary solution, you should fix the memory leak if you find one.
Those are the only two I can think of. If you want more help, I'll need to see your code.
you can use 'forever': https://github.com/foreverjs/forever
but I think you know why the script fails.
Good luck
Suppose I have the following situation. One of my Azure role instances happens to be started on a VM that runs inside a faulty server but Azure wiring processes don't see any problems. I somehow deduce this fact - for example I see an "impossible" call stack - one that can't happen in my program under any normal conditions.
So I'd like Azure to move my instance to another VM and have the underlying hardware checked and repaired.
How can I do that except contacting support?
A few comments:
You can have this done, sortof, by calling support. The support team won't move your VM to a new server just because you ask, but they will work with you to determine if the physical server really is bad, and if so move it to out of service.
RequestRecycle will only shut down the host process (ie. WaIISHost) and related processes and then restart them. It won't reboot the VM, clean boot, or redeploy.
You can try a 'Reimage' from the portal or Powershell if you suspect you might have a corrupt Windows installation. A Reimage will recreate the Windows partition from scratch.
In order to force a new VM to be on a new server you would have to do an in-place upgrade and modify the size of the VM (ie. go from Small to Medium). This will cause new VMs to be created on new servers. You can then do another in-place upgrade to revert back to the original size.
That being said, I strongly agree with Brian's comment that it is very unlikely that bad hardware is causing an 'impossible' callstack. I would recommend opening a support incident so you can find the actual root cause instead of just fixing the most visible symptom.
I don't think you can move a VM. But you could create a new staging deployment, swap it into production, and then destroy the old one. You can't actually guarantee that the VMs are on different physical machines, but it seems reasonably likely. The larger the VMs are, the more likely that they're on separate servers.
That said, it seems really unlikely that your problems are due to a hardware fault rather than some subtle bug.
Very suddenly without any changes or recent access my Azure virtual server is no longer available for RDP or web...I have logged into the azure control panel and everything appears to running without issue but it is not working.
I have checked the end points and they are present for both RDP and Web, totally weird.
I have 2 virtual servers and the other one is working fine and responding.
Anyone ever experience this? Just when my client wants to view his website as well...
http://cn-web-02.cloudapp.net is the URL
TIA
As I just answered for this question, Virtual Machines are in Preview and not in Production yet. There are several reasons why your Virtual Machines became unavailable (see other answer). Given that this is the second reported incident here today, it's a good guess it's related to the underlying Host OS being updated, which would take your Virtual Machine offline for a short period of time.
I tried your URL and it's available again. Just remember about this being in Preview, especially since you mention having a client that wants to view his website. If you put a production website in Virtual Machines, then you'll have to absorb the risk of not having an SLA.
Having said that: You can mitigate downtime risk by running two Virtual Machines, listening on a load-balanced input endpoint. Be sure to have both Virtual Machines in the same Availability Set. Doing that ensures that the Windows Azure fabric controller will not take both Virtual Machines offline at the same time when doing things like Host OS updates. If this were in Production, you'd then have a very high availability scenario. Even in Preview, you'll improve availability by taking advantage of Availability Sets. Note: You'll need to use some type of shared session cache, since visitors will now be sent to either one of your Virtual Machines.
I had same experience on it! We had 2 instances and all of its were re-imaged without any notified. I known it since we made some local change via RDP.
Reboot or Reimage may help! You may try!
Turns out it was an outage from Microsoft...for over 22 hours but everything is back up and running. This is the 2nd time in 6 months this has happened for long stretches...makes me a little nervous to say the least.
Thanks for the input everyone and for anyone that's interested MS have a good site that tracks the service levels on Azure. Windows Azure Service Dashboard
S