What causes frequent GCP Ubuntu server restarts? - linux

I have an e2-medium GCP Compute Engine instance running my application, for a while now. It has worked just well, however recently I have experienced frequent auto restarts (3 times in the space of 2 weeks), nothing in the logs points me to the reason for this, but it keeps repeating itself. Can someone please tell me what could be the problem?

VM's in GCP are not guaranteed to be up 100% of the time. Restarts could be cause by anything from hardware failures to Google performing some maintenance on it's physical servers (hardware repairs or Software patching).

Related

Rapid gain in storage use with gitlab not exchanging any files

I am currently running Gitlab CE. I have an issue where it is constantly gaining space,
There is 1 current user (myself). But sitting idle it gains 20gb of usage in under an hour for no apparent reason (not pushing or pulling or even using it, the service is simply live and idle) until eventually it fills my drive (411gb of free space before the installation of Gitlabs. takes less than 24hrs to fill it.).
I cannot locate the source of the issue, google seems to like referring me to size limitations, and that is fine if I needed to increase that which I don't, i have tried to disable some metrics and the safety features such as "Health checks" in an attempt to stop it from doing this but with no success
I have to keep reinstalling it to negate the idle data usage. There is a reason for me setting it up, but I cannot deploy this the way it is. Have any of you experienced this issue? Is there a way around this?
The system current running it: Fedora 36 running the installation on a 500GB SSD, 8 core Ryzen 7 Processor.
any advice to solve this problem would be great. Please note I am not an expert.
Answer to this question:
rsync was scheduled automatically and was in a loop.
Removed rsync, reinstalled it, rescheduled rsync to go on my schedule, removed the older 100 or so back ups and my space has been returned.
for those that are running rsync, just check that it is not running too closely and is detecting that its own backups are there. as the back ups i found were corrupted.

Redis on Azure - suddenly keys stop increasing, hits & misses zero

We are running a setup on Azure consisting:
S3 web app in UK South
S2 failover in UK West
200DTU Elastic Pool with around 25 databases
Redis server
Several times this week, we have had periods where Redis has stopped hitting and missing data, and no additional items are being added to the cache. In effect the caching completely ceases being available.
Flushing the cache does not make any difference to the issue - nothing is added, nothing is hit or even missed.
The only way to re-enable is to restart the web app itself. After which everything is back to normal.
Our developers are looking into potential causes in our codebase, but I wonder if anyone has any ideas on how to diagnose or solve this issue.
Thanks
If restarting your web app fixes it, it sounds like some contention on the client side. You might find this (https://gist.github.com/JonCole/db0e90bedeb3fc4823c2#file-diagnoserediserrors-clientside-md) link useful. It could likely be threadpool or CPU contention on the client machines hosting your web app.

Azure App Service Plan CPU spikes for no obvious reason

We're experiencing CPU spikes on our Azure App Service Plan for no obvious reason. Its not something that stops the service, but we'd like to have an understanding of when&how that kind of things happen.
For example, CPU percentage sits at 0-1% range for days but then all of the sudden it spikes to 98%, 45%, 60% and comes back to 0-1% range very quickly. Memory stays unchanged at comfortable 40-45% level, no incoming requests to it, no web jobs, nothing unusual in logs, no failures, service health ok, nothing we could point our finger to as a reason.
We tried to find out through kudu > support > analyze (metrics)...but we couldn't get request submited. It just keeps giving error to try later.
There is only one web app running in that app service plan, its a asp.net core 2.0. web api.
Could someone shed some light on this kind of behavior? Is this normal, expected? If so, why it happens? Is there a danger that it spikes to 90% and don't immediately come back?
Just, what's going on?
After speaking with MS support i've got an answer it is a normal behavior coming from their monitoring tool:
We reviewed our internal tools taking as starting point 12/26 and
today 12/29 and we could notice that this was majority System
processes doing background tasks, which is normal for each sandbox
environment. In your case, it was mostly MonAgentCore.exe fluctuating
in CPU which is our diagnostic log capturing process and this looks
like a very temporary spike and appears normal.

I'm not sure how to correctly configure my server setup

This is kind of a multi-tiered question in which my end goal is to establish the best way to setup my server which will be hosting a website as well as a service (using Socket.io) for an iOS (and eventually an Android) app. Both the app service and the website are going to be written in node.js as I need high concurrency and scaling for the app server and I figured whilst I'm at it may as well do the website in node because it wouldn't be that much different in terms of performance than something different like Apache (from my understanding).
Also the website has a lower priority than the app service, the app service should receive significantly higher traffic than the website (but in the long run this may change). Money isn't my greatest priority here, but it is a limiting factor, I feel that having a service that has 99.9% uptime (as 100% uptime appears to be virtually impossible in the long run) is more important than saving money at the compromise of having more down time.
Firstly I understand that having one node process per cpu core is the best way to fully utilise a multi-core cpu. I now understand after researching that running more than one per core is inefficient due to the fact that the cpu has to do context switching between the multiple processes. How come then whenever I see code posted on how to use the in-built cluster module in node.js, the master worker creates a number of workers equal to the number of cores because that would mean you would have 9 processes on an 8 core machine (1 master process and 8 worker processes)? Is this because the master process usually is there just to restart worker processes if they crash or end and therefore does so little it doesnt matter that it shares a cpu core with another node process?
If this is the case then, I am planning to have the workers handle providing the app service and have the master worker handle the workers but also host a webpage which would provide statistical information on the server's state and all other relevant information (like number of clients connected, worker restart count, error logs etc). Is this a bad idea? Would it be better to have this webpage running on a separate worker and just leave the master worker to handle the workers?
So overall I wanted to have the following elements; a service to handle the request from the app (my main point of traffic), a website (fairly simple, a couple of pages and a registration form), an SQL database to store user information, a webpage (probably locally hosted on the server machine) which only I can access that hosts information about the server (users connected, worker restarts, server logs, other useful information etc) and apparently nginx would be a good idea where I'm handling multiple node processes accepting connection from the app. After doing research I've also found that it would probably be best to host on a VPS initially. I was thinking at first when the amount of traffic the app service would be receiving will most likely be fairly low, I could run all of those elements on one VPS. Or would it be best to have them running on seperate VPS's except for the website and the server status webpage which I could run on the same one? I guess this way if there is a hardware failure and something goes down, not everything does and I could run 2 instances of the app service on 2 different VPS's so if one goes down the other one is still functioning. Would this just be overkill? I doubt for a while I would need multiple app service instances to support the traffic load but it would help reduce the apparent down time for users.
Maybe this all depends on what I value more and have the time to do? A more complex server setup that costs more and maybe a little unnecessary but guarantees a consistent and reliable service, or a cheaper and simpler setup that may succumb to downtime due to coding errors and server hardware issues.
Also it's worth noting I've never had any real experience with production level servers so in some ways I've jumped in the deep end a little with this. I feel like I've come a long way in the past half a year and feel like I'm getting a fairly good grasp on what I need to do, I could just do with some advice from someone with experience that has an idea with what roadblocks I may come across along the way and whether I'm causing myself unnecessary problems with this kind of setup.
Any advice is greatly appreciated, thanks for taking the time to read my question.

Azure Virtual Machines stuck in Starting mode

This morning I found 5 of my Azure Virtual machines to be stuck in Starting mode.
All other VMs are working ok.
I managed to stop the VMs using the Azure command shell and then start them again but they are still stuck in starting mode with no end in sight.
It has now been over 5 1/2 hours and still stuck in starting mode.
I have contacted Microsoft support but they are taking hours to respond :(((
The Azure Status page doesn't show anything is wrong in my region.
Has anybody else experiencing this problem?
We've had the same issue and it's linked to a big issue Azure is having this morning.
The trick we used in order to get the instance running again is:
1. stop the VMs via Powershell
2. change the size of the vm and back (preferably from A to D as this is different hardware)
3. start the VM
We also have people complaining about RDP not working where reboots fixed the problem.
There are currently some problems with Azure, including the VM service. Also the status page does not reflect all of the problems. Here you have to keep in mind that this page also show impacts affecting most of the service customers. It does not reflect minor outages to single customers. You should keep an eye at the Azure blog which possibly gives a statement related to the current problems.
What works for me is a redeploy of the Virtual Machine within the Azure Portal whenever it gets stuck at "Starting...". Altho it takes half an hour to redeploy, it solves the issue. More details here.
Same problem I experienced and what I did is I resized Virtual Machine's Disk Size, You can go for increasing the whole VM size / power but for me the Disk size fixed it, probably it was updating and the disk file ran out.

Resources