Unable to connect to SSH on Google Cloud VM Instance - linux

I have run into a problem today where I am unable to connect via SSH to my Google Cloud VM instance running debian-10-buster. SSH has been working until today when it suddenly lost connection while docker was running. I've tried rebooting the VM instance and resetting, but the problem still persists. This is the serial console output on GCE, but I am not sure what to look for in that, so any help would be highly appreciated.
Another weird thing is that earlier today before the problem started, my disk usage was fine and then suddenly I was getting a bunch of errors that the disk was out of space even after I tried clearing up a bunch of space. df showed that the disk was 100% full to the point where I couldn't even install ncdu to see what was taking the space. So then I tried rebooting the instance to see if that would help and that's when the SSH problem started. Now I am unable to connect to SSH at all (even through the online GCE interface), so I am not sure what next steps to take.

Your system has run out of disk space for the boot (root) file system.
The error message is:
Root filesystem has insufficient free space
Shutdown the VM, resize the disk larger in the Google Cloud Web GUI and then restart the VM.
Provided that there are no uncorrectable file system errors, your system will startup, resize the partition and file system, and be fine.
If you have modified the boot disk (partition restructuring, added additional partitions, etc) then you will need to repair and resize manually.
I wrote an article on resizing the Debian root file system. My article goes into more detail than you need, but I do explain the low level details of what happens.
Google Cloud – Debian 9 – Resize Root File System

Related

Azure VM won't boot - Error is 'Fatal error C0000034 applying update operation 63 of 82641'

VM is set to start every morning at 8am. This morning I got the following error : -
'Fatal error C0000034 applying update operation 63 of 82641' in the Boot Diagnostics section in the VM Console
Every previous occurrence I found googling the error did not relate to an Azure VM but a standalone laptop. All of these suggest starting from a different partition or rescue disk which is not possible in my case.
Tried re-starting the VM
Redeploying the VM
Resizing the VM
Whatever I try I still can't RDP to the VM.
I can't restore the C: drive as I can't connect to the VM to do it.
Any ideas how I can recover from this or rescue the VM ? All greatly appreciated.
Thanks,
Dan.
I've managed no to resolve this so will post the Solution in here for anyone else
Found the error in the Microsoft Docs here :-
https://learn.microsoft.com/en-gb/troubleshoot/azure/virtual-machines/troubleshoot-stuck-updating-boot-error?WT.mc_id=Portal-Microsoft_Azure_Support
This advised taking a copy of the OS Disk and attaching it as a Data Disk to a 'Rescue VM'
Running the following Powershell command against the Disk in
dism /image::\ /get-packages > c:\temp\Patch_level.txt
Open the file , scroll to the bottom and look for updates that are Install Pending or Uninstall Pending
Running the following Powershell command
dism /Image::\ /Remove-Package /PackageName: where PackageName is the Package Identity in the text file
Detach the now repaired OS Disk from the Repair VM and Attach it to the Original VM
Start the VM - it may take a while to start.
With the oncoming tide of Managed Disks, I don't suppose there will be much call for this solution , but it's here if anyone needs it.

does docker manage filesystem like a standalone OS?

I have a program I'm running in a docker container. After 10-12 hours of run, the program terminated with filesystem-related errors (FileNotFoundError, or similar).
I'm wondering if the disk space got filled up or a similar filesystem-related issue or there was a problem in my code (e.g one process deleted the file pre-maturely).
I don't know much about docker management of files and wonder if inside docker it creates and manages its own FS or not. Here are three possibilities I'm considering and mainly wonder if #1 could be the case or not.
If docker manages it's own filesystem, could it be that although disk space is available on the host machine, docker container ran out of it's own storage space? (I've seen similar issues regarding running out of memory for a process that has limited memory artificially imposed using cgroups)
Could it be that host filesystem ran out of space and the files got corrupted or maybe didn't get written correctly?
There is some bug in my code.
This is likely a bug in your code. Most programs print the error they encounter, and when a program encounters out-of-space, the error returned by the filesystem is: "No space left on device" (errno 28 ENOSPC).
If you see FileNotFoundError, that means the file is missing. My best theory is that it's coming from your consumer process.
It's still possible though, that the file doesn't exist because the producer ran out of space and you didn't handle the error correctly - you'll need to check your logs.
It might also be a race condition, depending on your application. There's really not enough details to answer that.
As to the title question:
By default, docker just overlay-mounts an empty directory from the host's filesystem into the container, so the amount of free space on the container is the same as the amount on the host.
If you're using volumes, that depends on the storage driver you use. As #Dan Serbyn mentioned, the default limit for the devicemapper driver is 10 GB. The overlay2 driver - the default driver - doesn't have that limitation.
In the current Docker version, there is a default limitation on the Docker container storage of 10 GB.
You can check the disk space that containers are using by running the following command:
docker system df
It's also possible that the file your container is trying to access has access level restrictions. Try to make it available for docker or maybe everybody (chmod 777 file.txt).

New Azure data disk does not appear when running lsblk from virtual machine's cli

I am attempting to create and attach a new data disk to an Azure linux VM per these instructions: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/attach-disk-portal
Azure Portal reported that the disk was created and attached to my VM successfully, and I can see it listed as a data disk under "Disks" for that VM in Azure Portal. However, when I run lsblk from the VM's command line, as instructed under "Find the disk" in the documentation, the new disk does not appear in the listing. Therefore I can't proceed in setting up the disk.
How can I get the disk to show up in lsblk, or at least begin to diagnose why it didn't? The VM is running Ubuntu 20.04, in case that matters.
For what it's worth, immediately before this, I executed the same process to add a different data disk to a different VM and it went very smoothly, so there seems to be some particular problem with this VM.
If the VM was running when you added the disk, you need to rescan for the new disk. Rebooting will work, but you can rescan without rebooting.
If the sg3-utils package is installed, you can use rescan-scsi-bus.sh to rescan. If not you can use the following:
for h in $(ls /sys/class/scsi_host); do
echo '- - -' > /sys/class/scsi_host/$h/scan
done
For more information Refer this Document : Virtual Hard Disk is added, but not showing using lsblk -d command
I've been in touch with Azure support and they've diagnosed (through some back-end method) that the storage module (hv_storvsc) on the virtual machine is down. They have stated that the only solution is to reboot the virtual machine.

Does an opened SSH connection to a GCLoud VM prevent it from freezing/crashing?

I have a f1-micro gcloud vm instance running Ubuntu 20.04.
It has 0,2 vcpus and 600mb memory.
I write freezing/crashing which stands for just not responding to anything anymore.
From my monitoring i can see that the cpu is at its peak at 40% usage (usually steady under 1%), while the memory is always arround 60% (both stats with my (nodejs) server running).
When i open a ssh connection to my instance and run my (nodejs) server in background everything works fine as long as i keep the ssh connection alive. As soon as i close the connection it takes a few more minutes until the instance freezes/crashes. Without closing the ssh connection i can keep it running for hours without any problem.
I dont get any crash or freeze information from gcloud itself. The instance has a green checkmark and is kind of still running. I just cant open a new ssh connection and also the only way to do something again with this instance is by restarting it.
I have cloud logging active and there are also no messages in there.
So with this knowledge my question is if gcloud somehow boosts ssh connected vms to keep them alive?
Cause i dont know what else could cause this behaviour.
My (nodejs) server uses arround 120mb, another service uses 80mb and the gcp monitoring agent uses 30mb. The linux free command on the instance shows memory available between 60mb and 100mb.
In addition to John Hanley and Mike, You can edit your Machine Type based on you needs.
In the Google Cloud Console, Go to VM Instance under Compute Engine.
Select Instance name to open its Overview page.
Make sure to Stop the Instance before editing Instance.
Select Machine Type that match your application needs.
Save.
For more info and guides you may refer on link below:
Edit Instance
Machine Family Categories
Since there were no answers that explained the strange behaviour i encountered.
I also haven't figured it out but at least my servers wont crash/freeze anymore.
I somehow fixxed it by running my node.js application in an actual background job using forever instead of running it like node main.js &.

Unable to increase disk size on file system

I'm currently trying to log in to one of the instances created on google cloud, but found myself unable to do so. Somehow the machine escaped my attention and the hard disk got completely full. Of course I wanted to free some disk space and make sure the server running could restart, but I am facing some issues.
First off, I have found the guide on increasing the size of the persistent disk (https://cloud.google.com/compute/docs/disks/add-persistent-disk). I followed that and already set it 50 GB which should be fine for now.
However, on file system level because my disk is full I cannot make any SSH connection. The error is simply a timeout caused by the fact that there is absolutely no space for the SSH deamon to write to its log. Without any form of connection I cannot free some disk space and/or run the "resize2fs" command.
Furthermore, I already tried different approaches.
I seem to not be able to change the boot disk to something else.
I created a snapshot and tried to increase the disk size on the new
instance I created from that snapshot, but it has the same problem
(filesystem is stuck at 15GB).
I am not allowed to mount the disk as an additional disk in another
instance.
Currently I'm pretty much out of ideas. The important data on the disk was back-upped but I'd rather have the settings working as well. Does anyone have any clues as where to start?
[EDIT]
Currently still trying out new things. I have also tried to run shutdown- and startup scripts that remove /opt/* in order to free some temporary space but the script either don't run or provide some error I cannot catch. It's pretty frustrating working nearly blind I must say.
The next step for me would be to try and get the snapshot locally. It should be doable using the bucket but I will let you know.
[EDIT2]
Getting a snapshot locally is not an option either or so it seems. Images from the google cloud instances can only be created or deleted, but not downloaded.
I'm now out of ideas.
So I finally found the answer. These steps were taken:
In the GUI I increased the size of the disk to 50 GB.
In the GUI I detached the drive by deleting the machine whilst
ensuring that I did not throw away the original disk.
In the GUI I created a new machine with a sufficiently big harddisk.
On the command line (important!!) I attached the disk to the newly
created machine (the GUI option has a bug still ...)
After that I could mount the disk as a secondary disk and perform all the operations I needed.
Keep in mind: By default google cloud solutions do NOT use logical volume management, so pvresize/lvresize/etc. is not installed and resize2fs might not work out of the box.

Resources