GCE instance hangs after some time - linux

Image:
Debian GNU/Linux 7.6 (wheezy) amd64 with backports kernel and SSH packages built on 2014-10-17
Machine type:
n1-highcpu-2 (2 vCPU, 1.8 GB memory)
Zone:
urope-west1-b
I have 10-20 ruby workers which listen for external RabbitMQ server. Nothing special minimizing css/js/html code, upload pictures via http, transfer data from MongoDB to Mysql.
Everything works fine, but after some time (several hours ~5-6) instance hangs. I can't ssh to it from external client or browser console. Only instance reboot helps.
What should I check? or change? or whatever to fix this behavior?

One of the reasons of "hanging" might be IO throttling. You'd need to understand disk access patterns and see if your disk type/size is a good match.
Please refer to this section of documentation that explains Compute Engine disk performance:
https://cloud.google.com/compute/docs/disks#pdperformance

Related

Does an opened SSH connection to a GCLoud VM prevent it from freezing/crashing?

I have a f1-micro gcloud vm instance running Ubuntu 20.04.
It has 0,2 vcpus and 600mb memory.
I write freezing/crashing which stands for just not responding to anything anymore.
From my monitoring i can see that the cpu is at its peak at 40% usage (usually steady under 1%), while the memory is always arround 60% (both stats with my (nodejs) server running).
When i open a ssh connection to my instance and run my (nodejs) server in background everything works fine as long as i keep the ssh connection alive. As soon as i close the connection it takes a few more minutes until the instance freezes/crashes. Without closing the ssh connection i can keep it running for hours without any problem.
I dont get any crash or freeze information from gcloud itself. The instance has a green checkmark and is kind of still running. I just cant open a new ssh connection and also the only way to do something again with this instance is by restarting it.
I have cloud logging active and there are also no messages in there.
So with this knowledge my question is if gcloud somehow boosts ssh connected vms to keep them alive?
Cause i dont know what else could cause this behaviour.
My (nodejs) server uses arround 120mb, another service uses 80mb and the gcp monitoring agent uses 30mb. The linux free command on the instance shows memory available between 60mb and 100mb.
In addition to John Hanley and Mike, You can edit your Machine Type based on you needs.
In the Google Cloud Console, Go to VM Instance under Compute Engine.
Select Instance name to open its Overview page.
Make sure to Stop the Instance before editing Instance.
Select Machine Type that match your application needs.
Save.
For more info and guides you may refer on link below:
Edit Instance
Machine Family Categories
Since there were no answers that explained the strange behaviour i encountered.
I also haven't figured it out but at least my servers wont crash/freeze anymore.
I somehow fixxed it by running my node.js application in an actual background job using forever instead of running it like node main.js &.

Unable to connect to SSH on Google Cloud VM Instance

I have run into a problem today where I am unable to connect via SSH to my Google Cloud VM instance running debian-10-buster. SSH has been working until today when it suddenly lost connection while docker was running. I've tried rebooting the VM instance and resetting, but the problem still persists. This is the serial console output on GCE, but I am not sure what to look for in that, so any help would be highly appreciated.
Another weird thing is that earlier today before the problem started, my disk usage was fine and then suddenly I was getting a bunch of errors that the disk was out of space even after I tried clearing up a bunch of space. df showed that the disk was 100% full to the point where I couldn't even install ncdu to see what was taking the space. So then I tried rebooting the instance to see if that would help and that's when the SSH problem started. Now I am unable to connect to SSH at all (even through the online GCE interface), so I am not sure what next steps to take.
Your system has run out of disk space for the boot (root) file system.
The error message is:
Root filesystem has insufficient free space
Shutdown the VM, resize the disk larger in the Google Cloud Web GUI and then restart the VM.
Provided that there are no uncorrectable file system errors, your system will startup, resize the partition and file system, and be fine.
If you have modified the boot disk (partition restructuring, added additional partitions, etc) then you will need to repair and resize manually.
I wrote an article on resizing the Debian root file system. My article goes into more detail than you need, but I do explain the low level details of what happens.
Google Cloud – Debian 9 – Resize Root File System

Inconsistency Errors in kombu using celery and redis with the key '_kombu.binding.reply.celery.pidbox'

I have two Django sites (archive and test-archive) on one machine. Each has its own virtual environment and different celery queues and daemons, using Python 3.6.9 on Ubuntu 18.04, Django 3.0.2, Redis v 4.0.9, celery v 4.3, and Kombu v4.6.3. This server has 16 GB of RAM, and under load there is at least 10GB free and swap is minimal.
I keep getting this error in my logs:
kombu.exceptions.InconsistencyError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
I tried
downgrading Kombu to 4.5 for both sites per some stackoverflow posts
and setting maxmemory=2GB and maxmemory-policy=allkeys-lru in redis.conf per celery docs (https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/redis.html#broker-redis); originally the settings were the defaults of unlimited memory and noeviction and these errors were present for both versions of kombu
I still get those errors when one site is under load (i.e. doing something like uploading a set of images and processing them) and the other site is idle.
What is a little strange is that on some test runs using test-archive, test-archive will not have any errors, while archive will show those errors, even though the archive site is not doing anything. On other identical test runs using test-archive, test-archive will generate the errors and archive will not.
I know this is a reported bug in kombu/celery, so I am wondering if anyone has a work around that works more often than not for this configuration. What versions of celery, kombu, redis, etc. seem to work more often than not? I am happy to share my config files or log files, but there are so many I thought it would be best to start this discussion with the problem statement and my setup and see what else is needed.
Thanks!

Collectd server not writing down received client data

I have pretty strange problem with Collectd. I'm not new to Collectd, was using it for a long time on CentOS based boxes, but now we have Ubuntu TLS 12.04 boxes, and I have really strange issue.
So, using version 5.2 on Ubuntu 12.04 TLS. Two boxes residing on Rackspace (maybe important, but I'm not sure). Network plugin configured using two local IPs, without any firewall in between and without any security (just to try to set simple client server scenario).
On both servers collectd writes in configured folders as it should write, but on server machine it doesn't write data received from client.
Troubleshooted with tcpdump, and I can clearly see UDP traffic and collectd data, including hostname and plugin names from my client machine, received on server, but they are not flushed to appropriate folder (configured by collectd) ever. Also running everything as root user, to avoid troubleshooting permissions.
Anyone has any idea or similar experience with this? Or maybe some idea what could I do for troubleshooting this beside trying to crawl internet (I think I clicked on every sensible link Google gave me in last two days) and checking network layer (which looks fine)?
And just small note: exactly the same happened with official 4.10.2 version from Ubuntu's repo. After trying to troubleshoot it for hours moved to upgrade to version five.
I'd suggest trying out the quite generic troubleshooting procedure based on the csv and logfile plugins, as described in this answer. As everything seems to be fine locally, follow this procedure on the server, activating only the network plugin (in addition to logfile, csv and possibly rrdtool).
So after no way of fixing this, I upgraded my Ubuntu to 12.04.2 LTS (3.2.0-24-virtual) and this just started working fine, without any intervention.

QSslSocket timeouts in Ubuntu Server, but not in Desktop

We have problem with our Qt based production server for our business application. When total SSL connections increases with time, some clients does not manage to connect at all.
QSslSocket::waitForEncrypted() starts to fail with no QSslError, regardless of that timeout where set. There are more then ~100 active connections when this problem starts to kick in.
So there are ~170 connections, twice of threads, and "lsof" mentions a little more then 1000 opened files (we had to increase file "ulimit" for that..).
It does not look like it's clients problem, since IPs that are failing and reconnecting changes with time (some "leaps in" with success, but then other don't).
As mentioned, this happens in Ubuntu Server (Zentyal 10.04 and "vanilla" 9.10), but does NOT in Ubuntu Desktop 9.10.
Everything runs inside VMWare ESX 4.1, systems there tested with same resources attached. System loads stays below 1.0. Daemon runs with root permissions.
It looks like it's something with "server"/"desktop" kernel or other configuration differences, but I couldn't tell what exactly could make SSL connection not to handshake... in "server editions"...
We are using Qt 4.5.3 compiled by ourselves.
EDIT: after all it's the same on any Linux I tried. It feels like it's some kind socket limit per process, witch is about 1016 - other_opened_files. I'll try to create new question about that.
EDIT 2: It's select and FD_SETSIZE limit problem...
Problem is with fact that Qt uses select() which is limited with FD_SETSIZE macro for maximum selected sockets/files. I had to change FD_SETSIZE value inside /usr/include/bits/typesizes.h before compiling libQtNetwork and libQtCore.

Resources