hung_task_timeout_secs error during copy to a mount point in linux - linux

I am trying to copy data files from my VM to a NFS VM- ZFS Storage(Both VM's can talk to each other). During copy sometimes I encounter error:
INFO: task cp: blocked for more than 120 seconds .
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message"
Both my VM's hang and I have to restart them. If I copy again it works.
I have around 233 data files to copy and its becoming difficult to restart VM's again and again.
I looked at the solutions given on internet and changed the vm.dirty_ratio to 5 and vm.dirty_background_ratio to 10 to resolve but it did not work.
I am running these VM's on virtual box and allocated around 17GB RAM for one and the NFS VM around 6GB RAM.
Any hack which could help me in copying these files to the NFS without my VM's hanging?

I am sorry if I am answering an answer with another answer, but this case has many variables that need exploring.
1, you have a Linux VM sharing your storage (assumption)
A. Which distro ? 32 or 64 bits ? When the problem happens what does top reports for system load ?
B. Local storage or nas ? Or San ?
C. Which version of NFS ? 3 or 4 ?
D. Can you set the variables of your mount when mapping the NFS share? You might want to play with rsize and wsize, setting them to at least 64000. I would recommend also setting noatime and nodiratime on the share.
E. From my VMware background with Gluster, there are some timeout/refresh settings you can set on the storage side. How often the storage publishes its presence, telling it is alive. A good start is 20 seconds.
F. VMware can tell you how much latency you have for read or write on a physical and on a VM level. Try to figure out those to know who to blame.
Ah, and, of course, make sure your Linux VM has the latest patches applied.
Let's see where we get from here.

Related

Unable to connect to SSH on Google Cloud VM Instance

I have run into a problem today where I am unable to connect via SSH to my Google Cloud VM instance running debian-10-buster. SSH has been working until today when it suddenly lost connection while docker was running. I've tried rebooting the VM instance and resetting, but the problem still persists. This is the serial console output on GCE, but I am not sure what to look for in that, so any help would be highly appreciated.
Another weird thing is that earlier today before the problem started, my disk usage was fine and then suddenly I was getting a bunch of errors that the disk was out of space even after I tried clearing up a bunch of space. df showed that the disk was 100% full to the point where I couldn't even install ncdu to see what was taking the space. So then I tried rebooting the instance to see if that would help and that's when the SSH problem started. Now I am unable to connect to SSH at all (even through the online GCE interface), so I am not sure what next steps to take.
Your system has run out of disk space for the boot (root) file system.
The error message is:
Root filesystem has insufficient free space
Shutdown the VM, resize the disk larger in the Google Cloud Web GUI and then restart the VM.
Provided that there are no uncorrectable file system errors, your system will startup, resize the partition and file system, and be fine.
If you have modified the boot disk (partition restructuring, added additional partitions, etc) then you will need to repair and resize manually.
I wrote an article on resizing the Debian root file system. My article goes into more detail than you need, but I do explain the low level details of what happens.
Google Cloud – Debian 9 – Resize Root File System

Proxmox VE: How to create a raw disk and pass it through to a VM

I am searching for an answer on how to create and pass through a raw device to a VM using proxmox. Through that I am hoping to have full control of the disk including S.M.A.R.T. stats and disk spindown.
Currently I am using passthrough using the SATA passthrough offered by proxmox.
Unfortunately I have no clue how to create a raw disk file from my (empty) disk). Furthermore I am not entirely certain on how to bind it to the VM.
I hope someone knows the relevant steps.
Side notes:
This question is just a measure I want to try out to achieve a certain goal. For the sake of simplicity I posed my question confined to the part above. However, if you have a better idea, feel free to give me a hint. So far I have tried a lot of things to achieve my ultimate goal.
Goal that I want to achieve:
I am using Proxmox VE 5.3-8 on a HP Proliant Gen 8 server. It hosts several VMs among which OMV should serve as a NAS. Since the files will not be accessed too often, I opt for a spindown of the drives.
My goal is reduction of noise and power savings.
Current status:
I passed through two disks by adding them to
/etc/pve/nodes/pve/qemu-server/vmid.conf
sata1: /dev/disk/by-id/{disk-id}
Through that I do see SMART stats and everything except disk spindown works fine. Using virtio instead of SATA does not give me SMART values.
using hdparm -y to put a drive to sleep does not work inside the VM. Doing the same on the proxmox console result in a sleep, but it wakes up a few seconds later.
Passing through the entire HBA is currently not an option.
I read in a forum that first installing Debian and then manually installing the proxmox packages resulted in a success. However that was still for Debian jessie and three years ago.
Install Proxmox VE on Debian Stretch
Before I try this as a last resort, I want to make sure if passing the disk through as a raw file will lead to the result.
Maybe someone has an idea on how to achieve my ultimate goal.
I do not have a clear answer to your question, as per "passing through" the disk, but i recently found a good enough solution for my use case.
I have an HDD that i planned to use as a backup dir for VMs, but i also wanted to put any kind of data on it, and share that disk with any VM that would like to.
The solution i found is to format the disk using ZFS, then creating mount points for different usage (vzdump backup, shared nas folder accross VMs + ISO mounting point etc...). I followed this guide: https://forum.level1techs.com/t/how-to-create-a-nas-using-zfs-and-proxmox-with-pictures/117375
I ended up installing samba on proxmox host itself, with a config to share some folder/mount point of the disk, via SMB. Now the device appears as a normal disk over the network, with excellent read/write speed as everything is local.
Sorry that this post does not "answer" your question (no SMART data or things low level like that :'( ) BUT shared storage ^^'

How do I migrate Proxmox 3.x openVZ containers to Proxmox 4.x LXC?

I have just upgraded one of the Proxmox machines in a cluster from 3.4 to 4.2 following these instructions.
Normal VMs have migrated correctly. We had to change hard drives from virtio to ide so that the machines would detect them but other than that, normal VMs migrated without much problem.
OpenVZ containers on the other hand just appeared in the list of machines in the gui, but they didn't allow me to do anything. I found and removed /etc/pve/nodes/testnode/openvz/xxx.conf and that finally removed it from the GUI.
Then I tried to restore one openvz container into an lxe machine following these instructions but I couldn't. This is what I tried:
The manual says that I should restore a previous backup into a new LXE container, but I cannot find that option anywhere in the GUI.
I tried to restore using the console with this command: pct restore 236 /mnt/nas/dump/vzdump-openvz-236-2016_08_26-20_53_07.tar.gz but after a while restoring the disk is full (No space left on device). I don't understand why this happens; maybe the uncompressed content plus the base lxc is bigger than the 12GiB I assigned to that machine when it was in openvz?.
When that didn't work I created a new lxc container with a similar template (CentOS) on the same ip, but with a bigger disk (20GiB). I then copied the backup file so that it had a valid name for lxc containers (mv /mnt/nas/dump/vzdump-openvz-236-2016_08_26-20_53_07.tar.gz /mnt/nas/dump/vzdump-lxc-236-2016_08_26-20_53_07.tar.gz). Then I tried to restore that backup to the new machine, but it reformatted that machine to 12GiB and I got the exact same result as before (Cannot write: No space left on device).
To see if it was just this machine, I tried restoring a machine from another Proxmox host, but it said that that machine already exists on another host.
After that last two tests I guessed that the disk size, the VM's id and the Proxmox host must all be written somewhere in the backup file, but I cannot find it.
How can I restore my old OpenVZ machines into LXC containers?
Edit: I have been able to restore eight machines so far, I just got this error with this machine.

Unable to increase disk size on file system

I'm currently trying to log in to one of the instances created on google cloud, but found myself unable to do so. Somehow the machine escaped my attention and the hard disk got completely full. Of course I wanted to free some disk space and make sure the server running could restart, but I am facing some issues.
First off, I have found the guide on increasing the size of the persistent disk (https://cloud.google.com/compute/docs/disks/add-persistent-disk). I followed that and already set it 50 GB which should be fine for now.
However, on file system level because my disk is full I cannot make any SSH connection. The error is simply a timeout caused by the fact that there is absolutely no space for the SSH deamon to write to its log. Without any form of connection I cannot free some disk space and/or run the "resize2fs" command.
Furthermore, I already tried different approaches.
I seem to not be able to change the boot disk to something else.
I created a snapshot and tried to increase the disk size on the new
instance I created from that snapshot, but it has the same problem
(filesystem is stuck at 15GB).
I am not allowed to mount the disk as an additional disk in another
instance.
Currently I'm pretty much out of ideas. The important data on the disk was back-upped but I'd rather have the settings working as well. Does anyone have any clues as where to start?
[EDIT]
Currently still trying out new things. I have also tried to run shutdown- and startup scripts that remove /opt/* in order to free some temporary space but the script either don't run or provide some error I cannot catch. It's pretty frustrating working nearly blind I must say.
The next step for me would be to try and get the snapshot locally. It should be doable using the bucket but I will let you know.
[EDIT2]
Getting a snapshot locally is not an option either or so it seems. Images from the google cloud instances can only be created or deleted, but not downloaded.
I'm now out of ideas.
So I finally found the answer. These steps were taken:
In the GUI I increased the size of the disk to 50 GB.
In the GUI I detached the drive by deleting the machine whilst
ensuring that I did not throw away the original disk.
In the GUI I created a new machine with a sufficiently big harddisk.
On the command line (important!!) I attached the disk to the newly
created machine (the GUI option has a bug still ...)
After that I could mount the disk as a secondary disk and perform all the operations I needed.
Keep in mind: By default google cloud solutions do NOT use logical volume management, so pvresize/lvresize/etc. is not installed and resize2fs might not work out of the box.

"Unstable" NFS mount point

First of all, this is the first time I'm posting a question on StackOverflow, so please don't kill me if I've done anything wrong.
There goes my issue:
We have few dedicated servers with a well known French provider. With one of those servers ewe have recently acquired a 5.000GB backup space which can be mounted via NFS, and that's what we've done.
The issue comes when backing up big files. Every night we back up several VM's running on that host and we know from fact that the backups are not being properly done (the file size differs a lot from one day to the other plus we've checked the content of the backup and there's stuff missing).
So, it seems like the mount point is not stable and the backups are not being properly done. Seems like there are micro network cuts and therefore the hypervisor finishes the current backup and starts with the next one.
This is how it's mounted right now:
xxx.xxx.xxx:/export/ftpbackup/xxx.ip-11-22-33.eu/ /NFS nfs auto,timeo=5,retrans=5,actimeo=10,retry=5,bg,soft,intr,nolock,rw,_netdev,mountproto=tcp 0 0
Any advise? Is there any parameter you would change?
We need to be sure that the NFS mount point is correctly working in order to have proper backups.
Thank you so much
By specifying "soft" as an option, you're saying that it's OK for the mount to be unreliable -- for the kernel to return an I/O error instead of running the I/O to completion when things are taking too long. Using a hard mount, without the "soft" option instructs the kernel to avoid returning I/O errors for timeouts.
This will fix your corrupted backups, but... your backup process will hang hard until I/O's complete. An alternative is to use much longer timeout values.
You're using TCP for the mount protocol, but not for NFS itself. If your server supports it, consider adding "tcp" to the options line.

Resources