I have a AWS micro instance running Ubuntu 12.04 LTS and last night when I SSH in, I did apt-get update and it gave me an error (I don't recall which). So I thought I would give my instance a reboot. This morning, it says that my instance has failed an Instance Sstaus Check and I am unable to SSH into it. The bottom of my system log is below. Is there any way to save this and if not, anyway to save the data?
Thank you!
Loading, please wait...
[35914369.823672] udevd[81]: starting version 175
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[35914370.187877] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
[35914373.347844] init: mountall main process (183) terminated with status 1
General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
Give root password for maintenance
(or type Control-D to continue):
It depends on how badly broken the filesystem is.
You can start a new instance in AWS and then attach the EBS volume to your new instance. That may help you recover the data.
Don't terminate the instance, otherwise you could lose the filesystem completely.
Always use the "Create AMI" option of a running instance before doing an apt-get update/upgrade or yum update/upgrade. This way, if your system fails to come up after a reboot (after the update), you can spin up a 'before' version (i.e. bootable) instance using the AMI you just created.
In this case, Ubuntu probably tried to install a new kernel and/or ram file system (ramfs) and since this is an AWS virtual machine with kernel and ramfs dependencies, the standard Ubuntu build probably did not meet those dependencies and your virtual machine is now toast.
As mentioned, if you need to recover data from the unbootable system, mount it's EBS volume to a working system. It may complain that it is in use. If so, and to keep the EBS volume, you must check the option that lets you preserve the volume before you terminate it. The default on termination of an instance is to destroy it's EBS volume because, the assumption is, you booted from an EBS backed AMI that you previously (or regularly) created.
Related
Im having issue with my script that calulates intergity on this version of ubunutu :
cyber#ubuntu:/$ hostnamectl
Static hostname: ubuntu
Icon name: computer-vm
Chassis: vm
Machine ID: 48d13c046d74421781e6c6f771f6ac31
Boot ID: 847b838897ac47eb932f6427361232d1
Virtualization: vmware
Operating System: Ubuntu 20.04.4 LTS
Kernel: Linux 5.13.0-51-generic
Architecture: x86-64
Im wondering if /sys/kernel/tracing/per_cpu/cpu45 is not by any chance an alive file ?
because calculating the hash of the files inside takes ifinite time.
If you want to check filesystem integrity, skip the whole /sys folder - it is an interface to the kernel.
Also it would be better if you also skip /proc (also kernel interface) and /dev (special or device files) folders. F.e - you can read from /dev/zero or /dev/urandom forever. Network mounts can give you a lot of bright moments too.
Also your script can freeze on reading pipes - it there is enough permissions it can read from a pipe forever.
If I was building such a script, I'll start from the mounts, checked their filesystems and scanned only needed ones. For example if a mount is tmpfs - it's contents is located in RAM and will be wiped after reboot.
And you totally should check it out -
https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard
I have the following problematic and I am not sure what is happening. I'll explain briefly.
I work on a cluster with several nodes which are managed via slurm. All these nodes share the same disk memory (I think it uses NFS4). My problem is that since this disk memory is shared by a lots of users, we have a limit a mount of disk memory per user.
I use slurm to launch python scripts that runs some code and saves the output to a csv file and a folder.
Since I need more memory than assigned, what I do is I mount a remote folder via sshfs from a machine where I have plenty of disk. Then, I configure the python script to write to that folder via an environment variable, named EXPERIMENT_PATH. The script example is the following:
Python script:
import os
root_experiment_dir = os.getenv('EXPERIMENT_PATH')
if root_experiment_dir is None:
root_experiment_dir = os.path.expanduser("./")
print(root_experiment_dir)
experiment_dir = os.path.join( root_experiment_dir, 'exp_dir')
## create experiment directory
try:
os.makedirs(experiment_dir)
except:
pass
file_results_dir = os.path.join( root_experiment_dir, 'exp_dir' , 'results.csv' )
if os.path.isfile(file_results_dir):
f_results = open(file_results_dir, 'a')
else:
f_results = open(file_results_dir, 'w')
If I directly launch this python script, I can see the created folder and file in my remote machine whose folder has been mounted via sshfs. However, If I use sbatch to launch this script via the following bash file:
export EXPERIMENT_PATH="/tmp/remote_mount_point/"
sbatch -A server -p queue2 --ntasks=1 --cpus-per-task=1 --time=5-0:0:0 --job-name="HOLA" --output='./prueba.txt' ./run_argv.sh "python foo.py"
where run_argv.sh is a simple bash taking info from argv and launching, i.e. that file codes up:
#!/bin/bash
$*
then I observed that in my remote machine nothing has been written. I can check the mounted folder in /tmp/remote_mount_point/ and nothing appears as well. Only when I unmount this remote folder using: fusermount -u /tmp/remote_mount_point/ I can see that in the running machine a folder has been created with name /tmp/remote_mount_point/ and the file is created inside, but obviously nothing appears in remote machine.
In other words, it seems like by launching through slurm, it bypasses the sshfs mounted folder and creates a new one in the host machine which is only visible once the remote folder is unmounted.
Anyone knows why this happens and how to fix it? I emphasize that this only happens if I launch everything through slurm manager. If not, then everything works.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
Thanks in advance.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
This is not how it works, unfortunately. Trying to put it simply; you could say that mount point inside mount points (here SSHFS inside NFS) are "stored" in memory and not in the "parent" filesystem (here NFS) so the compute nodes have no idea there is an SSHFS mount on the login node.
For your setup to work, you should create the SSHFS mount point inside your submission script (which can create a whole lot of new problems, for instance regarding authentication, etc.)
But before you dive into that, you probably should enquiry whether the cluster has another filesystem ("scratch", "work", etc.) where there you could temporarily store larger data than what the quota allows in your home filesystem.
After changing a BCache cache device, I was unable to mount my BTRFS filesystem without the "usebackuproot" option; I suspect that there is corruption without any disk failures. I've tried recreating the previous caching setup but it doesn't seem to help.
So the question is, what now? Both btrfs rescue chunk-recover and btrfs check failed, but with the "usebackuproot" option I can mount it r/w, the data seems fine, and btrfs rescue super-recover reports no issues. I'm currently performing a scrub operation but it will take several more hours.
Can I trust the data stored on that filesystem? I made a read-only snapshot shortly before the corruption occurred, but still within the same filesystem; can I trust that? Will btrfs scrub or any other operation truly check whether any of my files are damaged? Should I just add "usebackuproot" to my /etc/fstab (and update-initramfs) and call it a day?
I don't know if this will work for everyone, but I saved my filesystem with the following steps*:
Mount the filesystem with mount -o usebackuproot†
Scrub the mounted filesystem with btrfs scrub
Unmount (or remount as ro) the filesystem and run btrfs check --mode=lowmem‡
*These steps assume that you're unable to mount the filesystem normally and that btrfs check has failed. Otherwise, try that first.
†If this step fails, try running btrfs rescue super-recover, and if that alone doesn't fix it, btrfs rescue chunk-recover.
‡This command will not fix your file systems if problems are found, but it's otherwise very memory intensive and will be killed by the kernel if run in a live image. If problems are found, make or use a separate installation to run btrfs check --repair.
I am following the basic nfs server tutorial here, however when I am trying to create the test busybox replication controler I get an error indicating that the mount has failed.
Can someone point out what am I doing wrong ?
MountVolume.SetUp failed for volume
"kubernetes.io/nfs/4e247b33-a82d-11e6-bd41-42010a840113-nfs"
(spec.Name: "nfs") pod "4e247b33-a82d-11e6-bd41-42010a840113" (UID:
"4e247b33-a82d-11e6-bd41-42010a840113") with: mount failed: exit
status 32 Mounting arguments: 10.63.243.192:/exports
/var/lib/kubelet/pods/4e247b33-a82d-11e6-bd41-42010a840113/volumes/kubernetes.io~nfs/nfs
nfs [] Output: mount: wrong fs type, bad option, bad superblock on
10.63.243.192:/exports, missing codepage or helper program, or other error (for several filesystems (e.g. nfs, cifs) you might need a
/sbin/mount. helper program) In some cases useful info is found
in syslog - try dmesg | tail or so
I have tried using a ubuntu vm as well just to see if I can manage to mitigate a possible missble /sbin/mount.nfs dependency by running apt-get install nfs-common, but that too fails with the same error.
Which container image are you using? On 18th of October Google announce a new container image, which doesn't support NFS, yet. Since Kubernetes 1.4 this image (called gci) is the default. See also https://cloud.google.com/container-engine/docs/node-image-migration#known_limitations
I'm writing a Chef recipe to automate setting up software RAID 1 on an existing system with. The basic procedure is:
Clear partition table on new disk (/dev/sdb)
Add new partitions, and set then to raid using parted (sdb1 for /boot and sdb2 with LVM for /)
Create a degraded RAID with /dev/sdb using mdadm --create ... missing
pvcreate /dev/md1 && vgextend VolGroup /dev/md1
pvmove /dev/sda2 /dev/md1
vgreduce VolGroup /dev/sda2 && pvremove /dev/sda2
...
...
I'm stuck on no. 5. With 2 disks of the same size I always get an error:
Insufficient free space: 10114 extents needed, but only 10106 available
Unable to allocate mirror extents for pvmove0.
Failed to convert pvmove LV to mirrored
I think it's because when I do the mdadm --create, it adds extra information to the disk so it has slightly less physical extents.
To remedy the issue, one would normally reboot the system off a live distro and:
e2fsck -f /dev/VolGroup/lv_root
lvreduce -L -0.5G --resizefs ...
pvresize --setphysicalvolumesize ...G /dev/sda2
etc etc
reboot
and continue with step no. 5 above.
I can't do that with Chef as it can't handle the rebooting onto a live distro and continuing where it left off. I understand that this obviously wouldn't be idempotent.
So my requirements are to be able to lvreduce (somehow) on the live system without using a live distro cd.
Anyone out there have any ideas on how this can be accomplished?
Maybe?:
Mount a remote filesystem as root and remount current root elsewhere
Remount the root filesystem as read-only (but I don't know how that's possible as you can't unmount the live system in the first place).
Or another solution to somehow reboot into a live distro, script the resize and reboot back and continue the Chef run (Not sure if this is even popssible
Ideas?
I'm quite unsure chef is the right tool.
Not a definitive solution but what I would do for this case:
Create a live system with a chef and a cookbook
Boot on this
run chef as chef-solo with the recipe doing the work (which should work as the physical disk are unmounted at first)
The best way would be to write cookbooks to be able to redo the target boxes from scratch, once it's done you can reinstall entirely the target with the correct partitionning at system install time and let chef rebuild your application stack after that.