How can I simulate a failed disk during testing? - linux

In a Linux VM (Vmware workstation or similar), how can I simulate a failure on a previously working disc?
I have a situation happening in production where a disc fails (probably a controller, cable or firmware problem). Obviously this is not predictable or reproducible, I want to test my monitoring to ensure that it alerts correctly.
I'd ideally like to be able to simulate a situation where it fails writes but succeeds reads, as well as a complete failure, i.e. the scsi interface reports errors back to the kernel.

There are several layers at which a disk error can be simulated. If you are testing a single user-space program, probably the simplest approach is to interpose the appropriate calls (e.g. write()) and have them sometimes return an error. The libfiu fault-injection library can do this using its fiu-run tool.
Another approach is to use a kernel driver that can pass through data to/from another device, but inject faults along the way. You can then mount the device and use it from any application as if it was a faulty disk. The fsdisk driver is an example of this.
There is also a fault injection infrastructure that has been merged in to the Linux kernel, although you will probably need to reconfigure your kernel to enable it. It is documented in Documentation/fault-injection/fault-injection.txt. This is useful for testing kernel code.
It is also possible to use SystemTap to inject faults at the kernel level. See The SCSI fault injection test and Kernel Fault injection using SystemTap.

To add to mark4o's answer, you can also use Linux's Device Mapper to generate failing devices.
Device Mapper's delay device can be used to send read and write I/O of the same block to different underlying devices (it can also delay that I/O as its name suggests). Device Mapper's error device can be used to generate permanent errors when a particular block is accessed. By combining the two you can create a device where writes always fail but reads always succeed for a given area.
The above is a more complicated example of what is described in the question Simulate a faulty block device with read errors? (see https://stackoverflow.com/a/1871029 for a simple Device Mapper example).
There is also a list of Linux disk fault injection mechanisms on the Special File that causes I/O error Unix & Linux question.

A simple way to make a SCSI disk disappear with a 2.6 kernel is:
echo 1 > /sys/bus/scsi/devices/H:B:T:L/delete
(H:B:T:L is host, bus, target, LUN). To simulate the read-only case you'll have to use the fault injection methods that mark4o mentioned, though.

Linux kernel provides a nice feature called “fault injection”
echo 1 > /sys/block/vdd/vdd2/make-it-fail
To setup some of the options:
mkdir /debug
mount debugfs /debug -t debugfs
cd /debug/fail_make_request
echo 10 > interval # interval
echo 100 > probability # 100% probability
echo -1 > times # how many times: -1 means no limit
https://lxadm.com/Using_fault_injection

You may use scsi_debug kernel module to simulate a RAM disk and it supports all the SCSI errors with opts and every_nth options.
Please check this http://sg.danny.cz/sg/sdebug26.html
Example on medium error on sector 4656:
[fge#Gris-Laptop ~]$ sudo modprobe scsi_debug opts=2 every_nth=1
[fge#Gris-Laptop ~]$ sudo dd if=/dev/sdb of=/dev/null
dd: error reading ‘/dev/sdb’: Input/output error
4656+0 records in
4656+0 records out
2383872 bytes (2.4 MB) copied, 0.021299 s, 112 MB/s
[fge#Gris-Laptop ~]$ dmesg|tail
[11201.454332] blk_update_request: critical medium error, dev sdb, sector 4656
[11201.456292] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11201.456299] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current]
[11201.456303] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
[11201.456308] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 00 00 12 30 00 00 08 00
[11201.456312] blk_update_request: critical medium error, dev sdb, sector 4656
You could alter the opts and every_nth options in runtime via sysfs:
echo 2 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
echo 1 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts

One can also use methods that are provided by the disks to do media error testing. SCSI has a WRITE LONG command that can be used to corrupt a block by writing data with invalid ECC. SATA and NVMe also have similar commands.
For the most common case (SATA) you can use hdparm with --make-bad-sector to employ that command, you can use sg_write_long for SCSI and for NVMe you can use the nvme-cli with the write-uncor option.
The big advantage that these commands have over other injection methods is that they also behave just like a drive does, with full latency impacts and also the recovery upon a write to that sector by reallocation. This includes also error counters going up in the drive.
The disadvantage is that if you do this too much for the same drive its error counters will go up and SMART may flag the disk as bad or you may exhaust its reallocation tables. So do use it for manual testing but if you are running it on automated testing don't do it too often.

You can also use a low-level SCSI utility (sg3-utils) to stop the drive. It will still respond to Inquiry, so its state will still be "running" but reads and writes will fail until it is started again. I've tested RAID drive removal and recovery using mdadm this way.
sg_start --stop /dev/sdb

Related

Is running `sync` necessary after writing a disk image?

Common way to write an image to disk looks like:
dd if=file.img of=/dev/device
After this command, is it necessary to run sync?
sync(2) explains it only flushes filesystem caches. Since dd command is not related to any filesystem, I think it is not necessary to run sync. However, block layer is complex and in doubt, most people prefers to run sync.
Does anyone has a proof that it is useful or useless?
TL;DR: Run blockdev --flushbufs /dev/device after dd.
I tried to follow the different paths in kernel. Here is what I understood:
ioctl(block_dev, BLKFLSBUF, 0) call blkdev_flushbuf(). Considering its name, it should flush caches associated with device (or I think you can consider there is bug in device driver). I think it should also responsible to flush hardware caches if they exist. Notice e2fsprogs use BLKFLSBUF.
fdatasync() (and fsync()) will call blkdev_fsync(). It looks like blkdev_flushbuf() but it only impact range of data written by current process (It use filemap_write_and_wait_range() while BLKFLSBUF use filemap_write_and_wait).
Closing a block device calls blkdev_close() that do not flush buffers.
sync() will call sync_fs(). It will flush filesystem caches and call fsync() on underlying block device.
Command sync /dev/device will call fsync() on /dev/device. However, I think it is useless since dd didn't touch to any filesystem.
So my conclusions is that call to sync has no (direct) impact on block device. However, passing fdatasync (or fsync) to dd is the only way to guarantee that data are correctly written on media.
If have you run dd but you missed to pass fdatasync, running sync /dev/device is not sufficient. You have have to run dd with fdatasync on whole device. Alternatively, you can call BLKFLSBUF to flush whole device. Unfortunately, there is no standard command for that.
EDIT
You can issue a BLKFLSBUF with blockdev --flushbufs /dev/device.
To ensure the data is flushed on a usb device before to unplug, I use the following command :
echo 1 > /sys/block/${device}/device/delete
This way, the data is flushed, and if the device is a hard drive, then the head is parked.

how to force immediate writes to a disk from our own driver on linux

We have a driver writing data directly to sectors of disk by sending the BIO through “make_request_fn” of other block device from block layer.
But somehow the data did not write down to the disk instantly, when I reboot the machine, the data what I wrote to disk before reboot are gone.
The data can reflect correctly on next reboot ONLY if one of following , before reboot,
I use function to flush the block device (from program) after writing
Drop the system cache by “fsyhc ; echo 1 > /proc/sys/vm/drop_caches” after writing
We also tried following ways, but they all did not work.
Adding flag like flush/FUA on BIO flag
Calling blkdev_issue_flush()
This happened on both real machine and virtual environment like VMware.
OS is Ubuntu 14.4.3, with kernel 3.19 and Ext4 file system.
I am wondering if somebody can explain the reason and help me out.

How can I write a kernel module that can keep track of the SCSI commands sent to a particular device?

Given a SCSI device as input, I am trying to implement a kernel module that can :
Get the list of SCSI commands sent to that particular device and count the number of times that command was given.
How do I proceed to implement this?
I am a beginner at programming kernel modules, In fact I have only written a "Hello, World" program till now.
I believe there is already debugging and logging support on device driver side.
As a startpoint you should investigate the driver in the linux kernel which starts here:
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_module.c
And for the debugging you find this one:
https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_logging.c
And it is always a good idea to read the manual first! Give it a chance here:
https://www.kernel.org/doc/Documentation/scsi/
If you can't get logging activated, you should come back and ask a more specific question. But I believe you have no need to write your own logger at all.
From http://www.theunixway.com/2013/10/ol56-enable-additional-scsi-logging-or.html
Enabling scsi logging on kernel:
sysctl -q -w dev.scsi.logging_level=<N>
or
echo <N> > /proc/dev/scsi/logging_level
N is a bit field which contains different categories of logging information. Please refer to the documentation of the actual kernel driver version you use.

Reasons for direct_io failure

I want to know under what circumstances a direct I/O transfer will fail?
I have following three sub-queries for that. As per "Understanding Linux kernel" book..
Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
-- So to explain failure one needs to check whether application has self caching feature or not? Not sure how that can be done.
2.Furthermore the book says "When a self-caching application wishes to directly access a file, it opens the file specifying the O_DIRECT flag . While servicing the open( ) system call, the dentry_open( ) function checks whether the direct_IO method is implemented for the address_space object of the file being opened, and returns an error code in the opposite case".
-- Apart from this any other reason that can explain direct I/O failure ?
3.Will this command "dd if=/dev/zero of=myfile bs=1M count=1 oflag=direct" ever fail in linux (assuming ample disk space available) ?
The underlying filesystem and block device must support O_DIRECT flag. This command will fail because tmpfs doesn't support O_DIRECT.
dd if=/dev/zero of=/dev/shm/test bs=1M count=1 oflag=direct
The write size must be the multiply of the block size of underlying driver. This command will fail because 123 is not multiply of 512:
dd if=/dev/zero of=myfile bs=123 count=1 oflag=direct
There are many reasons why direct I/O can go on to fail.
So to explain failure one needs to check whether application has self caching feature or not?
You can't do this externally - you have the either deduce this from the source code or watching how the program used resources as it ran (binary disassembly I guess). This is more a property of how the program does its work rather than a "turn this feature on in a call". It would be a dangerous assumption to think all programs that use O_DIRECT have self caching (probabilistically I'd say it's more likely but you don't know for sure).
There are strict requirements for using O_DIRECT and they are mentioned in the man page of open (see the O_DIRECT section of NOTES).
With modern kernels the area being operated on must be aligned to, and its size must be a multiple of, the disk's block size. Failure to do this correctly may even result in silent fallback to buffered I/O.
Yes, for example trying to use it on a filesystem (such as tmpfs) doesn't support O_DIRECT. I suppose it could also fail if the path down to the disk returns failures for some reason (e.g. disk is dying and is returning the error much sooner in comparison to what happens with writeback).

Is there a way to check if an USB drive is stopped?

I've written a script to backup my server's HD every night. At the end of the script, I sync, wait a couple of minutes, sync and then I issue sg_start --stop to stop the device. The idea is to extend the lifetime of the device by switching the HD off after ten minutes of incremental backup (desktop disks will survive several thousand on/off cycles but only a few hundred hours of continuous running).
This doesn't always work; I often find the drive still spinning the next morning. Is there a shell command which I can use to check that the drive has stopped (so I can issue the stop command again [EDIT2]or write a script to create a process list when the drive is running so I can debug this[/EDIT2])?
[EDIT] I've tried sg_inq (as suggested by the sg_start man page) but this command always returns 0.
I've tried hdparm but it always returns "drive state is: unknown" for USB drives (attached via /dev/sdX) and when trying to spin down the drive, I get "HDIO_DRIVE_CMD(setidle1) failed: Input/output error".
sdparm seems to support to set the idle timer on the drive (see "Power mode condition page") but the IDLE option has "Changeable: n" and I haven't found an option which tells me the drive power state.
[EDIT2] Note: I can stop the drive with sg_start --stop from the console. That always works; it just doesn't always stay down until midnight. The sever is in the basement (where it's nice and cool) and I'd rather have a way to check whether the drive is up or not from the warm living room :) If I had a command which told me the status of the drive, I could write a script to alert me when it spins up (check every minute) and then I could try to figure out what might be causing this.
If that matters: I'm using openSUSE 11.1.
When you say you've tried hdparm, you haven't said what you have tried. I have some usb hard drives in an enclosure, and some of the commands work for it and others don't, but I guess it all depends on all facets of the transport mechanism.
hdparm -S 120 /dev/sda
Should tell the drive to sleep itself after ~10 minutes of inactivity.
I'm guessing you have already tried this, but its not obvious, and writing this as an answer may help a future reader.
Nothing accesses the drive but the backup script.
This is nice in theory, but I have found on odd occasions this is not enough. There are lots of processes and tasks that if they even look at the drive a spin up may occur if the lookup is for some reason out of the disk-cache.
Common culprits are tools like updatedb scanning all mounts for files, and fam or gamin doing funky stuff to monitor disks for changes.
If in doubt, add a layer of certainty by mounting the device before executing the script, and unmounting it when you are done.
Seeing things that could cause a wakeup
lsof +D /mountpoint
You should probably parse the output of this as well before attempting to unmount to be sure that nothing is still trying to use it.
You should probably also be doing a lazy umount,
umount -l /mountpoint
so if anything accesses it between you doing lsof| grep and calling umount, it will still unmount the drive and stop things being able to read it.
HAL and friends
Its also possible that HAL and friends are doing wakeups to the drive probing for connect/removal state. I really hope it isn't, but it does that on some device types. It seems an unlikely cause, but I'll consider anything possible.
Try fun stuff like
lsof /dev/devicehere
and
lsof /dev/devicehere1
Which appears to get a comprehensive list of all things that would be accessing the handle either directly or indirectly.
You also need to check the mount options and use "noatime" as a mount option, otherwise the kernel still updates the access times periodically. So this may be the cause of your problem too.
If hdparm returns this:
drive state is: unknown
And smartctl returns this:
CHECK POWER MODE: incomplete response, ATA output registers missing
CHECK POWER MODE not implemented, ignoring -n option
Then it is not possible to obtain the usb drive sleep status. I tried multiple creative things like the USB power consumption or monitoring file access, but I found nothing to verify the drive's state.
The only solution is to replace the USB SATA adapter (or enclosure) against one with a different USB bridge chip. The one that did not work for me, used the JMicron JMS567 (152d:0578 / 0x152d 0x0578) which could be perhaps updated, but I was not able to test it as the update tool needs an ARM CPU. Now I'm using two different adapters which both use the Asmedia ASM1051E / ASM1053E / ASM1153 / ASM1153E and they passthrough all commands.
You can check the bridge chip of your USB SATA adapter / enclosure as follows:
lsusb -v | grep -E ': ID |idVendor|idProduct'
Which returns something like:
Bus 002 Device 003: ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge
idVendor 0x174c ASMedia Technology Inc.
idProduct 0x55aa ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge
I often find the drive still spinning the next morning.
Couldn't this just be because it was spun up again when the server eg. wrote to a log file or something during the night? You might try sdparm, which can return status information on a drive. But I think it's better to just set the option in your BIOS that lets your HD automatically spin down after an amount of inactivity, it's easier.
Have you tried stopping the drive with the following command:
eject -t /dev/yourHD
This works quite good for my USB hard drives.
If hdparm -C /dev/sda said drive state is: unknown than it's not supported for your drive and there is no way to tell
Your additional questions (answered by others already)
what's using a drive?
lsof
fuser
triggers
how to force a drive to stay sleeping?
hdparm
umount
your device can still be woken when unmounted but likely only by something you do (smartctl, blkid, etc)
Related: you can also automatically pause backup or whatever for an hour if your drives get to hot.

Resources