Shred: Doesn't work on Journaled FS? - linux

Shred documentation says shred is "not guaranteed to be effective" (See bottom). So if I shred a document on my Ext3 filesystem or on a Raid, what happens? Do I shred part of the file? Does it sometimes shred the whole thing and sometimes not? Can it shred other stuff? Does it only shred the file header?
CAUTION: Note that shred relies on a very important assumption:
that the file system overwrites data in place. This is the
traditional way to do things, but many modern file system designs
do not satisfy this assumption. The following are examples of file
systems on which shred is not effective, or is not guaranteed to be
effective in all file sys‐ tem modes:
log-structured or journaled file systems, such as those supplied with AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some writes fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance’s NFS server
file systems that cache in temporary locations, such as NFS version 3 clients
compressed file systems
In the case of ext3 file systems, the above disclaimer applies
(and shred is thus of limited effectiveness) only in data=journal
mode, which journals file data in addition to just metadata. In
both the data=ordered (default) and data=writeback modes, shred
works as usual. Ext3 journaling modes can be changed by adding
the data=something option to the mount options for a
particular file system in the /etc/fstab file, as documented in the
mount man page (man mount).

All shred does is overwrite, flush, check success, and repeat. It does absolutely nothing to find out whether overwriting a file actually results in the blocks which contained the original data being overwritten. This is because without knowing non-standard things about the underlying filesystem, it can't.
So, journaling filesystems won't overwrite the original blocks in place, because that would stop them recovering cleanly from errors where the change is half-written. If data is journaled, then each pass of shred might be written to a new location on disk, in which case nothing is shredded.
RAID filesystems (depending on the RAID mode) might not overwrite all of the copies of the original blocks. If there's redundancy, you might shred one disk but not the other(s), or you might find that different passes have affected different disks such that each disk is partly shredded.
On any filesystem, the disk hardware itself might just so happen to detect an error (or, in the case of flash, apply wear-leveling even without an error) and remap the logical block to a different physical block, such that the original is marked faulty (or unused) but never overwritten.
Compressed filesystems might not overwrite the original blocks, because the data with which shred overwrites is either random or extremely compressible on each pass, and either one might cause the file to radically change its compressed size and hence be relocated. NTFS stores small files in the MFT, and when shred rounds up the filesize to a multiple of one block, its first "overwrite" will typically cause the file to be relocated out to a new location, which will then be pointlessly shredded leaving the little MFT slot untouched.
Shred can't detect any of these conditions (unless you have a special implementation which directly addresses your fs and block driver - I don't know whether any such things actually exist). That's why it's more reliable when used on a whole disk than on a filesystem.
Shred never shreds "other stuff" in the sense of other files. In some of the cases above it shreds previously-unallocated blocks instead of the blocks which contain your data. It also doesn't shred any metadata in the filesystem (which I guess is what you mean by "file header"). The -u option does attempt to overwrite the file name, by renaming to a new name of the same length and then shortening that one character at a time down to 1 char, prior to deleting the file. You can see this in action if you specify -v too.

The other answers have already done a good job of explaining why shred may not be able to do its job properly.
This can be summarised as:
shred only works on partitions, not individual files
As explained in the other answers, if you shred a single file:
there is no guarantee the actual data is really overwritten, because the filesystem may send writes to the same file to different locations on disk
there is no guarantee the fs did not create copies of the data elsewhere
the fs might even decide to "optimize away" your writes, because you are writing the same file repeatedly (syncing is supposed to prevent this, but again: no guarantee)
But even if you know that your filesystem does not do any of the nasty things above, you also have to consider that many applications will automatically create copies of file data:
crash recovery files which word processors, editors (such as vim) etc. will write periodically
thumbnail/preview files in file managers (sometimes even for non-imagefiles)
temporary files that many applications use
So, short of checking every single binary you use to work with your data, it might have been copied right, left & center without you knowing. The only realistic way is to always shred complete partitions (or disks).

The concern is that data might exist on more than one place on the disk. When the data exists in exactly one location, then shred can deterministically "erase" that information. However, file systems that journal or other advanced file systems may write your file's data in multiple locations, temporarily, on the disk. Shred -- after the fact -- has no way of knowing about this and has no way of knowing where the data may have been temporarily written to disk. Thus, it has no way of erasing or overwriting those disk sectors.
Imagine this: You write a file to disk on a journaled file system that journals not just metadata but also the file data. The file data is temporarily written to the journal, and then written to its final location. Now you use shred on the file. The final location where the data was written can be safely overwritten with shred. However, shred would have to have some way of guaranteeing that the sectors in the journal that temporarily contained your file's contents are also overwritten to be able to promise that your file is truly not recoverable. Imagine a file system where the journal is not even in a fixed location or of a fixed length.
If you are using shred, then you're trying to ensure that there is no possible way your data could be reconstructed. The authors of shred are being honest that there are some conditions beyond their control where they cannot make this guarantee.

Related

Ext4 on magnetic disk: Is it possible to process an arbitrary list of files in a seek-optimized manner?

I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
compute smaller versions of all JPEG images to process them with a classifier
reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
extract the headlines of all Word documents
recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the FIBMAP ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).

Force rsync to compare local files byte by byte instead of checksum

I have written a Bash script to backup a folder. At the core of the script is an rsync instruction
rsync -abh --checksum /path/to/source /path/to/target
I am using --checksum because I neither want to rely on file size nor modification time to determine if the file in the source path needs to be backed up. However, most -- if not all -- of the time I run this script locally, i.e., with an external USB drive attached which contains the backup destination folder; no backup over network. Thus, there is no need for a delta transfer since both files will be read and processed entirely by the same machine. Calculating the checksums even introduces a speed down in this case. It would be better if rsync would just diff the files if they are both on stored locally.
After reading the manpage I stumbled upon the --whole-file option which seems to avoid the costly checksum calculation. The manpage also states that this is the default if source and destination are local paths.
So I am thinking to change my rsync statement to
rsync -abh /path/to/source /path/to/target
Will rsync now check local source and target files byte by byte or will it use modification time and/or size to determine if the source file needs to be backed up? I definitely do not want to rely on file size or modification times to decide if a backup should take place.
UPDATE
Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced. So blindly rsync'ing all files in the source folder, e.g., by supplying --ignore-times as suggested in the comments, is not an option. It would create too many duplicate files and waste storage space. Keep also in mind that I am trying to reduce backup time and workload on a local machine. Just backing up everything would defeat that purpose.
So my question could be rephrased as, is rsync capable of doing a file comparison on a byte by byte basis?
Question: is rsync capable of doing a file comparison on a byte by byte basis?
Strictly speaking, Yes:
It's a block by block comparison, but you can change the block size.
You could use --block-size=1, (but it would be unreasonably inefficient and inappropriate for basically every)
The block based rolling checksum is the default behavior over a network.
Use the --no-whole-file option to force this behavior locally. (see below)
Statement 1. Calculating the checksums even introduces a speed down in this case.
This is why it's off by default for local transfers.
Using the --checksum option forces an entire file read, as opposed to the default block-by-block delta-transfer checksum checking
Statement 2. Will rsync now check local source and target files byte by byte or
       will it use modification time and/or size to determine if the source
file        needs to be backed up?
By default it will use size & modification time.
You can use a combination of --size-only, --(no-)ignore-times, --ignore-existing and
--checksum to modify this behavior.
Statement 3. I definitely do not want to rely on file size or modification times to decide if a        backup should take place.
Then you need to use --ignore-times and/or --checksum
Statement 4. supplying --ignore-times as suggested in the comments, is not an option
Perhaps using --no-whole-file and --ignore-times is what you want then ? This forces the use of the delta-transfer algorithm, but for every file regardless of timestamp or size.
You would (in my opinion) only ever use this combination of options if it was critical to avoid meaningless writes (though it's critical that it's specifically the meaningless writes that you're trying to avoid, not the efficiency of the system, since it wouldn't actually be more efficient to do a delta-transfer for local files), and had reason to believe that files with identical modification stamps and byte size could indeed be different.
I fail to see how modification stamp and size in bytes is anything but a logical first step in identifying changed files.
If you compared the following two files:
File 1 (local) : File.bin - 79776451 bytes and modified on the 15 May 07:51
File 2 (remote): File.bin - 79776451 bytes and modified on the 15 May 07:51
The default behaviour is to skip these files. If you're not satisfied that the files should be skipped, and want them compared, you can force a block-by-block comparison and differential update of these files using --no-whole-file and --ignore-times
So the summary on this point is:
Use the default method for the most efficient backup and archive
Use --ignore-times and --no-whole-file to force delta-change (block by block checksum, transferring only differential data) if for some reason this is necessary
Use --checksum and --ignore-times to be completely paranoid and wasteful.
Statement 5. Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced
Yes, but this can work however you want it to, it doesn't necessarily mean a full backup every time a file is updated, and it certainly doesn't mean that a full transfer will take place at all.
You can configure rsync to:
Keep 1 or more versions of a file
Configure it with a --backup-dir to be a full incremental backup system.
Doing it this way doesn't waste space other than what is required to retain differential data. I can verify that in practise as there would not be nearly enough space on my backup drives for all of my previous versions to be full copies.
Some Supplementary Information
Why is Delta-transfer not more efficient than copying the whole file locally?
Because you're not tracking the changes to each of your files. If you actually have a delta file, you can merge just the changed bytes, but you need to know what those changed bytes are first. The only way you can know this is by reading the entire file
For example:
I modify the first byte of a 10MB file.
I use rsync with delta-transfer to sync this file
rsync immediately sees that the first byte (or byte within the first block) has changed, and proceeds (by default --inplace) to change just that block
However, rsync doesn't know it was only the first byte that's changed. It will keep checksumming until the whole file is read
For all intents and purposes:
Consider rsync a tool that conditionally performs a --checksum based on whether or not the file timestamp or size has changed. Overriding this to --checksum is essentially equivalent to --no-whole-file and --ignore-times, since both will:
Operate on every file, regardless of time and size
Read every block of the file to determine which blocks to sync.
What's the benefit then?
The whole thing is a tradeoff between transfer bandwidth, and speed / overhead.
--checksum is a good way to only ever send differences over a network
--checksum while ignoring files with the same timestamp and size is a good way to both only send differences over a network, and also maximize the speed of the entire backup operation
Interestingly, it's probably much more efficient to use --checksum as a blanket option than it would be to force a delta-transfer for every file.
There is no way to do byte-by-byte comparison of files instead of checksum, the way you are expecting it.
The way rsync works is to create two processes, sender and receiver, that create a list of files and their metadata to decide with each other, which files need to be updated. This is done even in case of local files, but in this case processes can communicate over a pipe, not over a network socket. After the list of changed files is decided, changes are sent as a delta or as whole files.
Theoretically, one could send whole files in the file list to the other to make a diff, but in practice this would be rather inefficient in many cases. Receiver would need to keep these files in the memory in case it detects the need to update the file, or otherwise the changes in files need to be re-sent. Any of the possible solutions here doesn't sound very efficient.
There is a good overview about (theoretical) mechanics of rsync: https://rsync.samba.org/how-rsync-works.html

Can inode and crtime be used as a unique file identifier?

I have a file indexing database on Linux. Currently I use file path as an identifier.
But if a file is moved/renamed, its path is changed and I cannot match my DB record to the new file and have to delete/recreate the record. Even worse, if a directory is moved/renamed, then I have to delete/recreate records for all files and nested directories.
I would like to use inode number as a unique file identifier, but inode number can be reused if file is deleted and another file created.
So, I wonder whether I can use a pair of {inode,crtime} as a unique file identifier.
I hope to use i_crtime on ext4 and creation_time on NTFS.
In my limited testing (with ext4) inode and crtime do, indeed, remain unchanged when renaming or moving files or directories within the same file system.
So, the question is whether there are cases when inode or crtime of a file may change.
For example, can fsck or defragmentation or partition resizing change inode or crtime or a file?
Interesting that
http://msdn.microsoft.com/en-us/library/aa363788%28VS.85%29.aspx says:
"In the NTFS file system, a file keeps the same file ID until it is deleted."
but also:
"In some cases, the file ID for a file can change over time."
So, what are those cases they mentioned?
Note that I studied similar questions:
How to determine the uniqueness of a file in linux?
Executing 'mv A B': Will the 'inode' be changed?
Best approach to detecting a move or rename to a file in Linux?
but they do not answer my question.
{device_nr,inode_nr} are a unique identifier for an inode within a system
moving a file to a different directory does not change its inode_nr
the linux inotify interface enables you to monitor changes to inodes (either files or directories)
Extra notes:
moving files across filesystems is handled differently. (it is infact copy+delete)
networked filesystems (or a mounted NTFS) can not always guarantee the stability of inodenumbers
Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored (except for NTFS's internals)
Extra text: the old Unix adagium "everything is a file" should in fact be: "everything is an inode". The inode carries all the metainformation about a file (or directory, or a special file) except the name. The filename is in fact only a directory entry that happens to link to the particular inode. Moving a file implies: creating a new link to the same inode, end deleting the old directory entry that linked to it.
The inode metatata can be obtained by the stat() and fstat() ,and lstat() system calls.
The allocation and management of i-nodes in Unix is dependent upon the filesystem. So, for each filesystem, the answer may vary.
For the Ext3 filesystem (the most popular), i-nodes are reused, and thus cannot be used as a unique file identifier, nor is does reuse occur according to any predictable pattern.
In Ext3, i-nodes are tracked in a bit vector, each bit representing a single i-node number. When an i-node is freed, it's bit is set to zero. When a new i-node is needed, the bit vector is searched for the first zero-bit and the i-node number (which may have been previously allocated to another file) is reused.
This may lead to the naive conclusion that the lowest numbered available i-node will be the one reused. However, the Ext3 file system is complex and highly optimised, so no assumptions should be made about when and how i-node numbers can be reused, even though they clearly will.
From the source code for ialloc.c, where i-nodes are allocated:
There are two policies for allocating an inode. If the new inode is a
directory, then a forward search is made for a block group with both
free space and a low directory-to-inode ratio; if that fails, then of
he groups with above-average free space, that group with the fewest
directories already is chosen. For other inodes, search forward from
the parent directory's block group to find a free inode.
The source code that manages this for Ext3 is called ialloc and the definitive version is here: https://github.com/torvalds/linux/blob/master/fs/ext3/ialloc.c
I guess the dB application would need to consider the case where the file is subject to restoration from backup, which would preserve the file crtime, but not the inode number.

How to estimate a file size from header's sector start address?

Suppose I have a deleted file in my unallocated space on a linux partition and i want to retrieve it.
Suppose I can get the start address of the file by examining the header.
Is there a way by which I can estimate the number of blocks to be analyzed hence (this depends on the size of the image.)
In general, Linux/Unix does not support recovering deleted files - if it is deleted, it should be gone. This is also good for security - one user should not be able to recover data in a file that was deleted by another user by creating huge empty file spanning almost all free space.
Some filesystems even support so called secure delete - that is, they can automatically wipe file blocks on delete (but this is not common).
You can try to write a utility which will open whole partition that your filesystem is mounted on (say, /dev/sda2) as one huge file and will read it and scan for remnants of your original data, but if file was fragmented (which is highly likely), chances are very small that you will be able to recover much of the data in some usable form.
Having said all that, there are some utilities which are trying to be a bit smarter than simple scan and can try to be undelete your files on Linux, like extundelete. It may work for you, but success is never guaranteed. Of course, you must be root to be able to use it.
And finally, if you want to be able to recover anything from that filesystem, you should unmount it right now, and take a backup of it using dd or pipe dd compressed through gzip to save space required.

File backup in linux with advisory locks

How do backup programs make sure they get a consistent copy of a file, when file locks in linux mostly are advisory?
For example if some other process do not respect file locks and writes to a file, how can I create a consistent copy of that file?
This is quite an interesting topic, the modern way seems to be to use a filesystem snapshot; another way is to use a block-device snapshot.
In any case, some kind of snapshot is the best solution. Zfs has snapshots (but is not available as a "first class" filesystem under Linux), as does btrfs (which is quite new).
Alternatively, a LVM volume can have a block-level snapshot taken (which can then be mounted readonly in a temporary location while a backup is taken).
If you had mandatory file locks, then a backup program would disrupt normal operation of (for example) a database so that it was not able to work correctly. Moreover, unless there was a mechanism to atomically take a mandatory lock on every file in the filesystem, there would be no way to take a consistent backup (i.e. with every file as it was at the same moment).

Resources