Deleting Files Permanently and Securely on CentOS

Deleting Files Permanently and Securely on CentOS - linux

I would like to know how would to permanently and securely delete files on CentOS. The problem I'm having right now is that, the filesystem is ext3, and when I thought of using srm- it said something like
"It should work on ext2, FAT-based file systems, and the BSDnative file system. Ext3 users should be especially careful as it can be set to journal data as well, which is an obvious route to reconstructing information."
If I can't use shred or srm, and secure-delete is also not an option, I'm clueless about how to securely and permanently delete the data. The files I'm deleting are NOT encrypted.

just use shred:
shred -v -n 1 -z -u /path/to/your/file
this will shred the given file by overwriting it first with random data and then with 0x00 (zeros), afterwards it will remove the file ;) happy shreding!
notice that ext3/ext4 (and all journaling FS) could buffer the shred with random data and zeros and will only wirte the zeros to disk, this would be the case when you have a little file. for a little file use this:
shred -v -n 1 /path/to/your/file #overwriting with random data
sync #forcing a sync of the buffers to the disk
shred -v -n 0 -z -u /path/to/your/file #overwriting with zeroes and remove the file
for ext3 1MB or greater should be enough to write to the disk (but im not sure on that, its a long time since i used ext3!), for ext4 theres a huge buffer (up to half a gig or more/less).

The srm readme says only that Ext3 users should be especially careful, not that srm definitely won't work on Ext3.
In particular, Ext3 does not enable data journaling by default, so in theory, srm should work basically to the extent that it was designed to work. You may want to take a look at this link for a good overview of the basic issues.

Related

How to make file sparse?

If I have a big file containing many zeros, how can i efficiently make it a sparse file?
Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas?
Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))?
I'm looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.

A lot's changed in 8 years.
Fallocate
fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place,
without using extra disk space. The minimum size of the hole
depends on filesystem I/O block size (usually 4096 bytes).
Also, when using this option, --keep-size is implied. If no
range is specified by --offset and --length, then the entire
file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the
need for extra disk space.
See --punch-hole for a list of supported filesystems.
(That list:)
Supported for XFS (since Linux 2.6.38), ext4 (since Linux
3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).
tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.
GNU cp
Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, --sparse=auto:
sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
But there's also --sparse=always, which activates the file-copy equivalent of what fallocate -d does in-place:
Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
I've finally been able to retire my tar cpSf - SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.

Some filesystems on Linux / UNIX have the ability to "punch holes" into an existing file. See:
LKML posting about the feature
UNIX file trunctation FAQ (search for F_FREESP)
It's not very portable and not done the same way across the board; as of right now, I believe Java's IO libraries do not provide an interface for this.
If hole punching is available either via fcntl(F_FREESP) or via any other mechanism, it should be significantly faster than a copy/seek loop.

I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied.
Making a file sparse would result in those sections being fragmented if they were ever re-used. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.

You can use $ truncate -s filename filesize on linux teminal to create sparse file having
only metadata.
NOTE --Filesize is in bytes.

According to this article, it seems there is currently no easy solution, except for using FIEMAP ioctl. However, I don't know how you can make "non sparse" zero blocks into "sparse" ones.

Preventing Linux from adding file into memory cache?

I hope you've all seen the wonderful site, Linux Ate My Ram. This is usually great, but it presents a problem for me. I have a secure file that I'm decrypting with gpg and then reading into memory to process. The unencrypted file is deleted a short time later, but I do NOT want that decrypted file to be saved in Linux's in-memory file cache.
Is there a way to explicitly prevent a file from being saved from Linux's cache?
Thanks!

Use gpg -d, which will cause GPG to output the file to STDOUT, so then you can have it all in memory.
Depending on how paranoid you are, you may want to use mlock as well.

If you really, really need gpg's output to be a file, you could put that file on a ramfs file system. The file's contents will only exist in non-swappable memory pages.
You can attach a ramfs file system to your tree by running (as root):
mount none /your/mnt/point -t ramfs
You may have also heard of tmpfs. It's similar in that its files have no permanent storage and generally exist only in RAM. However, for your use, you want to avoid this file system because tmpfs files can be swapped to disk.

Sure. Shred the file as you delete it.
shred -u $FILE
Granted, it doesn't directly answer your question, but I still think it's a solution---whatever's living in the cache is now randomly-generated garbage. :-)

Creating a hard link to partial file contents in linux

I have fileA, which has a size of 200 MB. Now I want to create a hardlink to fileA, named fileB, but I only want this file point to the first 100 mb of fileA. So basically I need fileB to point to the same data blocks, but with a different length. It doesn't necessarily have to be a real hardlink it could be a virtual file proxying contents.
I was thinking about duplicating the Inode somehow and changing the length, but I presume this could cause filesystem coherency issues (when datablocks move around etc.). Is there any linux tool or user-level system call that could let me do this?

You cannot do this directly in the manner you are talking about. There are filesystems that sort of do this. For example LessFS can do this. I also believe that the underlying structure of btrfs supports this, so if someone put the hooks in, it would be accessible at user level.
But, there is no way to do this with any of the ext filesystems, and I believe that it happens implicitly in LessFS.
There is a really ugly way of doing it in btrfs. If you make a snapshot, then truncate the snapshot file to 100M, you have effectively achieved your goal.
I believe this would also work with btrfs and a sufficiently recent version of cp you could just copy the file with cp then truncate one copy. The version of cp would have to have the --reflink option, and if you want to be extra sure about this, give the --reflink=always option.

Adding to #Omnifarious's answer:
What you're describing is not a hard link. A hard links is essentially a reference to an inode, by a path name. (A soft link is a reference to a path name, by a path name.) There is no mechanism to say, "I want this inode, but slightly different, only the first k blocks". A copy-on-write filesystem could do this for you under the covers. If you were using such a filesystem then you would simply say
cp fileA fileB && truncate -s 200M fileB
Of course, this also works on a non-copy-on-write filesystem, but it takes up an extra 200 MB instead of just the filesystem overhead.
Now, that said, you could still implement something like this easily on Linux with FUSE. You could implement a filesystem that mirrors some target directory but simply artificially sets a maximum length to files (at say 200 MB).
FUSE
FUSE Hello World

Maybe you can check ChunkFS. I think this is what you need (I didn't try it).

Shred: Doesn't work on Journaled FS?

Shred documentation says shred is "not guaranteed to be effective" (See bottom). So if I shred a document on my Ext3 filesystem or on a Raid, what happens? Do I shred part of the file? Does it sometimes shred the whole thing and sometimes not? Can it shred other stuff? Does it only shred the file header?
CAUTION: Note that shred relies on a very important assumption:
that the file system overwrites data in place. This is the
traditional way to do things, but many modern file system designs
do not satisfy this assumption. The following are examples of file
systems on which shred is not effective, or is not guaranteed to be
effective in all file sys‐ tem modes:
log-structured or journaled file systems, such as those supplied with AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some writes fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance’s NFS server
file systems that cache in temporary locations, such as NFS version 3 clients
compressed file systems
In the case of ext3 file systems, the above disclaimer applies
(and shred is thus of limited effectiveness) only in data=journal
mode, which journals file data in addition to just metadata. In
both the data=ordered (default) and data=writeback modes, shred
works as usual. Ext3 journaling modes can be changed by adding
the data=something option to the mount options for a
particular file system in the /etc/fstab file, as documented in the
mount man page (man mount).

All shred does is overwrite, flush, check success, and repeat. It does absolutely nothing to find out whether overwriting a file actually results in the blocks which contained the original data being overwritten. This is because without knowing non-standard things about the underlying filesystem, it can't.
So, journaling filesystems won't overwrite the original blocks in place, because that would stop them recovering cleanly from errors where the change is half-written. If data is journaled, then each pass of shred might be written to a new location on disk, in which case nothing is shredded.
RAID filesystems (depending on the RAID mode) might not overwrite all of the copies of the original blocks. If there's redundancy, you might shred one disk but not the other(s), or you might find that different passes have affected different disks such that each disk is partly shredded.
On any filesystem, the disk hardware itself might just so happen to detect an error (or, in the case of flash, apply wear-leveling even without an error) and remap the logical block to a different physical block, such that the original is marked faulty (or unused) but never overwritten.
Compressed filesystems might not overwrite the original blocks, because the data with which shred overwrites is either random or extremely compressible on each pass, and either one might cause the file to radically change its compressed size and hence be relocated. NTFS stores small files in the MFT, and when shred rounds up the filesize to a multiple of one block, its first "overwrite" will typically cause the file to be relocated out to a new location, which will then be pointlessly shredded leaving the little MFT slot untouched.
Shred can't detect any of these conditions (unless you have a special implementation which directly addresses your fs and block driver - I don't know whether any such things actually exist). That's why it's more reliable when used on a whole disk than on a filesystem.
Shred never shreds "other stuff" in the sense of other files. In some of the cases above it shreds previously-unallocated blocks instead of the blocks which contain your data. It also doesn't shred any metadata in the filesystem (which I guess is what you mean by "file header"). The -u option does attempt to overwrite the file name, by renaming to a new name of the same length and then shortening that one character at a time down to 1 char, prior to deleting the file. You can see this in action if you specify -v too.

The other answers have already done a good job of explaining why shred may not be able to do its job properly.
This can be summarised as:
shred only works on partitions, not individual files
As explained in the other answers, if you shred a single file:
there is no guarantee the actual data is really overwritten, because the filesystem may send writes to the same file to different locations on disk
there is no guarantee the fs did not create copies of the data elsewhere
the fs might even decide to "optimize away" your writes, because you are writing the same file repeatedly (syncing is supposed to prevent this, but again: no guarantee)
But even if you know that your filesystem does not do any of the nasty things above, you also have to consider that many applications will automatically create copies of file data:
crash recovery files which word processors, editors (such as vim) etc. will write periodically
thumbnail/preview files in file managers (sometimes even for non-imagefiles)
temporary files that many applications use
So, short of checking every single binary you use to work with your data, it might have been copied right, left & center without you knowing. The only realistic way is to always shred complete partitions (or disks).

The concern is that data might exist on more than one place on the disk. When the data exists in exactly one location, then shred can deterministically "erase" that information. However, file systems that journal or other advanced file systems may write your file's data in multiple locations, temporarily, on the disk. Shred -- after the fact -- has no way of knowing about this and has no way of knowing where the data may have been temporarily written to disk. Thus, it has no way of erasing or overwriting those disk sectors.
Imagine this: You write a file to disk on a journaled file system that journals not just metadata but also the file data. The file data is temporarily written to the journal, and then written to its final location. Now you use shred on the file. The final location where the data was written can be safely overwritten with shred. However, shred would have to have some way of guaranteeing that the sectors in the journal that temporarily contained your file's contents are also overwritten to be able to promise that your file is truly not recoverable. Imagine a file system where the journal is not even in a fixed location or of a fixed length.
If you are using shred, then you're trying to ensure that there is no possible way your data could be reconstructed. The authors of shred are being honest that there are some conditions beyond their control where they cannot make this guarantee.

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?

ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.

My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D

Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.

When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...

The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?

Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.

I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string