Need ideas for running srm on large amount of files

Need ideas for running srm on large amount of files - linux

For security, I need to use srm (secure delete) rather than rm to delete some files: http://en.wikipedia.org/wiki/Srm_%28Unix%29
I currently have srm set up to run 3 passes over any data that I need to delete. The problem I'm having is that srm is running extremely extremely slowly on large amounts of files. For example, there is a 150 directory I tried to delete, and I found it to only have deleted 10GB over 1 week.
I know that srm will run slowly with multiple small files, but does directory depth matter as well? For most of the data I need to delete on a weekly basis, the actual files themselves are nested in various deep subdirectories. Would it help out if I flattened the directory structure before running srm?
Here are two workarounds I am looking at (maybe a combination of both), though I don't know how much they would help out:
Flatten all the directories structures before running srm. That way, all the files that need to be wiped out are in the same target dierctory.
Archive the entire directory before running srm. That way, the target file would be one big tar.gz file. Zipping up the data will likely take a while, but not as long as the srm would have taken.
Does anybody have any other suggestions on what I could do? Some others have used shred as well, but the results were similar and we ended up switching over to srm.

Don't know much about srm, but might be worth trying :
find $mydir -type f -exec srm {} \;
find $mydir -type d -exec srm {} \;

Related

Can I list the files and directories on a logical volume

I have a logical volume, /dev/dm-0, 95% full. I want to delete some directories and files from it to free up space.
I have no idea how to list what files and dirs are on it. Google doesn't even have an answer it seems. I have to believe this is possible, but crazy how hard it is to find the answer.

Goto the directory and execute b
find . -type f

What is the best way to speed up a find command on a huge directory tree using GNU parallel?

I've been using GNU parallel for a while, mostly to grep large files or run the same command for various arguments when each command/arg instance is slow and needs to be spread out across cores/hosts.
One thing which would be great to do across multiple cores and hosts as well would be to find a file on a large directory subtree. For example, something like this:
find /some/path -name 'regex'
will take a very long time if /some/path contains many files and other directories with many files. I'm not sure if this is as easy to speed up. For example:
ls -R -1 /some/path | parallel --profile manyhosts --pipe egrep regex
something like that comes to mind but ls would be very slow to come up with the files to search. What's a good way then to speed up such a find?

If you have N hundred immediate subdirs, you can use:
parallel --gnu -n 10 find {} -name 'regex' ::: *
to run find in parallel on each of them, ten at a time.
Note however that listing a directory recursively like this is an IO bound task, and the speedup you can get will depend on the backing medium. On a hard disk drive, it'll probably just be slower (if testing, beware disk caching).

Linux - Recursively list all files from folder with performance

I want to list down absolute filenames of almost 80 thousand files recursively from folders in text file.
Do anybody know which command will give me performance?
I know that I can use commands like ls, tree, ll But which command will give me more performance.
Also note that these files are on NAS and NAS is mapped using symlink command.

Simply find -type f -print0 for zero-delimited output. I don't think you can get significantly faster than that.

Can /tmp in Linux ever fill up?

I'm putting some files in /tmp on a web server that are being used by a web application for a limited amount of time. If the files get left in the server's /tmp after the user quits using the application and this happens repeatedly, should i be concerned about the directory filling up? I read online that rebooting cleans out the /tmp directory, but this box doesn't get rebooted very much.
Tom

Yes, it will fill up. Consider implementing a cron job that will delete old files after a while.
Something like this should do the trick:
/usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
This will delete files that have a modification time that's more than a day old.
Or as a crontab entry:
# run five minutes after midnight, every day
5 0 * * * /usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
where /tmp/mydata is a subdirectory where your application stores its temporary files. (Simply deleting old files under /tmp would be a very bad idea, as someone else pointed out here.)
Look at the crontab and find man pages for the details. Don't go running scripts that delete files on your filesystem without understanding all the details - that's how bad things happen to good servers. :)
Of course, if you can just modify your application to delete temporary files when it's done with them, that would be a far better solution, generally.

You can't just blindly delete everything that hasn't been modified for a certain amount of time. A lot of programs store sockets in there, which never get modified but are still an integral part of the program working. Take for instance mysql from one of my servers:
srwxrwxrwx 1 mysql mysql 0 Sep 11 04:01 mysql.sock=
That's a valid, working "file" in /tmp. It just looks old because mysql hasn't been restarted in a while. Either limit your find with '-type f' or '-atime', or use one of the distro-provided tools others have mentioned.

The only thing you can write to without worrying it will fill up is /dev/null. Everything else will eventually run out of space if you keep dumping things in it.
One simple approach would be to have a cron job clean up all your /tmp files that are older than, say, a few days.

Yep It will be linked to one of your disks/partitions and can fill up.
It gets deleted on a reboot.
When the user quits the application you should clean the files up after them.

In which language is your web-application? A lot of languages propose temp files:
C
python
php
...
Search in your language if there is such a feature.

Just a warning: not all Linux installation clean the /tmp directory after each reboot

Some linux distros have a package that will clean up old files in /tmp for you. It isn't hard to implement your own, as mentioned above. One thing to look out for are long running processes, especially "zombies", which are ones that have died but which haven't finished cleaning up after themselves. If a process has a file open, just deleting it from /tmp won't actually reclaim its space - you have to kill the process or somehow coerce it to close the file. Many programs that write log or temporary files are designed to catch a signal (often SIGUSR1) and close and re-open any log or temporary files for that reason.

Many Linux distributions include something named 'tmpwatch', or similar which runs via cron and deletes things on a pre-defined gradient. Some are smart enough to go by the owner of the file .. stuff that is owned by daemon users gets cleaned out faster than stuff owned by regular users. Check on the mailing lists for your distro of choice to find out.
Still, you should have SNMP or some other kind of monitor watching how much room is available, if it fills up services like Apache aren't going to be happy. For instance, e-accelerator for PHP will need plenty of room, some mail scanners don't clean up properly, etc.

Maximum number of inodes in a directory? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
Is there a maximum number of inodes in a single directory?
I have a directory of over 2 million files and can't get the ls command to work against that directory. So now I'm wondering if I've exceeded a limit on inodes in Linux. Is there a limit before a 2^64 numerical limit?

df -i should tell you the number of inodes used and free on the file system.

Try ls -U or ls -f.
ls, by default, sorts the files alphabetically. If you have 2 million files, that sort can take a long time. If ls -U (or perhaps ls -f), then the file names will be printed immediately.

No. Inode limits are per-filesystem, and decided at filesystem creation time. You could be hitting another limit, or maybe 'ls' just doesn't perform that well.
Try this:
tune2fs -l /dev/DEVICE | grep -i inode
It should tell you all sorts of inode related info.

What you hit is an internal limit of ls. Here is an article which explains it quite well:
http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/

Maximum directory size is filesystem-dependent, and thus the exact limit varies. However, having very large directories is a bad practice.
You should consider making your directories smaller by sorting files into subdirectories. One common scheme is to use the first two characters for a first-level subdirectory, as follows:
${topdir}/aa/aardvark
${topdir}/ai/airplane
This works particularly well if using UUID, GUIDs or content hash values for naming.

As noted by Rob Adams, ls is sorting the files before displaying them. Note that if you are using NFS, the NFS server will be sorting the directory before sending it, and 2 million entries may well take longer than the NFS timeout. That makes the directory unlistable via NFS, even with the -f flag.
This may be true for other network file systems as well.
While there's no enforced limit to the number of entries in a directory, it's good practice to have some limit to the entries you anticipate.

Can you get a real count of the number of files? Does it fall very near a 2^n boundry? Could you simply be running out of RAM to hold all the file names?
I know that in windows at least file system performance would drop dramatically as the number of files in the folder went up, but I thought that linux didn't suffer from this issue, at least if you were using a command prompt. God help you if you try to get something like nautilus to open a folder with that many files.
I'm also wondering where these files come from. Are you able to calculate file names programmatically? If that's the case, you might be able to write a small program to sort them into a number of sub-folders. Often listing the name of a specific file will grant you access where trying to look up the name will fail. For example, I have a folder in windows with about 85,000 files where this works.
If this technique is successful, you might try finding a way to make this sort permanent, even if it's just running this small program as a cron job. It'll work especially well if you can sort the files by date somewhere.

Unless you are getting an error message, ls is working but very slowly. You can try looking at just the first ten files like this:
ls -f | head -10
If you're going to need to look at the file details for a while, you can put them in a file first. You probably want to send the output to a different directory than the one you are listing at the moment!
ls > ~/lots-of-files.txt
If you want to do something to the files, you can use xargs. If you decide to write a script of some kind to do the work, make sure that your script will process the list of files as a stream rather than all at once. Here's an example of moving all the files.
ls | xargs -I thefilename mv thefilename ~/some/other/directory
You could combine that with head to move a smaller number of the files.
ls | head -10000 | xargs -I x mv x /first/ten/thousand/files/go/here
You can probably combine ls | head into a shell script to that will split up the files into a bunch of directories with a manageable number of files in each.

For NetBackup, the binaries that analyze the directories in clients perform some type of listing that timeouts by the enormous quantity of files in every folder (about one million per folder, SAP work directory).
My solution was (as Charles Duffy write in this thread), reorganize the folders in subfolders with less archives.

Another option is find:
find . -name * -exec somcommands {} \;
{} is the absolute filepath.
The advantage/disadvantage is that the files are processed one after each other.
find . -name * > ls.txt
would print all filenames in ls.txt
find . -name * -exec ls -l {} \; > ls.txt
would print all information form ls for each file in ls.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Need ideas for running srm on large amount of files - linux

Don't know much about srm, but might be worth trying : find $mydir -type f -exec srm {} \; find $mydir -type d -exec srm {} \;

Related

Can I list the files and directories on a logical volume

What is the best way to speed up a find command on a huge directory tree using GNU parallel?

Linux - Recursively list all files from folder with performance

Can /tmp in Linux ever fill up?

Maximum number of inodes in a directory? [closed]

Categories

Resources