how to delete 600 GBs of small files?

how to delete 600 GBs of small files? - linux

This was an interview question, they did not tell any information about the files, ie: extension, hidden files?, location (stored in single directory or a directory tree), so my first reaction to this question was:
rm -fr *
oh no, wait, should be:
rm -fr -- *
Then I realize that the above command would not remove hidden files successfully and quite frankly directories like . and .. might interfere, my second and final thought was a ShellScript that uses find.
find -depth -type f -delete
I'm not sure if this is the right way of doing it, I'm wondering if there is a better way of doing this task.

It's not as obvious as it seems:
http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html

Related

bash: get path from current directory given sub-directory name

Trying to write a script to clean up environment files after a resource is deleted. The problem is all the script is given as input is the name of the resource (this cannot be changed) with zero identifying information beyond that. How can I find the path of the directory the resource is sitting in?
The directory is set up a bit like the following, although much more extensive. All of these are directories, not files. There can be as many as 40+ directories to search, but the desired one is generally not more than 2-3 directories deep.
foo
aaa
aaa_green
aaa_blue
bbb
ccc
ccc_green
bar
ddd
eee
eee_green
eee_blue
fff
fff_green
fff_blue
fff_pink
I might be handed input like aaa_green or just ddd.
As an example, given eee_blue as input, I need to know eee_blue's path from the working directory so I can cd there and delete the directory. IE, I would expect to return bar/eee/eee_blue/ or bar/eee/, either is acceptable.
The "best" option I can see currently is to cd into the lowest level of each directory via multiple greps, get each's contents and look for a match, and when it does (eventually) match save that cd'ing as the path. This frankly sounds awful and inefficient.
The only other alternative method I could think of was a straight recursive grep, but I tested it and at 8 minutes it still hadn't finished running.
This script needs to run on both mac and linux, although in a desperate pinch I could go linux only.

The standard Unix tool for doing this sort of task is the find command. The GNU version of find has more extensive options than the POSIX specification (by quite a margin). The version on macOS Sierra (and Mac OS X) is similar to the GNU version. I found an online manual for OS X 10.9 at Apple find, but there's probably a better location somewhere.
It looks like you might want to run:
find . -name 'eee_blue'
which will print the names of matching files or directories, or perhaps:
find . -name 'eee_blue' -exec rm -fr {} +
which will run the rm -fr command on each name. You can run a custom script you create in place of rm -fr if you prefer; if the logic is complex, it's what I do.
Be extremely cautious before using rm -fr automatically!

Go to bottom most directory?

I'm working with a directory with a lot of nested folders like /path/to/project/users/me/tutorial
I found a neat way to navigate up the folders here:
https://superuser.com/questions/449687/using-cd-to-go-up-multiple-directory-levels
But I'm wondering how to go down them. This seems significantly more difficult, but a couple things about the directory structure help. Each directory only has another directory in it, or maybe a directory and a README.
The directory I'm looking for looks more like a traditional project and might have random directories and files in it (more than any of the other higher directories certainly).
Right now I'm working on a solution using uh.. recursive bash functions cd'ing into the only directory underneath until there are either 0 or 2+ directories to loop through. This doesn't work yet..
Am I overcomplicating this? I feel like there could be some sweet solution using find. Ideally I want to be able to type something like:
down path
where path is a top-level folder. And that will take me down to the bottom folder tutorial.

There is an environment variable named CDPATH. This variable is used by cd in the same manner that executables use PATH when searching for pathname.
For example, if you have the following directories:
/path/to/project/users/me
/path/to/project/users/me/tutorial
/path/to/project/users/him
/path/to/project/users/him/test
/path/to/project/users/her
/path/to/project/users/her/uat
/path/to/project/users/her/dev
/path/to/application
/path/to/application/conf
/path/to/application/bin
/path/to/application/share
export CDPATH=/path/to/project/users/me:/path/to/project/users/him:/path/to/project/users/her:/path/to/application
A simple command such as cd tutorial will search the above paths for tutorial.
Let's pretend /path/to/application has directories underneath namely, conf, bin, share. A simple cd conf will send you to /path/to/application/conf as long as none of the paths before it have conf directory. This behavior is similar to executables in PATH. The first occurrence always gets chosen

My attempt - this actually works now! I'm still afraid it could easily go infinite with symbolic links or some such.
Also, I have to run this like
. down
from within the first empty folder.
#!/bin/bash
function GoDownOnce {
Dirs=$(find ./ -maxdepth 1 -mindepth 1 -type d)
NumDirs=$(echo $Dirs | wc -w)
echo $Dirs
echo $NumDirs
if [ "$NumDirs" = "1" ]; then
cd $Dirs
GoDownOnce
fi
}
GoDownOnce

A friend also suggested this sweet one liner:
cd $(find . -type d -name tutorial)
Admittedly this isn't quite what I asked, but it gets the job done pretty well.

Deleting large no of folders in one go

I have a list of 1M folders/directory, which I need to delete on my system. What is the best possible way to do it ?
I am looking for best possible solution, which will not consume lot of time, as I a have some processes which will be waiting for its completion.
PS: I can put all folders name in a file, if required, or do it in batch, if we can not do it in one go.

Use the xargs tool. It will read all folder names from the file and call a command - in this case rmdir.
xargs rmdir < list_of_folders
If you are sure you can delete non-empty folders, use rm -r instead of rmdir.
I think this is about the fastest you can get. rmdir will act as fast as it can (simple OS call), and using xargs ensures that you do not create 1M separate processes.
You may exploit that there are no "nested" directories in your list.
By that, if you have three folder a/ a/b/ and a/c/, and b/ and c/ are the only entries in a/, then you can omit a/b/ and a/c/ and just call rm -r a/.
But it will not be worth checking that by ls, as ls will also cost time which you probably won't save.

The rm command is perfectly capable of handling this. Just give it the list of folders you need to delete (shell expansions can save you some time for that matter), and don't forget the -r switch.
Exemple using some common expansions:
rm -r folder_a src/dir_* app_{logs,src,bin}

cronjob to remove files older than 99 days

I have to make a cronjob to remove files older than 99 days in a particular directory but I'm not sure the file names were made by trustworthy Linux users. I must expect special characters, spaces, slash characters, and others.
Here is what I think could work:
find /path/to/files -mtime +99 -exec rm {}\;
But I suspect this will fail if there are special characters or if it finds a file that's read-only, (cron may not be run with superuser privileges). I need it to go on if it meets such files.

When you use -exec rm {} \;, you shouldn't have any problems with spaces, tabs, returns, or special characters because find calls the rm command directly and passes it the name of each file one at a time.
Directories won't' be removed with that command because you aren't passing it the -r parameter, and you probably don't want too. That could end up being a bit dangerous. You might also want to include the -f parameter to do a force in case you don't have write permission. Run the cron script as root, and you should be fine.
The only thing I'd worry about is that you might end up hitting a file that you don't want to remove, but has not been modified in the past 100 days. For example, the password to stop the autodestruct sequence at your work. Chances are that file hasn't been modified in the past 100 days, but once that autodestruct sequence starts, you wouldn't want the one to be blamed because the password was lost.
Okay, more reasonable might be applications that are used but rarely modified. Maybe someone's resume that hasn't been updated because they are holding a current job, etc.
So, be careful with your assumptions. Just because a file hasn't been modified in 100 days doesn't mean it isn't used. A better criteria (although still questionable) is whether the file has been accessed in the last 100 days. Maybe this as a final command:
find /path/to/files -atime +99 -type f -exec rm -f {}\;
One more thing...
Some find commands have a -delete parameter which can be used instead of the -exec rm parameter:
find /path/to/files -atime +99 -delete
That will delete both found directories and files.
One more small recommendation: For the first week, save the files found in a log file instead of removing them, and then examine the log file. This way, you make sure that you're not deleting something important. Once you're happy thet there's nothing in the log file you don't want to touch, you can remove those files. After a week, and you're satisfied that you're not going to delete anything important, you can revert the find command to do the delete for you.

If you run rm with the -f option, your file is going to be deleted regardless of whether you have write permission on the file or not (all that matters is the containing folder). So, either you can erase all the files in the folder, or none. Add also -r if you want to erase subfolders.
And I have to say it: be very careful! You're playing with fire ;) I suggest you debug with something less harmful likfe the file command.
You can test this out by creating a bunch of files like, e.g.:
touch {a,b,c,d,e,f}
And setting permissions as desired on each of them

You should use -execdir instead of -exec. Even better, read the full Security considerations for find chapter in the findutils manual.

Please, always use rm [opts] -- [files], this will save you from errors with files like -rf wiich would otherwise be parsed as options. When you provide file names, then end all options.

Is there a way to check if there are symbolic links pointing to a directory?

I have a folder on my server to which I had a number of symbolic links pointing. I've since created a new folder and I want to change all those symbolic links to point to the new folder. I'd considered replacing the original folder with a symlink to the new folder, but it seems that if I continued with that practice it could get very messy very fast.
What I've been doing is manually changing the symlinks to point to the new folder, but I may have missed a couple.
Is there a way to check if there are any symlinks pointing to a particular folder?

I'd use the find command.
find . -lname /particular/folder
That will recursively search the current directory for symlinks to /particular/folder. Note that it will only find absolute symlinks. A similar command can be used to search for all symlinks pointing at objects called "folder":
find . -lname '*folder'
From there you would need to weed out any false positives.

You can audit symlinks with the symlinks program written by Mark Lord -- it will scan an entire filesystem, normalize symlink paths to absolute form and print them to stdout.

There isn't really any direct way to check for such symlinks. Consider that you might have a filesystem that isn't mounted all the time (eg. an external USB drive), which could contain symlinks to another volume on the system.
You could do something with:
for a in `find / -type l`; do echo "$a -> `readlink $a`"; done | grep destfolder
I note that FreeBSD's find does not support the -lname option, which is why I ended up with the above.

find . -type l -printf '%p -> %l\n'

Apart from looking at all other folders if there are links pointing to the original folder, I don't think it is possible. If it is, I would be interested.

find / -lname 'fullyqualifiedpathoffile'

find /foldername -type l -exec ls -lad {} \;

For hardlinks, you can get the inode of your directory with one of the "ls" options (-i, I think).
Then a find with -inum will locate all common hardlinks.
For softlinks, you may have to do an ls -l on all files looking for the text after "->" and normalizing it to make sure it's an absolute path.

To any programmers looking here (cmdline tool questions probably should instead go to unix.stackexchange.com nowadays):
You should know that the Linux/BSD function fts_open() gives you an easy-to-use iterator for traversing all sub directory contents while also detecting such symlink recursions.
Most command line tools use this function to handle this case for them. Those that don't often have trouble with symlink recursions because doing this "by hand" is difficult (any anyone being aware of it should just use the above function instead).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string