Deleting large no of folders in one go

Deleting large no of folders in one go - linux

I have a list of 1M folders/directory, which I need to delete on my system. What is the best possible way to do it ?
I am looking for best possible solution, which will not consume lot of time, as I a have some processes which will be waiting for its completion.
PS: I can put all folders name in a file, if required, or do it in batch, if we can not do it in one go.

Use the xargs tool. It will read all folder names from the file and call a command - in this case rmdir.
xargs rmdir < list_of_folders
If you are sure you can delete non-empty folders, use rm -r instead of rmdir.
I think this is about the fastest you can get. rmdir will act as fast as it can (simple OS call), and using xargs ensures that you do not create 1M separate processes.
You may exploit that there are no "nested" directories in your list.
By that, if you have three folder a/ a/b/ and a/c/, and b/ and c/ are the only entries in a/, then you can omit a/b/ and a/c/ and just call rm -r a/.
But it will not be worth checking that by ls, as ls will also cost time which you probably won't save.

The rm command is perfectly capable of handling this. Just give it the list of folders you need to delete (shell expansions can save you some time for that matter), and don't forget the -r switch.
Exemple using some common expansions:
rm -r folder_a src/dir_* app_{logs,src,bin}

Related

Bash Scripting with xargs to BACK UP files

I need to copy a file from multiple locations to the BACK UP directory by retaining its directory structure. For example, I have a file "a.txt" at the following locations /a/b/a.txt /a/c/a.txt a/d/a.txt a/e/a.txt, I now need to copy this file from multiple locations to the backup directory /tmp/backup. The end result should be:
when i list /tmp/backup/a --> it should contain /b/a.txt /c/a.txt /d/a.txt & /e/a.txt.
For this, I had used the command: echo /a/*/a.txt | xargs -I {} -n 1 sudo cp --parent -vp {} /tmp/backup. This is throwing the error "cp: cannot stat '/a/b/a.txt /a/c/a.txt a/d/a.txt a/e/a.txt': No such file or directory"
-I option is taking the complete input from echo instead of individual values (like -n 1 does). If someone can help debug this issue that would be very helpful instead of providing an alternative command.

Use rsync with the --relative (-R) option to keep (parts of) the source paths.
I've used a wildcard for the source to match your example command rather than the explicit list of directories mentioned in your question.
rsync -avR /a/*/a.txt /tmp/backup/

Do the backups need to be exactly the same as the originals? In most cases, I'd prefer a little compression. [tar](https://man7.org/linux/man-pages/man1/tar.1.html) does a great job of bundling things including the directory structure.
tar cvzf /path/to/backup/tarball.tgz /source/path/
tar can't update compressed archives, so you can skip the compression
tar uf /path/to/backup/tarball.tar /source/path/
This gives you versioning of a sort, as if only updates changed files, but keeps the before and after versions, both.
If you have time and cycles and still want the compression, you can decompress before and recompress after.

Absolute fastest way to recursively delete all files and folders in a given path. Linux

I am looking for the absolute fastest method of performing unlink and rmdir commands on a path containing millions of files and thousands of folders.
I have found following perl one-liner, but this does not recurse and also performs a stat before each unlink (this is unnecessary):
perl -e 'for(<*>){((stat)[9]<(unlink))}'

It's not going to make much difference either way - CPUs are fast, disks are slow. Most of the work - however you do it - will be the traverse and unlink system calls.
There's not really a way to speed that up (well, short of maybe just initialising/quickformatting your disk and starting over).

The fastest way to delete all files and folders recursively that I was able to find is:
perl -le 'use File::Find; find(sub{unlink if -f}, ".")' && rm -rf *

How do I clear space on my main system drive on a Linux CentOS system?

Sorry if this sounds dumb, but I'm not sure what to do.
I've got an Amazon EC2 instance with a completely full Ephemeral drive ( the main drive with all the system files ). Almost all the directories where I've installed things like Apache, MySQL, Sphinx, my applications, etc. are on a separate physical drive and have symlinks from the ephemeral drive. As far as I am aware, none of thier data or logs write to the ephemeral drive, so I'm not sure what happened to the space.
Obviously lots of system stuff is still on the ephemeral drive, but I'm not sure how to clear things off to make space. My best guess is that amazon filled the drive when it did some auto updates to the system. I'm trying to install some new packages, and update all my system packages via YUM, but the drive has no space.
What should I do?

du --max-depth=1 -h /
where / can be any directory starting from root will show you the size in human readable form (-h) without further recursing further down.
Once you find something big that you want to remove you can do it via
rm <thing you want to remove>
this accepts shell expansion, so for instance to remove all mp3 files:
rm *.mp3
if it's a directory then you need to add -r
rm -r /dir/to/remove
to protect yourself it would be advisable to add the -i switch to every rm call, this forces you to acknowledge that you want the files removed.
if there are a lot of readonly files you want to remove then you could add the -f switch to force deletion, be very careful with this.
Be careful that rm accepts multiple parameters so when you specify an absolute path make sure to do it within quotes or not to have any spaces, especially should you execute it as root and super especially with the -r and -f options. (otherwise you'll join the group of people that did rm -rf / some/directory/* and killed their / inadvertantly)
If you just want to look for big files and delete those then you could also use find
find / -type f -size +100M
would search for files only (-type f) with a size > 100MB (-size +100M)
subsequently you could use the same command to delete them.
find / -type f -size +100M -exec rm \{\} \;
-exec executes a program which gets passed the file or folder it has found ( \{\} ), needs to be terminated with \;
don't forget you could add -i to rm to approve or disapprove a deletion.

You can use the unix disk utility command du to see what's taking up all the space for starters.

This works great. Can take a few minutes on bigger drives ( over a few hundred GB ):
find /directory/to/scan/ -type f -exec du -a {} + | sort -n -r | less
The output will be the biggest files first. You can page through the results with normal "less" commands ... space bar ( next page ) and b ( previous page ).

cronjob to remove files older than 99 days

I have to make a cronjob to remove files older than 99 days in a particular directory but I'm not sure the file names were made by trustworthy Linux users. I must expect special characters, spaces, slash characters, and others.
Here is what I think could work:
find /path/to/files -mtime +99 -exec rm {}\;
But I suspect this will fail if there are special characters or if it finds a file that's read-only, (cron may not be run with superuser privileges). I need it to go on if it meets such files.

When you use -exec rm {} \;, you shouldn't have any problems with spaces, tabs, returns, or special characters because find calls the rm command directly and passes it the name of each file one at a time.
Directories won't' be removed with that command because you aren't passing it the -r parameter, and you probably don't want too. That could end up being a bit dangerous. You might also want to include the -f parameter to do a force in case you don't have write permission. Run the cron script as root, and you should be fine.
The only thing I'd worry about is that you might end up hitting a file that you don't want to remove, but has not been modified in the past 100 days. For example, the password to stop the autodestruct sequence at your work. Chances are that file hasn't been modified in the past 100 days, but once that autodestruct sequence starts, you wouldn't want the one to be blamed because the password was lost.
Okay, more reasonable might be applications that are used but rarely modified. Maybe someone's resume that hasn't been updated because they are holding a current job, etc.
So, be careful with your assumptions. Just because a file hasn't been modified in 100 days doesn't mean it isn't used. A better criteria (although still questionable) is whether the file has been accessed in the last 100 days. Maybe this as a final command:
find /path/to/files -atime +99 -type f -exec rm -f {}\;
One more thing...
Some find commands have a -delete parameter which can be used instead of the -exec rm parameter:
find /path/to/files -atime +99 -delete
That will delete both found directories and files.
One more small recommendation: For the first week, save the files found in a log file instead of removing them, and then examine the log file. This way, you make sure that you're not deleting something important. Once you're happy thet there's nothing in the log file you don't want to touch, you can remove those files. After a week, and you're satisfied that you're not going to delete anything important, you can revert the find command to do the delete for you.

If you run rm with the -f option, your file is going to be deleted regardless of whether you have write permission on the file or not (all that matters is the containing folder). So, either you can erase all the files in the folder, or none. Add also -r if you want to erase subfolders.
And I have to say it: be very careful! You're playing with fire ;) I suggest you debug with something less harmful likfe the file command.
You can test this out by creating a bunch of files like, e.g.:
touch {a,b,c,d,e,f}
And setting permissions as desired on each of them

You should use -execdir instead of -exec. Even better, read the full Security considerations for find chapter in the findutils manual.

Please, always use rm [opts] -- [files], this will save you from errors with files like -rf wiich would otherwise be parsed as options. When you provide file names, then end all options.

Modifying files nested in tar archive

I am trying to do a grep and then a sed to search for specific strings inside files, which are inside multiple tars, all inside one master tar archive. Right now, I modify the files by
First extracting the master tar archive.
Then extracting all the tars inside it.
Then doing a recursive grep and then sed to replace a specific string in files.
Finally packaging everything again into tar archives, and all the archives inside the master archive.
Pretty tedious. How do I do this automatically using shell scripting?

There isn't going to be much option except automating the steps you outline, for the reasons demonstrated by the caveats in the answer by Kimvais.
tar modify operations
The tar command has some options to modify existing tar files. They are, however, not appropriate for your scenario for multiple reasons, one of them being that it is the nested tarballs that need editing rather than the master tarball. So, you will have to do the work longhand.
Assumptions
Are all the archives in the master archive extracted into the current directory or into a named/created sub-directory? That is, when you run tar -tf master.tar.gz, do you see:
subdir-1.23/tarball1.tar
subdir-1.23/tarball2.tar
...
or do you see:
tarball1.tar
tarball2.tar
(Note that nested tars should not themselves be gzipped if they are to be embedded in a bigger compressed tarball.)
master_repackager
Assuming you have the subdirectory notation, then you can do:
for master in "$#"
do
tmp=$(pwd)/xyz.$$
trap "rm -fr $tmp; exit 1" 0 1 2 3 13 15
cat $master |
(
mkdir $tmp
cd $tmp
tar -xf -
cd * # There is only one directory in the newly created one!
process_tarballs *
cd ..
tar -czf - * # There is only one directory down here
) > new.$master
rm -fr $tmp
trap 0
done
If you're working in a malicious environment, use something other than tmp.$$ for the directory name. However, this sort of repackaging is usually not done in a malicious environment, and the chosen name based on process ID is sufficient to give everything a unique name. The use of tar -f - for input and output allows you to switch directories but still handle relative pathnames on the command line. There are likely other ways to handle that if you want. I also used cat to feed the input to the sub-shell so that the top-to-bottom flow is clear; technically, I could improve things by using ) > new.$master < $master at the end, but that hides some crucial information multiple lines later.
The trap commands make sure that (a) if the script is interrupted (signals HUP, INT, QUIT, PIPE or TERM), the temporary directory is removed and the exit status is 1 (not success) and (b) once the subdirectory is removed, the process can exit with a zero status.
You might need to check whether new.$master exists before overwriting it. You might need to check that the extract operation actually extracted stuff. You might need to check whether the sub-tarball processing actually worked. If the master tarball extracts into multiple sub-directories, you need to convert the 'cd *' line into some loop that iterates over the sub-directories it creates.
All these issues can be skipped if you know enough about the contents and nothing goes wrong.
process_tarballs
The second script is process_tarballs; it processes each of the tarballs on its command line in turn, extracting the file, making the substitutions, repackaging the result, etc. One advantage of using two scripts is that you can test the tarball processing separately from the bigger task of dealing with a tarball containing multiple tarballs. Again, life will be much easier if each of the sub-tarballs extracts into its own sub-directory; if any of them extracts into the current directory, make sure you create a new sub-directory for it.
for tarball in "$#"
do
# Extract $tarball into sub-directory
tar -xf $tarball
# Locate appropriate sub-directory.
(
cd $subdirectory
find . -type f -print0 | xargs -0 sed -i 's/name/alternative-name/g'
)
mv $tarball old.$tarball
tar -cf $tarball $subdirectory
rm -f old.$tarball
done
You should add traps to clean up here, too, so the script can be run in isolation from the master script above and still not leave any intermediate directories around. In the context of the outer script, you might not need to be so careful to preserve the old tarball before the new is created (so rm -f $tarbal instead of the move and remove command), but treated in its own right, the script should be careful not to damage anything.
Summary
What you're attempting is not trivial.
Debuggability splits the job into two scripts that can be tested independently.
Handling the corner cases is much easier when you know what is really in the files.

You probably can sed the actual tar as tar itself does not do compression itself.
e.g.
zcat archive.tar.gz|sed -e 's/foo/bar/g'|gzip > archive2.tar.gz
However, beware that this will also replace foo with bar also in filenames, usernames and group names and ONLY works if foo and bar are of equal length

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string