how to efficiently find if a linux directory including sudirectories has at least 1 file

how to efficiently find if a linux directory including sudirectories has at least 1 file - linux

In my project various jobs are created as files in directories inside subdirectories.
But usually the case is I find that the jobs are mostly in some dirs and not in the most others
currently I use
find $DIR -type f | head -n 1
to know if the directory has atleast 1 file , but this is a waste
how to efficiently find if a linux directory including sudirectories has at least 1 file

Your code is already efficient, but perhaps the reason is not obvious. When you pipe the output of find to head -n 1 you probably assume that find lists all the files and then head discards everything after the first one. But that's not quite what head does.
When find lists the first file, head will print it, but when find lists the second file, head will terminate itself, which sends SIGPIPE to find because the pipe between them is closed. Then find will stop running, because the default signal handler for SIGPIPE terminates the program which receives it.
So the cost of your pipelined commands is only the cost of finding two files, not the cost of finding all files. For most obvious use cases this should be good enough.

Try this
find -type f -printf '%h\n' | uniq
The find part finds all files, but prints only the directory. The uniq part eliminates duplicates.
Pitfall: It doesn't work (like your example) for files containing a NEWLINE in the directory path.

This command finds the first subdiretory containing at least one file and then stop:
find . -mindepth 1 -type d -exec bash -c 'c=$(find {} -maxdepth 1 -type f -print -quit);test "x$c" != x' \; -print -quit
The first find iterates through all subdirectories and second find finds the first file and then stop.

Related

Find command - How to cut short after finding first file

I am using the find command to check if a certain pattern of
file exists within a directory tree. (Note, anywhere down the tree)
Once the first file is found the checking can stop because the answer is "yes".
How can I stop "find" from continuing the unnecessary search for other files?
Limiting -maxdepth does not work for the obvious reason that I am checking
any where down the tree.
I tried -exec exit ; and -exec quit ;
Hoping there was a linux command to call via -exec that would stop processing.
Should I write a script (to call via -exec above) that kills the find process
but continues running my script?
Additional detail: I am calling find from a perl script.
I don't necessarily have to use 'find' if there are other tools.
I may have to resolve to walking the dir-path via a longer perl script that I can control
when to stop.
I also looked into -prune option but it seems to be valid only up front (globally)
and can't change it in the middle of processing.
This was one instance of my find command that worked and returned all occurrences of the file pattern.
find /media/pk/mm2020A1/00.staging /media/pk/mm2020A1/x.folders -name hevc -o -name 'HEVC' -o -name '265'

It sounds like you want something along the lines of
find . -name '*.csv' | wc -l
and then ask whether that is -gt 0,
with the detail that we'd like to exit
early if possible, to conserve compute resources.
Well, here's a start:
find . -name '*.csv' | head -1
It doesn't exactly bail after finding first match,
since there's a race condition,
but it keeps you from spending two minutes
recursing down a deep directory tree.
In particular, after receiving 1st result head
will close() stdin, so find won't be able
to write to stdout, and it soon will exit.
I don't know your business use case.
But you may find it convenient and performant
to record find . -ls | sort > files.txt
every now and again,
and have your script consult that file.
It typically takes less time to access those stored results
than to re-run find, that is, to once again
recurse through the directory trees.
Why? It's a random I/O versus sequential access story.
You can exit earlier if you adopt
this
technique:
use Path::Class;
dir('.')->recurse( ...

Why cat command not working in script

I have the following script and it has an error. I am trying to merge all the files into one large file. From the command line the cat commant works fine and the content is printed to the redirected file. From script it is working sometime but not the other time. I dont know why its behaving abnormally. Please help.
#!/bin/bash
### For loop starts ###
for D in `find . -type d`
do
combo=`find $D -maxdepth 1 -type f -name "combo.txt"`
cat $combo >> bigcombo.tsv
done
Here is the output of bash -x app.sh
++ find . -type d
+ for D in '`find . -type d`'
++ find . -maxdepth 1 -type f -name combo.txt
+ combo=
+ cat
^C
UPDATE:
The following worked for me. There was issue with the path. I still dont know what was the issue so answer is welcome.
#!/bin/bash
### For loop starts ###
rm -rf bigcombo.tsv
for D in `find . -type d`
do
psi=`find $D -maxdepth 1 -type f -name "*.psi_filtered"`
# This will give us only the directory path from find result i.e. removing filename.
directory=$(dirname "${psi}")
cat $directory"/selectedcombo.txt" >> bigcombo.tsv
done

The obvious problem is that you are attempting to cat a file which doesn't exist.
Secondary problems are related to efficiency and correctness. Running two nested loops is best avoided, though splitting the action into two steps is merely inelegant here; the inner loop will only execute once, at most. Capturing command results into variables is a common beginner antipattern; a variable which is only used once can often be avoided, and avoids littering the shell's memory with cruft (and coincidentally solves the multiple problems with missing quoting - a variable which contains a file or directory name should basically always be interpolated in double quotes). Redirection is better performed outside any containing loop;
rm file
while something; do
another thing >>file
done
will open, seek to the end of the file, write, and close the file as many times as the loop runs, whereas
while something; do
another thing
done >file
only performs the open, seek, and close actions once, and avoids having to clear the file before starting the loop. Though your script can be refactored to not have any loops at all;
find ./*/ -type f -name "*.psi_filtered" -execdir cat selectedcombo.txt \;> bigcombo.tsv
Depending on your problem, it might be an error for there to be directories which contain combo.txt but which do not contain any *.psi_filtered files. Perhaps you want to locate and examine these directories.

Find and sort files by date modified

I know that there are many answers to this question online. However, I would like to know if this alternate solution would work:
ls -lt `find . -name "*.jpg" -print | head -10`
I'm aware of course that this will only give me the first 10 results. The reason I'm asking is because I'm not sure whether the ls is executing separately for each result of find or not. Thanks

In your solution:
the ls will be executed after the find is evaluated
it is likely that find will yield too many results for ls to process, in which case you might want to look at the xargs command
This should work better:
find . -type f -print0 | xargs -0 stat -f"%m %Sm %N" | sort -rn
The three parts of the command to this:
find all files and print their path
use xargs to process the (long) list of files and print out the modification unixtime, human readable time, and filename for each file
sort the resulting list in reverse numerical order
The main trick is to add the numerical unixtime when the files were last modified to the beginning of the lines, and then sort them.

Create a bash script to delete folders which do not contain a certain filetype

I have recently run into a problem.
I used a utility to move all my music files into directories based on tags. This left a LOT of almost empty folders. The folders, in general, contain a thumbs.db file or some sort of image for album art. The mp3s have the correct album art in their new directories, so the old ones are okay to delete.
Basically, I need to find any directories within D:/Music/ that:
-Do not have any subdirectories
-Do not contain any mp3 files
And then delete them.
I figured this would be easier to do in a shell script or bash script or whatever else linux/unix world than in Windows 8.1 (HAHA).
Any suggestions? I'm not very experienced writing scripts like this.

This should get you started
find /music -mindepth 1 -type d |
while read dt
do
find "$dt" -mindepth 1 -type d | read && continue
find "$dt" -iname '*.mp3' -type f | read && continue
echo DELETE $dt
done

Here's the short story...
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
find . -type d -print | sort | uniq > all-dirs.tmp
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
less dirs-to-be-deleted.tmp
cat dirs-to-be-deleted.tmp | xargs rm -rf
Note that you might have to run all the commands a few times (depending on your repository's directory depth) before you're done deleting all recursive empty directories...
And the long story goes...
You can approach this problem from two basic perspective: either you find all directories, then iterate over each of them, check if it contain any mp3 file or any subdirectory, if not, mark that directory for deletion. It will works, but on large very large repositories, you might expect a significant run time.
Another approach, which is in my sense much more interesting, is to build a list of directories NOT to be deleted, and subtract that list from the list of all directories. Let's work the second strategy, one step at a time...
First of all, to find the path of all directories that contains mp3 files, you can simply do:
find . -name '*.mp3' -printf '%h\n' | sort | uniq
This means "find any file ending with .mp3, then print the path to it's parent directory".
Now, I could certainly name at least ten different approaches to find directories that contains at least one subdirectory, but keeping the same strategy as above, we can easily get...
find . -type d -printf '%h\n' | sort | uniq
What this means is: "Find any directory, then print the path to it's parent."
Both of these queries can be combined in a single invocation, producing a single list containing the paths of all directories NOT to be deleted.. Let's redirect that list to a temporary file.
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
Let's similarly produce a file containing the paths of all directories, no matter if they are empty or not.
find . -type d -print | sort | uniq > all-dirs.tmp
So there, we have, on one side, the complete list of all directories, and on the other, the list of directories not to be deleted. What now? There are tons of strategies, but here's a very simple one:
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
Once you have that, well, review it, and if you are satisfied, then pipe it through xargs to rm to actually delete the directories.
cat dirs-to-be-deleted.tmp | xargs rm -rf

grep files based on time stamp

This should be pretty simple, but I am not figuring it out. I have a large code base more than 4GB under Linux. A few header files and xml files are generated during build (using gnu make). If it matters the header files are generated based on xml files.
I want to search for a keyword in header file that was last modified after a time instance ( Its my start compile time), and similarly xml files, but separate grep queries.
If I run it on all possible header or xml files, it take a lot of time. Only those that were auto generated. Further the search has to be recursive, since there are a lot of directories and sub-directories.

You could use the find command:
find . -mtime 0 -type f
prints a list of all files (-type f) in and below the current directory (.) that were modified in the last 24 hours (-mtime 0, 1 would be 48h, 2 would be 72h, ...). Try
grep "pattern" $(find . -mtime 0 -type f)

To find 'pattern' in all files newer than some_file in the current directory and its sub-directories recursively:
find -newer some_file -type f -exec grep 'pattern' {} +
You could specify the timestamp directly in date -d format and use other find tests e.g., -name, -mmin.
The file list could also be generate by your build system if find is too slow.
More specific tools such as ack, etags, GCCSense might be used instead of grep.

Use this. Because if find doesn't return a file, then grep will keep waiting for an input halting the script.
find . -mtime 0 -type f | xargs grep "pattern"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string