Shell: find files in a list under a directory

Shell: find files in a list under a directory - linux

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The following command will run find for 1000 times:
cat filelist.txt | while read f; do find /dir -name $f; done
Is there a much faster way to do it?

If filelist.txt has a single filename per line:
find /dir | grep -f <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
(The -f option means that grep searches for all the patterns in the given file.)
Explanation of <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):
The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):
sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt
The call to sed runs the commands s#^#/#, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.
s#^#/# means put a / at the before each filename. (The ^ means "start of line" in a regex)
s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").
The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.
s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).
As an example:
$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile
$ sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$
Grep then uses each line of that output as a pattern when it is searching the output of find.

If filelist.txt is a plain list:
$ find /dir | grep -F -f filelist.txt
If filelist.txt is a pattern list:
$ find /dir | grep -f filelist.txt

Use xargs(1) for the while loop can be a bit faster than in bash.
Like this
xargs -a filelist.txt -I filename find /dir -name filename
Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.
An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1) to exit early when it finds the instance.
xargs -a filelist.txt -I filename find /dir -name filename -print -quit
Another solution. You can pre-process the filelist.txt, make it into a find(1) arguments list like this. This will reduce find(1) invocations:
find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'

I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.
Neither of the answers did it for me so I did this:
cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u
Which resulted with a list of the 4 files I needed.
The idea is to combine the two file lists to determine the unique entries.
sort is used to make duplicate entries adjacent to each other which is the only way uniq will filter them out.

Related

Format xargs output to grep

I have a script that I'm trying to optimize with xargs. The current version uses find with -exec to call the command:
find -type f -iname "*.mp4" -print0 -printf '\n' -exec getfattr -d --absolute-names {} \;
after which I can pipe to grep with something like:
grep -z -P user\.md5\=\"$input_search_hash\"
to filter the results while keeping the whole output with -z.
I need the whole output returned from getfattr to be "preserved", per file, because I need the filename for which there is a matching extended attribute, which then is then passed to sed to extract it. There are also cases where I have multiple grep commands in sequence if I need to search for files with multiple matches in the extended attributes. The problem is that the output of:
find -type f -iname "*.mp4" -print0 | xargs -0 getfattr -d --absolute-names
is not formatted in such a way that grep will filter in this way. This does work with the -exec method. Can I pass an addional option to xargs or pipe in some additional command that will format the output to make grep properly replicate the behaviour of -exec? I'm guessing I need some sort of line-break before feeding to grep like what -printf '\n' does in the -exec method. I would just use getfattr to "search" the extended attributes instead of needing to grep the output at all, but it has no way to do this by suppling a xattr name and value.
Example
The input comes from the find command, which is a list of video files in an arbitrary directory structure. The output of each getfattr command, for each file is such:
# file: /path/to/file/test.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="10"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="1645645"
If I attempt to grep the output of find using the + method, say for a value of "10" on the quality, I will get results like this:
# file: /path/to/file/test.mp4
user.md5="8cf97b888e6fdbed27b02233cd6779f5"
user.quality="12"
user.sha256="613d16b2a0270e2e5f81cfd58b1eacf710a65b82ce2dab49a1e415275440f429"
user.size="1645645"
# file: /path/to/file/test1.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
# file: /path/to/file/test2.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="6"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="15645"
All files that find locates are returned and the string to be searched from grep, in this example user.quality="10", is highlighted, but the other files test.mp4 and test2.mp4 still have the output printed post-grep. In other words, find may locate 1000 mp4 files of which maybe 20 have a user.quality="10" entry, but even applying grep to search for that string still returns 1000 filenames (after sed).
This does not happen when using \;. The only thing I would get out from grep would be:
# file: /path/to/file/test.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
This is the expected behaviour.

xargs vs find -exec
To me it seems like you want to use xargs instead of find -exec {} \; to speed things up.
Yes, xargs is faster than find -exec {} \;, not because it does the same work more efficiently, but because it does different work!
find -exec {} \; calls once for each file (getfattr file1, then getfattr file2, and so on).
xargs crams as many files into one call as possible (getfattr file1 file2 file3 ...).
The same behavior (and even more speedup) can be achieved with find -exec {} + -- no need to use xargs for that.
With xargs and find -exec {} + you loose control over the output format. There is only one call of getfattr so that program decides what to print between file1, file2 and so on. getfattr has no option to customize its output format.
No problem! You can ...
Parse getfattr's output
... pretty easily.
For starters, we assume that all path names are pretty normal. Spaces, *, and ? are ok though. For really unusual path names containing backslashes and linebreaks see the last section.
If you output only the relevant attribute using -n user.md5 instead of -d, then you know that the output (if any) for each file is always of the form
# file: path in a single line
user.md5=encoded value of the attribute
Files without the attribute user.md5 are not printed at all. They cause a warning on stderr which can be suppressed by 2> /dev/null.
Now, grep for matching attributes. Use grep -B1 to print the line above each match (i.e. the path) too. Then use sed -n or grep -o to extract the filenames.
find -type f -iname '*.mp4' -exec getfattr -n user.md5 --absolute-names {} + 2> /dev/null |
grep -B1 -Fx "user.md5=\"$input_search_hash\"" |
sed -n 's/^# file: //p'
Above command prints the paths of all mp4 files having the attribute user.md5 with value $input_search_hash.
Handling Unusual Filenames
At least my version (getfattr 2.4.48 by Andreas Gruenbacher) on Debian 10 always prints the file name in a single line. Linebreaks are encoded using \012 and backslashes are encoded using \134. Therefore, safe processing of those files is possible.
Above command works, but prints only the encoded file names. To get the actual filenames you have to extend the sed command or add another command to interpret octal escape sequences. For me, getfattr only escapes \n, \r and \\, thus sed 's:\\012:\n:g;s:\\015:\r:g;s:\\134:\\:g' should be sufficient for printing. For further processing, you may want to use tr \\n \\0 | sed -z ... instead, such that filenames are separated by null bytes.
To test which characters are escaped for you, create a filename containing all allowed bytes and let getfattr print its name:
f=$(printf $(printf '\\%o' $(seq 1 255)) | tr -d /)
touch "$f"
setfattr -n user.md5 -v 123 "$f"
getfattr -n user.md5 "$f"
rm "$f"

using grep in single-line files to find the number of occurrences of a word/pattern

I have json files in the current directory, and subdirectories. All the files have a single line of content.
I want to a list of all files that contain the word XYZ, and the number of times it occurs in that file.
I want to print the list according to the following format:
file_name pattern_occurence_times
It should look something like:
.\x1\x2\file1.json 3
.\x1\file3.json 2
The problem is that grep counts the NUMBER of lines containing XYZ, not the number of occurrences.
Since the whole content of the files is always contained in a single line, the count is always 1 (if the pattern occurs in the file).
I used this command for that:
find . -type f -name "*.json" -exec grep --files-with-match -i 'xyz' {} \; -exec grep -wci 'xyz' {} \;
I wrote a python code, and it works, but I would like to know if there is any way of doing that using find and grep or any other command line tools.
Thanks

The classical approach to this problem is the pipeline grep -o regex file | wc -l. However, to execute a pipeline in find's -exec you have to run a shell (e.g. sh -c ... ). But all these things together will only print the number of matches, not the file names. Also, files with no matches have to be filtered out.
Because of all of this I think a single awk command would be preferable:
find ... -type f -exec awk '{$0=tolower($0); c+=gsub(/xyz/,"")}
END {if(c>0) print FILENAME " " c}' {} \;
Here the tolower($0) emulates grep's -i option. Make sure to write your search pattern xyz only in lowercase.
If you want to combine this with subsequent filters in find you can add else exit 1 at the end of the last awk block to continue (inside find) only with the printed files.

Use the -o option of grep, e.g. in conjunction with wc, e.g.
find . -name "*.json" | while read -r f ; do
echo $f : $(grep -ow XYZ "$f" | wc -l)
done

How do I classify files in Linux server by their names?

How can use the ls command and options to list the repetitious filenames that are in different directories?

You can't use a single, basic ls command to do this. You'd have to use a combination of other POSIX/Unix/GNU utilities. For example, to find the duplicate filenames first:
find . -type f -exec basename "\{}" \; | sort | uniq -d > dupes
This means find all the files (-type f) through the entire directory hierarchy in the current directory (.), and execute (-exec) the command basename (which strips the directory portion) on the found file (\{}), end of command (\;). These files then sort and print out duplicate lines (uniq -d). The result goes in the file dupes. Now you have the filenames that are duplicated, but you don't know what directory they are in. Use find again to find them. Using bash as your shell:
while read filename; do find . -name "$filename" -print; done < dupes
This means loop through (while) all contents of file dupes and read into the variable filename each line. For each line, execute find again and search for the specific -name of the $filename and print it out (-print, but it's implicit so this is redundant).
Truth be told you can combine these without using an intermediate file:
find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
If you're not familiar with it, the | operator means, execute the following command using the output of the previous command as the input to the following command. Example:
eje#EEWANCO-PC:~$ mkdir test
eje#EEWANCO-PC:~$ cd test
eje#EEWANCO-PC:~/test$ mkdir 1 2 3 4 5
eje#EEWANCO-PC:~/test$ mkdir 1/2 2/3
eje#EEWANCO-PC:~/test$ touch 1/0000 2/1111 3/2222 4/2222 5/0000 1/2/1111 2/3/4444
eje#EEWANCO-PC:~/test$ find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
./1/0000
./5/0000
./1/2/1111
./2/1111
./3/2222
./4/2222
Disclaimer: The requirement stated that the filenames were all numbers. While I have tried to design the code to handle filenames with spaces (and in tests on my system, it works), the code may break when it encounters special characters, newlines, nuls, or other unusual situations. Please note that the -exec parameter has special security considerations and should not be used by root over arbitrary user files. The simplified example provided is intended for illustrative and didactic purposes only. Please consult your man pages and relevant CERT advisories for full security implications.

I have a function in my bash profile (bash 4.4) for duplicate files.
It is true that find is the correct tool.
I use find combined with -print0 options which separates the find results with null char instead of new lines (default find action). Now i can catch all files under current directory and subdirectories.
This will ensure that results will be correct no matter if filenames contain special chars like spaces or new lines (in some very rare cases). Instead of double running find against find, you can built an array and just locate the duplicate files in this array. Then you grep the whole array using the "duplicates" as pattern.
So something like this works ok for my function:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
This is a test:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
# find all files and load them in an array using null delimiter
$ printf '%s\n' "${fn[#]}" #print the array
./tmp/file7
./tmp/file14
./tmp/file11
./tmp/file8
./tmp/file9
./tmp/tmp2/file09 99
./tmp/tmp2/file14.txt
./tmp/tmp2/file15.txt
./tmp/tmp2/file$100
./tmp/tmp2/file14.txt.bak
./tmp/tmp2/file15.txt.bak
./tmp/file1
./tmp/file4
./file09 99
./file14
./file$100
./file1
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
#Locate duplicate files
$ echo "$dupes"
\<file$100\>$ #Mind this one with special char $ in filename
\<file09 99\>$ #Mind also this one with spaces
\<file14\>$
\<file1\>$
#I have on purpose enclose the results between \<...\> to force grep later to capture full words and avoid file1 to match file1.txt or file11
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
file$100 ==> ./file$100 #File with special char correctly captured
file$100 ==> ./tmp/tmp2/file$100
file09 99 ==> ./file09 99 #File with spaces in name also correctly captured
file09 99 ==> ./tmp/tmp2/file09 99
file1 ==> ./file1
file1 ==> ./tmp/file1
file14 ==> ./file14 #other files named file14 like file14.txt and file14.txt.bak not captured since they are not duplicates.
file14 ==> ./tmp/file14
Tips:
This one <(printf '\<%s\>$\n' "${fn[#]##*/}") uses process substitution on the basename of the find results using bash built in parameter expansion techniques.
LC_ALL=C is required on sorting in order filenames to be sorted correctly.
In bash versions before 4.4 , the readarray does not accept -d option (delimiter). In this case you can transform find results to an array with
while IFS= read -r -d '' res;do fn+=( "$res" );done < <(find.... -print0)

How can I use grep to get all the lines that contains string1 and string2 separated by space?

Line1: .................
Line2: #hello1 #hello2 #hello3
Line3: .................
Line4: .................
Line5: #hello1 #hello4 #hello3
Line6: #hello1 #hello2 #hello3
Line7: .................
I have files that look similar in terms of lines on one of my project directories. I want to get the counts of all the lines that contain #hello1 and #hello2. In this case I would get 2 as a result only for this file. However, I want to do this recursively.

The canonical way to "do something recursively" is to use the find command. If you want to find lines that have two words on them, a simple regex will do:
grep -lr '#hello1.*#hello2' .
The option -l instructs grep to show us only filenames rather than file content, and the option -r tells grep to traverse the filesystem recursively. The start of the search is the path at the end of the line. Once you have the list of files, you can parse that list using commands run by xargs.
For example, this will count all the lines in files matching the pattern you specified.
grep -lr '#hello1.*#hello2' . | xargs -n 1 wc -l
This uses xargs to run the wc command on each of the files listed by grep. You could probably also run this without the -n 1, unless you're dealing with many many thousands of files that would exceed your maximum command line length.
Or, if I'm interpreting your question correctly, the following will count just the patterns in those files.
grep -lr '#hello1.*#hello2' . | xargs -n 1 grep -Hc '#hello1.*#hello2'
This runs a similar grep to the one used to generate your recursive list of files, and presents the output with filename (-H) and count (-c).
But if you want complex rules like finding two patterns possibly on different lines in the file, then grep probably is not the optimal tool, unless you use multiple greps launched by find:
find /path/to/base -type f \
-exec grep -q '#hello1' {} \; \
-exec grep -q '#hello2' {} \; \
-print
(Lines split for easier reading.)
This is somewhat costly, as find needs to launch up to two children for each file. So another approach would be to use awk instead:
find /path/to/base -type f \
-exec awk '/#hello1/{c++} /#hello2/{c++} c==2{r=1} END{exit 1-r}' {} \; \
-print
Alternately, if your shell is bash version 4 or above, you can avoid using find and use the bash option globstar:
$ shopt -s globstar
$ awk 'FNR=1{c=0} /#hello1/{c++} /#hello2/{c++} c==2{print FILENAME;nextfile}' **/*
Note: none of this is tested.

If you are not nterested in the number of files also,
then just something along:
find $BASEDIRECTORY -type f -print0 | xargs -0 grep -h PATTERN | wc -l

If you want to count lines containing #hello1 and #hello2 separated by space in a specific file you can:
$ grep -c '#hello1 #hello2' file
If you want to count in more than one file:
$ grep -c '#hello1 #hello2' file1 file2 ...
And if you want to get the gran total:
$ grep -c '#hello1 #hello2' file1 file2 ... | paste -s -d+ - | bc
of course you can let your shell expanding file names. So, for example:
$ grep -c '#hello1 #hello2' *.txt | paste -s -d+ - | bc
or so...

find . -type f | xargs -1 awk '/#hello1/ && /#hello2/{c++} END{print FILENAME, c+0}'

Search and replace entire files

I've seen numerous examples for replacing one string with another among multiple files but what I want to do is a bit different. Probably a lot simpler :)
Find all the files that match a certain string and replace them completely with the contents of a new file.
I have a find command that works
find /home/*/public_html -name "index.php" -exec grep "version:1.23" '{}' \; -print
This finds all the files I need to update.
Now how do I replace their entire content with the CONTENTS of /home/indexnew.txt (I could also name it /home/index.php)
I emphasize content because I don't want to change the name or ownership of the files I'm updating.

find ... | while read filename; do cat static_file > "$filename"; done
efficiency hint: use grep -q -- it will return "true" immediately when the first match is found, not having to read the entire file.

If you have a bunch of files you want to replace, and you can get all of their names using wildcards you can try piping output to the tee command:
cat my_file | tee /home/*/update.txt
This should look through all the directories in /home and write the text in my_file to update.txt in each of those directories.
Let me know if this helps or isn't what you want.

I am not sure if your command without -l and then print it is better than to add -l in grep to list file directly.
find /home/*/public_html -name "index.php" -exec grep -l "version:1.23" '{}' \; |xargs -i cp /home/index.php {}
Here is the option -l detail
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Shell: find files in a list under a directory - linux

If filelist.txt is a plain list: $ find /dir | grep -F -f filelist.txt If filelist.txt is a pattern list: $ find /dir | grep -f filelist.txt

Related

Format xargs output to grep

using grep in single-line files to find the number of occurrences of a word/pattern

How do I classify files in Linux server by their names?

How can I use grep to get all the lines that contains string1 and string2 separated by space?

Search and replace entire files

Categories

Resources