using grep in single-line files to find the number of occurrences of a word/pattern - linux

I have json files in the current directory, and subdirectories. All the files have a single line of content.
I want to a list of all files that contain the word XYZ, and the number of times it occurs in that file.
I want to print the list according to the following format:
file_name pattern_occurence_times
It should look something like:
.\x1\x2\file1.json 3
.\x1\file3.json 2
The problem is that grep counts the NUMBER of lines containing XYZ, not the number of occurrences.
Since the whole content of the files is always contained in a single line, the count is always 1 (if the pattern occurs in the file).
I used this command for that:
find . -type f -name "*.json" -exec grep --files-with-match -i 'xyz' {} \; -exec grep -wci 'xyz' {} \;
I wrote a python code, and it works, but I would like to know if there is any way of doing that using find and grep or any other command line tools.
Thanks

The classical approach to this problem is the pipeline grep -o regex file | wc -l. However, to execute a pipeline in find's -exec you have to run a shell (e.g. sh -c ... ). But all these things together will only print the number of matches, not the file names. Also, files with no matches have to be filtered out.
Because of all of this I think a single awk command would be preferable:
find ... -type f -exec awk '{$0=tolower($0); c+=gsub(/xyz/,"")}
END {if(c>0) print FILENAME " " c}' {} \;
Here the tolower($0) emulates grep's -i option. Make sure to write your search pattern xyz only in lowercase.
If you want to combine this with subsequent filters in find you can add else exit 1 at the end of the last awk block to continue (inside find) only with the printed files.

Use the -o option of grep, e.g. in conjunction with wc, e.g.
find . -name "*.json" | while read -r f ; do
echo $f : $(grep -ow XYZ "$f" | wc -l)
done

Related

How to read out a file line by line and for every line do a search with find and copy the search result to destination?

I hope you can help me with the following problem:
The Situation
I need to find files in various folders and copy them to another folder. The files and folders can contain white spaces and umlauts.
The filenames contain an ID and a string like:
"2022-01-11-02 super important file"
The filenames I need to find are collected in a textfile named ids.txt. This file only contains the IDs but not the whole filename as a string.
What I want to achieve:
I want to read out ids.txt line by line.
For every line in ids.txt I want to do a find search and copy cp the result to destination.
So far I tried:
for n in $(cat ids.txt); do find /home/alex/testzone/ -name "$n" -exec cp {} /home/alex/testzone/output \; ;
while read -r ids; do find /home/alex/testzone -name "$ids" -exec cp {} /home/alex/testzone/output \; ; done < ids.txt
The output folder remains empty. Not using -exec also gives no (search)results.
I was thinking that -name "$ids" is the root cause here. My files contain the ID + a String so I should search for names containing the ID plus a variable string (star)
As argument for -name I also tried "$ids *" "$ids"" *" and so on with no luck.
Is there an argument that I can use in conjunction with find instead of using the star in the -name argument?
Do you have any solution for me to automate this process in a bash script to read out ids.txt file, search the filenames and copy them over to specified folder?
In the end I would like to create a bash script that takes ids.txt and the search-folder and the output-folder as arguments like:
my-id-search.sh /home/alex/testzone/ids.txt /home/alex/testzone/ /home/alex/testzone/output
EDIT:
This is some example content of the ids.txt file where only ids are listed (not the whole filename):
2022-01-11-01
2022-01-11-02
2020-12-01-62
EDIT II:
Going on with the solution from tripleee:
#!/bin/bash
grep . $1 | while read -r id; do
echo "Der Suchbegriff lautet:"$id; echo;
find /home/alex/testzone -name "$id*" -exec cp {} /home/alex/testzone/ausgabe \;
done
In case my ids.txt file contains empty lines the -name "$id*" will be -name * which in turn finds all files and copies all files.
Trying to prevent empty line to be read does not seem to work. They should be filtered by the expression grep . $1 |. What am I doing wrong?
If your destination folder is always the same, the quickest and absolutely most elegant solution is to run a single find command to look for all of the files.
sed 's/.*/-o\n—name\n&*/' ids.txt |
xargs -I {} find -false {} -exec cp {} /home/alex/testzone/output +
The -false predicate is a bit of a hack to allow the list of actual predicates to start with -o (as in "or").
This could fail if ids.txt is too large to fit into a single xargs invocation, or if your sed does not understand \n to mean a literal newline.
(Here's a fix for the latter case:
xargs printf '-o\n-name\n%s*\n' <ids.txt |
...
Still the inherent problem with using xargs find like this is that xargs could split the list between -o and -name or between -name and the actual file name pattern if it needs to run more than one find command to process all the arguments.
A slightly hackish solution to that is to ensure that each pair is a single string, and then separately split them back out again:
xargs printf '-o_-name_%s*\n' <ids.txt |
xargs bash -c 'arr=("$#"); find -false ${arr[#]/-o_-name_/-o -name } -exec cp {} "$0"' /home/alex/testzone/ausgabe
where we temporarily hold the arguments in an array where each file name and its flags is a single item, and then replace the flags into separate tokens. This still won't work correctly if the file names you operate on contain literal shell metacharacters like * etc.)
A more mundane solution fixes your while read attempt by adding the missing wildcard in the -name argument. (I also took the liberty to rename the variable, since read will only read one argument at a time, so the variable name should be singular.)
while read -r id; do
find /home/alex/testzone -name "$id*" -exec cp {} /home/alex/testzone/output \;
done < ids.txt
Please try the following bash script copier.sh
#!/bin/bash
IFS=$'\n' # make newlines the only separator
set -f # disable globbing
file="files.txt" # name of file containing filenames
finish="finish" # destination directory
while read -r n ; do (
du -a | awk '{for(i=2;i<=NF;++i)printf $i" " ; print " "}' | grep $n | sed 's/ *$//g' | xargs -I '{}' cp '{}' $finish
);
done < $file
which copies recursively all the files named in files.txt from . and it's subfiles to ./finish
This new version works even if there are spaces in the directory names or file names.

Format xargs output to grep

I have a script that I'm trying to optimize with xargs. The current version uses find with -exec to call the command:
find -type f -iname "*.mp4" -print0 -printf '\n' -exec getfattr -d --absolute-names {} \;
after which I can pipe to grep with something like:
grep -z -P user\.md5\=\"$input_search_hash\"
to filter the results while keeping the whole output with -z.
I need the whole output returned from getfattr to be "preserved", per file, because I need the filename for which there is a matching extended attribute, which then is then passed to sed to extract it. There are also cases where I have multiple grep commands in sequence if I need to search for files with multiple matches in the extended attributes. The problem is that the output of:
find -type f -iname "*.mp4" -print0 | xargs -0 getfattr -d --absolute-names
is not formatted in such a way that grep will filter in this way. This does work with the -exec method. Can I pass an addional option to xargs or pipe in some additional command that will format the output to make grep properly replicate the behaviour of -exec? I'm guessing I need some sort of line-break before feeding to grep like what -printf '\n' does in the -exec method. I would just use getfattr to "search" the extended attributes instead of needing to grep the output at all, but it has no way to do this by suppling a xattr name and value.
Example
The input comes from the find command, which is a list of video files in an arbitrary directory structure. The output of each getfattr command, for each file is such:
# file: /path/to/file/test.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="10"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="1645645"
If I attempt to grep the output of find using the + method, say for a value of "10" on the quality, I will get results like this:
# file: /path/to/file/test.mp4
user.md5="8cf97b888e6fdbed27b02233cd6779f5"
user.quality="12"
user.sha256="613d16b2a0270e2e5f81cfd58b1eacf710a65b82ce2dab49a1e415275440f429"
user.size="1645645"
# file: /path/to/file/test1.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
# file: /path/to/file/test2.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="6"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="15645"
All files that find locates are returned and the string to be searched from grep, in this example user.quality="10", is highlighted, but the other files test.mp4 and test2.mp4 still have the output printed post-grep. In other words, find may locate 1000 mp4 files of which maybe 20 have a user.quality="10" entry, but even applying grep to search for that string still returns 1000 filenames (after sed).
This does not happen when using \;. The only thing I would get out from grep would be:
# file: /path/to/file/test.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
This is the expected behaviour.
xargs vs find -exec
To me it seems like you want to use xargs instead of find -exec {} \; to speed things up.
Yes, xargs is faster than find -exec {} \;, not because it does the same work more efficiently, but because it does different work!
find -exec {} \; calls once for each file (getfattr file1, then getfattr file2, and so on).
xargs crams as many files into one call as possible (getfattr file1 file2 file3 ...).
The same behavior (and even more speedup) can be achieved with find -exec {} + -- no need to use xargs for that.
With xargs and find -exec {} + you loose control over the output format. There is only one call of getfattr so that program decides what to print between file1, file2 and so on. getfattr has no option to customize its output format.
No problem! You can ...
Parse getfattr's output
... pretty easily.
For starters, we assume that all path names are pretty normal. Spaces, *, and ? are ok though. For really unusual path names containing backslashes and linebreaks see the last section.
If you output only the relevant attribute using -n user.md5 instead of -d, then you know that the output (if any) for each file is always of the form
# file: path in a single line
user.md5=encoded value of the attribute
Files without the attribute user.md5 are not printed at all. They cause a warning on stderr which can be suppressed by 2> /dev/null.
Now, grep for matching attributes. Use grep -B1 to print the line above each match (i.e. the path) too. Then use sed -n or grep -o to extract the filenames.
find -type f -iname '*.mp4' -exec getfattr -n user.md5 --absolute-names {} + 2> /dev/null |
grep -B1 -Fx "user.md5=\"$input_search_hash\"" |
sed -n 's/^# file: //p'
Above command prints the paths of all mp4 files having the attribute user.md5 with value $input_search_hash.
Handling Unusual Filenames
At least my version (getfattr 2.4.48 by Andreas Gruenbacher) on Debian 10 always prints the file name in a single line. Linebreaks are encoded using \012 and backslashes are encoded using \134. Therefore, safe processing of those files is possible.
Above command works, but prints only the encoded file names. To get the actual filenames you have to extend the sed command or add another command to interpret octal escape sequences. For me, getfattr only escapes \n, \r and \\, thus sed 's:\\012:\n:g;s:\\015:\r:g;s:\\134:\\:g' should be sufficient for printing. For further processing, you may want to use tr \\n \\0 | sed -z ... instead, such that filenames are separated by null bytes.
To test which characters are escaped for you, create a filename containing all allowed bytes and let getfattr print its name:
f=$(printf $(printf '\\%o' $(seq 1 255)) | tr -d /)
touch "$f"
setfattr -n user.md5 -v 123 "$f"
getfattr -n user.md5 "$f"
rm "$f"

How can I use grep to get all the lines that contains string1 and string2 separated by space?

Line1: .................
Line2: #hello1 #hello2 #hello3
Line3: .................
Line4: .................
Line5: #hello1 #hello4 #hello3
Line6: #hello1 #hello2 #hello3
Line7: .................
I have files that look similar in terms of lines on one of my project directories. I want to get the counts of all the lines that contain #hello1 and #hello2. In this case I would get 2 as a result only for this file. However, I want to do this recursively.
The canonical way to "do something recursively" is to use the find command. If you want to find lines that have two words on them, a simple regex will do:
grep -lr '#hello1.*#hello2' .
The option -l instructs grep to show us only filenames rather than file content, and the option -r tells grep to traverse the filesystem recursively. The start of the search is the path at the end of the line. Once you have the list of files, you can parse that list using commands run by xargs.
For example, this will count all the lines in files matching the pattern you specified.
grep -lr '#hello1.*#hello2' . | xargs -n 1 wc -l
This uses xargs to run the wc command on each of the files listed by grep. You could probably also run this without the -n 1, unless you're dealing with many many thousands of files that would exceed your maximum command line length.
Or, if I'm interpreting your question correctly, the following will count just the patterns in those files.
grep -lr '#hello1.*#hello2' . | xargs -n 1 grep -Hc '#hello1.*#hello2'
This runs a similar grep to the one used to generate your recursive list of files, and presents the output with filename (-H) and count (-c).
But if you want complex rules like finding two patterns possibly on different lines in the file, then grep probably is not the optimal tool, unless you use multiple greps launched by find:
find /path/to/base -type f \
-exec grep -q '#hello1' {} \; \
-exec grep -q '#hello2' {} \; \
-print
(Lines split for easier reading.)
This is somewhat costly, as find needs to launch up to two children for each file. So another approach would be to use awk instead:
find /path/to/base -type f \
-exec awk '/#hello1/{c++} /#hello2/{c++} c==2{r=1} END{exit 1-r}' {} \; \
-print
Alternately, if your shell is bash version 4 or above, you can avoid using find and use the bash option globstar:
$ shopt -s globstar
$ awk 'FNR=1{c=0} /#hello1/{c++} /#hello2/{c++} c==2{print FILENAME;nextfile}' **/*
Note: none of this is tested.
If you are not nterested in the number of files also,
then just something along:
find $BASEDIRECTORY -type f -print0 | xargs -0 grep -h PATTERN | wc -l
If you want to count lines containing #hello1 and #hello2 separated by space in a specific file you can:
$ grep -c '#hello1 #hello2' file
If you want to count in more than one file:
$ grep -c '#hello1 #hello2' file1 file2 ...
And if you want to get the gran total:
$ grep -c '#hello1 #hello2' file1 file2 ... | paste -s -d+ - | bc
of course you can let your shell expanding file names. So, for example:
$ grep -c '#hello1 #hello2' *.txt | paste -s -d+ - | bc
or so...
find . -type f | xargs -1 awk '/#hello1/ && /#hello2/{c++} END{print FILENAME, c+0}'

find and copy all images in directory using terminal linux mint, trying to understand syntax

OS Linux Mint
Like the title says finally I would like to find and copy all images in a directory.
I found:
find all jpg (or JPG) files in a directory and copy them into the folder /home/joachim/neu2:
find . -iname \*.jpg -print0 | xargs -I{} -0 cp -v {} /home/joachim/neu2
and
find all image files in a direcotry:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
My problem is first of all, I don't really understand the syntax. Could someone explain the code?
And secondly can someone connect the two codes for generating a code that does what I want ;)
Greetings and thanks in advance!
First, understand that the pipe "|" links commands piping the output of the first into the second as an argument. Your two shell codes both pipe output of the find command into other commands (grep and xargs). Let's look at those commands one after another:
First command: find
find is a program to "search for files in a directory hierarchy" (that is the explanation from find's man page). The syntax is (in this case)
find <search directory> <search pattern> <action>
In both cases the search directory is . (that is the current directory). Note that it does not just search the current directory but all its subdirectories as well (the directory hierarchy).
The search pattern accepts options -name (meaning it searches for files the name of which matches the pattern given as an argument to this option) or -iname (same as name but case insensitive) among others.
The action pattern may be -print0 (print the exact filename including its position in the given search directory, i.e. the relative or absolute path to the file) or -exec (execute the given command on the file(s), the command is to be ended with ";" and every instance of "{}" is replaced by the filename).
That is, the first shell code (first part, left of the pipe)
find . -iname \*.jpg -print0
searches all files with ending ".jpg" in the current directory hierarchy and prints their paths and names. The second one (first part)
find . -name '*' -exec file {} \;
finds all files in the current directory hierarchy and executes
file <filename>
on them. File is another command that determines and prints the file type (have a look at the man page for details, man file).
Second command: xargs
xargs is a command that "builds and exectues command lines from standard input" (man xargs), i.e. from the find output that is piped into xargs. The command that it builds and executes is in this case
cp -v {} /home/joachim/neu2"
Option -I{} defines the replacement string, i.e. every instance of {} in the command is to be replaced by the input it gets from file (that is, the filenames). Option -0 defines that input items are not terminated (seperated) by whitespace or newlines but only by a null character. This seems to be necessary when using and the standard way to deal with find output as xargs input.
The command that is built and executed is then of course the copy command with option -v (verbose) and it copies each of the filenames it gets from find to the directory.
Third command: grep
grep filters its input giving only those lines or strings that match a particular output pattern. Option -o tells grep to print only the matching string, not the entire line (see man grep), -P tells it to interpret the following pattern as a perl regexp pattern. In perl regex, ^ is the start of the line, .+ is any arbitrary string, this arbitrary should then be followed by a colon, a space, a number of alphanumeric characters (in perl regex denoted \w+) a space and the string "image". Essentially this grep command filters the file output to only output the filenames that are image files. (Read about perl regex's for instance here: http://www.comp.leeds.ac.uk/Perl/matching.html )
The command you actually wanted
Now what you want to do is (1) take the output of the second shell command (which lists the image files), (2) bring it into the appropriate form and (3) pipe it into the xargs command from the first shell command line (which then builds and executes the copy command you wanted). So this time we have a three (actually four) stage shell command with two pipes. Not a problem. We already have stages (1) and (3) (though in stage (3) we need to leave out the -0 option because the input is not find output any more; we need it to treat newlines as item seperators).
Stage (2) is still missing. I suggest using the cut command for this. cut changes strings py splitting them into different fields (seperated by a delimiter character in the original string) that can then be rearranged. I will choose ":" as the delimiter character (this ends the filename in the grep output, option -d':') and tell it to give us just the first field (option -f1, essentialls: print only the filename, not the part that comes after the ":"), i.e. stage (2) would then be
cut -d':' -f1
And the entire command you wanted will then be:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/joachim/neu2
Note that you can find all the man pages for instance here: http://www.linuxmanpages.com
I figured out a command only using awk that does the job as well:
find . -name '*' -exec file {} \; |
awk '{
if ($3=="image"){
print substr($1, 0, length($1)-1);
system("cp " substr($1, 0, length($1)-1) " /home/joachim/neu2" )
}
}'
the substr($1, 0, length($1)-1) is needed because in first column file returns name;
The above answer is really good. but it could take longer if it a huge directory.
here is a shorter version of it , if you already know your file extension
find . -name \*.jpg | cut -d':' -f1 | xargs -I{} cp --parents -v {} ~/testimage/
Here's another one which works like a charm.
It adds the EPOCH time to prevent overwriting files with the same name.
cd /media/myhome/'Local station'/
find . -path ./jpg -prune -o -type f -iname '*.jpg' -exec sh -c '
for file do
newname="${file##*/}"
newname="${newname%.jpg}"
mv -T -- "$file" "/media/myhome/Local station/jpg/$newname-$(date +%s).jpg"
done
' find-sh {} +
cd ~/
It's been designed by Kamil in this post here.
Find a specific type file from a directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
Copy specific type of file from one directory to another directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/user/copy/data/

Shell: find files in a list under a directory

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The following command will run find for 1000 times:
cat filelist.txt | while read f; do find /dir -name $f; done
Is there a much faster way to do it?
If filelist.txt has a single filename per line:
find /dir | grep -f <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
(The -f option means that grep searches for all the patterns in the given file.)
Explanation of <(sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):
The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):
sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt
The call to sed runs the commands s#^#/#, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.
s#^#/# means put a / at the before each filename. (The ^ means "start of line" in a regex)
s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").
The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.
s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).
As an example:
$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile
$ sed 's#^#/#; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$
Grep then uses each line of that output as a pattern when it is searching the output of find.
If filelist.txt is a plain list:
$ find /dir | grep -F -f filelist.txt
If filelist.txt is a pattern list:
$ find /dir | grep -f filelist.txt
Use xargs(1) for the while loop can be a bit faster than in bash.
Like this
xargs -a filelist.txt -I filename find /dir -name filename
Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.
An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1) to exit early when it finds the instance.
xargs -a filelist.txt -I filename find /dir -name filename -print -quit
Another solution. You can pre-process the filelist.txt, make it into a find(1) arguments list like this. This will reduce find(1) invocations:
find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'
I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.
Neither of the answers did it for me so I did this:
cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u
Which resulted with a list of the 4 files I needed.
The idea is to combine the two file lists to determine the unique entries.
sort is used to make duplicate entries adjacent to each other which is the only way uniq will filter them out.

Resources