I have json files in the current directory, and subdirectories. All the files have a single line of content.
I want to a list of all files that contain the word XYZ, and the number of times it occurs in that file.
I want to print the list according to the following format:
file_name pattern_occurence_times
It should look something like:
.\x1\x2\file1.json 3
.\x1\file3.json 2
The problem is that grep counts the NUMBER of lines containing XYZ, not the number of occurrences.
Since the whole content of the files is always contained in a single line, the count is always 1 (if the pattern occurs in the file).
I used this command for that:
find . -type f -name "*.json" -exec grep --files-with-match -i 'xyz' {} \; -exec grep -wci 'xyz' {} \;
I wrote a python code, and it works, but I would like to know if there is any way of doing that using find and grep or any other command line tools.
Thanks
The classical approach to this problem is the pipeline grep -o regex file | wc -l. However, to execute a pipeline in find's -exec you have to run a shell (e.g. sh -c ... ). But all these things together will only print the number of matches, not the file names. Also, files with no matches have to be filtered out.
Because of all of this I think a single awk command would be preferable:
find ... -type f -exec awk '{$0=tolower($0); c+=gsub(/xyz/,"")}
END {if(c>0) print FILENAME " " c}' {} \;
Here the tolower($0) emulates grep's -i option. Make sure to write your search pattern xyz only in lowercase.
If you want to combine this with subsequent filters in find you can add else exit 1 at the end of the last awk block to continue (inside find) only with the printed files.
Use the -o option of grep, e.g. in conjunction with wc, e.g.
find . -name "*.json" | while read -r f ; do
echo $f : $(grep -ow XYZ "$f" | wc -l)
done
Line1: .................
Line2: #hello1 #hello2 #hello3
Line3: .................
Line4: .................
Line5: #hello1 #hello4 #hello3
Line6: #hello1 #hello2 #hello3
Line7: .................
I have files that look similar in terms of lines on one of my project directories. I want to get the counts of all the lines that contain #hello1 and #hello2. In this case I would get 2 as a result only for this file. However, I want to do this recursively.
The canonical way to "do something recursively" is to use the find command. If you want to find lines that have two words on them, a simple regex will do:
grep -lr '#hello1.*#hello2' .
The option -l instructs grep to show us only filenames rather than file content, and the option -r tells grep to traverse the filesystem recursively. The start of the search is the path at the end of the line. Once you have the list of files, you can parse that list using commands run by xargs.
For example, this will count all the lines in files matching the pattern you specified.
grep -lr '#hello1.*#hello2' . | xargs -n 1 wc -l
This uses xargs to run the wc command on each of the files listed by grep. You could probably also run this without the -n 1, unless you're dealing with many many thousands of files that would exceed your maximum command line length.
Or, if I'm interpreting your question correctly, the following will count just the patterns in those files.
grep -lr '#hello1.*#hello2' . | xargs -n 1 grep -Hc '#hello1.*#hello2'
This runs a similar grep to the one used to generate your recursive list of files, and presents the output with filename (-H) and count (-c).
But if you want complex rules like finding two patterns possibly on different lines in the file, then grep probably is not the optimal tool, unless you use multiple greps launched by find:
find /path/to/base -type f \
-exec grep -q '#hello1' {} \; \
-exec grep -q '#hello2' {} \; \
-print
(Lines split for easier reading.)
This is somewhat costly, as find needs to launch up to two children for each file. So another approach would be to use awk instead:
find /path/to/base -type f \
-exec awk '/#hello1/{c++} /#hello2/{c++} c==2{r=1} END{exit 1-r}' {} \; \
-print
Alternately, if your shell is bash version 4 or above, you can avoid using find and use the bash option globstar:
$ shopt -s globstar
$ awk 'FNR=1{c=0} /#hello1/{c++} /#hello2/{c++} c==2{print FILENAME;nextfile}' **/*
Note: none of this is tested.
If you are not nterested in the number of files also,
then just something along:
find $BASEDIRECTORY -type f -print0 | xargs -0 grep -h PATTERN | wc -l
If you want to count lines containing #hello1 and #hello2 separated by space in a specific file you can:
$ grep -c '#hello1 #hello2' file
If you want to count in more than one file:
$ grep -c '#hello1 #hello2' file1 file2 ...
And if you want to get the gran total:
$ grep -c '#hello1 #hello2' file1 file2 ... | paste -s -d+ - | bc
of course you can let your shell expanding file names. So, for example:
$ grep -c '#hello1 #hello2' *.txt | paste -s -d+ - | bc
or so...
find . -type f | xargs -1 awk '/#hello1/ && /#hello2/{c++} END{print FILENAME, c+0}'
Say my current working directory is called my_dir and I have a few different files in them:
file1
file_dog
filedog_1
dog_file
How do I count the number of instances "dog" appears (should be 3) in my directory?
Thank you!
You can use the -c option from grep
grep -c:
Suppress normal output; instead print a count of matching lines for each input file.
ls -1 | grep -c dog
or:
ls -1 *[dD][oO][gG]* | wc -l
The following solution works even if filenames contain newline characters:
$ touch file1 file_dog filedog_1 dog_file
$ find . -maxdepth 1 -name '*dog*' -print0 | grep -zc .
3
How it works:
find . -maxdepth 1 -name '*dog*' -print0
This finds all files in the current directory with dog in their name and prints them out in a nul-separated list. Since the nul character is one of the few not allowed in a file name, this is safe. If you want a case-insensitive match (so that Dog is matched as well), replace -name with -iname.
grep -zc .
This reads in a nul-separated list and counts the lines.
-z tells grep that the input is nul-separated
-c tells grep to suppress normal output and count the number of matches
. tells grep to match anything.
while I was trying to practice my linux skills, but I could not solve this question.
So its basically saying "Write a bash script that takes a name of
directory as a command argument and printf the name of subdirectories
that has more than 5 files in it."
I thought we will use the find command but ı still could not figure it out. My code is:
find directory -type d -mindepth5
but it's not working.
You can use find twice:
First you can use find and wc to count the number of files in a given directory:
nb=$(find directory -maxdepth 1 -type f -printf "x\n" | wc -l)
This just asks find to output an x on a line for each file in the directory directory, proceeding non-recursively, then wc -l counts the number of lines, so, really, nb is the number of files in directory.
If you want to know whether a directory contains more than 5 files, it's a good idea to stop find as soon as 6 files are found:
nb=$(find directory -maxdepth 1 -type f -printf "x\n" | head -6 | wc -l)
Here nb has an upper threshold of 6.
Now if for each subdirectory of a directory directory you want to output the number of files (threshold at 6), you can do this:
find directory -type d -exec bash -c 'nb=$(find "$0" -maxdepth 1 -type f -printf "x\n" | head -6 | wc -l); echo "$nb"' {} \;
where the $0 that appears is the 0-th argument, namely {} that find will replaced by the subdirectory of directory.
Finally, you only want to display the subdirectory name if the number of files is more than 5:
find . -type d -exec bash -c 'nb=$(find "$0" -maxdepth 1 -type f -printf "x\n" | head -6 | wc -l); ((nb>5))' {} \; -print
The final test ((nb>5)) returns success or failure whether nb is greater than 5 or not, and in case of success, find will -print the subdirectory name.
This should do the trick:
find directory/ -type f | sed 's/\(.*\)\/.*/\1/g' | sort | uniq -c | sort -n | awk '{if($1>5) print($2)}'
Using mindpeth is useless here since it only lists directories at at least depth 5. You say you need subdirectories with more then 5 files in it.
find directory -type f prints all files in subdirectories
sed 's/\(.*\)\/.*/\1/g' removes names of files leaving only list of subdirecotries without filenames
sort sorts that list so we can use uniq
uniq -c merges duplicate lines and writes how many times it occured
sort -n sorts it by number of occurences (so you end up with a list:(how many times, subdirectory))
awk '{if($1>5) print($2)}' prints only those with first comlun 1 > 5 (and it only prints the second column)
So you end up with a list of subdirectories with at least 5 files inside.
EDIT:
A fix for paths with spaces was proposed:
Instead of awk '{if($1>5) print($2)}' there should be awk '{if($1>5){ $1=""; print(substr($0,2)) }}' which sets first part of line to "" and then prints whole line without a leading space (which was delimiter). So put together we get this:
find directory/ -type f | sed 's/\(.*\)\/.*/\1/g' | sort | uniq -c | sort -n | awk '{if($1>5){ $1=""; print(substr($0,2)) }}'
I am able to list all the directories by
find ./ -type d
I attempted to list the contents of each directory and count the number of files in each directory by using the following command
find ./ -type d | xargs ls -l | wc -l
But this summed the total number of lines returned by
find ./ -type d | xargs ls -l
Is there a way I can count the number of files in each directory?
This prints the file count per directory for the current directory level:
du -a | cut -d/ -f2 | sort | uniq -c | sort -nr
Assuming you have GNU find, let it find the directories and let bash do the rest:
find . -type d -print0 | while read -d '' -r dir; do
files=("$dir"/*)
printf "%5d files in directory %s\n" "${#files[#]}" "$dir"
done
find . -type f | cut -d/ -f2 | sort | uniq -c
find . -type f to find all items of the type file, in current folder and subfolders
cut -d/ -f2 to cut out their specific folder
sort to sort the list of foldernames
uniq -c to return the number of times each foldername has been counted
You could arrange to find all the files, remove the file names, leaving you a line containing just the directory name for each file, and then count the number of times each directory appears:
find . -type f |
sed 's%/[^/]*$%%' |
sort |
uniq -c
The only gotcha in this is if you have any file names or directory names containing a newline character, which is fairly unlikely. If you really have to worry about newlines in file names or directory names, I suggest you find them, and fix them so they don't contain newlines (and quietly persuade the guilty party of the error of their ways).
If you're interested in the count of the files in each sub-directory of the current directory, counting any files in any sub-directories along with the files in the immediate sub-directory, then I'd adapt the sed command to print only the top-level directory:
find . -type f |
sed -e 's%^\(\./[^/]*/\).*$%\1%' -e 's%^\.\/[^/]*$%./%' |
sort |
uniq -c
The first pattern captures the start of the name, the dot, the slash, the name up to the next slash and the slash, and replaces the line with just the first part, so:
./dir1/dir2/file1
is replaced by
./dir1/
The second replace captures the files directly in the current directory; they don't have a slash at the end, and those are replace by ./. The sort and count then works on just the number of names.
Here's one way to do it, but probably not the most efficient.
find -type d -print0 | xargs -0 -n1 bash -c 'echo -n "$1:"; ls -1 "$1" | wc -l' --
Gives output like this, with directory name followed by count of entries in that directory. Note that the output count will also include directory entries which may not be what you want.
./c/fa/l:0
./a:4
./a/c:0
./a/a:1
./a/a/b:0
Slightly modified version of Sebastian's answer using find instead of du (to exclude file-size-related overhead that du has to perform and that is never used):
find ./ -mindepth 2 -type f | cut -d/ -f2 | sort | uniq -c | sort -nr
-mindepth 2 parameter is used to exclude files in current directory. If you remove it, you'll see a bunch of lines like the following:
234 dir1
123 dir2
1 file1
1 file2
1 file3
...
1 fileN
(much like the du-based variant does)
If you do need to count the files in current directory as well, use this enhanced version:
{ find ./ -mindepth 2 -type f | cut -d/ -f2 | sort && find ./ -maxdepth 1 -type f | cut -d/ -f1; } | uniq -c | sort -nr
The output will be like the following:
234 dir1
123 dir2
42 .
Everyone else's solution has one drawback or another.
find -type d -readable -exec sh -c 'printf "%s " "$1"; ls -1UA "$1" | wc -l' sh {} ';'
Explanation:
-type d: we're interested in directories.
-readable: We only want them if it's possible to list the files in them. Note that find will still emit an error when it tries to search for more directories in them, but this prevents calling -exec for them.
-exec sh -c BLAH sh {} ';': for each directory, run this script fragment, with $0 set to sh and $1 set to the filename.
printf "%s " "$1": portably and minimally print the directory name, followed by only a space, not a newline.
ls -1UA: list the files, one per line, in directory order (to avoid stalling the pipe), excluding only the special directories . and ..
wc -l: count the lines
This can also be done with looping over ls instead of find
for f in */; do echo "$f -> $(ls $f | wc -l)"; done
Explanation:
for f in */; - loop over all directories
do echo "$f -> - print out each directory name
$(ls $f | wc -l) - call ls for this directory and count lines
This should return the directory name followed by the number of files in the directory.
findfiles() {
echo "$1" $(find "$1" -maxdepth 1 -type f | wc -l)
}
export -f findfiles
find ./ -type d -exec bash -c 'findfiles "$0"' {} \;
Example output:
./ 6
./foo 1
./foo/bar 2
./foo/bar/bazzz 0
./foo/bar/baz 4
./src 4
The export -f is required because the -exec argument of find does not allow executing a bash function unless you invoke bash explicitly, and you need to export the function defined in the current scope to the new shell explicitly.
My answer is a little different, due to the options of find, you can actually be much more flexible. Just try:
find . -type f -printf "%h\n" | sort | uniq -c
With the "%h" option to "-printf", find prints only the directory of the files it found. Then sort and count with "uniq -c". This prints the number of search result entries with the same directory, per directory.
Using further options on find, you can be much more flexible. For example, to get an overview how many files in which directory have been modified at a certain date, use:
find . -newermt "2022-01-01 00:00:00" -type f -printf "%TY-%Tm-%Td %h\n" | sort | uniq -c
This finds all files that have been modified since 1. January 2022, prints (with "-printf") the modification date and the directory, then sorts and counts them. In this example, each line in the result has the number of files, the date of modification (without time), and the directory.
Note that "-printf" may not be available in all versions of find I think.
I combined #glenn jackman's answer and #pcarvalho's answer(in comment list, there is something wrong with pcarvalho's answer because the extra style control function of character '`'(backtick)).
My script can accept path as an augument and sort the directory list as ls -l, also it can handles the problem of "space in file name".
#!/bin/bash
OLD_IFS="$IFS"
IFS=$'\n'
for dir in $(find $1 -maxdepth 1 -type d | sort);
do
files=("$dir"/*)
printf "%5d,%s\n" "${#files[#]}" "$dir"
done
FS="$OLD_IFS"
My first answer in stackoverflow, and I hope it can help someone ^_^
THis could be another way to browse through the directory structures and provide depth results.
find . -type d | awk '{print "echo -n \""$0" \";ls -l "$0" | grep -v total | wc -l" }' | sh
find . -type f -printf '%h\n' | sort | uniq -c
gives for example:
5 .
4 ./aln
5 ./aln/iq
4 ./bs
4 ./ft
6 ./hot
I tried with some of the others here but ended up with subfolders included in the file count when I only wanted the files. This prints ./folder/path<tab>nnn with the number of files, not including subfolders, for each subfolder in the current folder.
for d in `find . -type d -print`
do
echo -e "$d\t$(find $d -maxdepth 1 -type f -print | wc -l)"
done
This will give the overall count.
for file in */; do echo "$file -> $(ls $file | wc -l)"; done | cut -d ' ' -f 3| py --ji -l 'numpy.sum(l)'
A super fast miracle command, which recursively traverses files to count the number of images in a directory and organize the output by image extension:
find . -type f | sed -e 's/.*\.//' | sort | uniq -c | sort -n | grep -Ei '(tiff|bmp|jpeg|jpg|png|gif)$'
Credits: https://unix.stackexchange.com/a/386135/354980
I edited the script in order to exclude all node_modules directories inside the analyzed one.
This can be used to check if the project number of files is exceeding the maximum number that the file watcher can handle.
find . -type d ! -path "*node_modules*" -print0 | while read -d '' -r dir; do
files=("$dir"/*)
printf "%5d files in directory %s\n" "${#files[#]}" "$dir"
done
To check the maximum files that your system can watch:
cat /proc/sys/fs/inotify/max_user_watches
node_modules folder should be added to your IDE/editor excluded paths in slow systems, and the other files count shouldn't ideally exceed the maximum (which can be changed though).
Easy Method:
find ./|grep "Search_file.txt" |cut -d"/" -f2|sort |uniq -c
In my case I needed the count at subfolder level, so I did:
du -a | cut -d/ -f3 | sort | uniq -c | sort -nr
Easy way to recursively find files of a given type. In this case, .jpg files for all folders in current directory:
find . -name *.jpg -print | wc -l
omg why the complex commands. just use something like
find whatever_folder | wc -l