print search term with line count - linux

Hello bash beginner question. I want to look through multiple files, find the lines that contain a search term, count the number of unique lines in this list and then print into a tex file:
the input file name
the search term used
the count of unique lines
so an example output line for file 'Firstpredictoroutput.txt' using search term 'Stop_gained' where there are 10 unique lines in the file would be:
Firstpredictoroutput.txt Stop_gained 10
I can get the unique count for a single file using:
grep 'Search_term' inputfile.txt | uniq -c | wc -l | >>output.txt
But I don't know enough yet about implementing loops in pipelines using bash.
All my inputfiles end with *predictoroutput.txt
Any help is greatly appreciated.
Thanks in advance,
Rubal

You can write a function called fun, and call the fun with two arguments: filename and pattern
$ fun() { echo "$1 $2 `grep -c $2 $1`"; }
$ fun input.txt Stop_gained
input.txt Stop_gained 2

You can use find :
find . -type f -exec sh -c "grep 'Search_term' {} | uniq -c | wc -l >> output.txt" \;
Although you can have issue with weird filenames. You can add more options to find, for example to treat only '.txt' files :
find . -type f -name "*.txt" -exec sh -c "grep 'Search_term' {} | uniq -c | wc -l >> output.txt" \;

q="search for this"
for f in *.txt; do echo "$f $q $(grep $q $f | uniq | wc -l)"; done > out.txt

Related

Given an specific directory get how many occurences of an specific string appear in every file in bash

Given a directory get al the txt files and get how many occurences of a string are in every text file using find and grep.
find $1 -type f -name "*."$2"" -exec grep $3 -l printf {} \;
Being a $1 a directory $2 the txt format and $3 the string to find occurences.
The output must be:
$1/file1.txt
3
$1/file2.txt
6
As grep -c counts the matched lines, not the occurences of a string, grep -o word | wc -l will be better. Would you please try:
find "$1" -type f -name "*.$2" -exec grep -l "$3" "{}" \; -exec bash -c 'grep -o '"$3"' "{}" | wc -l' \;
Suggesting feed find results into grep command. Than format the results with awk
grep -c "$3" $(find "$1" -name "*.$2") | awk '{print $1,$2}' FS=":" OFS="\\n"

Finding and counting duplicate filenames

Need to search through all sub folders of current folder recursively and list all files of certain type and number of duplicates
e.g. if current folder is home and there are 2 sub folders dir1 and dir2
Then i need it to search dir1 and dir2 and list file names and number of duplicates
this is what i have so far:
I am using
find -name "*.h" .
to get a list of all the files of certain type.
I need to now count duplicates and create a new list like
file1.h 2
file2.h 1
where file1 is file name and 2 is number of duplicates overall.
Use uniq --count
You can use a set of core utilities to do this quickly. For example, given the following setup:
mkdir -p foo/{bar,baz}
touch foo/bar/file{1,2}.h
touch foo/baz/file{2,3}.h
you can then find (and count) the files with a pipeline like this:
find foo -name \*.h -print0 | xargs -0n1 basename | sort | uniq -c
This results in the following output:
1 file1.h
2 file2.h
1 file3.h
If you want other output formats, or to order the list in some other way than alphabetically by file, you can extend the pipeline with another sort (e.g. sort -nr) or reformat your columns with sed, awk, perl, ruby, or your text-munging language of choice.
find -name "*.h"|awk -F"/" '{a[$NF]++}END{for(i in a)if(a[i]>1)print i,a[i]}'
Note: This will print files with similar names and only if there are more than one.
Using a shell script, the following code will print a filename of there are duplicates, then below that list all duplicates.
The script is used as in the following exanmple:
./find_duplicate.sh ./ Project
and will search the current directory tree for file names with 'project' in it.
#! /bin/sh
find "${1}" -iname *"${2}"* -printf "%f\n" \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
find ${1} -iname "${FILE}" -exec echo " {}" ';'
echo
done
if you wish to search for all files (and not search for a pattern in the name, replace the line:
find "${1}" -iname *"${2}"* -printf "%f\n" \
with
find "${1}" -type f -printf "%f\n" \

Use wc on all subdirectories to count the sum of lines

How can I count all lines of all files in all subdirectories with wc?
cd mydir
wc -l *
..
11723 total
man wc suggests wc -l --files0-from=-, but I do not know how to generate the list of all files as NUL-terminated names
find . -print | wc -l --files0-from=-
did not work.
You probably want this:
find . -type f -print0 | wc -l --files0-from=-
If you only want the total number of lines, you could use
find . -type f -exec cat {} + | wc -l
Perhaps you are looking for exec option of find.
find . -type f -exec wc -l {} \; | awk '{total += $1} END {print total}'
To count all lines for specific file extension u can use ,
find . -name '*.fileextension' | xargs wc -l
if you want it on two or more different types of files u can put -o option
find . -name '*.fileextension1' -o -name '*.fileextension2' | xargs wc -l
Another option would be to use a recursive grep:
grep -hRc '' . | awk '{k+=$1}END{print k}'
The awk simply adds the numbers. The grep options used are:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines. (-c is specified by POSIX.)
-h, --no-filename
Suppress the prefixing of file names on output. This is the
default when there is only one file (or only standard input) to
search.
-R, --dereference-recursive
Read all files under each directory, recursively. Follow all
symbolic links, unlike -r.
The grep, therefore, counts the number of lines matching anything (''), so essentially just counts the lines.
I would suggest something like
find ./ -type f | xargs wc -l | cut -c 1-8 | awk '{total += $1} END {print total}'
Based on ДМИТРИЙ МАЛИКОВ's answer:
Example for counting lines of java code with formatting:
one liner
find . -name *.java -exec wc -l {} \; | awk '{printf ("%3d: %6d %s\n",NR,$1,$2); total += $1} END {printf (" %6d\n",total)}'
awk part:
{
printf ("%3d: %6d %s\n",NR,$1,$2);
total += $1
}
END {
printf (" %6d\n",total)
}
example result
1: 120 ./opencv/NativeLibrary.java
2: 65 ./opencv/OsCheck.java
3: 5 ./opencv/package-info.java
190
Bit late to the game here, but wouldn't this also work? find . -type f | wc -l
This counts all lines output by the 'find' command. You can fine-tune the 'find' to show whatever you want. I am using it to count the number of subdirectories, in one specific subdir, in deep tree: find ./*/*/*/*/*/*/TOC -type d | wc -l . Output: 76435. (Just doing a find without all the intervening asterisks yielded an error.)

How to count occurrences of a word in all the files of a directory?

I’m trying to count a particular word occurrence in a whole directory. Is this possible?
Say for example there is a directory with 100 files all of whose files may have the word “aaa” in them. How would I count the number of “aaa” in all the files under that directory?
I tried something like:
zegrep "xception" `find . -name '*auth*application*' | wc -l
But it’s not working.
grep -roh aaa . | wc -w
Grep recursively all files and directories in the current dir searching for aaa, and output only the matches, not the entire line. Then, just use wc to count how many words are there.
Another solution based on find and grep.
find . -type f -exec grep -o aaa {} \; | wc -l
Should correctly handle filenames with spaces in them.
Use grep in its simplest way. Try grep --help for more info.
To get count of a word in a particular file:
grep -c <word> <file_name>
Example:
grep -c 'aaa' abc_report.csv
Output:
445
To get count of a word in the whole directory:
grep -c -R <word>
Example:
grep -c -R 'aaa'
Output:
abc_report.csv:445
lmn_report.csv:129
pqr_report.csv:445
my_folder/xyz_report.csv:408
Let's use AWK!
$ function wordfrequency() { awk 'BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { word = tolower($i); words[word]++ } } END { for (w in words) printf("%3d %s\n", words[w], w) } ' | sort -rn; }
$ cat your_file.txt | wordfrequency
This lists the frequency of each word occurring in the provided file. If you want to see the occurrences of your word, you can just do this:
$ cat your_file.txt | wordfrequency | grep yourword
To find occurrences of your word across all files in a directory (non-recursively), you can do this:
$ cat * | wordfrequency | grep yourword
To find occurrences of your word across all files in a directory (and it's sub-directories), you can do this:
$ find . -type f | xargs cat | wordfrequency | grep yourword
Source: AWK-ward Ruby
find .|xargs perl -p -e 's/ /\n'|xargs grep aaa|wc -l
cat the files together and grep the output: cat $(find /usr/share/doc/ -name '*.txt') | zegrep -ic '\<exception\>'
if you want 'exceptional' to match, don't use the '\<' and '\>' around the word.
How about starting with:
cat * | sed 's/ /\n/g' | grep '^aaa$' | wc -l
as in the following transcript:
pax$ cat file1
this is a file number 1
pax$ cat file2
And this file is file number 2,
a slightly larger file
pax$ cat file[12] | sed 's/ /\n/g' | grep 'file$' | wc -l
4
The sed converts spaces to newlines (you may want to include other space characters as well such as tabs, with sed 's/[ \t]/\n/g'). The grep just gets those lines that have the desired word, then the wc counts those lines for you.
Now there may be edge cases where this script doesn't work but it should be okay for the vast majority of situations.
If you wanted a whole tree (not just a single directory level), you can use somthing like:
( find . -name '*.txt' -exec cat {} ';' ) | sed 's/ /\n/g' | grep '^aaa$' | wc -l
There's also a grep regex syntax for matching words only:
# based on Carlos Campderrós solution posted in this thread
man grep | less -p '\<'
grep -roh '\<aaa\>' . | wc -l
For a different word matching regex syntax see:
man re_format | less -p '\[\[:<:\]\]'

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20
find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.
Have a look at wc (word count).
To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done
Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done
This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done
Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Resources