Find command prints numbers infront of result - linux

Am using the following "find" command to extract some files,
find /lag/cnnf/ -maxdepth 3 -newer ./start ! -newer ./end | grep -nri abc | egrep '([^0-9]45[^0-9])' | grep -nri "db.tar.gz" >> sample.txt
My output in sample.txt is
5:175:/lag/cnnf/abc/45/r-01.bac.db.tar.gz
20:190:/lag/cnnf/abc/45/r-01.bac.db.tar.gz
what should i do to get only,
/lag/cnnf/abc/45/r-01.bac.db.tar.gz
/lag/cnnf/abc/45/r-01.bac.db.tar.gz
without the random numbers in front of it and what those numbers actually mean.

It is grep which is printing the numbers, not find. Remove the -n option from the grep commands and the numbers will disappear.
find /lag/cnnf/ -maxdepth 3 -newer ./start ! -newer ./end | grep -ri abc | egrep '([^0-9]45[^0-9])' | grep -ri "db.tar.gz" >> sample.txt
Also it looks like unnecessary overhead to use 3 grep statements, one should be enough, or even the find command itself can do the filtering job. Would need to know your input data to say more

Related

Count files in a directory with filename matching a string

The command:
ls /some/path/some/dir/ | grep some_mask_*.txt | wc -l
returns the correct number of files when doing this via ssh on bash. When I put this into a .sh Script
iFiles=`ls /some/path/some/dir/ | grep some_mask_*.txt | wc -l`
echo "iFiles: ${iFiles}"
it is always 0. Whats wrong here?
Solution:
When I worked on it I found out that my "wildcard-mask" seems to be the problem. using grep some_mask_ | grep \.txt instead of the single grep above helped me to solve the problem for the first.
I marked the answer as solution which pretty much describes exactly what I made wrong. I'm going to edit my script now. Thanks everyone.
The problem here is that grep some_mask_*.txt is expanded by the shell and not by grep, so most likely you have a file in the directory where grep is executed which matches some_mask_*.txtand that filename is then used by grep as a filter.
If you want to ensure that the pattern is used by grep then you need to enclose it in single quotes. In addition you need to write the pattern as a regexp and not as a wildcard match (which bash uses for matching). Putting this together your command line version should be:
ls /some/path/some/dir/ | grep 'some_mask_.*\.txt' | wc -l
and the script:
iFiles=`ls /some/path/some/dir/ | grep 'some_mask_.*\.txt' | wc -l`
echo "iFiles: ${iFiles}"
Note that . needs to be prefixed with a backslash since it has special significance as a regexp that matches a single character.
I would also suggest that you postfix the regexp with $ in order to anchor it to the end (thus ensuring that the regexp matches filenames that ends with ".txt"):
ls /some/path/some/dir/ | grep 'some_mask_.*\.txt$' | wc -l
Parsing ls is not a good thing. If you want to find files, use find:
find /some/path/some/dir/ -maxdepth 1 -name "some_mask_*.txt" -print0
This will print those files matching the condition within that directory and without going into subdirectories. Using print0 prevents weird situations when the file name contains not common characters:
-print0
True; print the full file name on the standard output, followed
by a null character (instead of the newline character that
-print uses). This allows file names that contain newlines or
other types of white space to be correctly interpreted by pro‐
grams that process the find output. This option corresponds to
the -0 option of xargs.
Then, just pipe to wc -l to get the final count.
By the way, note that
ls /some/path/some/dir/ | grep some_mask_*.txt
can be reduced to a simple
ls /some/path/some/dir/some_mask_*.txt
Simple solution is (for bash)
find -name "*pattern*" | wc -l
"*" represent anything (prefix- anything before , postfix - anything after)
wc -l : give the count
find -name : will find file with given name in double quotes
I suggest to use find as shown below. The reason for that is that filenames may contain newlines which would break a script that is using wc -l. I'm printing just a dot per filename and count the dots with wc -c:
find /some/path/some/dir/ -maxdepth 1 -name 'some_mask_*.txt' -printf '.' | wc -c
or if you want to write the results to variable:
ifiles=$(find /some/path/some/dir/ -maxdepth 1 -name 'some_mask_*.txt' -printf '.' | wc -c)
Try this,
iFiles=$(ls /some/path/some/dir/ | grep some_mask_*.txt | wc -l)
echo "iFiles: ${iFiles}"
I think there wouldn't be the shell version problem.
try to use escape char on your command. It likes below.
ls /some/path/some/dir/ | grep some_mask_\*.txt | wc -l
Your problem is due to shell expansion. You probably tested the command line in the original directory, but if you try it from another directory then it will not work anymore.
When you type:
grep *.txt
then the shell replace *.txt with all the file names that correspond to the pattern and then execute the command (something like grep a.txt dummy.txt). But you want the pattern to be interpreted by grep not expanded by the shell, so:
ls /tmp | grep '.*.cpp'
wille make it. Here the pattern is in the syntax of grep command (each command as its own syntax) and not expanded because it is protected with surroundings '.
Modify your command like:
a=`ls /tmp | grep '.*.cpp'`
This is quite similar to other answers, but with a bit more robustness
iFiles=$( find /some/path/ -name "some_mask_*.txt" -type f 2> /dev/null | wc -l )
echo "Number of files: $iFiles"
This limits the find to files and also pipes stderr to null, so if the find command doesn't work or has permission issues you don't get a bogus result.
I was writing a shell script to count the files of same format in a directory. For that I have used the below command
LOCATION=/home/students/run_date/FILENAME #stored the location in a variable
DIRECTORYCOUNT=$(find $LOCATION -type d -print | wc -l) using find command
DIRECTORYCOUNT=$(find $LOCATION -type f -print | wc -l)
I have used above commands and enter code here it worked well

How can I use grep to get all the lines that contains string1 and string2 separated by space?

Line1: .................
Line2: #hello1 #hello2 #hello3
Line3: .................
Line4: .................
Line5: #hello1 #hello4 #hello3
Line6: #hello1 #hello2 #hello3
Line7: .................
I have files that look similar in terms of lines on one of my project directories. I want to get the counts of all the lines that contain #hello1 and #hello2. In this case I would get 2 as a result only for this file. However, I want to do this recursively.
The canonical way to "do something recursively" is to use the find command. If you want to find lines that have two words on them, a simple regex will do:
grep -lr '#hello1.*#hello2' .
The option -l instructs grep to show us only filenames rather than file content, and the option -r tells grep to traverse the filesystem recursively. The start of the search is the path at the end of the line. Once you have the list of files, you can parse that list using commands run by xargs.
For example, this will count all the lines in files matching the pattern you specified.
grep -lr '#hello1.*#hello2' . | xargs -n 1 wc -l
This uses xargs to run the wc command on each of the files listed by grep. You could probably also run this without the -n 1, unless you're dealing with many many thousands of files that would exceed your maximum command line length.
Or, if I'm interpreting your question correctly, the following will count just the patterns in those files.
grep -lr '#hello1.*#hello2' . | xargs -n 1 grep -Hc '#hello1.*#hello2'
This runs a similar grep to the one used to generate your recursive list of files, and presents the output with filename (-H) and count (-c).
But if you want complex rules like finding two patterns possibly on different lines in the file, then grep probably is not the optimal tool, unless you use multiple greps launched by find:
find /path/to/base -type f \
-exec grep -q '#hello1' {} \; \
-exec grep -q '#hello2' {} \; \
-print
(Lines split for easier reading.)
This is somewhat costly, as find needs to launch up to two children for each file. So another approach would be to use awk instead:
find /path/to/base -type f \
-exec awk '/#hello1/{c++} /#hello2/{c++} c==2{r=1} END{exit 1-r}' {} \; \
-print
Alternately, if your shell is bash version 4 or above, you can avoid using find and use the bash option globstar:
$ shopt -s globstar
$ awk 'FNR=1{c=0} /#hello1/{c++} /#hello2/{c++} c==2{print FILENAME;nextfile}' **/*
Note: none of this is tested.
If you are not nterested in the number of files also,
then just something along:
find $BASEDIRECTORY -type f -print0 | xargs -0 grep -h PATTERN | wc -l
If you want to count lines containing #hello1 and #hello2 separated by space in a specific file you can:
$ grep -c '#hello1 #hello2' file
If you want to count in more than one file:
$ grep -c '#hello1 #hello2' file1 file2 ...
And if you want to get the gran total:
$ grep -c '#hello1 #hello2' file1 file2 ... | paste -s -d+ - | bc
of course you can let your shell expanding file names. So, for example:
$ grep -c '#hello1 #hello2' *.txt | paste -s -d+ - | bc
or so...
find . -type f | xargs -1 awk '/#hello1/ && /#hello2/{c++} END{print FILENAME, c+0}'

How to count the number of files whose name contains a vowel

I was trying to code a script that counts the number of files with a vowel in a directory.
If I use
find $1 -type f | wc -l
I get the number of files in the directory $1, but I do not know how to use grep to count just the one with a vowel, I was trying something like this
find $1 -type f | grep -l '[a,e,i,o,u,A,E,I,O,U]' | wc -l
You can use this gnu find command to count all the files with at least one vowel:
find . -maxdepth 1 -type f -iname '*[aeiou]*' -printf ".\n" | wc -l
-iname '*[aeiou]*' glob pattern will match only filename with at least one of the a,e,i,o,u (ignore case).
Remove -maxdepth 1 if you want to count files recursively in sub directories as well.
If you can accept counting directories:
ls -d *a* *e* *i* *o* *u* *y* *A* *E* *I* *O* *U* *Y* | wc -l
Otherwise:
find $1 -type f | grep -i '[aeiouy]' | wc -l
Your attempt fails for two reasons. First, -l does not make sense if grep is reading in a pipeline, since the purpose of -l is to print only the input file that matched, but in this case the only input file is stdin. Second, your syntax is wrong. Try:
... | grep -i '[aeiou]' | ...
Please don't use commas in a character group expression (the thing in [] brackets)
The best way is to first do a find(1) to get the files you want to scan. Then you need the base names, as the path info is not valid. Finally, you need to grep with [aeiouAEIOU] to get only the lines with a vowel in, and finally use wc(1) to count lines.
find ${DIRECTORY} -type f -print | sed -e 's#^.*/##' | grep '[aeiouAEIOU]' | wc -l
-type f allows you to select just files (not directories). The sed(1) command edits the output, line by line, eliminating the first part of the name up to the last / character. The grep filters names with at least one vowel and discards the others, and finally wc -l counts the lines.

grep a pattern in some files and print the sum in each file

I want to grep a pattern in some files and count the occurrence with the filename. Right know, if I use
grep -r "month" report* | wc -l
it will sum all instances in all files. So the output is a single value 324343. I want something like this
report1: 3433
report2: 24399
....
The grep command will show the filename but will print every instance.
grep -c will give you a count of matches for each file:
grep -rc "month" report*
You need to pass each file to grep: echo report* | xargs grep -c month .
If recursively, use find report* -exec grep month -Hc '{}' \;.

How can I find all of the distinct file extensions in a folder hierarchy?

On a Linux machine I would like to traverse a folder hierarchy and get a list of all of the distinct file extensions within it.
What would be the best way to achieve this from a shell?
Try this (not sure if it's the best way, but it works):
find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u
It work as following:
Find all files from current folder
Prints extension of files if any
Make a unique sorted list
No need for the pipe to sort, awk can do it all:
find . -type f | awk -F. '!a[$NF]++{print $NF}'
Recursive version:
find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort -u
If you want totals (how may times the extension was seen):
find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort | uniq -c | sort -rn
Non-recursive (single folder):
for f in *.*; do printf "%s\n" "${f##*.}"; done | sort -u
I've based this upon this forum post, credit should go there.
My awk-less, sed-less, Perl-less, Python-less POSIX-compliant alternative:
find . -type f | rev | cut -d. -f1 | rev | tr '[:upper:]' '[:lower:]' | sort | uniq --count | sort -rn
The trick is that it reverses the line and cuts the extension at the beginning.
It also converts the extensions to lower case.
Example output:
3689 jpg
1036 png
610 mp4
90 webm
90 mkv
57 mov
12 avi
10 txt
3 zip
2 ogv
1 xcf
1 trashinfo
1 sh
1 m4v
1 jpeg
1 ini
1 gqv
1 gcs
1 dv
Powershell:
dir -recurse | select-object extension -unique
Thanks to http://kevin-berridge.blogspot.com/2007/11/windows-powershell.html
Adding my own variation to the mix. I think it's the simplest of the lot and can be useful when efficiency is not a big concern.
find . -type f | grep -oE '\.(\w+)$' | sort -u
Find everythin with a dot and show only the suffix.
find . -type f -name "*.*" | awk -F. '{print $NF}' | sort -u
if you know all suffix have 3 characters then
find . -type f -name "*.???" | awk -F. '{print $NF}' | sort -u
or with sed shows all suffixes with one to four characters. Change {1,4} to the range of characters you are expecting in the suffix.
find . -type f | sed -n 's/.*\.\(.\{1,4\}\)$/\1/p'| sort -u
I tried a bunch of the answers here, even the "best" answer. They all came up short of what I specifically was after. So besides the past 12 hours of sitting in regex code for multiple programs and reading and testing these answers this is what I came up with which works EXACTLY like I want.
find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort -u
Finds all files which may have an extension.
Greps only the extension
Greps for file extensions between 2 and 16 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail).
Awk to print the extensions in lower case.
Sort and bring in only unique values. Originally I had attempted to try the awk answer but it would double print items that varied in case sensitivity.
If you need a count of the file extensions then use the below code
find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort | uniq -c | sort -rn
While these methods will take some time to complete and probably aren't the best ways to go about the problem, they work.
Update:
Per #alpha_989 long file extensions will cause an issue. That's due to the original regex "[[:alpha:]]{3,6}". I have updated the answer to include the regex "[[:alpha:]]{2,16}". However anyone using this code should be aware that those numbers are the min and max of how long the extension is allowed for the final output. Anything outside that range will be split into multiple lines in the output.
Note: Original post did read "- Greps for file extensions between 3 and 6 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail)."
Idea: Could be used to find file extensions over a specific length via:
find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{4,}" | awk '{print tolower($0)}' | sort -u
Where 4 is the file extensions length to include and then find also any extensions beyond that length.
In Python using generators for very large directories, including blank extensions, and getting the number of times each extension shows up:
import json
import collections
import itertools
import os
root = '/home/andres'
files = itertools.chain.from_iterable((
files for _,_,files in os.walk(root)
))
counter = collections.Counter(
(os.path.splitext(file_)[1] for file_ in files)
)
print json.dumps(counter, indent=2)
Since there's already another solution which uses Perl:
If you have Python installed you could also do (from the shell):
python -c "import os;e=set();[[e.add(os.path.splitext(f)[-1]) for f in fn]for _,_,fn in os.walk('/home')];print '\n'.join(e)"
Another way:
find . -type f -name "*.*" -printf "%f\n" | while IFS= read -r; do echo "${REPLY##*.}"; done | sort -u
You can drop the -name "*.*" but this ensures we are dealing only with files that do have an extension of some sort.
The -printf is find's print, not bash. -printf "%f\n" prints only the filename, stripping the path (and adds a newline).
Then we use string substitution to remove up to the last dot using ${REPLY##*.}.
Note that $REPLY is simply read's inbuilt variable. We could just as use our own in the form: while IFS= read -r file, and here $file would be the variable.
None of the replies so far deal with filenames with newlines properly (except for ChristopheD's, which just came in as I was typing this). The following is not a shell one-liner, but works, and is reasonably fast.
import os, sys
def names(roots):
for root in roots:
for a, b, basenames in os.walk(root):
for basename in basenames:
yield basename
sufs = set(os.path.splitext(x)[1] for x in names(sys.argv[1:]))
for suf in sufs:
if suf:
print suf
I think the most simple & straightforward way is
for f in *.*; do echo "${f##*.}"; done | sort -u
It's modified on ChristopheD's 3rd way.
I don't think this one was mentioned yet:
find . -type f -exec sh -c 'echo "${0##*.}"' {} \; | sort | uniq -c
The accepted answer uses REGEX and you cannot create an alias command with REGEX, you have to put it into a shell script, I'm using Amazon Linux 2 and did the following:
I put the accepted answer code into a file using :
sudo vim find.sh
add this code:
find ./ -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u
save the file by typing: :wq!
sudo vim ~/.bash_profile
alias getext=". /path/to/your/find.sh"
:wq!
. ~/.bash_profile
you could also do this
find . -type f -name "*.php" -exec PATHTOAPP {} +
I've found it simple and fast...
# find . -type f -exec basename {} \; | awk -F"." '{print $NF}' > /tmp/outfile.txt
# cat /tmp/outfile.txt | sort | uniq -c| sort -n > tmp/outfile_sorted.txt

Resources