Count specific pattern inside text

Count specific pattern inside text - linux

I have a huge file I want to use as shell command to count the number of the word 'new' in the file
a tried to use wc and grep but I get the number of lines that contain pattern only

From #Fravadona's suggestion:
grep -ow new file.txt | wc -l
-o means "print only the matches, one per line"
-w means "only match if it's a full word" and avoid matching for e.g. newOrder
wc -l counts the amount of lines grep did output

Related

GREP to show files WITH text and WITHOUT text

I am trying to search for files with specific text but excluding a certain text and showing only the files.
Here is my code:
grep -v "TEXT1" *.* | grep -ils "ABC2"
However, it returns:
(standard input)
Please suggest. Thanks a lot.
The output should only show the filenames.

Here's one way to do it, assuming you want to match these terms anywhere in the file.
grep -LZ 'TEXT1' *.* | xargs -0 grep -li 'ABC2'
-L will match files not containing the given search term
use -LiZ if you want to match TEXT1 irrespective of case
The -Z option is needed to separate filenames with NUL character and xargs -0 will then separate out filenames based on NUL character
If you want to check these two conditions on same line instead of anywhere in the file:
grep -lP '^(?!.*TEXT1).*(?i:ABC2)' *.*
-P enables PCRE, which I assume you have since linux is tagged
(?!regexp) is a negative lookahead construct, so ^(?!.*TEXT1) will ensure the line doesn't have TEXT1
(?i:ABC2) will match ABC2 case insensitively
Use grep -liP '^(?!.*TEXT1).*ABC2' if you want to match both terms irrespective of case

(standard input)
This error is due to use of grep -l in a pipeline as your second grep command is reading input from stdin not from a file and -l option is printing (standard input) instead of the filename.
You can use this alternate solution in a single awk command:
awk '/ABC2/ && !/TEXT1/ {print FILENAME; nextfile}' *.* 2>/dev/null

Counting lines starting with a symbol

There are many lines containing the > symbol in a file. How can I count the total number of > symbols in a file? I have tried sed and grep and it did not work.

You can use GNU grep together with wc
grep -o '>' file.txt | wc -l
grep -o prints every match on a separate line. wc counts the lines.
Btw, it's not 100% clear in your question if the > can appear only at the start of a line. If you just want to count the lines that start with a > you can use the following grep command:
grep -c '^[[:space:]]*>' file.txt
^ matches the beginning of the line, [[:space:]]* allows for zero ore more space characters in front of the >, just in case.

Find files from a folder when a specific word appears on at least a specific number of lines

How can I find the files from a folder where a specific word appears on more than 3 lines? I tried using recursive grep for finding that word and then using -c to count the number of lines where the word appears.

This command will recursively list the files in the current directory where word appears on more than 3 lines, along with the matches count for each file:
grep -c -r 'word' . | grep -v -e ':[0123]$' | sort -n -t: -k2
The final sort is not necessary if you don't want the results sorted, but I'd say it's convenient.
The first command in the pipeline (grep -c -r 'word' .) recursively finds every file in the current directory that contains word, and counts the occurrences for each file. The intermediate grep discards every count that is 0, 1, 2 or 3, so you just get counts greater than 3 (this is because -v in grep(1) inverts the sense of matching to select non-matching lines). The final sort step sorts the list according to the occurrences for each file; it sets the field delimiter to : and instructs sort(1) to do a numeric-based sorting using the 2nd field (the count) as the sort key.
Here's a sample output from some tests I ran:
./file1:4
./dir1/dir2/file3:5
./dir1/file2:8
If you just want the filenames without the match counts, you can use sed(1) to discard the :count portions:
grep -m 4 -c -r 'word' . | grep -v -e ':[0123]$' | sed -r 's/:[0-9]+$//'
As noted in the comments, if matches count is not important, in this case we can optimize the first grep with -m 4, which stops reading the file after 4 matching lines.
UPDATE
The solution above works fine up to a certain extent if used with small numbers, but it does not scale well for larger numbers. If you want to filter based on an arbitrary number, you can use awk(1) (and in fact it ends up being much more clean), like so:
grep -c -r 'word' . | awk -F: '$2 > 10'
The -F: argument to awk(1) is necessary; it instructs awk(1) to separate fields by : rather than the default (whitespace and tab). This solution generalizes well to any number.
Again, if matches count doesn't matter and all you want is to get a list of the filenames, do this instead:
grep -c -r 'word' . | awk -F: '$2 > 10 { print $1 }'

How do I grep in a list of files targeted by a previous grep?

I am using grep to get a list of files that I want to use for another grep search (and not simply piping it).
For example I got as an output:
file1.h:XXX: linecontent
file2.h:XXX: linecontent
file3.h:XXX: linecontent
file4.h:XXX: linecontent
and I want to grep only file1.h, file2.h ...

I'm assuming you want to search for files that contain two different patterns. If so this is what you want:
grep 'your pattern 2' `grep -l 'your pattern 1' *`
The contents of the back quotes will be executed first and the output substituted into the command line. Use of the -l flag will restrict the output of the grep command to just the file names.
If there are a very large number of files that match against your pattern 1 this could fail. The solution for that is to use xargs
grep -l 'your pattern 1' * | xargs grep 'your argument 2'

Assuming what you want is the names of files that contain 'lineofcontent', you could use:
grep -l 'lineofcontent' file*.h

Grep only a part of a text file

How can I apply the following command to only a part of a text file? For example from the beginning to the line 5000.
grep "^ A : 11 B : 10" filename | wc -l
I cannot use head and then apply the above command since the text file is huge.

You could try using the sed command, which I believe does better for large files, from this question and pipe to grep.
sed -n 1,5000p file | grep ...

You can try combination of -n (prefixing each line of output with line number) and -m (limiting number of matching lines). Something like this:
grep -n -m 5000 pattern file.txt | grep -B 5000 "^5000:" | wc -l
First grep search for pattern, add line numbers and limit output to first 5000 matching lines (worst case scenario: all lines from range match pattern). Second grep match line number 5000, and print all lines before this line.
I don't know if it is more efficient solution

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Count specific pattern inside text - linux

I have a huge file I want to use as shell command to count the number of the word 'new' in the file a tried to use wc and grep but I get the number of lines that contain pattern only

From #Fravadona's suggestion: grep -ow new file.txt | wc -l -o means "print only the matches, one per line" -w means "only match if it's a full word" and avoid matching for e.g. newOrder wc -l counts the amount of lines grep did output

Related

GREP to show files WITH text and WITHOUT text

Counting lines starting with a symbol

Find files from a folder when a specific word appears on at least a specific number of lines

How do I grep in a list of files targeted by a previous grep?

Grep only a part of a text file

Categories

Resources