Why does Linux grep not give the correct count for line breaks? - linux

On Ubuntu 10.04.4 LTS, I did the following small test and got a surprising result:
First, I created a file with 5 lines and name it as a.txt:
echo -e "1\n2\n3\n4\n5" > a.txt
$ cat a.txt
1
2
3
4
5
Then I run wc to count the number of lines
$ wc -l a.txt
5 a.txt
However, when I run grep to count the number of lines that have line breaks I got an answer that I did not understand:
$ grep -c -P '\n' a.txt
3
My question is: how does grep get this number? Shouldn't it be 4?

Please Read The Fine Manual!
seq 1 5 | wc -l
5
seq 1 5 | grep -ac $'\n'
5
I don't understand where is the problem!?
seq 1 5 | hd
00000000 31 0a 32 0a 33 0a 34 0a 35 0a |1.2.3.4.5.|
Explanation:
-a switch tell grep to open file in binary mode. IE don't care about text formatting.
$'\n' syntax is resolved by bash himself, before running grep. Doing this give the ability to pass control characters as arguments to any command under bash.

Grep cannot see new line character. It searches for inline pattern.
Consider using grep -c -P '$' a.txt to match the ending of each line.

The newline character is not part of lines. grep uses the newline character as the record separator, and removes it from the lines, so that patterns with $ work as expected. For example, to search for lines ending with foo you can use the pattern foo$ instead of foo\n$. That would be very inconvenient.
So grep -c -P '\n' a.txt should give you 0. If you're getting 3, that sounds extremely strange, but perhaps it can be explained the highly experimental remark in man grep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see
below). This is highly experimental and grep -P may warn of
unimplemented features.
I'm in Debian/Wheezy, which is much more recent than Ubuntu 10.04. If the -P is "highly experimental" today, it's not too difficult to imagine it was buggy in older systems. This is just a guess though.
To count the number of newlines, use wc -l, not a grep -c hack.
Btw, interestingly:
$ printf hello >> a.txt
$ wc -l a.txt
5 a.txt
$ grep -c '' a.txt
6
That is, printf doesn't print a newline, so after we append "hello" to a.txt, there won't be a newline at the end of the file. So wc -l counts newline characters, not exactly "lines", and grep '' (empty string) matches all lines.

I think you want to use
$ grep -c -P "." a.txt
5
$ echo "6" >> a.txt
$ grep -c -P "." a.txt
6
$ cat a.txt
1
2
3
4
5
6

Related

How to get lines that don't contain certain patterns

I have a file that contain many lines :
ABRD0455252003666
JLKS8568875002886
KLJD2557852003625
.
.
.
AION9656532007525
BJRE8242248007866
I want to extract the lines that start with (ABRD or AION) and in column 12 to 14 the numbers (003 or 007).
The output should be
KLJD2557852003625
BJRE8242248007866
I have tried this and it works but it s too long command and I want to optimise it for performance concerns:
egrep -a --text '^.{12}(?:003|007)' file.txt > result.txt |touch results.txt && chmod 777 results.txt |egrep -v -a --text "ABRD|AION" result.txt > result2.text
The -a option is a non-standard extension for dealing with binary files, it is not needed for text files.
grep -E '^.{11}(003|007)' file.txt | grep -Ev '^(ABRD|AION)'
The first stage matches any line with either 003 or 007 in the twelfth through fourteenth column.
The second stage filters out removes any line starting with either ABRD or AION.
You really need to just read a regexp tutorial but meantime try this:
grep -E "^(ABRD|AION).{7}00[37]"

How to find which line is missing in another file

on Linux box
I have one file as below
A.txt
1
2
3
4
Second file as below
B.txt
1
2
3
6
I want to know what is inside A.txt but not in B.txt
i.e. it should print value 4
I want to do that on Linux.
awk 'NR==FNR{a[$0]=1;next}!a[$0]' B A
didn't test, give it a try
Use comm if the files are sorted as your sample input shows:
$ comm -23 A.txt B.txt
4
If the files are unsorted, see #Kent's awk solution.
You can also do this using grep by combining the -v (show non-matching lines), -x (match whole lines) and -f (read patterns from file) options:
$ grep -v -x -f B.txt A.txt
4
This does not depend on the order of the files - it will remove any lines from A that match a line in B.
(An addition to #rjmunro's answer)
The proper way to use grep for this is:
$ grep -F -v -x -f B.txt A.txt
4
Without the -F flag, grep interprets PATTERN, read from B.txt, as a basic regular expression (BRE), which is undesired here, and can cause troubles. -F flag makes grep treat PATTERN as a set of newline-separated strings. For instance:
$ cat A.txt
&
^
[
]
$ cat B.txt
[
^
]
|
$ grep -v -x -f B.txt A.txt
grep: B.txt:1: Invalid regular expression
$ grep -F -v -x -f B.txt A.txt
&
Using diff:
diff --changed-group-format='%<' --unchanged-group-format='' A.txt B.txt

Exact grep -f command in Linux

I have 2 txt files in Linux.
A.txt contents (each line will contain a number):
1
2
3
B.txt contents (each line will contain a number):
1
2
3
10
20
30
grep -f A.txt B.txt results below:
1
2
3
10
20
30
Is there a way to grep in such a way I will get only the exact match, i.e. not 10, 20, 30?
Thanks in advance
For exact match use -x switch
grep -x -f A.txt B.txt
EDIT: If you don't want grep's regex capabilities and need to treat search pattern as fixed-strings then use -F switch as:
grep -xF -f A.txt B.txt
As anubhava pointed out, grep -x will match the whole line. there's another switch -w for matching word. So grep -wf A.txt B.txt will show matches if a word from A.txt matches with a word in B.txt
Try adding the -w flag:
grep -wf A.txt B.txt
This will give you exact result which is under:
1
2
3
Thanks
you can try to identify the file name which contains different contents
# cat a.txt
1
2
3
# cat b.txt
1
2
3
10
20
30
# grep -L a.txt b.txt
b.txt

Count the number of occurences of binary data

I need to count the occurrences of the hex string 0xFF 0x84 0x03 0x07 in a binary file, without too much hassle... is there a quick way of grepping for this data from the linux command line or should I write dedicated code to do it?
Patterns without linebreaks
If your version of grep takes the -P parameter, then you can use grep -a -P, to search for an arbitrary binary string (with no linebreaks) inside a binary file. This is close to what you want:
grep -a -c -P '\xFF\x84\x03\x07' myfile.bin
-a ensures that binary files will not be skipped
-c outputs the count
-P specifies that your pattern is a Perl-compatible regular expression (PCRE), which allows strings to contain hex characters in the above \xNN format.
Unfortunately, grep -c will only count the number of "lines" the pattern appears on - not actual occurrences.
To get the exact number of occurrences with grep, it seems you need to do:
grep -a -o -P '\xFF\x84\x03\x07' myfile.bin | wc -l
grep -o separates out each match onto its own line, and wc -l counts the lines.
Patterns containing linebreaks
If you do need to grep for linebreaks, one workaround I can think of is to use tr to swap the character for another one that's not in your search term.
# set up test file (0a is newline)
xxd -r <<< '0:08 09 0a 0b 0c 0a 0b 0c' > test.bin
# grep for '\xa\xb\xc' doesn't work
grep -a -o -P '\xa\xb\xc' test.bin | wc -l
# swap newline with oct 42 and grep for that
tr '\n\042' '\042\n' < test.bin | grep -a -o -P '\042\xb\xc' | wc -l
(Note that 042 octal is the double quote " sign in ASCII.)
Another way, if your string doesn't contain Nulls (0x0), would be to use the -z flag, and swap Nulls for linebreaks before passing to wc.
grep -a -o -P -z '\xa\xb\xc' test.bin | tr '\0\n' '\n\0' | wc -l
(Note that -z and -P may be experimental in conjunction with each other. But with simple expressions and no Nulls, I would guess it's fine.)
use hexdump like
hexdump -v -e '"0x" 1/1 "%02X" " "' <filename> | grep -oh "0xFF 0x84 0x03 0x07" |wc -w
hexdump will output binary file in the given format like 0xNN
grep will find all the occurrences of the string without considering the same ones repeated on a line
wc will give you final count
did you try grep -a?
from grep man page:
-a, --text
Process a binary file as if it were text; this is equivalent to the --binary-files=text option.
How about:
$ hexdump a.out | grep -Ec 'ff ?84 ?03 ?07'
This doesn't quite answer your question, but does solve the problem when the search string is ASCII but the file is binary:
cat binaryfile | sed 's/SearchString/SearchString\n/g' | grep -c SearchString
Basically, 'grep' was almost there except it only counted one occurrence if there was no newline byte in between, so I added the newline bytes.

Find line number in a text file - without opening the file

In a very large file I need to find the position (line number) of a string, then extract the 2 lines above and below that string.
To do this right now - I launch vi, find the string, note it's line number, exit vi, then use sed to extract the lines surrounding that string.
Is there a way to streamline this process... ideally without having to run vi at all.
Maybe using grep like this:
grep -n -2 your_searched_for_string your_large_text_file
Will give you almost what you expect
-n : tells grep to print the line number
-2 : print 2 additional lines (and the wanted string, of course)
You can do
grep -C 2 yourSearch yourFile
To send it in a file, do
grep -C 2 yourSearch yourFile > result.txt
Use grep -n string file to find the line number without opening the file.
you can use cat -n to display the line numbers and then use awk to get the line number after a grep in order to extract line number:
cat -n FILE | grep WORD | awk '{print $1;}'
although grep already does what you mention if you give -C 2 (above/below 2 lines):
grep -C 2 WORD FILE
You can do it with grep -A and -B options, like this:
grep -B 2 -A 2 "searchstring" | sed 3d
grep will find the line and show two lines of context before and after, later remove the third one with sed.
If you want to automate this, simple you can do a Shell Script. You may try the following:
#!/bin/bash
VAL="your_search_keyword"
NUM1=`grep -n "$VAL" file.txt | cut -f1 -d ':'`
echo $NUM1 #show the line number of the matched keyword
MYNUMUP=$["NUM1"-1] #get above keyword
MYNUMDOWN=$["NUM1"+1] #get below keyword
sed -n "$MYNUMUP"p file.txt #display above keyword
sed -n "$MYNUMDOWN"p file.txt #display below keyword
The plus point of the script is you can change the keyword in VAL variable as you like and execute to get the needed output.

Resources