grep obtains pattern from a file but printing not only the whole match word - linux

I've got file.txt to extract lines containing the exact words listed in check.txt file.
# file.txt
CA1C 2637 green
CA1C-S1 2561 green
CA1C-S2 2371 green
# check.txt
CA1C
I tried
grep -wFf check.txt file.txt
but I'm not getting the desired output, i.e. all the three lines were printed.
Instead, I'd like to get only the first line,
CA1C 2637 green
I searched and found this post being relevant, it's easy to do it when doing only one word matching. But how can I improve my code to let grep obtain patterns from check.txt file and print only the whole word matched lines?
A lot of thanks!

The man page for grep says the following about the -w switch:
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
In your case, all three lines start with "CA1C-", which meets the conditions of being at the beginning of the line, and being followed by a non-word constituent character (the hyphen).
I would do this with a loop, reading lines manually from check.txt:
cat check.txt | while read line; do grep "^$line " file.txt; done
CA1C 2637 green
This loop reads the lines from check.txt, and searches for each one at the start of a line in file.txt, with a following space.
There may be a better way to do this, but I couldn't get -f to actually consider whitespace at the end of a line of the input file.

Related

Grep the first line from each contiguous group of matching lines

I have a data file which looks like this:
a separator
interesting line 1
interesting line 2
a comment
interesting line 3
interesting line 4
interesting line 5
a non interesting line
some other data
interesting line 6
.
.
.
and I would like to extract the first interesting line from each contiguous group, no matter how many lines are in the group is or how many extra lines separate the groups.
For the test input above the output would be:
interesting line 1
interesting line 3
interesting line 6
I could easily do this in python by having a state variable that triggers when I match a line, and resets when I encounter a non-matching line, but what about a one-line shell script? Is there a not-too-obscure way to do this?
You can use grep with a greedy regex, then print the first line of every match with :
grep -Pzo '([^\n]*interesting line[^\n](\n|$))+' file |
while IFS='' read -d '' -r match
do
head -n1 <<< "$match"
done
grep parameters:
-P : Use Perl Compatible regular expression (instead of the default basic regular expression) for the \n in the regex.
-z : Treat the input as a set of lines, each terminated by a zero byte. An ASCII NUL character will separate each match, allowing us to reliably separate the matches.
the regex ([^\n]*blablabla[^\n]*(\n|$))+ will match each group of contiguous lines containing blablabla.
In the while condition command, the IFS is emptied for the read. Otherwise, with the default IFS, the last newline character of each match would be eaten by read (that might not be a problem). It's a good practice to always clear IFS in "while read" to get the text in the variable exactly as it is read (leading spaces are also easily eaten up).
read parameters:
-d '' : Use the empty string as delimiter (= the ASCII NUL character). This is equivalent to -d $'\0' (see https://unix.stackexchange.com/q/61029/283498).
-r : don't interpret any backslash in the lines (see https://unix.stackexchange.com/q/192786/283498).
match : just a variable name I chose, which is used in the body of the loop.
And in the body of the loop: head -n1 <<< "$match" prints only the first line of the current match (the command head with -n 1 prints the first 1 line of its input). Side note: <<< is a bashism ; the command is equivalent to echo "$match" | head -n1.

Why does "grep -w" match strings ending with "." or "$"? [duplicate]

This question already has an answer here:
The meaning of 'word' in Grep
(1 answer)
Closed 6 years ago.
1.) I am using Debian 8.4 on a virtual box and lets say I have a text file name sample.txt containing..
Linux.
Linux$
Then I ran the command grep -w Linux sample.txt and the output was
Linux.
Linux$
So I was wondering why it match those lines since I specified the -w option which is supposed to match the exact string only?
Both $ and . are non-word constituent characters, so -w matches Linux in both lines, nothing else.
man grep states that:
-w, --word-regexp
Select only those lines containing matches that form whole words. The
test is that the matching substring must either be at the beginning of
the line, or preceded by a non-word constituent character. Similarly,
it must be either at the end of the line or followed by a non-word
constituent character. Word-constituent characters are letters,
digits, and the underscore. This option has no effect if -x is also
specified.
This means that Linux will be matched in all cases where this text is surrounded by anything but letters, digits and the underscore.
To see what exactly is grep matching, use -o to print the matched part only:
$ echo "Linux.
Linux$" | grep -wo Linux
Linux
Linux
So it is just Linux what gets matched.
Option -w has the semantics of matching "whole words". A word delimiter is a change of character class, e. g. from letter to symbol or to interpunction, so x$ contains a word delimiter between the two characters, so does x..

how to print last few lines when a pattern is matched using sed?

I want to last few lines when a pattern is matched in a file using sed.
if file has following entries:
This is the first line.
This is the second line.
This is the third line.
This is the forth line.
This is the Last line.
so, search for pattern, "Last" and print last few lines ..
Find 'Last' using sed and pipe it to tail command which print last n lines of the file -n specifies the no. of lines to be read from end of the file, here I am reading last 2 lines of the file.
sed '/Last/ p' yourfile.txt|tail -n 2
For more info on tail use man tail.
Also, | symbol here is known as a pipe(unnamed pipe), which helps in inter-process communication. So, in simple words sed feeds data to tail command using pipe.
I assume you mean "find the pattern and also print the previous few lines". grep is your friend: to print the previous 3 lines:
$ grep -B 3 "Last" file
This is the second line.
This is the third line.
This is the forth line.
This is the Last line.
-B n for "before". There's also -A n ("after"), and -C n ("context", both before and after).
This might work for you (GNU sed):
sed ':a;$!{N;s/\n/&/2;Ta};/Last/P;D' file
This will print the line containing Last and the two previous lines.
N.B. This will only print the lines before the match once. Also more lines can by shown by changing the 2 to however many lines you want.

I'm having problems creating a sed script

My teacher didn't really go over sed scripts so they're very confusing. I need to finish this to get an A though. It's due very soon so I doubt I have time to fully understand it because I don't understand the syntax at all.
The instructions are:
Create a sed script named script3 that will print a file with “The Raven” at the top, replace every occurrence of multiple spaces with a single space, and print a line of 30 dashes below each line.
This is what I have so far:
sed script
echo "The Raven"
s/[ ]\{2,\}/ /g
/\,\./ s/^/------------------------------ /
Using the online GNU sed manual:
use the i command with an address to insert the title
use the a command with no address to append the separator after every line
use s/[[:blank:]]\+/ /g to replace any horizontal whitespace characters with a single space: sed regular expressions
enter code hereCreate a sed script named script3
so create a text file calles script3 taht is called with sed -f script3 YourSoureFile
content is
1 i\
The raven
s/ \{2,\}/ /g
a\
------------------------------
that will print a file with
“The Raven” at the top
1 i\
The raven
on 1st (1) line, insert (i) the next line (each line following until last cahr is NO NORE \
replace every occurrence of multiple spaces with a single space
s/ \{2,\}/ /g
substitute (s///) any pattern (part of text between the separator /composed of 2 or more \{2,\} following space jsute before the occurance specification, by a singel space the second , any occurence present ( option g). This happend at each line (no number or filter pattern in the head of the line like the 1 on precedent line)
and print a line of 30 dashes below each line
a\
------------------------------
(a) work like the (i) but append in this case. No filter pattern nor number indicate line to act, so on each line also.
i and a add text to ouptut but not on working buffer, so this cannot be manipulate inside this sed action, this is not the case of s/// that work ON the current working buffer (that is print at end of treatment of the line, before starting with a new line)
This is what I have so far:
About your script
echo "The Raven"
that is not a sed action but a shell action so, it cannot work in a sed like this (could using shell substitution but not the goal normaly)
s/[ ]\{2,\}/ /g
it's fine, class [] for this space is not necessary here but litterally ok. You could use [[:space:]] or [[:blank:]] or [[:space:][:blank:]] to be exhaustif using meta class of type [:xxxxx:] inside a class
/\,\./ s/^/------------------------------ /
this will mean *on each occurence of ,. (both and in this order), replace (add thus) the start of the line with 30 - and a space. So it's not occuring on each line and it add at the start not following the line. A way using s/// could be:
s/.*/&\
------------------------------/
Replace on each line every char (all at once) by herself (&) followed by a new line and 30 dash

Filter out some input with GREP

Echo "Hello everybody!"
I need to check whether the input argument of a linux script does comply with my security needs. It should contain only a-z characters, 0-9 digits, some spaces and the "+" sign. Eg.: "for 3 minutes do r51+r11"
This didn't worked for me:
if grep -v '[0123456789abcdefghijklmnopqrstuvwxyz+ ]' /tmp/input; then
echo "THIS DOES NOT COMPLY!";
fi
Any clues?
You are telling grep:
Show me every line that does not contain [0123456789abcdefghijklmnopqrstuvwxyz+ ]
Which would only show you lines that contains neither of the characters above. So a line only containing other characters, like () would match, but asdf() would not match.
Try instead to have grep showing you every line that contains charachter not in the list above:
if grep '[^0-9A-Za-z+ ]' file; then
If you find something that's not a number or a letter or a plus, then.
You want to test the entire row (assuming there is only one row in /tmp/input), not just whether a single character anywhere matches, so you need to anchor it to the start end end of the row. Try this regexp:
^[0123456789abcdefghijklmnopqrstuvwxyz+ ]*$
Note that you can shorten this using ranges:
^[0-9a-z+ ]*$

Resources