Exact Match of Word using grep - linux

I have data in file.txt as follows
BRAD CHICAGO|NORTH SAMSONCHESTER|
CORA|NEW ERICA|
CAMP LOGAN|KINGBERG|
NCHICAGOS|ESTING|
CHICAGO|MANKING|
OCREAN|CHICAGO|
CHICAGO PIT|BULL|
CHICAGO |NEWYORK|
Question 1:
I want to search for the exact match for word "CHICAGO" in first column and print second column.
Output should look like:
MANKING
NEWYORK
Question 2:
If multiple matches found then can we limit the out to only one ? so that the output will be only MANKING or NEWYORK
I tried below
grep -E -i "^CHICAGO" file.txt | awk -F '|' '{print $2}'
but i am getting below output
MANKING
BULL
NEWYORK
Expected output for Question 1:
MANKING
NEWYORK
Expected output for Question 2:
MANKING

Here are some more ways:
Using grep and cut:
grep "^CHICAGO|" file.txt | cut -d'|' -f2
Using awk
awk -F"|" '/^CHICAGO\|/{print $2}' file.txt
For question 2 simply pipe it to head, i.e:
grep "^CHICAGO|" file.txt | cut -d'|' -f2 | head -n1
Similarly for the awk command.

how about an awk solution?
awk -F'|' '$1 == "CHICAGO"{print $2}' file
to only print one output, exit once you have a match, i.e.
awk -F'|' '$1 == "CHICAGO"{print $2; exit}' file
Making that more generic, you can pass in a variable, i.e.
awk -v trgt="CHICAGO" -F'|' '{targ="^" trgt " *$"; if ( $1 ~ targ ) {print $2}}' file
The " *$" regex limits the match to zero or more trailing spaces without any extra chars at the end of the target string. So this will meet your criteria to match skip matching CHICAGO PIT|BULL.
AND this can be further reduced to
awk -v trgt="CHICAGO" -F'|' '{ if ( $1 ~ "^" trgt " *$" ) {print $2}}' file
constructing the regex "in-place" in with the comparison.
So you could use more verbose variable names to "describe" how the regex is being constructed from the input and the regex "wrappers" (as in the 3rd example) OR, you can just combine the input variable with the regex syntax in place. That is just a matter of taste or documentation conventions.
You might want to include a comment to explain you are constructing a regex test that would look like the $1 ~ /^CHICAGO *$/.
IHTH

Related

How to get 1st field of a file only when 2nd field matches a string?

How to get 1st field of a file only when 2nd field matches a given string?
#cat temp.txt
Ankit pass
amit pass
aman fail
abhay pass
asha fail
ashu fail
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'*
gives no output
Another syntax with awk:
awk '$2 ~ /^faild$/{print $1}' input_file
A deleted 'cat' command.
^ start string
$ end string
It's the best way to match patten.
Either:
Your fields are not tab-separated or
You have blanks at the end of the relevant lines or
You have DOS line-endings and so there are CRs at the end of every
line and so also at the end of every $2 in every line (see
Why does my tool output overwrite itself and how do I fix it?)
With GNU cat you can run cat -Tev temp.txt to see tabs (^I), CRs (^M) and line endings ($).
Your code seems to work fine when I remove the * at the end
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'
The other thing to check is if your file is using tab or spaces. My copy/paste of your data file copied spaces, so I needed this line:
cat temp.txt | awk '$2 == "fail" { print $1 }'
The other way of doing this is with grep:
cat temp.txt | grep fail$ | awk '{ print $1 }'

awk - Delimiter as combination of number and | (pipe) not working

I have an input file with some records as below,
input.txt
Record|111|aaa|aaa|11|1-bb|bb|1111|cccc|cccc
Record|11|1-aaa|aaa|111|bb|bb|1111|cccc|cccc
Record|111|aaa|aaa|11|1-bb|bb|1111|cccc|cccc
Record|111|aaa|aaa|111|bb|bb|11|1-cccc|cccc
Record|22|aaa|aaa|222|bb|bb|2222|cccc|cccc|11|1-dddd|dd
Record|333|aaa|aaa|11|1-bb|bb|333|cccc|cccc
Record|11|1-aaa|aaa|102|bb|bb|1111|cccc|cccc
i want to use a delimiter |11| in awk and get the second field, i tried the most common way as below,
Command
awk -F'|11|' '{print $2}' input.txt
Output
1|aaa|aaa|
|1-aaa|aaa|
1|aaa|aaa|
1|aaa|aaa|
|1-dddd|dd
|1-bb|bb|333|cccc|cccc
|1-aaa|aaa|102|bb|bb|
Expected Output
1-bb|bb|1111|cccc|cccc
1-aaa|aaa|111|bb|bb|1111|cccc|cccc
1-bb|bb|1111|cccc|cccc
1-cccc|cccc
1-dddd|dd
1-bb|bb|333|cccc|cccc
1-aaa|aaa|102|bb|bb|1111|cccc|cccc
Basically its not considering the last | of the delimiter |11|, instead it is taking a delimiter |11.
i tried all below, none gave me the expected output,
awk -F"|11|" '{print $2}' input.txt # gives wrong output
awk -F\|11\| '{print $2}' input.txt # gives Wrong output
awk -v FS='|11|' '{print $2}' input.txt # gives Wrong output
Finally i had to write a for loop inside awk with delimiter as | to make it work, i would like to know why the simple solution doesn't work
Argument to -F is a regex.
awk -F "\\\|11\\\|" '{print $2}' file
or
awk -F '\\|11\\|' '{print $2}' file
or (Thanks to EdMorton)
awk -F'[|]11[|]' '{print $2}' input.txt
Output:
1-bb|bb|1111|cccc|cccc
1-aaa|aaa|111|bb|bb|1111|cccc|cccc
1-bb|bb|1111|cccc|cccc
1-cccc|cccc
1-dddd|dd
1-bb|bb|333|cccc|cccc
1-aaa|aaa|102|bb|bb|1111|cccc|cccc
Cyrus explained why your delimiter does not work as expected (a combination of regular expression quoting issues).
With sed, removing everything up to and including the |11| on each line:
$ sed 's/.*|11|//' input.txt
1-bb|bb|1111|cccc|cccc
1-aaa|aaa|111|bb|bb|1111|cccc|cccc
1-bb|bb|1111|cccc|cccc
1-cccc|cccc
1-dddd|dd
1-bb|bb|333|cccc|cccc
1-aaa|aaa|102|bb|bb|1111|cccc|cccc

Linux awk with condition

I have a very large file (2.5M record) with 2 columns seperated by |.
I would like to filter all record that do not contain the value "-1" inside the second column and write it into a new file.
I tried to use:
grep -v "-1" norm_cats_21_07_assignments.psv > norm_cats_21_07_assignments.psv
but noo luck.
For quick and dirty solution, you can simply add | to your grep:
grep -v "|-1" input.psv > output.psv
This assumes that rows to be ignored look like
something|-1
Note that if you ever need to use grep -v "-1", you have to add -- after options, otherwise grep will treat -1 as an option, something like this:
grep -v -- "-1"
You could do this through awk,
awk -F"|" '$2~/^-1$/{next}1' file > newfile
Example:
$ cat r
foo|-1
foo|bar
$ awk -F"|" '$2~/^-1$/{next}1' r
foo|bar
You can have:
awk -F'|' '$2 != "-1"' file.psv > new_file.psv
Or
awk -F'|' '$2 !~ /-1/' file.psv > new_file.psv
!= matches the whole column while !~ needs only a part of it.
Edit: Just noticed that your input file and output file are the same. You can't do that as the output file which is the same file would get truncated even before awk starts reading it.
With awk after making the new filtered file (e.g. new_file.psv), you can save it back by using cat new_file.psv > file.psv or mv new_file.psv file.psv.
But somehow if you exactly have 2 columns separated with | and no spaces in between, and no quotes around, etc. You can just use inline editing with sed:
sed -i '/|-1/d' file.psv
Or perhaps something equivalent to awk -F'|' '$2 !~ /-1/':
sed -i '/|.*-1/d' file.psv

how can i get certain columns and certain rows from file with egrep and awk

This is my data and file name : example.txt
id name lastname point
1234;emanuel;emenike;2855
1357;christian;baroni;398789
1390;alex;souza;23143
8766;moussa;sow;5443
I want to see who has this id(1234, 1390) columnname and point like that
emanuel 2855
alex 23143
How can i do this in linux command line with awk and egrep
You can try this:
awk -F\; '$1=="1234" || $1=="1390" {print $2,$4}' file
Using grep and cut:
grep '^\(1234\|1390\);' input | cut -d\; --output-delimiter=' ' -f2,4
Some variation awk
awk -F\; '$1~/^(1234|1390)$/ {print $2,$4}' file
emanuel 2855
alex 23143
Through awk,
awk -F';' '$1~/^1234$/ || $1~/^1390$/ {print $2,$4}' file
Example:
$ cat ccc
id name lastname point
1234;emanuel;emenike;2855
1357;christian;baroni;398789
1390;alex;souza;23143
8766;moussa;sow;5443
$ awk -F';' '$1~/^1234$/ || $1~/^1390$/ {print $2,$4}' ccc
emanuel 2855
alex 23143
Use the GNU version of awk (= gawk) in a two step approach to make your solution very flexible:
Step 1:
Parse your data file (e.g., example.txt) to generate a gawk lookup-function (here called "function_library.awk"):
$ /PATH/TO/generate_awk_function.sh /PATH/TO/example.txt
"generate_awk_function.sh" is just an gawk script for printing:
#! /bin/bash -
gawk 'BEGIN {
FS=";"
OFS="\t"
print "#### gawk function library \"function_library.awk\""
print "function lookup_value(key, value_for_key) {"
}
{
if (NR > 1 ) print "\tvalue_for_key["$1"] = \"" $2 OFS $4 "\""
}
END {
print " print value_for_key[key]"
print "}"
}' $1 > function_library.awk
You have generated this lookup function:
$ cat function_library.awk
#### gawk function library "function_library.awk"
function lookup_value(key, value_for_key) {
value_for_key[1234] = "emanuel 2855"
value_for_key[1357] = "christian 398789"
value_for_key[1390] = "alex 23143"
value_for_key[8766] = "moussa 5443"
print value_for_key[key]
}
Adapt "generate_awk_function.sh" for your needs:
a) FS=";" is setting the field separator in your input file (here a semicolon)
b) OFS="\t" is setting the output field separator (here a TAB)
You only have to generate this gawk "lookup-function" anew when your "example.txt" has changed.
Step 2:
Read your IDs to look up your results:
$ cat id.txt
1234
1390
$ gawk -i function_library.awk '{lookup_value($1)}' id.txt
emanuel 2855
alex 23143
You can also use this approach in a pipe like this:
$ cat id.txt | gawk -i function_library.awk '{lookup_value($1)}'
or like this:
$ echo 1234 | gawk -i function_library.awk '{lookup_value($1)}'
You can adapt this approach if your lookup string (1234) or file (id.txt) is containing some additional unwanted data ("noise") by using simple awk means:
a) Here, too, you can define a field separator, e.g., by setting it to a colon (:)
$ gawk -F":" -i function_library.awk '{lookup_value($5)}' id.txt
b) You can use the nth field of your lookup string, e.g., setting it from the 1st field to the 5th field just by changing the lookup_value from $1 to $5:
$ gawk -i function_library.awk '{lookup_value($5)}' id.txt
Please be aware that the '-i' command-line option is only supported by the GNU version of awk (= gawk).
HTH
bernie

how to get requred field from file on linux?

I have one file which contains three fields separated by two spaces. I need to get only third field from file. File content is as in following example:
kuldeep Mirat Shakti
balaji salunke pune
.
.
.
How can I get the third field?
To get the 3rd field, assuming you don't have any "embedded spaces", just
awk '{print $3}' file
awk by default sets whitespaces as field delimiters. So even if you have 2 spaces or more, the 3rd field is always $3.
However, if you want to be specific, then specify a Field delimiter
awk -F" " '{print $3}' file
If you have other choices, a Ruby one
ruby -F" " -ane 'print $F[2]' file
ruby -ane 'print $F[2]' file
Update: If you need to get all fields after 3rd,
awk -F" " '{$1=$2=$3=""}1' OFS=" " file # add a pipe to `sed 's/^[ \t]*//'` if desired
ruby -F" " -ane 'puts $F[3..-1].join(" ")' file
Use awk:
awk -F' ' '{print $3}' file
This also works if fields may contain embedded spaces.
To get the third field of each line, pipe through awk, e.g
cat filename | awk '{print $3}'
If you just want to get the third field of the first line, use head, too:
cat filename | head -n 1 | awk '{print $3}'
Given #balaji's comment to #kurani's answer:
perl -pe 's/^.*? .*? //' filename
awk -F' ' '{for(i=3; i<NF; i++) {printf("%s%s",$i,FS)}; print $NF}' filename
less filename | cut -d" " -f 3

Resources