save multiple matches in a list (grep or awk) - text

I have a file that looks something like this:
# a mess of text
Hello. Student Joe Deere has
id number 1. Over.
# some more messy text
Hello. Student Steve Michael Smith has
id number 2. Over.
# etc.
I want to record the pairs (Joe Deere, 1), (Steve Michael Smith, 2), etc. into a list (or two separate lists with the same order). Namely, I will need to loop over those pairs and do something with the names and ids.
(names and ids are on distinct lines, but come in the order: name1, id1, name2, id2, etc. in the text). I am able to extract the lines of interest with
VAR=$(awk '/Student/,/Over/' filename.txt)
I think I know how to extract the names and ids with grep, but it will give me the result as one big block like
`Joe Deere 1 Steve Michael Smith 2 ...`
(and maybe even with a separator between names and ids). I am not sure at this point how to go forward with this, and in any case it doesn't feel like the right approach.
I am sure that there is a one-liner in awk that will do what I need. The possibilities are infinite and the documentation monumental.
Any suggestion?

$ cat tst.awk
/^id number/ {
gsub(/^([^ ]+ ){2}| [^ ]+$/,"",prev)
printf "(%s, %d)\n", prev, $3
}
{ prev = $0 }
$ awk -f tst.awk file
(Joe Deere, 1)
(Steve Michael Smith, 2)

Could you please try following too.
awk '
/id number/{
sub(/\./,"",$3)
print val", "$3
val=""
next
}
{
gsub(/Hello\. Student | has.*/,"")
val=$0
}
' Input_file

grep -oP 'Hello. Student \K.+(?= has)|id number \K\d+' file | paste - -

Related

Awk iteratively replacing strings from array

I've been recently trying to do the following in awk -
we have two files (F1.txt F2.txt.gz). While streaming from the second one, I want to replace all occurrences of entries from f1.txt with its substrings. I came to this point:
zcat F2.txt.gz |
awk 'NR==FNR {a[$1]; next}
{for (i in a)
$0=gsub(i, substr(i, 0, 2), $0) #this does not work of course
}
{print $0}
' F1.txt -
Was wondering how to do this properly in Awk. Thanks!
Please correct the assumptions if wrong.
You have two files, one includes a set of entries. If the second file has any one of these words, replace them with first two chars.
Example:
==> file1 <==
Azerbaijan
Belarus
Canada
==> file2 <==
Caspian sea is in Azerbaijan
Belarus is in Europe
Canada is in metric system.
$ awk 'NR==FNR {a[$1]; next}
{for(i=1;i<=NF;i++)
if($i in a) $i=substr($i,1,2)}1' file1 file2
Caspian sea is in Az
Be is in Europe
Ca is in metric system.
note that substring index starts with 1 in awk.
try to change
$0=gsub(i, substr(i, 0, 2), $0)
into
gsub(i, substr(i, 0, 2))
The return value of the gsub() function is the number of successful replacements instead of the string after the replacement.
$0=gsub(i, substr(i, 0, 2), $0) #this does not work of course
GNU AWK's function gsub does alter value of 3rd argument (thus it must be assignable) and does return number of substitutions made. You should not care about return value if you just want altered value.
Consider following simple example, let file1.txt content be
a x
b y
c z
and file2.txt content be
quick fox jumped over lazy dog
then
awk 'FNR==NR{arr[$1]=$2;next}{for(i in arr){gsub(i,arr[i],$0)};print}' file1.txt file2.txt
gives output
quizk fox jumped over lxzy dog
be warned that if there is any chain in your replacement
a b
b c
then output becomes dependent on array traversal order.
(tested in gawk 4.2.1)

How to use grep to match two strings in the same line

How can I use grep to find two terms / strings in one line?
The output, or an entry in the log file, should only be made if the two terms / strings have been found.
I have made the following attempts:
egrep -n --color '(string1.*string2)' debuglog.log
In this example, everything between the two strings is marked.
But I would like to see only the two found strings marked.
Is that possible?
Maybe you could do this with another tool, I am open for suggestions.
The simplest solution would be to first select only the lines that contain both strings and then grep twice to color the matches, eg:
egrep 'string1.*string2|string2.*string1' |
egrep -n --color=always 'string1' | egrep --color 'string2'
It is important to set color to always, otherwise the grep won't output the color information to the pipe.
Here is single command awk solution that prefixes and suffixes matched strings with color codes:
awk '/string1.*string2/{
gsub(/string1|string2/, "\033[01;31m\033[K&\033[m"); print}' file
I know some people will disagree, but I think the best way is to do it like this :
Lets say this is your input :
$ cat > fruits.txt
apple banana
orange strawberry
coconut watermelon
apple peach
With this code you can get exactly what you need, and the code looks nicer and cleaner :
awk '{ if ( $0 ~ /apple/ && $0 ~ /banana/ )
{
print $0
}
}' fruits.txt
But, as I said before, some people will disagree as its too much typing. ths short way with grep is just concatenate many greps , e.g. :
grep 'apple' fruits.txt | grep 'banana'
Regards!
I am a little confused of what you really want as there was no sample data or expected output, but:
$ cat foo
1
2
12
21
132
13
And the awk that prints the matching parts of the records:
$ awk '
/1/ && /2/ {
while(match($0,/1|2/)) {
b=b substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print b
b=""
}' foo
12
21
12
but fails with printing overlapping matches.

How to sort lines in textfile according to a second textfile

I have two text files.
File A.txt:
john
peter
mary
alex
cloey
File B.txt
peter does something
cloey looks at him
franz is the new here
mary sleeps
I'd like to
merge the two
sort one file according to the other
put the unknown lines of B at the end
like this:
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
$ awk '
NR==FNR { b[$1]=$0; next }
{ print ($1 in b ? b[$1] : $1); delete b[$1] }
END { for (i in b) print b[i] }
' fileB fileA
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
The above will print the remaining items from fileB in a "random" order (see http://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array for details). If that's a problem then edit your question to clarify your requirements for the order those need to be printed in.
It also assumes the keys in each file are unique (e.g. peter only appears as a key value once in each file). If that's not the case then again edit your question to include cases where a key appears multiple times in your ample input/output and additionally explain how you want the handled.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

Get string between characters in Linux

I have text file, which has data like this:
asd.www.aaa.com
abc.abc.co
look at me
asd.www.bbb.com
bzc.bzc.co
asd.www.ddd.com
hello world
www.eee.com
xx.yy.z
I want strings which is surrounded by "asd.www.[i want this string].com".
So my output will be like:
aaa
bbb
ddd
try:
grep -Po '^asd\.www\.\K[^.]*(?=\.com)' file
if asd could be in middle of the string, remove the first ^.
there could be other corner cases, like the greedy matching etc. it depends on your source input.
I suggested cut originally but I misread your question. So I'm going to post an alternative with awk instead. You are looking for the third column of your text input where there are a total of four columns.
less file.txt | awk -F '.' '{ if ($4 != "") print $3 }'
It splits your string on . and only prints out column $3 if column $4 is blank. This should yield the following given your example text:
aaa
bbb
ddd

Resources