AWK: confirming matching columns of consecutive rows - linux

Hello and thank you for taking the time to read this question. For the last day I have been trying to solve a problem and haven’t come any closer to a solution. I have a sample file of data that contains the following:
Fighter#Trainer
Bobby#SamBonen
Billy#BobBrown
Sammy#DJacobson
James#DJacobson
Donny#SonnyG
Ben#JasonS
Dave#JuanO
Derrek#KMcLaughlin
Dillon#LGarmati
Orson#LGarmati
Jeff#RodgerU
Brad#VCastillo
The goal is to identify “Trainers” that have have more then one fighter. My gut feeling is the “getline” and variable declaration directives in AWK are going to be needed. I have tried different combinations of
awk -F# 'NR>1{a=$2; getline; if($2 = a) {print $0,"Yes"} else {print $0,"NO"}}' sample.txt
Yet, the output is nowhere near the desired results. In fact, it doesn’t even output all the rows in the sample file!
My desired results are:
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
I am completely lost as to where to go from here. I have been searching and trying to find a solution to no avail, and I'm looking for some input. Thank you!

You don't need getline.
You could just process the input normally,
building up counts per trainer,
and print the result in an END block:
awk -F# '{
lines[NR] = $0;
trainers[NR] = $2;
counts[$2]++;
}
END {
print lines[1];
for (i = 2; i <= length(lines); i++) {
print lines[i] "#" (counts[trainers[i]] > 1 ? "YES" : "NO");
}
}' sample.txt

Another option is to make two passes:
$ cat p.awk
BEGIN {FS=OFS="#"}
NR==1 {print;next};
NR==FNR {++trainers[$2]; next}
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}
$ awk -f p.awk p.txt p.txt
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
Explained:
Set the input and output file separators:
BEGIN {FS=OFS="#"}
Print the header:
NR==1 {print;next};
First pass, count occurrences of each trainer:
NR==FNR {++trainers[$2]; next}
Second pass, set YES or NO according to trainer count, and print result:
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

Filling empty spaces in a CSV file

I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2

Search for a string and print the line in a different order using Linux

I need to write a shell script that does the following which I am showing below with an example.
Suppose I have a file cars.txt which depicts a table like this
Person|Car|Country
The '|' is the separator. So the first two lines goes like this
Michael|Ford|USA
Rahul|Maruti|India
I have to write a shell script which will find the lines in the cars.txt file that has the country as USA and will print it like
USA|Ford|Michael
I am not very adept with Unix so I need some help here.
Will this do?
while read -r i; do
NAME="$(cut -d'|' -f1 <<<"$i")"
MAKE="$(cut -d'|' -f2 <<<"$i")"
COUNTRY="$(cut -d'|' -f3 <<<"$i")"
echo "$COUNTRY|$MAKE|$NAME"
done < <(grep "USA$" cars.txt)
Updated To Locate USA Not 1st Line As Provided in Your Question
Using awk you can do what you are attempting in a very simple manner, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS}' cars.txt
USA|Ford|Michael
India|Maruti|Rahul
Explanation
awk -F'|' read the file using '|' as the Field-Separator, specified as -F'|' at the beginning of the call, or as FS within the command itself,
/USA/ locate only lines containing "USA",
for (i = NF; i >= 1; i--) - loop over fields in reverse order,
printf "%s%s", $i, i==1 ? RS : FS - output the field followed by a '|' (FS) if i is not equal 1 or by the Record-Separator (RS) which is a "\n" by default if i is equal 1. The form test ? true_val : false_val is just the ternary operator that tests if i == 1 and if so provides RS for output, otherwise provides FS for output.
It will be orders of magnitude faster than spawning 8-subshells using command substitutions, grep and cut (plus the pipes).
Printing Only The 1st Occurrence of Line Containing "USA"
To print only the first line with "USA", all you need to do is exit after processing, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS; exit}' cars.txt
USA|Ford|Michael
Explanation
simply adding exit to the end of the command will cause awk to stop processing records after the first one.
While both awk and sed take a little time to make friends with, together they provide the Unix-Swiss-Army-Knife for text processing. Well worth the time to learn both. It only takes a couple of hours to get a good base by going through one of the tutorials. Good luck with your scripting.

Grouping related rows of data into a single column in Linux

I have a csv file that gets generated daily and automatically that has output similar to the following example:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
I need to transform the text so that it is grouped by person (the fourth column), and all of the records are in a single row and in the columns are repeated using the same sequence: 1,2,3,5. Some cells may be missing data but must remain in the sequence so the columns line up. So the output I need will look like this:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
I am open to using sed, awk, or pretty much any standard Linux command to get this task done. I've been trying to use awk, and though I get close, I can't figure out how to finish it.
Here is the command where I'm close. It lists the header and the names, but no other data:
awk -F"," 'NR==1; NR>1 {a[$4]=a[$4] ? i : ""} END {for (i in a) {print i}}' test2.csv
you need little more code
$ awk 'BEGIN {FS=OFS=","}
{k=$4; $4=$5; NF--; a[k]=(k in a?a[k] FS $0:$0)}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
note that NF-- trick may not work in all awks.
Could you please try following too, reading the Input_file 2 times, it will provide output in same sequence in which 4th column has come in Input_file.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$4]=a[$4]?a[$4] OFS $1 OFS $2 OFS $3 OFS $5:$4 OFS $1 OFS $2 OFS $3 OFS $5
next
}
a[$4]{
print a[$4]
delete a[$4]
}
' Input_file Input_file
If there is any chance that any of the CSV values has a comma, then a "CSV-aware" tool will would be advisable to obtain a reliable but straightforward solution.
One approach would be to use one of the many readily available csv2tsv command-line tools. A variety of elegant solutions then becomes possible. For example, one could pipe the CSV into csv2tsv, awk, and tsv2csv.
Here is another solution that uses csv2tsv and jq:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| #csv
'
This produces:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
Trimming the excess spaces is left as an exercise :-)

Using awk to calculate an average of numbers

I have a file named test.txt with the following:
10
200
3000
=======
4
5
=======
I need to write an awk script to take the text in this file as input into the awk script and output:
10
200
3000
Average 1070.00
4
5
Average 4.50
I wrote my script like this:
{while($1!~"=======") s+=$1;}
{print "Average ", s}
Every time I run this code, I use:
awk -f awrp4 test.txt
But it crashes. I don't know what I'm doing wrong. I'm a beginner and trying to learn about the awk function so I apologize if this seems rather simple. Any help is welcome.
Using GNU awk, you can write:
gawk '
BEGIN {FS = "\n"; RS = "\n=+\n"}
NF > 0 {
sum = 0
for (i=1; i<=NF; i++) {
print $i
sum += $i
}
printf "Average %.2f\n", sum/NF
}
' file
Certainly nothing wrong with Glenn's solution, but it might be a bit advanced for you. Maybe this is better suited:
{
if ($1 == "=======") {
print "\nAverage " s/i "\n";
s=0;
i=0;
} else {
print $1;
s += $1;
i += 1;
}
}
As I mentioned in the comments, the nature of awk is to loop through every line of a text file. Unless you're doing some post-processing or working with arrays, a while loop probably isn't of much use.
awk '$1~/^[[:digit:]]/ {i++; sum+=$1; print $1} $1!~/[[:digit:]]/ {print "Average", sum/i; sum=0;i=0}' file
The first part checks if the first character from column 1 is a digit, if so then increment the counter 'i' and add that record(number) to sum.
The second part is to skip any record that does not start with a digit, and then you print the average by dividing sum/i, finally reset counter and sum to 0.

Resources