awk - pull out pair columns and get the count of occurrences - linux

I have a table schema - names of the columns in a comma separated fashion. For clarity, I'll put them in one column per line as below
$ cat cols_name.txt
id
resp
x_amt
rate1
rate2
rate3
pay1
pay2
rate_r1
rate_r2
x_rate1
x_rate2
x_rate3
x_rate_r1
x_rate_r2
x_pay1
x_pay2
rev1
x_rev1
I need to find out the pairs that match column pairs ( pay1 -> x_pay1 ) and list them together as an intermediate output like below
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_pay1 pay1
x_pay2 pay2
x_rate_r1 rate_r1
x_rate_r2 rate_r2
x_rev1 rev1
And then finally print the frequency as
pay 2
rate 3
rate_r 2
rev 1
In my attempt to get the intermediate output, the below awk command is not working.
awk ' NR==FNR { if( $1~/^x_/ ) a[$1]=1 ; next } $1~/"x_" a[$1]/ { print $0 } ' cols_name.txt cols_name.txt
It is not printing anything. Could you pls help to fix

Here is single pass awk to get it done:
awk '/^x_/ {xk[$0]; next} {s=$0; sub(/[0-9]+$/, "", s); xv[$0]=s} END {for (i in xv) if ("x_" i in xk) {print "x_" i, i; ++fq[xv[i]]}; print "== Summary =="; for (i in fq) print i, fq[i]}' file
x_rev1 rev1
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_rate_r1 rate_r1
x_pay1 pay1
x_rate_r2 rate_r2
x_pay2 pay2
== Summary ==
rate_r 2
rate 3
rev 1
pay 2
A more readable form:
awk '
/^x_/ {
xk[$0]
next
}
{
s = $0
sub(/[0-9]+$/, "", s)
xv[$0] = s
}
END {
for (i in xv)
if ("x_" i in xk) {
print "x_" i, i
++fq[xv[i]]
}
print "== Summary =="
for (i in fq)
print i, fq[i]
}' file

Using any awk in any shell on every Unix box and assuming every entry in your input file occurs only once as in your posted example:
$ cat tst.awk
{
sub(/^x_/,"")
pair = "x_" $0 OFS $0
if ( ++count[pair] == 2 ) {
print pair
sub(/[0-9]+$/,"")
freq[$0]++
}
}
END {
print "---"
for (key in freq) {
print key, freq[key]
}
}
$ awk -f tst.awk cols_name.txt
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_rate_r1 rate_r1
x_rate_r2 rate_r2
x_pay1 pay1
x_pay2 pay2
x_rev1 rev1
---
rate_r 2
rev 1
rate 3
pay 2

Assuming that the file is actually:
id,resp,x_amt,rate1,rate2,rate3,pay1,pay2,rate_r1,rate_r2,x_rate1,x_rate2,x_rate3,x_rate_r1,x_rate_r2,x_pay1,x_pay2,rev1,x_rev1
as suggested in the original post (it's not very clear), using GNU awk:
awk '{ split($0,map,",");for (i in map) { map1[map[i]]="1" } for (i in map) { if ( map[i] ~ /^x_/ ) { hd=gensub("x_","","g",map[i]);hd1=gensub("[[:digit:]]","","g",hd);if (map1[hd]=="1") { map2[hd1]++;print map[i]" "hd } } } printf "\n";for (i in map2) { print i" "map2[i] } }' cols_name.txt
Explanation:
awk '{
split($0,map,","); # Split the line into an array called map, using comma as the separator
for (i in map) {
map1[map[i]]="1" # Loop through map and create another array map1 with the values of map as indexes
}
for (i in map) {
if ( map[i] ~ /^x_/ ) {
hd=gensub("x_","","g",map[i]); # Loop through map and it the value is prefixed with "x_", remove it, reading the result into hd
hd1=gensub("[[:digit:]]","","g",hd); # Take any digits out of hd and read into hd1
if (map1[hd]=="1") {
map2[hd1]++; # Create a third array map2 with the index hd1 and the value an incrementing counter
print map[i]" "hd # If a match exists in the map1 array, print the match
}
}
}
printf "\n";
for (i in map2) {
print i" "map2[i] # Loop through the count array and print the values
}
}' cols_name.txt
Output:
x_pay2 pay2
x_rev1 rev1
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_rate_r1 rate_r1
x_rate_r2 rate_r2
x_pay1 pay1
rate_r 2
rate 3
rev 1
pay 2

With your shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk -v s1="x_" '
FNR==NR{
if($0~"^"s1){
arr[$0]=$0
}
next
}
((s1 $0) in arr){
print arr[s1 $0],$0
gsub(/^x_|[0-9]+$/,"",$0)
sum[$0]++
}
END{
for(i in sum){
print i,sum[i]
}
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="x_" ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when first time Input_file is being read.
if($0~"^"s1){ ##Checking condition if line starts with s1 value then do following.
arr[$0]=$0 ##Creating arr with current line index and value with current line.
}
next ##next will skip all further statements from here.
}
((s1 $0) in arr){ ##Checking condition if s1 $0 is present in arr then do following.
print arr[s1 $0],$0 ##Printing value of array with current line.
gsub(/^x_|[0-9]+$/,"",$0) ##Globally substituting starting x_ AND ending digits with NULL in current line.
sum[$0]++ ##Creating sum with inceasing value of 1 each time cursor comes here.
}
END{ ##Starting END block of this question from here.
for(i in sum){ ##Traversing through sum elements here.
print i,sum[i] ##Printing key and value of sum in here.
}
}
' Input_file Input_file ##Mentioning Input_file names here.

Related

Changing previous duplicate line in awk

I want to change all duplicate names in .csv to unique, but after finding duplicate I cannot reach previous line, because it's already printed. I've tried to save all lines in array and print them in End section, but it doesn't work and I don't understand how to access specific field in this array (two-dimensional array isn't supported in awk?).
sample input
...,9,phone,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone,...
desired output
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
My attempt ($2 - id field, $3 - name field)
BEGIN{
FS=","
OFS=","
marker=777
}
{
if (names[$3] == marker) {
$3 = $3 $2
#Attempt to change previous duplicate
results[nameLines[$3]]=$3 id[$3]
}
names[$3] = marker
id[$3] = $2
nameLines[$3] = NR
results[NR] = $0
}
END{
#it prints some numbers, not saved lines
for(result in results)
print result
}
Here is single pass awk that stores all records in buffer:
awk -F, '
{
rec[NR] = $0
++fq[$3]
}
END {
for (i=1; i<=NR; ++i) {
n = split(rec[i], a, /,/)
if (fq[a[3]] > 1)
a[3] = a[3] a[2]
for (k=1; k<=n; ++k)
printf "%s", a[k] (k < n ? FS : ORS)
}
}' file
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
This could be easily done in 2 pass Input_file in awk where we need not to create 2 dimensional arrays in it. With your shown samples written in GNU awk.
awk '
BEGIN{FS=OFS=","}
FNR==NR{
arr1[$3]++
next
}
{
$3=(arr1[$3]>1?$3 $2:$3)
}
1
' Input_file Input_file
Output will be as follows:
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...

move lines into a file by number of columns using awk

I have a sample file with '||o||' as field separator.
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
ngascash||o||||o||
ms-bronze.com.br||o||||o||
I want to move the lines with only 1 field in 1.txt and those having more than 1 field in not_1.txt. I am using the following command:
sed 's/\(||o||\)\+$//g' sample.txt | awk -F '[|][|]o[|][|]' '{if (NF == 1) print > "1.txt"; else print > "not_1.txt" }'
The problem is that it is moving not the original lines but the replaced ones.
The output I am getting is (not_1.txt):
td#the-end.org||o||srScSG2C5tg=||o||bnm
erba01#tiscali.it||o||4sQVj09gpls=
1.txt:
ngas
ms-inside#bol.com.br
As you can see the original lines are modified. I don't want to modify the lines.
Any help would be highly appreciated.
Awk solution:
awk -F '[|][|]o[|][|]' \
'{
c = 0;
for (i=1; i<=NF; i++) if ($i != "") c++;
print > (c == 1? "1" : "not_1")".txt"
}' sample.txt
Results:
$ head 1.txt not_1.txt
==> 1.txt <==
ngascash||o||||o||
ms-bronze.com.br||o||||o||
==> not_1.txt <==
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
Following awk may help you on same.
awk -F'\\|\\|o\\|\\|' '{for(i=1;i<=NF;i++){count=$i?++count:count};if(count==1){print > "1_field_only"};if(count>1){print > "not_1_field"};count=""}' Input_file
Adding a non-one liner form of solution too now.
awk -F'\\|\\|o\\|\\|' '
{
for(i=1;i<=NF;i++){ count=$i?++count:count };
if(count==1) { print > "1_field_only" };
if(count>1) { print > "not_1_field" };
count=""
}
' Input_file
Explanation: Adding explanation for above code too now.
awk -F'\\|\\|o\\|\\|' ' ##Setting field separator as ||o|| here and escaping the | here to take it literal character here.
{
for(i=1;i<=NF;i++){ count=$i?++count:count }; ##Starting a for loop to traverse through all the fields here, increasing variable count value if a field is NOT null.
if(count==1) { print > "1_field_only" }; ##Checking if count value is 1 it means fields are only 1 in line so printing current line into 1_field_only file.
if(count>1) { print > "not_1_field" }; ##Checking if count is more than 1 so printing current line into output file named not_1_field file here.
count="" ##Nullifying the variable count here.
}
' Input_file ##Mentioning Input_file name here.

LINUX Shell script to convert rows to multiple columns

Shell script to convert rows to multiple columns
input CSV file:
Driver Id,Driver Name,Measure Names,Measure Values
123,XYZ,Total Offers,10
123,XYZ,Driver Reject,0
123,XYZ,Driver Accept ,4
123,XYZ,Expired Offers,3
123,XYZ,Total Bookings,6
123,XYZ,Rider Cancels,2
123,XYZ,Driver Cancels,0
123,XYZ,Rider No-Show,0
123,XYZ,Completed Rides,4
124,PQR,Total Offers,2
124,PQR,Driver Reject,0
124,PQR,Driver Accept ,1
124,PQR,Expired Offers,1
124,PQR,Total Bookings,1
124,PQR,Rider Cancels,0
124,PQR,Driver Cancels,0
124,PQR,Rider No-Show,0
124,PQR,Completed Rides,1
Output Required:
Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides
123,XYZ,10,0,4,3,6,2,0,0,4
124,PQR,2,0,1,1,1,0,0,0,1
I tried with awk but it gives incorrect result.
awk -F\, '
BEGIN{
P["Total Offers"]="%s;%s;%s;;;;;;;;;\n"
P["Driver Reject"]="%s;%s;;%s;;;;;;;;\n"
P["Driver Accept"]="%s;%s;;;%s;;;;;;;\n"
P["Expired Offers"]="%s;%s;;;;%s;;;;;;\n"
P["Total Bookings"]="%s;%s;;;;;%s;;;;;\n"
P["Rider Cancels"]="%s;%s;;;;;;%s;;;;\n"
P["Driver Cancels"]="%s;%s;;;;;;;%s;;;\n"
P["Rider No-Show"]="%s;%s;;;;;;;;%s;;\n"
P["Completed Rides"]="%s;%s;;;;;;;;;%s;\n"
}
FNR==1{
print "Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides"
next
}
{
printf(P[$3],$1,$2,$4)
}
' sample1.csv
could somebody please assist me or show me any other method to implement this.
Thanks in Advance
Considering your Input_file is same as shown sample and if you don't care of about output sequence should be as input then following may help you in same.
awk -F, 'FNR>1{a[$1,$2]=a[$1,$2]?a[$1,$2] FS $NF:$NF} END{for(i in a){print i FS a[i]}}' SUBSEP="," Input_file
Below one takes care of order of output as well as missing value, if there's any
awk '
BEGIN{
FS=OFS=SUBSEP=",";
}
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
{
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
arr[$1,$2,$3] = $4
}
END{
print ""
for(u=1; u in usr; u++){
printf("%s", usr[u]);
for(n=1; n in names; n++){
indexkey = usr[u] SUBSEP names[n]
printf("%s%s",OFS, (indexkey in arr) ? arr[indexkey]:"")
}
print ""
}
}
' infile
Explanation:
FS=OFS=SUBSEP=","; - Set field separator, output field separator and built-in variable subsep to comma, in current program atleast atleast OFS and SUBSEP should be same, because I used it access array indexkey = usr[u] SUBSEP names[n], so if you got any other input field separator (say pipe) then make FS="|"; OFS=SUBSEP=","
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
If first line, then print first 2 fields and go to next line
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
Since you want ordered output, contiguous (in order) array (usr) is used in this program. tmp is array, where as index being $1 and $2, usr is array, where index being variable u, value being $1 and $2, if(!(($1,$2) in tmp)) takes care of if doesn't exist before.
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
Similarly like above, names array is contiguous, value being $3
arr[$1,$2,$3] = $4 array arr key being 3 fields, $1,$2,$3 and value being $4
Finally in END block loop through usr and names array, build indexkey and print array value, if indexkey exists in array arr
Input :
$ cat infile
Driver Id,Driver Name,Measure Names,Measure Values
123,XYZ,Total Offers,10
123,XYZ,Driver Reject,0
123,XYZ,Driver Accept ,4
123,XYZ,Expired Offers,3
123,XYZ,Total Bookings,6
123,XYZ,Rider Cancels,2
123,XYZ,Driver Cancels,0
123,XYZ,Rider No-Show,0
123,XYZ,Completed Rides,4
124,PQR,Total Offers,2
124,PQR,Driver Reject,0
124,PQR,Driver Accept ,1
124,PQR,Expired Offers,1
124,PQR,Total Bookings,1
124,PQR,Rider Cancels,0
124,PQR,Driver Cancels,0
124,PQR,Rider No-Show,0
124,PQR,Completed Rides,1
Output:
$ awk '
BEGIN{
FS=OFS=SUBSEP=",";
}
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
{
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
arr[$1,$2,$3] = $4
}
END{
print ""
for(u=1; u in usr; u++){
printf("%s", usr[u]);
for(n=1; n in names; n++){
indexkey = usr[u] SUBSEP names[n]
printf("%s%s",OFS, (indexkey in arr) ? arr[indexkey]:"")
}
print ""
}
}
' infile
Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept ,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides
123,XYZ,10,0,4,3,6,2,0,0,4
124,PQR,2,0,1,1,1,0,0,0,1
if the rows are not ordered in the required fields you have to use an associative array.
$ awk -F, -v cols='Total Offers,Driver Reject,Driver Accept ,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides' '
BEGIN {n=split(cols,f)}
NR>1 {k=$1 FS $2; keys[k]; a[k,$3]=$4}
END {for(k in keys)
{printf "%s", k;
for(i=1;i<=n;i++) printf "%s%d", FS,+a[k,f[i]];
print ""}}' file
124,PQR,2,0,1,1,1,0,0,0,1
123,XYZ,10,0,4,3,6,2,0,0,4
this will take care if any of the measure rows are missing
ps. Note that "Driver Accept " has a trailing space, which I kept.

file manipulation with command line tools on linux

I want to transform a file from this format
1;a;34;34;a
1;a;34;23;d
1;a;34;23;v
1;a;4;2;r
1;a;3;2;d
2;f;54;3;f
2;f;34;23;e
2;f;23;5;d
2;f;23;23;g
3;t;26;67;t
3;t;34;45;v
3;t;25;34;h
3;t;34;23;u
3;t;34;34;z
to this format
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
These are cvs files, so it should work with awk or sed ... but I have failed till now. If the first value is the same, I want to add the last three values to the first line. And this will run till the last entry in the file.
Here some code in awk, but it does not work:
#!/usr/bin/awk -f
BEGIN{ FS = " *; *"}
{ ORS = "\;" }
{
x = $1
print $0
}
{ if (x == $1)
print $3, $4, $5
else
print "\n"
}
END{
print "\n"
}
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr == prev {
sub(/^[^;]*;[^;]*/,"")
printf "%s", $0
next
}
{
printf "%s%s", (NR>1?ORS:""), $0
prev = curr
}
END { print "" }
$ awk -f tst.awk file
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
If I understand you correctly that you want to build a line from fields 3-5 of all lines with the same first two fields (preceded by those two fields), then
awk -F \; 'key != $1 FS $2 { if(NR != 1) print line; key = $1 FS $2; line = key } { line = line FS $3 FS $4 FS $5 } END { print line }' filename
That is
key != $1 FS $2 { # if the key (first two fields) changed
if(NR != 1) print line; # print the line (except at the very
# beginning, to not get an empty line there)
key = $1 FS $2 # remember the new key
line = key # and start building the next line
}
{
line = line FS $3 FS $4 FS $5 # take the value fields from each line
}
END { # and at the very end,
print line # print the last line (that the block above
} # cannot handle)
You got good answers in awk. Here is one in perl:
perl -F';' -lane'
$key = join ";", #F[0..1]; # Establish your key
$seen{$key}++ or push #rec, $key; # Remember the order
push #{ $h{$key} }, #F[2..$#F] # Build your data structure
}{
$, = ";"; # Set the output list separator
print $_, #{ $h{$_} } for #rec' file # Print as per order
This is going to seem a lot more complicated than the other answers, but it's adding a few things:
It computes the maximum number of fields from all built up lines
Appends any missing fields as blanks to the end of the built up lines
The posix awk on a mac doesn't maintain the order of array elements even when the keys are numbered when using the for(key in array) syntax. To maintain the output order then, you can keep track of it as I've done or pipe to sort afterwards.
Having matching numbers of fields in the output appears to be a requirement per the specified output. Without knowing what it should be, this awk script is built to load all the lines first, compute the maximum number of fields in an output line then output the lines with any adjustments in order.
#!/usr/bin/awk -f
BEGIN {FS=OFS=";"}
{
key = $1
# create an order array for the mac's version of awk
if( key != last_key ) {
order[++key_cnt] = key
last_key = key
}
val = a[key]
# build up an output line in array a for the given key
start = (val=="" ? $1 OFS $2 : val)
a[key] = start OFS $3 OFS $4 OFS $5
# count number of fields for each built up output line
nf_a[key] += 3
}
END {
# compute the max number of fields per any built up output line
for(k in nf_a) {
nf_max = (nf_a[k]>nf_max ? nf_a[k] : nf_max)
}
for(i=1; i<=key_cnt; i++) {
key = order[i]
# compute the number of blank flds necessary
nf_pad = nf_max - nf_a[key]
blank_flds = nf_pad!=0 ? sprintf( "%*s", nf_pad, OFS ) : ""
gsub( / /, OFS, blank_flds )
# output lines along with appended blank fields in order
print a[key] blank_flds
}
}
If the desired number of fields in the output lines is known ahead of time, simply appending the blank fields on key switch without all these arrays would work and make a simpler script.
I get the following output:
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?
awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.
does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.
Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file
Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)
this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

Resources