Changing previous duplicate line in awk

Changing previous duplicate line in awk - linux

I want to change all duplicate names in .csv to unique, but after finding duplicate I cannot reach previous line, because it's already printed. I've tried to save all lines in array and print them in End section, but it doesn't work and I don't understand how to access specific field in this array (two-dimensional array isn't supported in awk?).
sample input
...,9,phone,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone,...
desired output
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
My attempt ($2 - id field, $3 - name field)
BEGIN{
FS=","
OFS=","
marker=777
}
{
if (names[$3] == marker) {
$3 = $3 $2
#Attempt to change previous duplicate
results[nameLines[$3]]=$3 id[$3]
}
names[$3] = marker
id[$3] = $2
nameLines[$3] = NR
results[NR] = $0
}
END{
#it prints some numbers, not saved lines
for(result in results)
print result
}

Here is single pass awk that stores all records in buffer:
awk -F, '
{
rec[NR] = $0
++fq[$3]
}
END {
for (i=1; i<=NR; ++i) {
n = split(rec[i], a, /,/)
if (fq[a[3]] > 1)
a[3] = a[3] a[2]
for (k=1; k<=n; ++k)
printf "%s", a[k] (k < n ? FS : ORS)
}
}' file
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...

This could be easily done in 2 pass Input_file in awk where we need not to create 2 dimensional arrays in it. With your shown samples written in GNU awk.
awk '
BEGIN{FS=OFS=","}
FNR==NR{
arr1[$3]++
next
}
{
$3=(arr1[$3]>1?$3 $2:$3)
}
1
' Input_file Input_file
Output will be as follows:
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...

Related

LINUX Shell script to convert rows to multiple columns

Shell script to convert rows to multiple columns
input CSV file:
Driver Id,Driver Name,Measure Names,Measure Values
123,XYZ,Total Offers,10
123,XYZ,Driver Reject,0
123,XYZ,Driver Accept ,4
123,XYZ,Expired Offers,3
123,XYZ,Total Bookings,6
123,XYZ,Rider Cancels,2
123,XYZ,Driver Cancels,0
123,XYZ,Rider No-Show,0
123,XYZ,Completed Rides,4
124,PQR,Total Offers,2
124,PQR,Driver Reject,0
124,PQR,Driver Accept ,1
124,PQR,Expired Offers,1
124,PQR,Total Bookings,1
124,PQR,Rider Cancels,0
124,PQR,Driver Cancels,0
124,PQR,Rider No-Show,0
124,PQR,Completed Rides,1
Output Required:
Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides
123,XYZ,10,0,4,3,6,2,0,0,4
124,PQR,2,0,1,1,1,0,0,0,1
I tried with awk but it gives incorrect result.
awk -F\, '
BEGIN{
P["Total Offers"]="%s;%s;%s;;;;;;;;;\n"
P["Driver Reject"]="%s;%s;;%s;;;;;;;;\n"
P["Driver Accept"]="%s;%s;;;%s;;;;;;;\n"
P["Expired Offers"]="%s;%s;;;;%s;;;;;;\n"
P["Total Bookings"]="%s;%s;;;;;%s;;;;;\n"
P["Rider Cancels"]="%s;%s;;;;;;%s;;;;\n"
P["Driver Cancels"]="%s;%s;;;;;;;%s;;;\n"
P["Rider No-Show"]="%s;%s;;;;;;;;%s;;\n"
P["Completed Rides"]="%s;%s;;;;;;;;;%s;\n"
}
FNR==1{
print "Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides"
next
}
{
printf(P[$3],$1,$2,$4)
}
' sample1.csv
could somebody please assist me or show me any other method to implement this.
Thanks in Advance

Considering your Input_file is same as shown sample and if you don't care of about output sequence should be as input then following may help you in same.
awk -F, 'FNR>1{a[$1,$2]=a[$1,$2]?a[$1,$2] FS $NF:$NF} END{for(i in a){print i FS a[i]}}' SUBSEP="," Input_file

Below one takes care of order of output as well as missing value, if there's any
awk '
BEGIN{
FS=OFS=SUBSEP=",";
}
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
{
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
arr[$1,$2,$3] = $4
}
END{
print ""
for(u=1; u in usr; u++){
printf("%s", usr[u]);
for(n=1; n in names; n++){
indexkey = usr[u] SUBSEP names[n]
printf("%s%s",OFS, (indexkey in arr) ? arr[indexkey]:"")
}
print ""
}
}
' infile
Explanation:
FS=OFS=SUBSEP=","; - Set field separator, output field separator and built-in variable subsep to comma, in current program atleast atleast OFS and SUBSEP should be same, because I used it access array indexkey = usr[u] SUBSEP names[n], so if you got any other input field separator (say pipe) then make FS="|"; OFS=SUBSEP=","
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
If first line, then print first 2 fields and go to next line
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
Since you want ordered output, contiguous (in order) array (usr) is used in this program. tmp is array, where as index being $1 and $2, usr is array, where index being variable u, value being $1 and $2, if(!(($1,$2) in tmp)) takes care of if doesn't exist before.
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
Similarly like above, names array is contiguous, value being $3
arr[$1,$2,$3] = $4 array arr key being 3 fields, $1,$2,$3 and value being $4
Finally in END block loop through usr and names array, build indexkey and print array value, if indexkey exists in array arr
Input :
$ cat infile
Driver Id,Driver Name,Measure Names,Measure Values
123,XYZ,Total Offers,10
123,XYZ,Driver Reject,0
123,XYZ,Driver Accept ,4
123,XYZ,Expired Offers,3
123,XYZ,Total Bookings,6
123,XYZ,Rider Cancels,2
123,XYZ,Driver Cancels,0
123,XYZ,Rider No-Show,0
123,XYZ,Completed Rides,4
124,PQR,Total Offers,2
124,PQR,Driver Reject,0
124,PQR,Driver Accept ,1
124,PQR,Expired Offers,1
124,PQR,Total Bookings,1
124,PQR,Rider Cancels,0
124,PQR,Driver Cancels,0
124,PQR,Rider No-Show,0
124,PQR,Completed Rides,1
Output:
$ awk '
BEGIN{
FS=OFS=SUBSEP=",";
}
FNR==1{
printf("%s%s%s",$1,OFS,$2);
next
}
{
if(!(($1,$2) in tmp)){
usr[++u] = $1 OFS $2
tmp[$1,$2]
}
if(!($3 in tmp)){
names[++n] = $3;
tmp[$3]
printf("%s%s",OFS,$3)
}
arr[$1,$2,$3] = $4
}
END{
print ""
for(u=1; u in usr; u++){
printf("%s", usr[u]);
for(n=1; n in names; n++){
indexkey = usr[u] SUBSEP names[n]
printf("%s%s",OFS, (indexkey in arr) ? arr[indexkey]:"")
}
print ""
}
}
' infile
Driver Id,Driver Name,Total Offers,Driver Reject,Driver Accept ,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides
123,XYZ,10,0,4,3,6,2,0,0,4
124,PQR,2,0,1,1,1,0,0,0,1

if the rows are not ordered in the required fields you have to use an associative array.
$ awk -F, -v cols='Total Offers,Driver Reject,Driver Accept ,Expired Offers,Total Bookings,Rider Cancels,Driver Cancels,Rider No-Show,Completed Rides' '
BEGIN {n=split(cols,f)}
NR>1 {k=$1 FS $2; keys[k]; a[k,$3]=$4}
END {for(k in keys)
{printf "%s", k;
for(i=1;i<=n;i++) printf "%s%d", FS,+a[k,f[i]];
print ""}}' file
124,PQR,2,0,1,1,1,0,0,0,1
123,XYZ,10,0,4,3,6,2,0,0,4
this will take care if any of the measure rows are missing
ps. Note that "Driver Accept " has a trailing space, which I kept.

AWK file reformatting

I'm struggling to reformat a comma separated file using awk. The file contains minute data for a day for multiple servers and for multiple metrics
e.g 2 records, per minute, per server for 24hrs
Example input file:
server01,00:01:00,AckDelayAverage,9999
server01,00:01:00,AckDelayMax,8888
server01,00:02:00,AckDelayAverage,666
server01,00:02:00,AckDelayMax,5555
.....
server01,23:58:00,AckDelayAverage,4545
server01,23:58:00,AckDelayMax,8777
server01,23:59:00,AckDelayAverage,4686
server01,23:59:00,AckDelayMax,7820
server02,00:01:00,AckDelayAverage,1231
server02,00:01:00,AckDelayMax,4185
server02,00:02:00,AckDelayAverage,1843
server02,00:02:00,AckDelayMax,9982
.....
server02,23:58:00,AckDelayAverage,1022
server02,23:58:00,AckDelayMax,1772
server02,23:59:00,AckDelayAverage,1813
server02,23:59:00,AckDelayMax,9891
I'm trying to re-format the file to have a single row for each minute with a unique concatenation of fields 1 & 3 as the column headers
e.g the expected output file would look like:
Minute, server01-AckDelayAverage,server01-AckDelayMax, server02-AckDelayAverage,server02-AckDelayMax
00:01:00,9999,8888,1231,4185
00:02:00,666,5555,1843,8892
...
...
23:58:00,4545,8777,1022,1772
23:59:00,4686,7820,1813,9891

A solution using GNU awk. Call this as awk -F, -f script input_file:
/Average/ { average[$2, $1] = $4; }
/Max/ { maximum[$2, $1] = $4; }
{
if (!($2 in minutes)) {
minutes[$2] = 1;
}
if (!($1 in servers)) {
servers[$1] = 1;
}
}
END {
mcount = asorti(minutes, smin);
scount = asorti(servers, sserv);
printf "minutes";
for (col = 1; col <= scount; col++) {
printf "," sserv[col] "-average," sserv[col] "-maximum";
}
print "";
for (row = 1; row <= mcount; row++) {
key = smin[row];
printf key;
for (col = 1; col <= scount; col++) {
printf "," average[key, sserv[col]] "," maximum[key, sserv[col]];
}
print "";
}
}

run awk command : ./script.awk file
#! /bin/awk -f
BEGIN{
FS=",";
OFS=","
}
$1 ~ /server01/ && $3 ~ /Average/{
a[$2]["Avg01"] = $4;
}
$1 ~ /server01/ && $3 ~ /Max/{
a[$2]["Max01"] = $4;
}
$1 ~ /server02/ && $3 ~ /Average/{
a[$2]["Avg02"] = $4;
}
$1 ~ /server02/ && $3 ~ /Max/{
a[$2]["Max02"] = $4;
}
END{
print "Minute","server01-AckDelayAverage","server01-AckDelayMax","server02-AckDelayAverage","server02-AckDelayMax"
for(i in a){
print i,a[i]["Avg01"],a[i]["Max01"],a[i]["Avg02"],a[i]["Max02"] | "sort"
}
}

With awk and sort:
awk -F, -v OFS=, '{
a[$2]=(a[$2]?a[$2]","$4:$4)
}
END{
for ( i in a ) print i,a[i]
}' File | sort
If $4 has 0 values:
awk -F, -v OFS=, '!a[$2]{a[$2]=$2} {a[$2]=a[$2]","$4} END{for ( i in a ) print a[i]}' | sort
!a[$2]{a[$2]=$2}: If array with a with Index $2 ( the time in Minute) doesn't exit, array a with index as $2( the time in Minute) with value as $2 is created. True when Minute entry first time occurs in line.
{a[$2]=a[$2]","$4}: Concatenate value $4 to this array
END: Print all values of in array a
Finally pipe this awk result to sort.

file manipulation with command line tools on linux

I want to transform a file from this format
1;a;34;34;a
1;a;34;23;d
1;a;34;23;v
1;a;4;2;r
1;a;3;2;d
2;f;54;3;f
2;f;34;23;e
2;f;23;5;d
2;f;23;23;g
3;t;26;67;t
3;t;34;45;v
3;t;25;34;h
3;t;34;23;u
3;t;34;34;z
to this format
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
These are cvs files, so it should work with awk or sed ... but I have failed till now. If the first value is the same, I want to add the last three values to the first line. And this will run till the last entry in the file.
Here some code in awk, but it does not work:
#!/usr/bin/awk -f
BEGIN{ FS = " *; *"}
{ ORS = "\;" }
{
x = $1
print $0
}
{ if (x == $1)
print $3, $4, $5
else
print "\n"
}
END{
print "\n"
}

$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr == prev {
sub(/^[^;]*;[^;]*/,"")
printf "%s", $0
next
}
{
printf "%s%s", (NR>1?ORS:""), $0
prev = curr
}
END { print "" }
$ awk -f tst.awk file
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

If I understand you correctly that you want to build a line from fields 3-5 of all lines with the same first two fields (preceded by those two fields), then
awk -F \; 'key != $1 FS $2 { if(NR != 1) print line; key = $1 FS $2; line = key } { line = line FS $3 FS $4 FS $5 } END { print line }' filename
That is
key != $1 FS $2 { # if the key (first two fields) changed
if(NR != 1) print line; # print the line (except at the very
# beginning, to not get an empty line there)
key = $1 FS $2 # remember the new key
line = key # and start building the next line
}
{
line = line FS $3 FS $4 FS $5 # take the value fields from each line
}
END { # and at the very end,
print line # print the last line (that the block above
} # cannot handle)

You got good answers in awk. Here is one in perl:
perl -F';' -lane'
$key = join ";", #F[0..1]; # Establish your key
$seen{$key}++ or push #rec, $key; # Remember the order
push #{ $h{$key} }, #F[2..$#F] # Build your data structure
}{
$, = ";"; # Set the output list separator
print $_, #{ $h{$_} } for #rec' file # Print as per order

This is going to seem a lot more complicated than the other answers, but it's adding a few things:
It computes the maximum number of fields from all built up lines
Appends any missing fields as blanks to the end of the built up lines
The posix awk on a mac doesn't maintain the order of array elements even when the keys are numbered when using the for(key in array) syntax. To maintain the output order then, you can keep track of it as I've done or pipe to sort afterwards.
Having matching numbers of fields in the output appears to be a requirement per the specified output. Without knowing what it should be, this awk script is built to load all the lines first, compute the maximum number of fields in an output line then output the lines with any adjustments in order.
#!/usr/bin/awk -f
BEGIN {FS=OFS=";"}
{
key = $1
# create an order array for the mac's version of awk
if( key != last_key ) {
order[++key_cnt] = key
last_key = key
}
val = a[key]
# build up an output line in array a for the given key
start = (val=="" ? $1 OFS $2 : val)
a[key] = start OFS $3 OFS $4 OFS $5
# count number of fields for each built up output line
nf_a[key] += 3
}
END {
# compute the max number of fields per any built up output line
for(k in nf_a) {
nf_max = (nf_a[k]>nf_max ? nf_a[k] : nf_max)
}
for(i=1; i<=key_cnt; i++) {
key = order[i]
# compute the number of blank flds necessary
nf_pad = nf_max - nf_a[key]
blank_flds = nf_pad!=0 ? sprintf( "%*s", nf_pad, OFS ) : ""
gsub( / /, OFS, blank_flds )
# output lines along with appended blank fields in order
print a[key] blank_flds
}
}
If the desired number of fields in the output lines is known ahead of time, simply appending the blank fields on key switch without all these arrays would work and make a simpler script.
I get the following output:
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

Using awk on large txt to extract specific characters of fields

I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!

Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped

If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?

awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.

does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.

Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file

Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)

this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Changing previous duplicate line in awk - linux

Related

LINUX Shell script to convert rows to multiple columns

AWK file reformatting

file manipulation with command line tools on linux

Using awk on large txt to extract specific characters of fields

Check variables from different lines with awk

Categories

Resources