AWK file reformatting - linux

I'm struggling to reformat a comma separated file using awk. The file contains minute data for a day for multiple servers and for multiple metrics
e.g 2 records, per minute, per server for 24hrs
Example input file:
server01,00:01:00,AckDelayAverage,9999
server01,00:01:00,AckDelayMax,8888
server01,00:02:00,AckDelayAverage,666
server01,00:02:00,AckDelayMax,5555
.....
server01,23:58:00,AckDelayAverage,4545
server01,23:58:00,AckDelayMax,8777
server01,23:59:00,AckDelayAverage,4686
server01,23:59:00,AckDelayMax,7820
server02,00:01:00,AckDelayAverage,1231
server02,00:01:00,AckDelayMax,4185
server02,00:02:00,AckDelayAverage,1843
server02,00:02:00,AckDelayMax,9982
.....
server02,23:58:00,AckDelayAverage,1022
server02,23:58:00,AckDelayMax,1772
server02,23:59:00,AckDelayAverage,1813
server02,23:59:00,AckDelayMax,9891
I'm trying to re-format the file to have a single row for each minute with a unique concatenation of fields 1 & 3 as the column headers
e.g the expected output file would look like:
Minute, server01-AckDelayAverage,server01-AckDelayMax, server02-AckDelayAverage,server02-AckDelayMax
00:01:00,9999,8888,1231,4185
00:02:00,666,5555,1843,8892
...
...
23:58:00,4545,8777,1022,1772
23:59:00,4686,7820,1813,9891

A solution using GNU awk. Call this as awk -F, -f script input_file:
/Average/ { average[$2, $1] = $4; }
/Max/ { maximum[$2, $1] = $4; }
{
if (!($2 in minutes)) {
minutes[$2] = 1;
}
if (!($1 in servers)) {
servers[$1] = 1;
}
}
END {
mcount = asorti(minutes, smin);
scount = asorti(servers, sserv);
printf "minutes";
for (col = 1; col <= scount; col++) {
printf "," sserv[col] "-average," sserv[col] "-maximum";
}
print "";
for (row = 1; row <= mcount; row++) {
key = smin[row];
printf key;
for (col = 1; col <= scount; col++) {
printf "," average[key, sserv[col]] "," maximum[key, sserv[col]];
}
print "";
}
}

run awk command : ./script.awk file
#! /bin/awk -f
BEGIN{
FS=",";
OFS=","
}
$1 ~ /server01/ && $3 ~ /Average/{
a[$2]["Avg01"] = $4;
}
$1 ~ /server01/ && $3 ~ /Max/{
a[$2]["Max01"] = $4;
}
$1 ~ /server02/ && $3 ~ /Average/{
a[$2]["Avg02"] = $4;
}
$1 ~ /server02/ && $3 ~ /Max/{
a[$2]["Max02"] = $4;
}
END{
print "Minute","server01-AckDelayAverage","server01-AckDelayMax","server02-AckDelayAverage","server02-AckDelayMax"
for(i in a){
print i,a[i]["Avg01"],a[i]["Max01"],a[i]["Avg02"],a[i]["Max02"] | "sort"
}
}

With awk and sort:
awk -F, -v OFS=, '{
a[$2]=(a[$2]?a[$2]","$4:$4)
}
END{
for ( i in a ) print i,a[i]
}' File | sort
If $4 has 0 values:
awk -F, -v OFS=, '!a[$2]{a[$2]=$2} {a[$2]=a[$2]","$4} END{for ( i in a ) print a[i]}' | sort
!a[$2]{a[$2]=$2}: If array with a with Index $2 ( the time in Minute) doesn't exit, array a with index as $2( the time in Minute) with value as $2 is created. True when Minute entry first time occurs in line.
{a[$2]=a[$2]","$4}: Concatenate value $4 to this array
END: Print all values of in array a
Finally pipe this awk result to sort.

Related

Changing previous duplicate line in awk

I want to change all duplicate names in .csv to unique, but after finding duplicate I cannot reach previous line, because it's already printed. I've tried to save all lines in array and print them in End section, but it doesn't work and I don't understand how to access specific field in this array (two-dimensional array isn't supported in awk?).
sample input
...,9,phone,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone,...
desired output
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
My attempt ($2 - id field, $3 - name field)
BEGIN{
FS=","
OFS=","
marker=777
}
{
if (names[$3] == marker) {
$3 = $3 $2
#Attempt to change previous duplicate
results[nameLines[$3]]=$3 id[$3]
}
names[$3] = marker
id[$3] = $2
nameLines[$3] = NR
results[NR] = $0
}
END{
#it prints some numbers, not saved lines
for(result in results)
print result
}
Here is single pass awk that stores all records in buffer:
awk -F, '
{
rec[NR] = $0
++fq[$3]
}
END {
for (i=1; i<=NR; ++i) {
n = split(rec[i], a, /,/)
if (fq[a[3]] > 1)
a[3] = a[3] a[2]
for (k=1; k<=n; ++k)
printf "%s", a[k] (k < n ? FS : ORS)
}
}' file
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
This could be easily done in 2 pass Input_file in awk where we need not to create 2 dimensional arrays in it. With your shown samples written in GNU awk.
awk '
BEGIN{FS=OFS=","}
FNR==NR{
arr1[$3]++
next
}
{
$3=(arr1[$3]>1?$3 $2:$3)
}
1
' Input_file Input_file
Output will be as follows:
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...

Print sum of Nth column at the header of file with existing rows bash

I have an input file with billions of records and a header.
Header consists of meta info, total number of rows and sum of the sixth column. I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.
This is the sample record
filename: testFile.text
00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
Expected:
filename: testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
filename: testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
I am able to split the file and calculate the sum but unable to replace the value in header part.
This is the script I have made
#!/bin/bash
splitRowCount=$1
transactionColumn=$2
filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"
echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"
awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount')+1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename
ls *.txt | while read line
do
firstLine=$(head -n 1 $line);
awk -F '|' 'NR !=1 {sum += '$transactionColumn'}END {print sum} ' $line
done
Here's an awk solution for splitting the original file into files of n records. The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:
n=3
file=./testFile.text
awk -v numRecords="$n" '
BEGIN {
FS = OFS = "|"
if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
filePrefix = substr(ARGV[1],1,RSTART)
fileSuffix = substr(ARGV[1],RSTART+1)
} else {
filePrefix = ARGV[1]
fileSuffix = ""
}
if (getline headerStr <= 0)
exit 1
split(headerStr, headerArr)
}
(NR-2) % numRecords == 0 && recordsCount {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
records = ""
recordsCount = recordsSum = 0
}
{
records = records $0 ORS
recordsCount++
recordsSum += $6
}
END {
if (recordsCount) {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
}
}
' "$file"
With the given sample you'll get:
testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
With your shown samples please try following awk code(Written and tested in GNU awk). Here I have defined awk variables named fileInitials which contains your output file's initial name eg: testFile then extension which contains output file's extension eg: .txt here. Then comes lines which will be your value on how many lines you want to have in a output file.
You need not to run shell + awk code, this could be done in a single awk like shown following.
awk -v count="1" -v fileInitials="testFile" -v extension=".txt" -v lines="3" '
BEGIN { FS=OFS="|" }
FNR==1{
match($0,/^([^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*)\|[^|]*(.*)/,arr)
header1=arr[1]
header2=arr[2]
outputFile=(fileInitials count extension)
next
}
{
if(prev!=count){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
outputFile=(fileInitials count extension)
sum=0
val=""
}
sum+=$6
val=(val?val ORS:"") $0
prev=count
count=(++countline%lines==0?++count:count)
}
END{
if(count && val){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
}
}
' Input_file

In bash how to move row field to column in a text file

I have a .txt file with this record:
field_1 value01a value01b value01c
field_2 value02
field_3 value03a value03b value03c
field_1 value11
field_2 value12a value12b
field_3 value13
field_1 value21
field_2 value22
field_3 value23
...
field_1 valuen1
field_2 valuen2
field_3 valuen3
I would like to convert them like that:
field1 field2 field3
value01a value01b value01c valu02 value03a value03b value03c
value11 value12a value12b value13
value21 value22 value23
...
valuen1 valuen2 valuen3
I have tried something like:
awk '{for (i = 1; i <NR; i ++) FNR == i {print i, $ (i + 1)}}' filename
or like
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j]
}
print str
}
}'
but i can't get it to work
I would like the values to be transposed and that each tuple of values associated with a specific field is aligned with the others
Any suggestions?
Thank you in advance
I have downloaded your bigger sample file. And here is what I have come up with:
awk -v OFS='\t' -v RS= '
((n = split($0, a, / {2,}| *\n/)) % 2) == 0 {
# print header
if (NR==1)
for (i=1; i<=n; i+=2)
printf "%s", a[i] (i < n-1 ? OFS : ORS)
# print all records
for (i=2; i<=n; i+=2)
printf "%s", a[i] (i < n ? OFS : ORS)
}' reclamiTestFile.txt | column -t -s $'\t'
Code Demo
Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
first=$1
$1=""
sub(/^ +/,"")
if(!arr[first]++){
++indArr
counter[indArr]=first
}
++count[first]
arr[first OFS count[first]]=$0
}
END{
for(j=1;j<=indArr;j++){
printf("%s\t%s",counter[j],j==indArr?ORS:"\t")
}
for(i=1;i<=FNR;i++){
for(j=1;j<=indArr;j++){
if(arr[counter[j] OFS i]){
printf("%s\t%s",arr[counter[j] OFS i],j==indArr?ORS:"\t")
}
}
}
}' Input_file | column -t -s $'\t'
column command is taken from #anubhava sir's answer here.

How to get column names in awk?

I have a data file in the following format:
Program1, Program2, Program3, Program4
0, 1, 1, 0
1, 1, 1, 0
Columns are program names, and rows are features of programs. I need to write an awk loop that will go through every row, check if a value is equal to one, and then return the column names and put them into a "results.csv" file. The desired output should be this:
Program2, Program3
Program1, Program2, Program3
I was trying this code, but it wouldn't work:
awk -F, '{for(i=1; i<=NF; i++) if ($i==1) {FNR==1 print$i>>results}; }'
Help would be very much appreciated!
awk -F', *' '
NR==1 {for(i=1;i<=NF;i++) h[i]=$i; next}
{
sep="";
for(x=1;x<=NF;x++) {
if($x) {
printf "%s%s", sep, h[x];
sep=", ";
}
}
print ""
}' file
outputs:
Program2, Program3
Program1, Program2, Program3
$ cat tst.awk
BEGIN { FS=", *" }
NR==1 { split($0,a); next }
{
out = ""
for (i=1; i<=NF; i++)
out = out ($i ? (out?", ":"") a[i] : "")
print out
}
$ awk -f tst.awk file
Program2, Program3
Program1, Program2, Program3
My take on things is more verbose, but should handle the trailing comma. Not really a one-liner, though.
BEGIN {
# Formatting for the input and output files.
FS = ", *"
OFS = ", "
}
FNR == 1 {
# First line in the file
# Read the headers into a list for later use.
for (i = 1; i <= NF; i++) {
headers[i] = $i
}
}
FNR > 1 {
# Print the header for each column containing a 1.
stop = 0
for (i = 1; i <= NF; i++) {
# Gather the results from this line.
if ($i > 0) {
stop += 1
results[stop] = headers[i]
}
}
if (stop > 0) {
# If this input line had no results, the output line is blank
for (i = 1; i <= stop; i++) {
# Print the appropriate headers for this result.
if (i < stop) {
# Results other than the last
printf("%s%s", results[i], OFS)
} else {
# The last result
printf("%s", results[i])
}
}
}
printf("%s", ORS)
}
Save this as something like script.awk, and then run it as something like:
awk -f script.awk infile.txt > results

Sorting List and Adding Together Amounts Shellscript

I have a list such as:
10,Car Tyres
8,Car Tyres
4,Wheels
18,Crowbars
5,Jacks
5,Jacks
8,Jacks
The first number is quantity, second is item name. I need to get this list so that it only shows each item once and it adds together the quantity if the item appears more than once. The output of this working correctly would be:
18,Car Tyres
4,Wheels
18,Crowbars
18,Jacks
This will need to work on lists in this format of a few thousand lines, preferably coded in Linux shellscript, any help appreciated, thanks!
awk -F"," '{ t[$2] = t[$2] + $1 }
END{
for(o in t){
print o, t[o]
}
}' file
output
$ ./shell.sh
Crowbars 18
Wheels 4
Car Tyres 18
Jacks 18
How about a perl script?:
#!/usr/bin/perl -w
use strict;
my %parts;
while (<>) {
chomp;
my #fields = split /,/, $_;
if (scalar #fields > 1) {
if ($parts{$fields[1]}) {
$parts{$fields[1]} += $fields[0];
} else {
$parts{$fields[1]} = $fields[0];
}
}
}
foreach my $k (keys %parts) {
print $parts{$k}, ",$k\n";
}
awk -v FS=, '{ if (! $2 in a) {
a[$2] = $1;
}
else {
a[$2] += $1;
}
}
END {
for (name in a) {
printf("%s\t%d\n", name, a[name]);
}
}'
Look at:
man sort
man awk
The actual command you need is:
sort -n -t, +1 yourfile.txt | awk ......
You could also do this entirely in awk
Sum by group

Resources