merge every two rows in one Sum multiple entries - linux

I am bit struggling with the output,as i need to merge every second row with first , sort and add up all the multiple entries.
sample output:
bittorrent_block(PCC)
127
default_384k(PCC)
28
default_384k(BWM)
28
bittorrent_block(PCC)
127
default_384k(PCC)
28
default_384k(BWM)
28
Convert 2nd row into Column (expected)
bittorrent_block(PCC): 127
default_384k(PCC): 28
default_384k(BWM): 28
bittorrent_block(PCC): 127
default_384k(PCC): 28
default_384k(BWM): 28
Sum all duplicate entries (expected)
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56
These are the possible piece of code I tried. what I am finally getting as
zcat file.tar.gz | awk 'NR%2{v=$0;next;}{print $0,v}'
bittorrent_block(PCC)
default_384k(PCC)
default_384k(BWM)
default_mk1(PCC)
default_mk1_10m(PCC)
zcat file.tar.gz |awk 'NR%2{ prev = $0; next }{ print prev, $0;}
127orrent_block(PCC)
28ault_384k(PCC)
28ault_384k(BWM)
Due to this, I am not able, to sum up duplicate values.
Please help.

I often find it easier to transform the input first and then process it. paste helps to convert consecutive lines into columns; then summing the numbers with awk becomes trivial:
$ <input paste -sd'\t\n' | awk '{sum[$1] += $2}END{for(s in sum) print s": "sum[s]}'
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56

It seems like you got CRLF in your file, so you'll have to strip them:
zcat file.tar.gz |
awk -F '\r' -v OFS=': ' '
NR % 2 { id = $1; next }
{ sum[id] += $1 }
END { for (id in sum) print id, sum[id] }
'
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56

Here is a Ruby to do that:
zcat file | ruby -e '$<.read.split(/\R/).
each_slice(2).
each_with_object(Hash.new {|h,k| h[k] = 0}) {
|(k,v), h| h[k] = h[k]+v.to_i
}.
each{|k,v| puts "#{k}: #{v}"}'
By splitting on \R this automatically handles either DOS or Unix line endings.

This might work for you (GNU sed,sort and bash):
zcat file |
paste - - |
sort |
uniq -c |
sed -E 's/^ *(\S+) (.*)\t(\S+)/echo "\2 $((\1*\3))"/e'
Decompress file.
Join pairs of lines.
Sort.
Count duplicate lines.
Format and compute final sums.

Related

converting 4 digit year to 2 digit in shell script

I have file as:
$cat file.txt
1981080512 14 15
2019050612 17 18
2020040912 19 95
Here the 1st column represents dates as YYYYMMDDHH
I would like to write the dates as YYMMDDHH. So the desire output is:
81080512 14 15
19050612 17 18
20040912 19 95
My script:
while read -r x;do
yy=$(echo $x | awk '{print substr($0,3,2)}')
mm=$(echo $x | awk '{print substr($0,5,2)}')
dd=$(echo $x | awk '{print substr($0,7,2)}')
hh=$(echo $x | awk '{print substr($0,9,2)}')
awk '{printf "%10s%4s%4s\n",'$yy$mm$dd$hh',$2,$3}'
done < file.txt
It is printing
81080512 14 15
81080512 17 18
Any help please. Thank you.
Please don't kill me for this simple answer, but what about this:
cut -c 3- file.txt
You simply cut the first two digits by showing character 3 till the end of every line (the -c switch indicates that you need to cut characters (not bytes, ...)).
You can do it using single GNU AWK's substr as follows, let file.txt content be then
1981080512 14 15
2019050612 17 18
2020040912 19 95
then
awk '{$1=substr($1,3);print}' file.txt
output
81080512 14 15
19050612 17 18
20040912 19 95
Explanation: I used substr function to get 3rd and onward characters from 1st column and assign it back to said column, then I print such changed line.
(tested in gawk 4.2.1)

how to use shell to split string into correct format?

I have this file with time duration. Some have days but mostly in hh:mm form. The entire form is dd+hh:mm
I was trying to "tr -s '+:' ':'" them into dd:hh:mm form and then split($1,tm,":")calculate them into seconds.
However, the problem I am facing is that after this operation, the form with hh:mm would have hh in tm[1] but if its dd:hh:mm then the tm[1] would be dd.
Is there a way to put the hh in form of hh:mm into tm[2] and put tm[1] to be 0 Please?
4+11:26
10+06:54
20:27
is the input
the output I wanted would be(in form of tm[1], tm[2], tm[3]):
4 11 26
10 06 54
0 20 27
I would first preprocess it with sed (to add missing 0+ in lines that don't have a plus sign) and then tr +: to spaces:
cat a.txt | sed 's/^\([^+]\+\)$/0+\1/g' | tr '+:' ' '
Or as suggested by Lars, shorter sed version:
cat a.txt | sed '/+/! s/^/0+/;' | tr '+:' ' '
awk to the rescue!
You can do the conversion and computation in awk, using your input file the values are converted to minutes
$ awk -F: '{if($1~/+/){split($1,f,"+");h=f[1]*24+f[2]}
else h=$1; m=h*60+$2; print $0 " --> " m}' file
4+11:26 --> 6446
10+06:54 --> 14814
20:27 --> 1227

How to count all numbers in a file with awk?

I want to count all numbers that are in a file.
Example:
input -> Hi, this is 25 ...
input -> Lalala 21 or 29 what is ... 79?
The output should be the sum of all numbers: 154 (that is, 25+21+29+79).
From this beautiful answer by hek2mgl on how to extract the biggest number in a file, let's catch all the numbers in the file and sum them:
$ awk '{for(i=1;i<=NF;i++){sum+=$i}}END{print sum}' RS='$' FPAT='-{0,1}[0-9]+' file
154
This sets the record separator in a way that the whole block of text is a unique record. Then, it sets FPAT so that every single number (positive or negative) is a different field:
FPAT #
A regular expression (as a string) that tells gawk to create the
fields based on text that matches the regular expression. Assigning a
value to FPAT overrides the use of FS and FIELDWIDTHS for field
splitting.
$ cat data
Hi, this is 25 ...
Lalala 21 or 29 what is ... 79?
$ grep -oP '\b\d+\b' data | paste -s -d '+' | bc
154
With grep and awk :
$ cat test.txt
Hi, this is 25 ...
Lalala 21 or 29 what is ... 79?
$ grep '[0-9]\+' -o test.txt | awk '{ sum+=$1} END {print sum}'
154

compare if two columns of a file is identiical in linux

I would like to compare if two columns (mid) in a file is identical to each other. I am not sure of how to do it...Since the original file that I am working one is rather huge (in Gb)
file 1 (column1 and column 4 - to check if they are identical)
mid A1 A2 mid A3 A4 A5 A6
18 we gf 18 32 23 45 89
19 ew fg 19 33 24 46 90
21 ew fg 21 35 26 48 92
Thanks
M
if you just need to find the different row, awk will do,
awk '$1!=$4{print $1,$4}' data
You can check using diff and awk for advance difference.
diff <(awk '{print $1}' data) <(awk '{print $4}' data)
The status code ($?)of this command will tell if they are same (zero) or different (non-zero).
You can use that in base expression like this too,
if diff <(awk '{print $1}' data) <(awk '{print $4}' data) >& /dev/null;
then
echo same;
else
echo different;
fi
Something like this:
awk '{ if ($1 == $4) { print "same"; } else { print "different"; } }' < foo.txt
Completing a litle bit the Shiplu Mokaddim answer, if you have another delimiter, for example in a csv file, you can use:
awk -F; '$1!=$4{print $1,$4}' data.csv | sed -r 's/ /;/g'
In this sample, the delimiter is a ";". The sed command in the end is to replace again the delimiter to the original one. Be sure that you donĀ“t have another space in you answer, i.e. and date time.
Question: Compare two columns value in the same file.
Answer:
cut -d, -f1 a.txt > b.txt ; cut -d, -f3 a.txt > c.txt ; cmp b.txt c.txt && echo "Column values are same"; rm -rf b.txt c.txt

Slice 3TB log file with sed, awk & xargs?

I need to slice several TB of log data, and would prefer the speed of the command line.
I'll split the file up into chunks before processing, but need to remove some sections.
Here's an example of the format:
uuJ oPz eeOO 109 66 8
uuJ oPz eeOO 48 0 221
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 2 9 771
mxmx lo uUui 577 765 27878456
The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.
I want to keep only the last line in each group.
If there's only 1 line in a group, it should be kept.
Here's the expected output:
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 577 765 27878456
How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?
awk -F '\t' '
NR==1 {key=$1}
$1!=key {print line; key=$1}
{line=$0}
END {print line}
' file_in > file_out
Try this:
awk 'BEGIN{FS="\t"}
{if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
END{print lastLine}'
It saves the last line and prints it only when it notcies that the key has changed.
This might work for you:
sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

Resources