Slice 3TB log file with sed, awk & xargs? - linux

I need to slice several TB of log data, and would prefer the speed of the command line.
I'll split the file up into chunks before processing, but need to remove some sections.
Here's an example of the format:
uuJ oPz eeOO 109 66 8
uuJ oPz eeOO 48 0 221
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 2 9 771
mxmx lo uUui 577 765 27878456
The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.
I want to keep only the last line in each group.
If there's only 1 line in a group, it should be kept.
Here's the expected output:
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 577 765 27878456
How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?

awk -F '\t' '
NR==1 {key=$1}
$1!=key {print line; key=$1}
{line=$0}
END {print line}
' file_in > file_out

Try this:
awk 'BEGIN{FS="\t"}
{if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
END{print lastLine}'
It saves the last line and prints it only when it notcies that the key has changed.

This might work for you:
sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

Related

merge every two rows in one Sum multiple entries

I am bit struggling with the output,as i need to merge every second row with first , sort and add up all the multiple entries.
sample output:
bittorrent_block(PCC)
127
default_384k(PCC)
28
default_384k(BWM)
28
bittorrent_block(PCC)
127
default_384k(PCC)
28
default_384k(BWM)
28
Convert 2nd row into Column (expected)
bittorrent_block(PCC): 127
default_384k(PCC): 28
default_384k(BWM): 28
bittorrent_block(PCC): 127
default_384k(PCC): 28
default_384k(BWM): 28
Sum all duplicate entries (expected)
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56
These are the possible piece of code I tried. what I am finally getting as
zcat file.tar.gz | awk 'NR%2{v=$0;next;}{print $0,v}'
bittorrent_block(PCC)
default_384k(PCC)
default_384k(BWM)
default_mk1(PCC)
default_mk1_10m(PCC)
zcat file.tar.gz |awk 'NR%2{ prev = $0; next }{ print prev, $0;}
127orrent_block(PCC)
28ault_384k(PCC)
28ault_384k(BWM)
Due to this, I am not able, to sum up duplicate values.
Please help.
I often find it easier to transform the input first and then process it. paste helps to convert consecutive lines into columns; then summing the numbers with awk becomes trivial:
$ <input paste -sd'\t\n' | awk '{sum[$1] += $2}END{for(s in sum) print s": "sum[s]}'
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56
It seems like you got CRLF in your file, so you'll have to strip them:
zcat file.tar.gz |
awk -F '\r' -v OFS=': ' '
NR % 2 { id = $1; next }
{ sum[id] += $1 }
END { for (id in sum) print id, sum[id] }
'
bittorrent_block(PCC): 254
default_384k(PCC): 56
default_384k(BWM): 56
Here is a Ruby to do that:
zcat file | ruby -e '$<.read.split(/\R/).
each_slice(2).
each_with_object(Hash.new {|h,k| h[k] = 0}) {
|(k,v), h| h[k] = h[k]+v.to_i
}.
each{|k,v| puts "#{k}: #{v}"}'
By splitting on \R this automatically handles either DOS or Unix line endings.
This might work for you (GNU sed,sort and bash):
zcat file |
paste - - |
sort |
uniq -c |
sed -E 's/^ *(\S+) (.*)\t(\S+)/echo "\2 $((\1*\3))"/e'
Decompress file.
Join pairs of lines.
Sort.
Count duplicate lines.
Format and compute final sums.

How to use awk '{print $1*Number}' from the second line or telling him to ignore NaN values?

I have a file called 'waterproofposters.jsonl' with this type of output:
Regular price
100
200
300
400
500
And I need to take out 2% of each value. I have used the following code:
awk '{print $1*0.98}' waterproofposters.jsonl
And then I have the following output:
0
98
196
294
392
490
And then I'm stuck because I need to have 'Regular price' in the first line instead '0'
I thought to replace '0' with 'Regular price using
find . -name "waterproof.jsonl" | xargs sed -i -e 's/0/Regular price/g'
But it will replace all the '0' by 'Regular price'
To print the first line as-is:
awk '{print (NR>1 ? $0*0.98 : $0)}'
To print lines that are not a number as-is:
awk '{print ($0+0 == $0 ? $0*0.98 : $0)}'
I'm using $0 instead of $1 in the multiplication because:
They're the same thing in your numerical input, and
I aesthetically prefer using the same value across the whole script rather than different values for the numeric vs non-numeric lines, and
When you use a specific field it causes awk to do field-splitting so it's a bit more efficient to not reference a field when the whole record will do.
Here's both of the above working with the posted sample input:
$ awk '{print (NR>1 ? $0*0.98 : $0)}' file
Regular price
98
196
294
392
490
$ awk '{print ($0+0 == $0 ? $0*0.98 : $0)}' file
Regular price
98
196
294
392
490
and here's the difference between the two given input that has a non-numeric value mid input file:
$ cat file
Regular price
100
200
foobar
400
500
$ awk '{print (NR>1 ? $0*0.98 : $0)}' file
Regular price
98
196
0
392
490
$ awk '{print ($0+0 == $0 ? $0*0.98 : $0)}' file
Regular price
98
196
foobar
392
490
You can certainly achieve what you need with a single awk call, but an answer to why your sed -i -e 's/0/Regular price/g' command did not work as expected is that you used 0 as the regex pattern. 0 matches any zero char inside the string.
You want to replace 0s that are the only char on a line.
Hence, you need to use ^ and $ anchors to match the start and end of the line respectively:
sed -i 's/^0$/Regular price/'
If you need to replace on the first line only add the 1 address before the substitution command:
sed -i '1 s/^0$/Regular price/'
Note you do not need g, since you only expect one replacement per line and g is only needed when performing multiple replacements on a line. By default, all lines will get processed.
How to use awk '{print $1Number}' from the second line or telling him to ignore NaN values?*
I would do it following way using GNU AWK, let file.txt content be
Regular price
100
200
300
400
500
then
awk 'NR==1{print}NR>=2{print $1*0.98}' file.txt
output
Regular price
98
196
294
392
490
Explanation: if it 1st line just print it, if it 2nd or later line print 0.98 of 1st column value
(tested in GNU Awk 5.0.1)

Find strings from one file that are not in lines of another file

In a bash shell script, I need to create a file with strings from file 1 that are not found in lines from file 2. File 1 is opened through a for loop of files in a directory.
files=./Output/*
for f in $files
do
done
I have very large files, so using grep isn't ideal. I previously tried:
awk 'NR==FNR{A[$2]=$0;next}!($2 in A){print }' file2 file1 > file3
file 1:
NB551674:136:HHVMJAFX2:1:11101:18246:1165
NB551674:136:HHVMJAFX2:1:11101:10296:1192
NB551674:136:HHVMJAFX2:1:11101:13281:1192
NB551674:136:HHVMJAFX2:2:21204:11743:6409
file 2:
aggggcgttccgcagtcgacaagggctgaaaaa|AbaeA1 NB551674:136:HHVMJAFX2:2:21204:11743:6409 100.000 32 0 0 1 32 83 114 7.30e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:4:21611:6341:1845 100.000 32 0 0 1 32 27 58 6.70e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:4:11504:1547:13124 100.000 32 0 0 1 32 88 119 6.70e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:3:11410:11337:15451 100.000 32 0 0 1 32 27 58 6.70e-10 60.2
expected output:
NB551674:136:HHVMJAFX2:2:21204:11743:6409
You were close - file1 only has 1 field ($1) but you were trying to use $2 in the hash lookup ($2 in A). Do this instead:
$ awk 'NR==FNR{a[$2]; next} !($1 in a)' file2 file1
NB551674:136:HHVMJAFX2:1:11101:18246:1165
NB551674:136:HHVMJAFX2:1:11101:10296:1192
NB551674:136:HHVMJAFX2:1:11101:13281:1192
Don't use all upper case for user-defined variable names in awk or shell btw to avoid clashes with builtin variables and other reasons.
Use comm, which requires sorted files. Print the second field of file2 using a Perl one-liner (or cut):
comm -23 <(sort file1) <(perl -lane 'print $F[1]' file2 | sort)
don't do that one line left compare one line right mess.
use gawk in bytes mode, or preferably, mawk, preload every single line into an array from file 1. use the strings directly as the array's hash indices instead of just numerical 1,2,3....
and set FS same as ORS (to prevent it from unnecessarily attempt to process the string looking to split fields).
close file 1. open file 2, then use each of the strings in file 2 and delete the corresponding entry in the array.
close file 2.
in END section, print out whatever is left inside that array. that's your set.

GNU Awk - don't modify whitespaces

I am using GNU Awk to replace a single character in a file. The file is a single line with varying whitespacing between "fields". After passing through gawk all the extra whitespacing is removed and I end up with single spaces. This is completely unintended and I need it to ignore these spaces and only change the one character I have targeted. I have tried several variations, but I cannot seem to get gawk to ignore these extra spaces.
Since I know this will come up, I read from the end of the line for replacement because the whitespacing is arbitrary/inconsistent in the source file.
Command:
gawk -i inplace -v new=3 'NF {$(NF-5) = new} 1' ~/scripts/tmp_beta_weather_file
Original file example:
2020-07-01 18:29:51.00 C M -11.4 28.9 29 9 23 5.5 000 0 0 00020 044013.77074 1 1 1 3 0 0
Result after command above:
2020-07-01 18:30:51.00 C M -11.8 28.8 29 5 23 5.5 000 0 0 00020 044013.77143 3 1 1 3 0 0
it might be easier with sed
sed -E 's/([^ ]+)(( [^ ]+){5})$/3\2/' file
test and add -i for in-place edit.

Match specific column with grep command

I am having trouble matching specific column with grep command. I have a test file (test.txt) like this..
Bra001325 835 T 13 c$c$c$c$c$cccccCcc !!!!!68886676
Bra001325 836 C 8 ,,,,,.,, 68886676
Bra001325 841 A 6 ,$,.,,. BJJJJE
Bra001325 866 C 2 ,. HJ
And i want to extract all those lines which has a number 866 in the second column. When i use grep command i am getting all the lines that contains the number that number
grep "866" test.txt
Bra001325 835 T 13 c$c$c$c$c$cccccCcc !!!!!68886676
Bra001325 836 C 8 ,,,,,.,, 68886676
Bra001325 866 C 2 ,. HJ
How can i match specific column with grep command?
Try doing this :
$ awk '$2 == 866' test.txt
No need to add {print}, the default behaviour of awk is to print on a true condition.
with grep :
$ grep -P '^\S+\s+866\b' *
But awk can print filenames too & is quite more robust than grep here :
$ awk '$2 == 866{print FILENAME":"$0; nextfile}' *
In my case, the field separator is not space but comma. So I would have to add this, otherwise it won't work for me (On ubuntu 18.04.1).
awk -F ', ' '$2 == 866' test.txt

Resources