awk condiontals for grep like things and sum of the two - linux

I have a Data in a file where I have some numbers with denomination like GB & TB and i have to sum them together .
Below is file data:
$ cat data_in_transit| awk '/TB/{print $6}'
1.26TB
1.24TB
2.85TB
1.03TB
1.07TB
1.01TB
$ cat data_in_transit| awk '/GB/{print $6}'
962.2GB
1005GB
892.5GB
910.0GB
823.4GB
1008GB
426.4GB
168.6GB
208.1GB
511.3GB
787.5GB
448.0GB
509.6GB
496.1GB
550.7GB
I can calculate them individually, however i want below two to be summ'ed into one query.
Anything starting with GB ..
$ awk '/GB/{sumGB+=$6}END{printf ("%.2f\n", sumGB / 1024)}' seoul_data_in_transit
9.48
Anything starting with TB ..
$ awk '/TB/{sumTB+=$6}END{printf ("%.2f\n", sumTB)}' seoul_data_in_transit
8.46
please suggest .

awk '$6~/GB/{s+=$6}$6~/TB/{s+=$6 * 1024}END{print s/1024,"TB"}' file

Assuming the current summation code generates the correct results:
awk '
/GB/ { sumGB+=$6 }
/TB/ { sumTB+=$6 }
END { printf ("%.2f GB\n", sumGB / 1024)
printf ("%.2f TB\n", sumTB)
}
' seoul_data_in_transit
Which should generate:
9.48 GB
8.46 TB

this is probably way overkill, but here's an unified way to sum up sizes for anything from Kibi-bytes to Yotta-bytes down to individual bytes without hard-coding them one-by-one :
mawk 'BEGIN {
OFS = "\r\t\t"
_ = "_KMGTPEZY"
gsub("[^_]","&B",_)
_+=_^=__=___=____*=_____=_
OFMT = \
CONVFMT = "%\047"((__=-_-+-++_^_)+_*_)".f"
__*=++_+!(_-=_=___=_*(_+_))
} $++NF=____+=int(___^index(_____, substr($!_,length($!_)-!_))*$!_)'
509.6GB 547,178,833,510
509.6MB 547,713,187,839
550.7MB 548,290,638,642
2.85TB 3,681,898,777,803
168.6MB 3,682,075,567,716
1.01KB 3,682,075,568,750
1.03TB 4,814,572,545,359
962.2MB 4,815,581,485,186
448.0GB 5,296,617,822,338
962.2GB 6,329,772,205,390
448.0MB 6,330,241,967,438
823.4MB 6,331,105,364,916
823.4GB 7,215,224,382,797
1.07KB 7,215,224,383,892
550.7GB 7,806,534,006,368
511.3GB 8,355,538,200,979
892.5MB 8,356,474,055,059
1.26TB 9,741,858,706,056
511.3MB 9,742,394,842,964
496.1MB 9,742,915,041,517
1.07TB 10,919,392,483,237
426.4GB 11,377,235,996,990
1.24KB 11,377,235,998,259
426.4MB 11,377,683,111,065
208.1MB 11,377,901,319,730
1008MB 11,378,958,284,338
787.5GB 12,224,529,970,738
892.5GB 13,182,844,548,658
208.1GB 13,406,290,222,232
1005MB 13,407,344,041,112
910.0GB 14,384,449,100,952
2.85KB 14,384,449,103,870
1008GB 15,466,780,862,462
168.6GB 15,647,813,733,988
1.26KB 15,647,813,735,278
787.5MB 15,648,639,488,878
1005GB 16,727,750,021,998
910.0MB 16,728,704,226,158
1.03KB 16,728,704,227,212
1.24TB 18,092,098,645,654
496.1GB 18,624,781,964,540
1.01TB 19,735,288,708,593

Related

Is there any command to do fuzzy matching in Linux based on multiple columns

I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}

separate .txt file to csv file

trying to convert txt file to csv but doesnt work
orginal text:
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
expected result :
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
i have tried this but its doesn't work for space and comma constrain
awk 'BEGIN{print "text,value"}{print $1","$2"}' ifile.txt
also i have tired this with python but it doesn't contain all of them
import pandas as pd
df = pd.read_fwf('log.txt')
df.to_csv('log.csv')
Your request is unclear, how do you want to format the last field.
I created a script that align the last line on column 60.
script.awk
BEGIN {printf("text%61s\n","value")} # formatted printing heading line
{
lastField = $NF; # store current last field into var
$NF = ""; # remove last field from line
alignLen = 60 - length() + length(lastField); # compute last field alignment
alignFormat = "%s%"alignLen"s\n"; # create printf format for computed alignment
printf(alignFormat, $0, lastField); # format print current line and last field
}
run script.awk
awk -f script.awk ifile.txt
output
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL

force linux sort to use lexicographic order

I generated a text file with pseudo-random numbers like this:
-853340442 1130519212 -2070936922
-707168664 -2076185735 -2135012102
166464098 1928545126 5768715
1060168276 -684694617 395859713
-680897578 -2095893176 1457930442
299309402 192205833 1878010157
-678911642 2062673581 -1801057195
795693402 -631504846 2117889796
448959250 547707556 -1115929024
168558507 7468411 1600190097
-746131117 1557335455 73377787
-1144524558 2143073647 -2044347857
1862106004 -193937480 1596949168
-1193502513 -920620244 -365340967
-677065994 500654963 1031304603
Now I try to put it in order using linux sort command:
sort prng >prngsorted
The result is not what I expected:
1060168276 -684694617 395859713
-1144524558 2143073647 -2044347857
-1193502513 -920620244 -365340967
166464098 1928545126 5768715
168558507 7468411 1600190097
1862106004 -193937480 1596949168
299309402 192205833 1878010157
448959250 547707556 -1115929024
-677065994 500654963 1031304603
-678911642 2062673581 -1801057195
-680897578 -2095893176 1457930442
-707168664 -2076185735 -2135012102
-746131117 1557335455 73377787
795693402 -631504846 2117889796
-853340442 1130519212 -2070936922
Obviously, sort tries to parse strings and extract numbers for sorting. And it seems to ignore minus signs.
Is it possible to force sort to be a bit dumber and just compare lines lexicographically? The result should be like this:
-1144524558 2143073647 -2044347857
-1193502513 -920620244 -365340967
-677065994 500654963 1031304603
-678911642 2062673581 -1801057195
-680897578 -2095893176 1457930442
-707168664 -2076185735 -2135012102
-746131117 1557335455 73377787
-853340442 1130519212 -2070936922
1060168276 -684694617 395859713
166464098 1928545126 5768715
168558507 7468411 1600190097
1862106004 -193937480 1596949168
299309402 192205833 1878010157
448959250 547707556 -1115929024
795693402 -631504846 2117889796
Note: I tried -d option but it did not help
Note 2: Probably I should use another utility instead of sort?
The sort command takes account of your locale settings. Many of the locales ignore dashes for collation.
You can get appropriate sorting with
LC_COLLATE=C sort filename
custom sort with the help of awk
$ awk '{print ($1<0?"-":"+") "\t" $0}' file | sort -k1,1 -k2 | cut -f2-
-1144524558 2143073647 -2044347857
-1193502513 -920620244 -365340967
-677065994 500654963 1031304603
-678911642 2062673581 -1801057195
-680897578 -2095893176 1457930442
-707168664 -2076185735 -2135012102
-746131117 1557335455 73377787
-853340442 1130519212 -2070936922
1060168276 -684694617 395859713
166464098 1928545126 5768715
168558507 7468411 1600190097
1862106004 -193937480 1596949168
299309402 192205833 1878010157
448959250 547707556 -1115929024
795693402 -631504846 2117889796
sort by sign only first, then regular sort and remove sign afterwards...

To Replace Numerical Values in a certain column with other Numerical Values

I have data as below:
83997000|17561815|20370101000000 83997000|3585618|20370101000000
83941746|13898890|20361231230000 83940169|13842974|20171124205011
83999444|3585618|20370101000000 83943970|10560874|20370101000000
83942000|13898890|20371232230000 83999333|3585618|20350101120000
Now, what I want to achieve is as below:
If column 2 is 17561815, print 22220 to replace 17561815.
If column 2 is 3585618, print 23330 to replace 3585618.
If column 2 is 13898890, print 24440 to replace 13898890.
If column 2 is 13842974, print 25550 to replace 13842974.
If column 2 is 3585618, print 26660 to replace 3585618.
If column 2 is 10560874, print 27770 to replace 10560874.
Output to be like this:
83997000|22220|20370101000000 83997000|23330|20370101000000
83941746|24440|20361231230000 83940169|25550|20171124205011
83999444|26660|20370101000000 83943970|27770|20370101000000
83942000|24440|20371232230000 83999333|26660|20350101120000
awk solution:
awk 'BEGIN{
FS=OFS="|";
a["17561815"]=22220; a["13898890"]=24440;
a["3585618"]=26660; a["13842974"]=25550;
a["10560874"]=27770
}
$2 in a{ $2=a[$2] }
$4 in a{ $4=a[$4] }1' file
The output:
83997000|22220|20370101000000 83997000|26660|20370101000000
83941746|24440|20361231230000 83940169|25550|20171124205011
83999444|26660|20370101000000 83943970|27770|20370101000000
83942000|24440|20371232230000 83999333|26660|20350101120000

Compare different item in two file and output combined result to new file by using AWK

Greeting!
I have some file in pair taken from two nodes in network, and file has records about TCP segment send/receive time, IP id number, segment type,seq number and so on.
For same TCP flow, it looks like this on sender side:
1420862364.778332 50369 seq 17400:18848
1420862364.780798 50370 seq 18848:20296
1420862364.780810 50371 seq 20296:21744
....
or on receiver side(1 second delay, segment with IP id 50371 lost)
1420862364.778332 50369 seq 17400:18848
1420862364.780798 50370 seq 18848:20296
....
I want to compare IP identification number in two file and output to new one like this:
1420862364.778332 1420862365.778332 50369 seq 17400:18848 o
1420862364.780798 1420862365.780798 50370 seq 18848:20296 o
1420862364.780810 1420862365.780810 50371 seq 20296:21744 x
which has time of arrive on receiver side, and by comparing id field, when same value is not found in receiver sid(packet loss), an x will be added, otherwise o will be there.
I already have code like this,
awk 'ARGIND==1 {w[$2]=$1}
ARGIND==2 {
flag=0;
for(a in w)
if($2==a) {
flag=1;
print $1,w[a],$2,$3,$4;
break;
}
if(!flag)
print $1,"x",$2,$3,$4;
}' file2 file1 >file3
but it doesn't work in Linux, it stops right after I pressed Enter, and leave only empty file.
Shell script contains these code has been through chomd +x.
Please help. My code is not well organized, any new one liner will be appreciated.
Thank you for your time.
ARGIND is gawk-specific btw so check your awk version. – Ed Morton

Resources