awk condiontals for grep like things and sum of the two - linux
I have a Data in a file where I have some numbers with denomination like GB & TB and i have to sum them together .
Below is file data:
$ cat data_in_transit| awk '/TB/{print $6}'
1.26TB
1.24TB
2.85TB
1.03TB
1.07TB
1.01TB
$ cat data_in_transit| awk '/GB/{print $6}'
962.2GB
1005GB
892.5GB
910.0GB
823.4GB
1008GB
426.4GB
168.6GB
208.1GB
511.3GB
787.5GB
448.0GB
509.6GB
496.1GB
550.7GB
I can calculate them individually, however i want below two to be summ'ed into one query.
Anything starting with GB ..
$ awk '/GB/{sumGB+=$6}END{printf ("%.2f\n", sumGB / 1024)}' seoul_data_in_transit
9.48
Anything starting with TB ..
$ awk '/TB/{sumTB+=$6}END{printf ("%.2f\n", sumTB)}' seoul_data_in_transit
8.46
please suggest .
awk '$6~/GB/{s+=$6}$6~/TB/{s+=$6 * 1024}END{print s/1024,"TB"}' file
Assuming the current summation code generates the correct results:
awk '
/GB/ { sumGB+=$6 }
/TB/ { sumTB+=$6 }
END { printf ("%.2f GB\n", sumGB / 1024)
printf ("%.2f TB\n", sumTB)
}
' seoul_data_in_transit
Which should generate:
9.48 GB
8.46 TB
this is probably way overkill, but here's an unified way to sum up sizes for anything from Kibi-bytes to Yotta-bytes down to individual bytes without hard-coding them one-by-one :
mawk 'BEGIN {
OFS = "\r\t\t"
_ = "_KMGTPEZY"
gsub("[^_]","&B",_)
_+=_^=__=___=____*=_____=_
OFMT = \
CONVFMT = "%\047"((__=-_-+-++_^_)+_*_)".f"
__*=++_+!(_-=_=___=_*(_+_))
} $++NF=____+=int(___^index(_____, substr($!_,length($!_)-!_))*$!_)'
509.6GB 547,178,833,510
509.6MB 547,713,187,839
550.7MB 548,290,638,642
2.85TB 3,681,898,777,803
168.6MB 3,682,075,567,716
1.01KB 3,682,075,568,750
1.03TB 4,814,572,545,359
962.2MB 4,815,581,485,186
448.0GB 5,296,617,822,338
962.2GB 6,329,772,205,390
448.0MB 6,330,241,967,438
823.4MB 6,331,105,364,916
823.4GB 7,215,224,382,797
1.07KB 7,215,224,383,892
550.7GB 7,806,534,006,368
511.3GB 8,355,538,200,979
892.5MB 8,356,474,055,059
1.26TB 9,741,858,706,056
511.3MB 9,742,394,842,964
496.1MB 9,742,915,041,517
1.07TB 10,919,392,483,237
426.4GB 11,377,235,996,990
1.24KB 11,377,235,998,259
426.4MB 11,377,683,111,065
208.1MB 11,377,901,319,730
1008MB 11,378,958,284,338
787.5GB 12,224,529,970,738
892.5GB 13,182,844,548,658
208.1GB 13,406,290,222,232
1005MB 13,407,344,041,112
910.0GB 14,384,449,100,952
2.85KB 14,384,449,103,870
1008GB 15,466,780,862,462
168.6GB 15,647,813,733,988
1.26KB 15,647,813,735,278
787.5MB 15,648,639,488,878
1005GB 16,727,750,021,998
910.0MB 16,728,704,226,158
1.03KB 16,728,704,227,212
1.24TB 18,092,098,645,654
496.1GB 18,624,781,964,540
1.01TB 19,735,288,708,593
Related
Is there any command to do fuzzy matching in Linux based on multiple columns
I have two csv file. File 1 D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot 2,66M,J,Rock,F,1995,201211.0 3,David,HM,Lee,M,,201211.0 6,66M,,Rock,F,,201211.0 0,David,H M,Lee,,1990,201211.0 3,Marc,H,Robert,M,2000,201211.0 6,Marc,M,Robert,M,,201211.0 6,Marc,MS,Robert,M,2000,201211.0 3,David,M,Lee,,1990,201211.0 5,Paul,ABC,Row,F,2008,201211.0 3,Paul,ACB,Row,,,201211.0 4,David,,Lee,,1990,201211.0 4,66,J,Rock,,1995,201211.0 File 2 PID,FNAME,MNAME,LNAME,GENDER,DOB S2,66M,J,Rock,F,1995 S3,David,HM,Lee,M,1990 S0,Marc,HM,Robert,M,2000 S1,Marc,MS,Robert,M,2000 S6,Paul,,Row,M,2008 S7,Sam,O,Baby,F,2018 What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1". Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0, 6,66M,,Rock,F,,201211.0, 0,David,H M,Lee,,1990,201211.0, 3,Marc,H,Robert,M,2000,201211.0, 6,Marc,M,Robert,M,,201211.0, 6,Marc,MS,Robert,M,2000,201211.0, 3,David,M,Lee,,1990,201211.0, 5,Paul,ABC,Row,F,2008,201211.0, 3,Paul,ACB,Row,,,201211.0, 4,David,,Lee,,1990,201211.0, 4,66,J,Rock,,1995,201211.0, Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0, 6,66M,,Rock,F,,201211.0, 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0, 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0, 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0,S3 6,66M,,Rock,F,,201211.0,S2 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0,S1 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0,S6 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue. But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count: BEGIN { FS=OFS="," PROCINFO["sorted_in"]="#val_num_desc" } NR==FNR { # file2 for(i=1;i<=6;i++) # fields 1-6 if($i!="") { field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value } next } { # file1 for(i=1;i<=6;i++) { # fields 1-6 if($i in field[i]) { # if value matches split(field[i][$i],t,FS) # get PIDs for(j in t) { # and matches[t[j]]++ # increase PID counts } } else { # if no value match for(j in field[i]) # for all field values if($i~j || j~$i) # "go fuzzy" :D matches[field[i][j]]+=0.5 # fuzzy is half a match } } for(i in matches) { # the best match first print $0,i delete matches break # we only want the best match } } Output: D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0,S3 6,66M,,Rock,F,,201211.0,S2 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0,S1 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0,S6 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one. You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow. Update: A version that maps file1 fields to file2 fields (as mentioned in comments): BEGIN { FS=OFS="," PROCINFO["sorted_in"]="#val_num_desc" map[1]=1 # map file1 fields to file2 fields map[2]=3 map[3]=4 map[4]=2 map[5]=5 map[7]=6 } NR==FNR { # file2 for(i=1;i<=6;i++) # fields 1-6 if($i!="") { field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value } next } { # file1 for(i in map) { if($i in field[map[i]]) { # if value matches split(field[map[i]][$i],t,FS) # get PIDs for(j in t) { # and matches[t[j]]++ # increase PID counts } } else { # if no value match for(j in field[map[i]]) # for all field values if($i~j || j~$i) # "go fuzzy" :D matches[field[map[i]][j]]+=0.5 # fuzzy is half a match } } for(i in matches) { # the best match first print $0,i delete matches break # we only want the best match } }
separate .txt file to csv file
trying to convert txt file to csv but doesnt work orginal text: استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL expected result : text value استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL i have tried this but its doesn't work for space and comma constrain awk 'BEGIN{print "text,value"}{print $1","$2"}' ifile.txt also i have tired this with python but it doesn't contain all of them import pandas as pd df = pd.read_fwf('log.txt') df.to_csv('log.csv')
Your request is unclear, how do you want to format the last field. I created a script that align the last line on column 60. script.awk BEGIN {printf("text%61s\n","value")} # formatted printing heading line { lastField = $NF; # store current last field into var $NF = ""; # remove last field from line alignLen = 60 - length() + length(lastField); # compute last field alignment alignFormat = "%s%"alignLen"s\n"; # create printf format for computed alignment printf(alignFormat, $0, lastField); # format print current line and last field } run script.awk awk -f script.awk ifile.txt output text value استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
force linux sort to use lexicographic order
I generated a text file with pseudo-random numbers like this: -853340442 1130519212 -2070936922 -707168664 -2076185735 -2135012102 166464098 1928545126 5768715 1060168276 -684694617 395859713 -680897578 -2095893176 1457930442 299309402 192205833 1878010157 -678911642 2062673581 -1801057195 795693402 -631504846 2117889796 448959250 547707556 -1115929024 168558507 7468411 1600190097 -746131117 1557335455 73377787 -1144524558 2143073647 -2044347857 1862106004 -193937480 1596949168 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 Now I try to put it in order using linux sort command: sort prng >prngsorted The result is not what I expected: 1060168276 -684694617 395859713 -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 795693402 -631504846 2117889796 -853340442 1130519212 -2070936922 Obviously, sort tries to parse strings and extract numbers for sorting. And it seems to ignore minus signs. Is it possible to force sort to be a bit dumber and just compare lines lexicographically? The result should be like this: -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 -853340442 1130519212 -2070936922 1060168276 -684694617 395859713 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 795693402 -631504846 2117889796 Note: I tried -d option but it did not help Note 2: Probably I should use another utility instead of sort?
The sort command takes account of your locale settings. Many of the locales ignore dashes for collation. You can get appropriate sorting with LC_COLLATE=C sort filename
custom sort with the help of awk $ awk '{print ($1<0?"-":"+") "\t" $0}' file | sort -k1,1 -k2 | cut -f2- -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 -853340442 1130519212 -2070936922 1060168276 -684694617 395859713 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 795693402 -631504846 2117889796 sort by sign only first, then regular sort and remove sign afterwards...
To Replace Numerical Values in a certain column with other Numerical Values
I have data as below: 83997000|17561815|20370101000000 83997000|3585618|20370101000000 83941746|13898890|20361231230000 83940169|13842974|20171124205011 83999444|3585618|20370101000000 83943970|10560874|20370101000000 83942000|13898890|20371232230000 83999333|3585618|20350101120000 Now, what I want to achieve is as below: If column 2 is 17561815, print 22220 to replace 17561815. If column 2 is 3585618, print 23330 to replace 3585618. If column 2 is 13898890, print 24440 to replace 13898890. If column 2 is 13842974, print 25550 to replace 13842974. If column 2 is 3585618, print 26660 to replace 3585618. If column 2 is 10560874, print 27770 to replace 10560874. Output to be like this: 83997000|22220|20370101000000 83997000|23330|20370101000000 83941746|24440|20361231230000 83940169|25550|20171124205011 83999444|26660|20370101000000 83943970|27770|20370101000000 83942000|24440|20371232230000 83999333|26660|20350101120000
awk solution: awk 'BEGIN{ FS=OFS="|"; a["17561815"]=22220; a["13898890"]=24440; a["3585618"]=26660; a["13842974"]=25550; a["10560874"]=27770 } $2 in a{ $2=a[$2] } $4 in a{ $4=a[$4] }1' file The output: 83997000|22220|20370101000000 83997000|26660|20370101000000 83941746|24440|20361231230000 83940169|25550|20171124205011 83999444|26660|20370101000000 83943970|27770|20370101000000 83942000|24440|20371232230000 83999333|26660|20350101120000
Compare different item in two file and output combined result to new file by using AWK
Greeting! I have some file in pair taken from two nodes in network, and file has records about TCP segment send/receive time, IP id number, segment type,seq number and so on. For same TCP flow, it looks like this on sender side: 1420862364.778332 50369 seq 17400:18848 1420862364.780798 50370 seq 18848:20296 1420862364.780810 50371 seq 20296:21744 .... or on receiver side(1 second delay, segment with IP id 50371 lost) 1420862364.778332 50369 seq 17400:18848 1420862364.780798 50370 seq 18848:20296 .... I want to compare IP identification number in two file and output to new one like this: 1420862364.778332 1420862365.778332 50369 seq 17400:18848 o 1420862364.780798 1420862365.780798 50370 seq 18848:20296 o 1420862364.780810 1420862365.780810 50371 seq 20296:21744 x which has time of arrive on receiver side, and by comparing id field, when same value is not found in receiver sid(packet loss), an x will be added, otherwise o will be there. I already have code like this, awk 'ARGIND==1 {w[$2]=$1} ARGIND==2 { flag=0; for(a in w) if($2==a) { flag=1; print $1,w[a],$2,$3,$4; break; } if(!flag) print $1,"x",$2,$3,$4; }' file2 file1 >file3 but it doesn't work in Linux, it stops right after I pressed Enter, and leave only empty file. Shell script contains these code has been through chomd +x. Please help. My code is not well organized, any new one liner will be appreciated. Thank you for your time.
ARGIND is gawk-specific btw so check your awk version. – Ed Morton