Related
I have a Data in a file where I have some numbers with denomination like GB & TB and i have to sum them together .
Below is file data:
$ cat data_in_transit| awk '/TB/{print $6}'
1.26TB
1.24TB
2.85TB
1.03TB
1.07TB
1.01TB
$ cat data_in_transit| awk '/GB/{print $6}'
962.2GB
1005GB
892.5GB
910.0GB
823.4GB
1008GB
426.4GB
168.6GB
208.1GB
511.3GB
787.5GB
448.0GB
509.6GB
496.1GB
550.7GB
I can calculate them individually, however i want below two to be summ'ed into one query.
Anything starting with GB ..
$ awk '/GB/{sumGB+=$6}END{printf ("%.2f\n", sumGB / 1024)}' seoul_data_in_transit
9.48
Anything starting with TB ..
$ awk '/TB/{sumTB+=$6}END{printf ("%.2f\n", sumTB)}' seoul_data_in_transit
8.46
please suggest .
awk '$6~/GB/{s+=$6}$6~/TB/{s+=$6 * 1024}END{print s/1024,"TB"}' file
Assuming the current summation code generates the correct results:
awk '
/GB/ { sumGB+=$6 }
/TB/ { sumTB+=$6 }
END { printf ("%.2f GB\n", sumGB / 1024)
printf ("%.2f TB\n", sumTB)
}
' seoul_data_in_transit
Which should generate:
9.48 GB
8.46 TB
this is probably way overkill, but here's an unified way to sum up sizes for anything from Kibi-bytes to Yotta-bytes down to individual bytes without hard-coding them one-by-one :
mawk 'BEGIN {
OFS = "\r\t\t"
_ = "_KMGTPEZY"
gsub("[^_]","&B",_)
_+=_^=__=___=____*=_____=_
OFMT = \
CONVFMT = "%\047"((__=-_-+-++_^_)+_*_)".f"
__*=++_+!(_-=_=___=_*(_+_))
} $++NF=____+=int(___^index(_____, substr($!_,length($!_)-!_))*$!_)'
509.6GB 547,178,833,510
509.6MB 547,713,187,839
550.7MB 548,290,638,642
2.85TB 3,681,898,777,803
168.6MB 3,682,075,567,716
1.01KB 3,682,075,568,750
1.03TB 4,814,572,545,359
962.2MB 4,815,581,485,186
448.0GB 5,296,617,822,338
962.2GB 6,329,772,205,390
448.0MB 6,330,241,967,438
823.4MB 6,331,105,364,916
823.4GB 7,215,224,382,797
1.07KB 7,215,224,383,892
550.7GB 7,806,534,006,368
511.3GB 8,355,538,200,979
892.5MB 8,356,474,055,059
1.26TB 9,741,858,706,056
511.3MB 9,742,394,842,964
496.1MB 9,742,915,041,517
1.07TB 10,919,392,483,237
426.4GB 11,377,235,996,990
1.24KB 11,377,235,998,259
426.4MB 11,377,683,111,065
208.1MB 11,377,901,319,730
1008MB 11,378,958,284,338
787.5GB 12,224,529,970,738
892.5GB 13,182,844,548,658
208.1GB 13,406,290,222,232
1005MB 13,407,344,041,112
910.0GB 14,384,449,100,952
2.85KB 14,384,449,103,870
1008GB 15,466,780,862,462
168.6GB 15,647,813,733,988
1.26KB 15,647,813,735,278
787.5MB 15,648,639,488,878
1005GB 16,727,750,021,998
910.0MB 16,728,704,226,158
1.03KB 16,728,704,227,212
1.24TB 18,092,098,645,654
496.1GB 18,624,781,964,540
1.01TB 19,735,288,708,593
I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
trying to convert txt file to csv but doesnt work
orginal text:
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
expected result :
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
i have tried this but its doesn't work for space and comma constrain
awk 'BEGIN{print "text,value"}{print $1","$2"}' ifile.txt
also i have tired this with python but it doesn't contain all of them
import pandas as pd
df = pd.read_fwf('log.txt')
df.to_csv('log.csv')
Your request is unclear, how do you want to format the last field.
I created a script that align the last line on column 60.
script.awk
BEGIN {printf("text%61s\n","value")} # formatted printing heading line
{
lastField = $NF; # store current last field into var
$NF = ""; # remove last field from line
alignLen = 60 - length() + length(lastField); # compute last field alignment
alignFormat = "%s%"alignLen"s\n"; # create printf format for computed alignment
printf(alignFormat, $0, lastField); # format print current line and last field
}
run script.awk
awk -f script.awk ifile.txt
output
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
I have a file that is delimited with a "#". It has recurring data that can be used to split the file in sections. In another file I have data that I would like to add as another column to the first file. The source of the data that is added will be looped through with every instance of the recurring data from the first file. The files look like this:
File 1
Race1#300Yards#6
Race2#300Yards#7
Race3#250Yards#7
Race4#250Yards#7
Race5#250Yards#8
Race6#250Yards#9
Race7#300Yards#10
Race8#300Yards#12
Race1#330Yards#10
Race2#300Yards#10
Race3#300Yards#10
Race4#300Yards#10
Race5#11/2Miles#11
Race6#7Miles#9
Race7#6Miles#8
Race8#51/2Miles#7
Race9#1Mile#8
Race10#51/2Miles#12
Race1#61/2Miles#6
Race2#11/16Miles#9
Race3#1Mile#9
Race4#11/2Miles#6
Race5#11/16Miles#10
Race6#1Mile#10
Race7#11/16Miles#12
Race8#1Mile#12
The other file looks like:
File 2
London
New York
Dallas
The desired results look like:
Race1#300Yards#6#London
Race2#300Yards#7#London
Race3#250Yards#7#London
Race4#250Yards#7#London
Race5#250Yards#8#London
Race6#250Yards#9#London
Race7#300Yards#10#London
Race8#300Yards#12#London
Race1#330Yards#10#New York
Race2#300Yards#10#New York
Race3#300Yards#10#New York
Race4#300Yards#10#New York
Race5#11/2Miles#11#New York
Race6#7Miles#9#New York
Race7#6Miles#8#New York
Race8#51/2Miles#7#New York
Race9#1Mile#8#New York
Race10#51/2Miles#12#New York
Race1#61/2Miles#6#Dallas
Race2#11/16Miles#9#Dallas
Race3#1Mile#9#Dallas
Race4#11/2Miles#6#Dallas
Race5#11/16Miles#10#Dallas
Race6#1Mile#10#Dallas
Race7#11/16Miles#12#Dallas
Race8#1Mile#12#Dallas
I know that awk can be used to split the race location by "Race1". I think it starts by something like:
awk '/Race1/{x="Race"++i;}{print $5= something relating to file 2}
Does anybody know how to parse using awk, or any other Linux commands, for two files using loops and conditions?
If you save this as a.awk
BEGIN {
FS = OFS = "#"
i = 0
j = -1
}
NR == FNR {
a[i++] = $1
}
NR != FNR {
if ($1 == "Race1")
j++
$4 = a[j]
print
}
and run
awk -f a.awk file2 file1
You will get your desired results.
Output
Race1#300Yards#6#London
Race2#300Yards#7#London
Race3#250Yards#7#London
Race4#250Yards#7#London
Race5#250Yards#8#London
Race6#250Yards#9#London
Race7#300Yards#10#London
Race8#300Yards#12#London
Race1#330Yards#10#New York
Race2#300Yards#10#New York
Race3#300Yards#10#New York
Race4#300Yards#10#New York
Race5#11/2Miles#11#New York
Race6#7Miles#9#New York
Race7#6Miles#8#New York
Race8#51/2Miles#7#New York
Race9#1Mile#8#New York
Race10#51/2Miles#12#New York
Race1#61/2Miles#6#Dallas
Race2#11/16Miles#9#Dallas
Race3#1Mile#9#Dallas
Race4#11/2Miles#6#Dallas
Race5#11/16Miles#10#Dallas
Race6#1Mile#10#Dallas
Race7#11/16Miles#12#Dallas
Race8#1Mile#12#Dallas
Explanation
We begin by setting the input and output field separators to #. We also initialize our variables i, j that will be used as array indices.
The first condition checks if we are going through file2 with the NR == FNR. During the first block, we associate the index i with the first field, which is the city name. Then we increment i.
The second condition checks if we are going through file2 with NR != FNR. If the first field is equal to Race1, then we increment j (notice that we initialized j to be -1). We set the 4th field to be a[j], and then we print the line.
I have some files with some lines in Linux like:
2013/08/16,name1,,5000,8761,09:00,09:30
2013/08/16,name1,,5000,9763,10:00,10:30
2013/08/16,name1,,5000,8866,11:00,11:30
2013/08/16,name1,,5000,5768,12:00,12:30
2013/08/16,name1,,5000,11764,13:00,13:30
2013/08/16,name2,,5000,2765,14:00,14:30
2013/08/16,name2,,5000,4765,15:00,15:30
2013/08/16,name2,,5000,6765,16:00,16:30
2013/08/16,name2,,5000,12765,17:00,17:30
2013/08/16,name2,,5000,25665,18:00,18:30
2013/08/16,name2,,5000,45765,09:00,10:30
2013/08/17,name1,,5000,33765,10:00,11:30
2013/08/17,name1,,5000,1765,11:00,12:30
2013/08/17,name1,,5000,34765,12:00,13:30
2013/08/17,name1,,5000,12765,13:00,14:30
2013/08/17,name2,,5000,1765,14:00,15:30
2013/08/17,name2,,5000,3765,15:00,16:30
2013/08/17,name2,,5000,7765,16:00,17:30
My column separator is "," and in the third column (currently ,,), I need the entry number within the same day. For example, with date
2013/08/16 I have 11 lines and with date 2013/08/17 7 lines, so I need add the numbers for example:
2013/08/16,name1,1,5000,8761,09:00,09:30
2013/08/16,name1,2,5000,9763,10:00,10:30
2013/08/16,name1,3,5000,8866,11:00,11:30
2013/08/16,name1,4,5000,5768,12:00,12:30
2013/08/16,name1,5,5000,11764,13:00,13:30
2013/08/16,name2,6,5000,2765,14:00,14:30
2013/08/16,name2,7,5000,4765,15:00,15:30
2013/08/16,name2,8,5000,6765,16:00,16:30
2013/08/16,name2,9,5000,12765,17:00,17:30
2013/08/16,name2,10,5000,25665,18:00,18:30
2013/08/16,name2,11,5000,45765,09:00,10:30
2013/08/17,name1,1,5000,33765,10:00,11:30
2013/08/17,name1,2,5000,1765,11:00,12:30
2013/08/17,name1,3,5000,34765,12:00,13:30
2013/08/17,name1,4,5000,12765,13:00,14:30
2013/08/17,name2,5,5000,1765,14:00,15:30
2013/08/17,name2,6,5000,3765,15:00,16:30
2013/08/17,name2,7,5000,7765,16:00,17:30
I need do it in bash. How can I do it?
This one's good too:
awk -F, 'sub(/,,/, ","++a[$1]",")1' file
Output:
2013/08/16,name1,1,5000,8761,09:00,09:30
2013/08/16,name1,2,5000,9763,10:00,10:30
2013/08/16,name1,3,5000,8866,11:00,11:30
2013/08/16,name1,4,5000,5768,12:00,12:30
2013/08/16,name1,5,5000,11764,13:00,13:30
2013/08/16,name2,6,5000,2765,14:00,14:30
2013/08/16,name2,7,5000,4765,15:00,15:30
2013/08/16,name2,8,5000,6765,16:00,16:30
2013/08/16,name2,9,5000,12765,17:00,17:30
2013/08/16,name2,10,5000,25665,18:00,18:30
2013/08/16,name2,11,5000,45765,09:00,10:30
2013/08/17,name1,1,5000,33765,10:00,11:30
2013/08/17,name1,2,5000,1765,11:00,12:30
2013/08/17,name1,3,5000,34765,12:00,13:30
2013/08/17,name1,4,5000,12765,13:00,14:30
2013/08/17,name2,5,5000,1765,14:00,15:30
2013/08/17,name2,6,5000,3765,15:00,16:30
2013/08/17,name2,7,5000,7765,16:00,17:30