AWK: adding a new field from another file - linux
I have a file that is delimited with a "#". It has recurring data that can be used to split the file in sections. In another file I have data that I would like to add as another column to the first file. The source of the data that is added will be looped through with every instance of the recurring data from the first file. The files look like this:
File 1
Race1#300Yards#6
Race2#300Yards#7
Race3#250Yards#7
Race4#250Yards#7
Race5#250Yards#8
Race6#250Yards#9
Race7#300Yards#10
Race8#300Yards#12
Race1#330Yards#10
Race2#300Yards#10
Race3#300Yards#10
Race4#300Yards#10
Race5#11/2Miles#11
Race6#7Miles#9
Race7#6Miles#8
Race8#51/2Miles#7
Race9#1Mile#8
Race10#51/2Miles#12
Race1#61/2Miles#6
Race2#11/16Miles#9
Race3#1Mile#9
Race4#11/2Miles#6
Race5#11/16Miles#10
Race6#1Mile#10
Race7#11/16Miles#12
Race8#1Mile#12
The other file looks like:
File 2
London
New York
Dallas
The desired results look like:
Race1#300Yards#6#London
Race2#300Yards#7#London
Race3#250Yards#7#London
Race4#250Yards#7#London
Race5#250Yards#8#London
Race6#250Yards#9#London
Race7#300Yards#10#London
Race8#300Yards#12#London
Race1#330Yards#10#New York
Race2#300Yards#10#New York
Race3#300Yards#10#New York
Race4#300Yards#10#New York
Race5#11/2Miles#11#New York
Race6#7Miles#9#New York
Race7#6Miles#8#New York
Race8#51/2Miles#7#New York
Race9#1Mile#8#New York
Race10#51/2Miles#12#New York
Race1#61/2Miles#6#Dallas
Race2#11/16Miles#9#Dallas
Race3#1Mile#9#Dallas
Race4#11/2Miles#6#Dallas
Race5#11/16Miles#10#Dallas
Race6#1Mile#10#Dallas
Race7#11/16Miles#12#Dallas
Race8#1Mile#12#Dallas
I know that awk can be used to split the race location by "Race1". I think it starts by something like:
awk '/Race1/{x="Race"++i;}{print $5= something relating to file 2}
Does anybody know how to parse using awk, or any other Linux commands, for two files using loops and conditions?
If you save this as a.awk
BEGIN {
FS = OFS = "#"
i = 0
j = -1
}
NR == FNR {
a[i++] = $1
}
NR != FNR {
if ($1 == "Race1")
j++
$4 = a[j]
print
}
and run
awk -f a.awk file2 file1
You will get your desired results.
Output
Race1#300Yards#6#London
Race2#300Yards#7#London
Race3#250Yards#7#London
Race4#250Yards#7#London
Race5#250Yards#8#London
Race6#250Yards#9#London
Race7#300Yards#10#London
Race8#300Yards#12#London
Race1#330Yards#10#New York
Race2#300Yards#10#New York
Race3#300Yards#10#New York
Race4#300Yards#10#New York
Race5#11/2Miles#11#New York
Race6#7Miles#9#New York
Race7#6Miles#8#New York
Race8#51/2Miles#7#New York
Race9#1Mile#8#New York
Race10#51/2Miles#12#New York
Race1#61/2Miles#6#Dallas
Race2#11/16Miles#9#Dallas
Race3#1Mile#9#Dallas
Race4#11/2Miles#6#Dallas
Race5#11/16Miles#10#Dallas
Race6#1Mile#10#Dallas
Race7#11/16Miles#12#Dallas
Race8#1Mile#12#Dallas
Explanation
We begin by setting the input and output field separators to #. We also initialize our variables i, j that will be used as array indices.
The first condition checks if we are going through file2 with the NR == FNR. During the first block, we associate the index i with the first field, which is the city name. Then we increment i.
The second condition checks if we are going through file2 with NR != FNR. If the first field is equal to Race1, then we increment j (notice that we initialized j to be -1). We set the 4th field to be a[j], and then we print the line.
Related
Is there any command to do fuzzy matching in Linux based on multiple columns
I have two csv file. File 1 D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot 2,66M,J,Rock,F,1995,201211.0 3,David,HM,Lee,M,,201211.0 6,66M,,Rock,F,,201211.0 0,David,H M,Lee,,1990,201211.0 3,Marc,H,Robert,M,2000,201211.0 6,Marc,M,Robert,M,,201211.0 6,Marc,MS,Robert,M,2000,201211.0 3,David,M,Lee,,1990,201211.0 5,Paul,ABC,Row,F,2008,201211.0 3,Paul,ACB,Row,,,201211.0 4,David,,Lee,,1990,201211.0 4,66,J,Rock,,1995,201211.0 File 2 PID,FNAME,MNAME,LNAME,GENDER,DOB S2,66M,J,Rock,F,1995 S3,David,HM,Lee,M,1990 S0,Marc,HM,Robert,M,2000 S1,Marc,MS,Robert,M,2000 S6,Paul,,Row,M,2008 S7,Sam,O,Baby,F,2018 What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1". Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0, 6,66M,,Rock,F,,201211.0, 0,David,H M,Lee,,1990,201211.0, 3,Marc,H,Robert,M,2000,201211.0, 6,Marc,M,Robert,M,,201211.0, 6,Marc,MS,Robert,M,2000,201211.0, 3,David,M,Lee,,1990,201211.0, 5,Paul,ABC,Row,F,2008,201211.0, 3,Paul,ACB,Row,,,201211.0, 4,David,,Lee,,1990,201211.0, 4,66,J,Rock,,1995,201211.0, Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0, 6,66M,,Rock,F,,201211.0, 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0, 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0, 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0,S3 6,66M,,Rock,F,,201211.0,S2 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0,S1 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0,S6 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue. But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count: BEGIN { FS=OFS="," PROCINFO["sorted_in"]="#val_num_desc" } NR==FNR { # file2 for(i=1;i<=6;i++) # fields 1-6 if($i!="") { field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value } next } { # file1 for(i=1;i<=6;i++) { # fields 1-6 if($i in field[i]) { # if value matches split(field[i][$i],t,FS) # get PIDs for(j in t) { # and matches[t[j]]++ # increase PID counts } } else { # if no value match for(j in field[i]) # for all field values if($i~j || j~$i) # "go fuzzy" :D matches[field[i][j]]+=0.5 # fuzzy is half a match } } for(i in matches) { # the best match first print $0,i delete matches break # we only want the best match } } Output: D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID 2,66M,J,Rock,F,1995,201211.0,S2 3,David,HM,Lee,M,,201211.0,S3 6,66M,,Rock,F,,201211.0,S2 0,David,H M,Lee,,1990,201211.0,S3 3,Marc,H,Robert,M,2000,201211.0,S0 6,Marc,M,Robert,M,,201211.0,S1 6,Marc,MS,Robert,M,2000,201211.0,S1 3,David,M,Lee,,1990,201211.0,S3 5,Paul,ABC,Row,F,2008,201211.0,S6 3,Paul,ACB,Row,,,201211.0,S6 4,David,,Lee,,1990,201211.0,S3 4,66,J,Rock,,1995,201211.0,S2 The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one. You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow. Update: A version that maps file1 fields to file2 fields (as mentioned in comments): BEGIN { FS=OFS="," PROCINFO["sorted_in"]="#val_num_desc" map[1]=1 # map file1 fields to file2 fields map[2]=3 map[3]=4 map[4]=2 map[5]=5 map[7]=6 } NR==FNR { # file2 for(i=1;i<=6;i++) # fields 1-6 if($i!="") { field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value } next } { # file1 for(i in map) { if($i in field[map[i]]) { # if value matches split(field[map[i]][$i],t,FS) # get PIDs for(j in t) { # and matches[t[j]]++ # increase PID counts } } else { # if no value match for(j in field[map[i]]) # for all field values if($i~j || j~$i) # "go fuzzy" :D matches[field[map[i]][j]]+=0.5 # fuzzy is half a match } } for(i in matches) { # the best match first print $0,i delete matches break # we only want the best match } }
separate .txt file to csv file
trying to convert txt file to csv but doesnt work orginal text: استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL expected result : text value استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL i have tried this but its doesn't work for space and comma constrain awk 'BEGIN{print "text,value"}{print $1","$2"}' ifile.txt also i have tired this with python but it doesn't contain all of them import pandas as pd df = pd.read_fwf('log.txt') df.to_csv('log.csv')
Your request is unclear, how do you want to format the last field. I created a script that align the last line on column 60. script.awk BEGIN {printf("text%61s\n","value")} # formatted printing heading line { lastField = $NF; # store current last field into var $NF = ""; # remove last field from line alignLen = 60 - length() + length(lastField); # compute last field alignment alignFormat = "%s%"alignLen"s\n"; # create printf format for computed alignment printf(alignFormat, $0, lastField); # format print current line and last field } run script.awk awk -f script.awk ifile.txt output text value استقالة #رئيس_القضاء #السودان OBJ أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG #انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
To Replace Numerical Values in a certain column with other Numerical Values
I have data as below: 83997000|17561815|20370101000000 83997000|3585618|20370101000000 83941746|13898890|20361231230000 83940169|13842974|20171124205011 83999444|3585618|20370101000000 83943970|10560874|20370101000000 83942000|13898890|20371232230000 83999333|3585618|20350101120000 Now, what I want to achieve is as below: If column 2 is 17561815, print 22220 to replace 17561815. If column 2 is 3585618, print 23330 to replace 3585618. If column 2 is 13898890, print 24440 to replace 13898890. If column 2 is 13842974, print 25550 to replace 13842974. If column 2 is 3585618, print 26660 to replace 3585618. If column 2 is 10560874, print 27770 to replace 10560874. Output to be like this: 83997000|22220|20370101000000 83997000|23330|20370101000000 83941746|24440|20361231230000 83940169|25550|20171124205011 83999444|26660|20370101000000 83943970|27770|20370101000000 83942000|24440|20371232230000 83999333|26660|20350101120000
awk solution: awk 'BEGIN{ FS=OFS="|"; a["17561815"]=22220; a["13898890"]=24440; a["3585618"]=26660; a["13842974"]=25550; a["10560874"]=27770 } $2 in a{ $2=a[$2] } $4 in a{ $4=a[$4] }1' file The output: 83997000|22220|20370101000000 83997000|26660|20370101000000 83941746|24440|20361231230000 83940169|25550|20171124205011 83999444|26660|20370101000000 83943970|27770|20370101000000 83942000|24440|20371232230000 83999333|26660|20350101120000
entering text in a file at specific locations by identifying the number being integer or real in linux
I have an input like below 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01 Before anything beings I have to input a text like *Revolved *Gripped *Crippled 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 *Cracked *Crippled 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too. Any suggestions with sed or awk would be highly appreaciated! The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.
An awk with fixed strings: awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"} match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.
Search in directory of files based on keywords from another file
Perl Newbie here and looking for some help. I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type. For example: Keywords.txt Attribute1 boolean Attribute2 boolean Attribute3 search_and_extract Attribute4 chunk For each file in the directory, I have to: lookup the keywords.txt search based on Attribute type something like the below. IF attribute_type = boolean THEN search for attribute; set found = Y if attribute found; ELSIF attribute_type = search_and_extract THEN extract string where attribute is Found ELSIF attribute_type = chunk THEN extract the complete chunk of paragraph where attribute is found. This is what I have so far and I'm sure there is a more efficient way to do this. I'm hoping someone can guide me in the right direction to do the above. Thanks & regards, SiMa # Reads attributes from config file # First set boolean attributes. IF keyword is found in text, # variable flag is set to Y else N # End Code: For each text file in directory loop. # Run the below for each document. use strict; use warnings; # open Doc open(DOC_FILE,'Final_CLP.txt'); while(<DOC_FILE>) { chomp; # open the file open(FILE,'attribute_config.txt'); while (<FILE>) { chomp; ($attribute,$attribute_type) = split("\t"); $is_boolean = ($attribute_type eq "boolean") ? "N" : "Y"; # For each boolean attribute, check if the keyword exists # in the file and return Y or N if ($is_boolean eq "Y") { print "Yes\n"; # search for keyword in doc and assign values } print "Attribute: $attribute\n"; print "Attribute_Type: $attribute_type\n"; print "is_boolean: $is_boolean\n"; print "-----------\n"; } close(FILE); } close(DOC_FILE); exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But such a story - whether true or made up, because you can't disclose the truth - should give a vivid picture of the situation/problem/task the reason(s) why all the work must be done definitions for uncommon(ly used)terms So I'd start with: I'm working in a prison and have to scan the emails of the inmates for names (like "Al Capone") mentioned anywhere in the text; the director wants to read those mails in toto order lines (like "weapon: AK 4711 quantity: 14"); the ordnance officer wants those info to calculate the amount of ammunition and rack space needed paragraphs containing 'family'-keywords like "wife", "child", ...; the parson wants to prepare her sermons efficiently Taken for itself, each of the terms "keyword" (~running text) and "attribute" (~structured text) of may be 'clear', but if both are applied to "the X I have to search for", things get mushy. Instead of general ("chunk") and technical ("string") terms, you should use 'real-world' (line) and specific (paragraph) words. Samples of your input: From: Robin Hood To: Scarface Hi Scarface, tell Al Capone to send a car to the prison gate on sunday. For the riot we need: weapon: AK 4711 quantity: 14 knife: Bowie quantity: 8 Tell my wife in Folsom to send some money to my son in Alcatraz. Regards Robin and your expected output: --- Robin.txt ---- keywords: Al Capone: Yes Billy the Kid: No Scarface: Yes order lines: knife: knife: Bowie quantity: 8 machine gun: stinger rocket: weapon: weapon: AK 4711 quantity: 14 social relations paragaphs: Tell my wife in Folsom to send some money to my son in Alcatraz. Pseudo code should begin at the top level. If you start with for each file in folder load search list process current file('s content) using search list it's obvious that load search list for each file in folder process current file using search list would be much better. Based on this story, examples, and top level plan, I would try to come up with proof of concept code for a simplified version of the "process current file('s content) using search list" task: given file/text to search in and list of keywords/attributes print file name print "keywords:" for each boolean item print boolean item text if found anywhere in whole text print "Yes" else print "No" print "order line:" for each line item print line item text if found anywhere in whole text print whole line print "social relations paragaphs:" for each paragraph for each social relation item if found print paragraph no need to check for other items first implementation attempt: use Modern::Perl; #use English qw(-no_match_vars); use English; exit step_00(); sub step_00 { # given file/text to search in my $whole_text = <<"EOT"; From: Robin Hood To: Scarface Hi Scarface, tell Al Capone to send a car to the prison gate on sunday. For the riot we need: weapon: AK 4711 quantity: 14 knife: Bowie quantity: 8 Tell my wife in Folsom to send some money to my son in Alcatraz. Regards Robin EOT # print file name say "--- Robin.txt ---"; # print "keywords:" say "keywords:"; # for each boolean item for my $bi ("Al Capone", "Billy the Kid", "Scarface") { # print boolean item text printf " %s: ", $bi; # if found anywhere in whole text if ($whole_text =~ /$bi/) { # print "Yes" say "Yes"; # else } else { # print "No" say "No"; } } # print "order line:" say "order lines:"; # for each line item for my $li ("knife", "machine gun", "stinger rocket", "weapon") { # print line item text # if found anywhere in whole text if ($whole_text =~ /^$li.*$/m) { # print whole line say " ", $MATCH; } } # print "social relations paragaphs:" say "social relations paragaphs:"; # for each paragraph for my $para (split /\n\n/, $whole_text) { # for each social relation item for my $sr ("wife", "son", "husband") { # if found if ($para =~ /$sr/) { ## if ($para =~ /\b$sr\b/) { # print paragraph say $para; # no need to check for other items last; } } } return 0; } output: perl 16953439.pl --- Robin.txt --- keywords: Al Capone: Yes Billy the Kid: No Scarface: Yes order lines: knife: Bowie quantity: 8 weapon: AK 4711 quantity: 14 social relations paragaphs: tell Al Capone to send a car to the prison gate on sunday. Tell my wife in Folsom to send some money to my son in Alcatraz. Such (premature) code helps you to clarify your specs (Should not-found keywords go into the output? Is your search list really flat or should it be structured/grouped?) check your assumptions about how to do things (Should the order line search be done on the array of lines of thw whole text?) identify topics for further research/rtfm (eg. regex (prison!)) plan your next steps (folder loop, read input file) (in addition, people in the know will point out all my bad practices, so you can avoid them from the start) Good luck!