Is there any command to do fuzzy matching in Linux based on multiple columns - linux
I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Related
Removing columns and sorting by Name in finger command
When I use the finger command, it displays Login, Name, Tty, Idle, Login Time, Office, Office Phone, and Host. I just need the information in the Login, Name, Idle, and Login Time columns. I tried using awk and sed, but they resulted in chart being all over the place (example below). $ finger | sed -r 's/\S+//3' Login Name Idle Login Time Office Office Phone Host user1 Full Name pts/1 20 Feb 3 19:34 (--------------------) user2 FirstName LastName pts/2 Feb 3 17:04 (--------------) user3 Name NameName pts/3 1:11 Feb 2 11:37 (-------------------------------) user4 F Last pts/4 1:09 Feb 13 18:14 (-------------------) How do I go about removing specific columns while keeping the structure intact?
The problem here is that you cannot extract particular fields based on whitespace separator, because on certain rows the columns might be blank and contain only whitespace, especially the Idle column, which will be blank for sessions with limited idle time. (An additional problem is that the real name field may contain a variable number of spaces.) So you may have to resort to cut -b ... using hard-coded byte offsets. The following seems to work on my system, as finger seems to use a fixed format output, truncating real names etc as needed, so the byte offsets do not change if the length of the GECOS (real name) field of logged in users is changed. finger | cut -b 1-20,30-48 Note that it will be inherently fragile if the format of the finger command output were to change in future. You might be able to produce something slightly more robust using regular expression parsing, for example parsing the column headings (first line of finger output) to obtain the byte offsets rather than hard-coding them, but it will still be somewhat fragile. A more robust solution would involve writing your own code to obtain information from the same sources that finger uses, and use that in place of finger. The existing code of an open-source implementation of finger might be a suitable starting point, and then you can adapt it to remove the columns that are not of interest. Update: building a patched version of finger. Save this patch as /tmp/patch. It it just a quick-and-dirty patch to suppress certain fields from being printed; they are still calculated. --- sprint.c~ 2020-06-13 12:27:12.000000000 +0100 +++ sprint.c 2020-06-13 12:32:23.363138500 +0100 ## -89,7 +89,7 ## if (maxlname + maxrname < space-2) { maxlname++; maxrname++; } (void)xprintf("%-*s %-*s %s\n", maxlname, "Login", maxrname, - "Name", " Tty Idle Login Time Office Office Phone"); + "Name", " Idle Login Time"); for (cnt = 0; cnt < entries; ++cnt) { pn = list[cnt]; for (w = pn->whead; w != NULL; w = w->next) { ## -100,12 +100,6 ## (void)xprintf(" * * No logins "); goto office; } - (void)xputc(w->info == LOGGEDIN && !w->writable ? - '*' : ' '); - if (*w->tty) - (void)xprintf("%-7.7s ", w->tty); - else - (void)xprintf(" "); if (w->info == LOGGEDIN) { stimeprint(w); (void)xprintf(" "); ## -118,17 +112,6 ## else (void)xprintf(" %.5s", p + 11); office: - if (w->host[0] != '\0') { - xprintf(" (%s)", w->host); - } else { - if (pn->office) - (void)xprintf(" %-10.10s", pn->office); - else if (pn->officephone) - (void)xprintf(" %-10.10s", " "); - if (pn->officephone) - (void)xprintf(" %-.14s", - prphone(pn->officephone)); - } xputc('\n'); } } Then obtain the source code, patch it and build it. (Change destdir as required.) apt-get source finger cd bsd-finger-0.17/ pushd finger patch -p0 < /tmp/patch popd destdir=/tmp/finger mkdir -p $destdir/man/man8 $destdir/sbin $destdir/bin ./configure --prefix=$destdir make make install And run it... $destdir/bin/finger
Basically, to treat columns, awk is the way to go, ex: remove third column finger | awk '{$3="";print}'
Another way: If you found this informations, they have to be wrote somewhere in the system. Using who, awk and cut : The informations can be gathered by getent passwd. Created a test user with adduser : # adduser foobar Adding user `foobar' ... Adding new group `foobar' (1001) ... Adding new user `foobar' (1001) with group `foobar' ... Creating home directory `/home/foobar' ... Copying files from `/etc/skel' ... New password: Retype new password: passwd: password updated successfully Changing the user information for foobar Enter the new value, or press ENTER for the default Full Name []: Jean-Charles De la tour Room Number []: 42 Work Phone []: +33140000000 Home Phone []: +33141000000 Other []: sysadmin Is the information correct? [Y/n] Y And the new line in /etc/passwd file: foobar:x:1001:1001:Jean-Charles De la tour,42,+33140000000,+33141000000,sysadmin:/home/foobar:/bin/bash So it's easy to retrieve in formations from this: for u in $(who | cut -d' ' -f1); do # iterate over connected users getent passwd | awk -F'[:,]' -v OFS='\n' -v u="$u" '$1==u{print "user: "$1, "full name: "$5, "room: "$6, "work phone : "$7, "home phone: "$8, "other: "$9}' done Just make sure you have , in $5 column. Output user: foobar full name: Jean-Charles De la tour room: 42 work phone : +33140000000 home phone: +33141000000 other: sysadmin
entering text in a file at specific locations by identifying the number being integer or real in linux
I have an input like below 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01 Before anything beings I have to input a text like *Revolved *Gripped *Crippled 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 *Cracked *Crippled 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too. Any suggestions with sed or awk would be highly appreaciated! The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.
An awk with fixed strings: awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"} match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.
Compare different item in two file and output combined result to new file by using AWK
Greeting! I have some file in pair taken from two nodes in network, and file has records about TCP segment send/receive time, IP id number, segment type,seq number and so on. For same TCP flow, it looks like this on sender side: 1420862364.778332 50369 seq 17400:18848 1420862364.780798 50370 seq 18848:20296 1420862364.780810 50371 seq 20296:21744 .... or on receiver side(1 second delay, segment with IP id 50371 lost) 1420862364.778332 50369 seq 17400:18848 1420862364.780798 50370 seq 18848:20296 .... I want to compare IP identification number in two file and output to new one like this: 1420862364.778332 1420862365.778332 50369 seq 17400:18848 o 1420862364.780798 1420862365.780798 50370 seq 18848:20296 o 1420862364.780810 1420862365.780810 50371 seq 20296:21744 x which has time of arrive on receiver side, and by comparing id field, when same value is not found in receiver sid(packet loss), an x will be added, otherwise o will be there. I already have code like this, awk 'ARGIND==1 {w[$2]=$1} ARGIND==2 { flag=0; for(a in w) if($2==a) { flag=1; print $1,w[a],$2,$3,$4; break; } if(!flag) print $1,"x",$2,$3,$4; }' file2 file1 >file3 but it doesn't work in Linux, it stops right after I pressed Enter, and leave only empty file. Shell script contains these code has been through chomd +x. Please help. My code is not well organized, any new one liner will be appreciated. Thank you for your time.
ARGIND is gawk-specific btw so check your awk version. – Ed Morton
awk-insert row with specific text within specific position
I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected. Sample file: // Generated for: spectre // Design library name: Kovi // Design cell name: T_Line // Design view name: schematic simulator lang=spectre global 0 parameters frequency=3.8G Zo=250 // Library name: Kovi // Cell name: T_Line // View name: schematic T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1 T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1 T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1 T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1
How about sed sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end. Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.
Since you awk'd the post awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect. Test file # # # #Preceded by a tab begin protect # before unprotect Result # # # #Preceded by tab protect begin protect # before unprotect unprotect Edit: Removed the [:space:]* as it seems that is already handled by default. Support // If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \. This would check for at least one /. Adding a quantifier {2} will match // exactly: /#|\/{2}/
Search in directory of files based on keywords from another file
Perl Newbie here and looking for some help. I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type. For example: Keywords.txt Attribute1 boolean Attribute2 boolean Attribute3 search_and_extract Attribute4 chunk For each file in the directory, I have to: lookup the keywords.txt search based on Attribute type something like the below. IF attribute_type = boolean THEN search for attribute; set found = Y if attribute found; ELSIF attribute_type = search_and_extract THEN extract string where attribute is Found ELSIF attribute_type = chunk THEN extract the complete chunk of paragraph where attribute is found. This is what I have so far and I'm sure there is a more efficient way to do this. I'm hoping someone can guide me in the right direction to do the above. Thanks & regards, SiMa # Reads attributes from config file # First set boolean attributes. IF keyword is found in text, # variable flag is set to Y else N # End Code: For each text file in directory loop. # Run the below for each document. use strict; use warnings; # open Doc open(DOC_FILE,'Final_CLP.txt'); while(<DOC_FILE>) { chomp; # open the file open(FILE,'attribute_config.txt'); while (<FILE>) { chomp; ($attribute,$attribute_type) = split("\t"); $is_boolean = ($attribute_type eq "boolean") ? "N" : "Y"; # For each boolean attribute, check if the keyword exists # in the file and return Y or N if ($is_boolean eq "Y") { print "Yes\n"; # search for keyword in doc and assign values } print "Attribute: $attribute\n"; print "Attribute_Type: $attribute_type\n"; print "is_boolean: $is_boolean\n"; print "-----------\n"; } close(FILE); } close(DOC_FILE); exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But such a story - whether true or made up, because you can't disclose the truth - should give a vivid picture of the situation/problem/task the reason(s) why all the work must be done definitions for uncommon(ly used)terms So I'd start with: I'm working in a prison and have to scan the emails of the inmates for names (like "Al Capone") mentioned anywhere in the text; the director wants to read those mails in toto order lines (like "weapon: AK 4711 quantity: 14"); the ordnance officer wants those info to calculate the amount of ammunition and rack space needed paragraphs containing 'family'-keywords like "wife", "child", ...; the parson wants to prepare her sermons efficiently Taken for itself, each of the terms "keyword" (~running text) and "attribute" (~structured text) of may be 'clear', but if both are applied to "the X I have to search for", things get mushy. Instead of general ("chunk") and technical ("string") terms, you should use 'real-world' (line) and specific (paragraph) words. Samples of your input: From: Robin Hood To: Scarface Hi Scarface, tell Al Capone to send a car to the prison gate on sunday. For the riot we need: weapon: AK 4711 quantity: 14 knife: Bowie quantity: 8 Tell my wife in Folsom to send some money to my son in Alcatraz. Regards Robin and your expected output: --- Robin.txt ---- keywords: Al Capone: Yes Billy the Kid: No Scarface: Yes order lines: knife: knife: Bowie quantity: 8 machine gun: stinger rocket: weapon: weapon: AK 4711 quantity: 14 social relations paragaphs: Tell my wife in Folsom to send some money to my son in Alcatraz. Pseudo code should begin at the top level. If you start with for each file in folder load search list process current file('s content) using search list it's obvious that load search list for each file in folder process current file using search list would be much better. Based on this story, examples, and top level plan, I would try to come up with proof of concept code for a simplified version of the "process current file('s content) using search list" task: given file/text to search in and list of keywords/attributes print file name print "keywords:" for each boolean item print boolean item text if found anywhere in whole text print "Yes" else print "No" print "order line:" for each line item print line item text if found anywhere in whole text print whole line print "social relations paragaphs:" for each paragraph for each social relation item if found print paragraph no need to check for other items first implementation attempt: use Modern::Perl; #use English qw(-no_match_vars); use English; exit step_00(); sub step_00 { # given file/text to search in my $whole_text = <<"EOT"; From: Robin Hood To: Scarface Hi Scarface, tell Al Capone to send a car to the prison gate on sunday. For the riot we need: weapon: AK 4711 quantity: 14 knife: Bowie quantity: 8 Tell my wife in Folsom to send some money to my son in Alcatraz. Regards Robin EOT # print file name say "--- Robin.txt ---"; # print "keywords:" say "keywords:"; # for each boolean item for my $bi ("Al Capone", "Billy the Kid", "Scarface") { # print boolean item text printf " %s: ", $bi; # if found anywhere in whole text if ($whole_text =~ /$bi/) { # print "Yes" say "Yes"; # else } else { # print "No" say "No"; } } # print "order line:" say "order lines:"; # for each line item for my $li ("knife", "machine gun", "stinger rocket", "weapon") { # print line item text # if found anywhere in whole text if ($whole_text =~ /^$li.*$/m) { # print whole line say " ", $MATCH; } } # print "social relations paragaphs:" say "social relations paragaphs:"; # for each paragraph for my $para (split /\n\n/, $whole_text) { # for each social relation item for my $sr ("wife", "son", "husband") { # if found if ($para =~ /$sr/) { ## if ($para =~ /\b$sr\b/) { # print paragraph say $para; # no need to check for other items last; } } } return 0; } output: perl 16953439.pl --- Robin.txt --- keywords: Al Capone: Yes Billy the Kid: No Scarface: Yes order lines: knife: Bowie quantity: 8 weapon: AK 4711 quantity: 14 social relations paragaphs: tell Al Capone to send a car to the prison gate on sunday. Tell my wife in Folsom to send some money to my son in Alcatraz. Such (premature) code helps you to clarify your specs (Should not-found keywords go into the output? Is your search list really flat or should it be structured/grouped?) check your assumptions about how to do things (Should the order line search be done on the array of lines of thw whole text?) identify topics for further research/rtfm (eg. regex (prison!)) plan your next steps (folder loop, read input file) (in addition, people in the know will point out all my bad practices, so you can avoid them from the start) Good luck!