How would I find duplicate lines by matching only one part of each line and not the whole line itself?
Take for example the following text:
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=152084(k152084) gid=10003(pemcln) groups=10003(pemcln) k152084
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
uid=154226(u154226) gid=10003(pemcln) groups=10003(pemcln) u154226
I would like to show only the 1st and 3rd lines only as the have the same duplicate UID value "154163"
The only ways I know how would match the whole line and not the subset of each one.
This code looks for the ID from each line. If any ID appears more than once, its lines are printed:
$ awk -F'[=(]' '{cnt[$2]++;lines[$2]=lines[$2]"\n"$0} END{for (k in cnt){if (cnt[k]>1)print lines[k]}}' file
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
How it works:
-F'[=(]'
awk separates input files into records (lines) and separates the records into fields. Here, we tell awk to use either = or ( as the field separator. This is done so that the second field is the ID.
cnt[$2]++; lines[$2]=lines[$2]"\n"$0
For every line that is read in, we keep a count, cnt, of how many times that ID has appeared. Also, we save all the lines associated with that ID in the array lines.
END{for (k in cnt){if (cnt[k]>1)print lines[k]}}
After we reach the end of the file, we go through each observed ID and, if it appeared more than once, its lines are printed.
Someone has already provided an awk script that will do what you need, assuming the files are small enough to fit into memory (they store all the lines until the end then decide what to output). There's nothing wrong with it, indeed it could be considered the canonical awk solution to this problem. I provide this answer really for those cases where awk may struggle with the storage requirements.
Specifically, if you have larger files that cause problems with that approach, the following awk script, myawkscript.awk, will handle it, provided you first sort the file so it can rely on the fact related lines are together. In order to ensure it's sorted and that you can easily get at the relevant key (using = and ( as field separators), you call it with:
sort <inputfile | awk -F'[=(]' -f myawkscript.awk
The script is:
state == 0 {
if (lastkey == $2) {
printf "%s", lastrec;
print;
state = 1;
};
lastkey = $2;
lastrec = $0"\n";
next;
}
state == 1 {
if (lastkey == $2) {
print;
} else {
lastkey = $2;
lastrec = $0"\n";
state = 0;
}
}
It's basically a state machine where state zero is scanning for duplicates and state one is outputting the duplicates.
In state zero, the relevant part of the current line is checked against the previous and, if there's a match, it outputs both and switches to state one. If there's no match, it simply moves on to the next line.
In state one, it checks each line against the original in the set and outputs it as long as it matches. When it finds one that doesn't match, it stores it and reverts to state zero.
Related
I have some data (basically bounding box annotations) in a txt files (space separated)
I would like to replace multiple occurrences of specific characters with some other characters. For example
0 0.649489 0.666668 0.0625 0.260877
1 0.89485 0.445085 0.0428084 0.084259
1 0.80625 0.508509 0.0469892 0.005556
2 0.529068 0.0906668 0.0582908 0.0954804
2 0.565625 0.0268509 0.0040625 0.0546296
I might have to change it to something like
2 0.649489 0.666668 0.0625 0.260877
4 0.89485 0.445085 0.0428084 0.084259
4 0.80625 0.508509 0.0469892 0.005556
7 0.529068 0.0906668 0.0582908 0.0954804
7 0.565625 0.0268509 0.0040625 0.0546296
and this should happen simultaneously for all the elements only in the first column (not one after the other replacement as that will index it incorrectly)
I'll basically have a mapping {old_class_1:new_class_1,old_class_2:new_class_2,old_class_3:new_class_3} and so on...
I looked into the post here, but it does not work for my case since the method described in those answers would change all the values to the last replacement.
I looked into this post as well, but am not sure if the answer here can be applied to my case since I'll have around 25 classes, so the indexes (the values of the first column) can range from 0-24
I know this can be probably be done in python by reading each file line by line and making the replacement, just was wondering if there was a quicker way
Any help would be appreciated. Thanks!
Here's a simple example of how to map the labels in the first column to different ones.
This specifies the mapping as a variable; you could equally well specify it in a file, or something else entirely. The main consideration is that you need to have unambiguous separator characters, and use a format which isn't unnecessarily hard for Awk to parse.
awk 'BEGIN { n = split("0:2 1:4 2:7", m);
for(i=1; i<=n; ++i) { split(m[i], p); map[p[1]] = p[2] } }
$1 in map { $1 = map[$1] }1' file
The BEGIN field could be simplified, but I wanted to make it easy to update; now all you have to do is update the string which is the first argument to the first split to specify a different mapping. We spend a bunch of temporary variables on parsing out the values into an associative array map which is what the main script then uses.
The final 1 is not a typo; it is a standard Awk idiom to say "print every line unconditionally".
I currently have a code where I want to enter a condition block if I can confirm that a substring is within another string. In this case the substring is a variable and the string is an array element. I'm cycling through the array with a for loop. Searching online I've determined that Awk does not do variable expansion within regex //. See below for my current code, note that all "..." are other non-relevant parts of my code:
#!/bin/bash
...
cat "${inputFile}" | awk -v mac="$mac" '
...
n=split($0,array,";");
if(array[8]~/MAC/){
mac=array[9];
}
...
#Don't both checking if first line is a duplicate of some data. We won't have storage[i] declared/values set until after the first run through "if (duplicate == 0)"
if (inputLine!=1)
{
for (i=0; i<=row;++i)
{
if($storage[i]~/mac/)
{
#Test block I want to enter
duplicate = 1;
}
}
}
if (duplicate == 0)
{
print outputText;
storage[row]= [concatenation of some variables with spaces];
row++;
}
...
'
After following some of the suggestions I'm still not entering the if block. I know that some of the storage lines (array elements) should contain a concatenation of strings that will contain the string inside the mac variable. However, I never seem to enter the if block. How am I supposed to correct the format of ~/mac/ ?
I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character
I have a large body of text and I print only lines that contain one of several strings. Each line can contain more than one string.
Example of the rule:
(house|mall|building)
I want to mark the found string for making the result easier to read.
Example of the result I want:
New record: Two New York houses under contract for nearly $5 million each.
New record: Two New York #house#s under contract for nearly $5 million each.
I know I can find the location, trim, add marker, add string etc.
I am asking if there is a way to mark the found string in one command.
Thanks.
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression ...
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the
extended regular expression ERE in string in and return the number of
substitutions. An ampersand ( '&' ) appearing in the string repl shall
be replaced by the string from in that matches the ERE ...
BEGIN {
r = "house|mall|building"
s = "Two New York houses under contract for nearly $5 million each."
gsub(r, "#&#", s)
print s
}
I have a field in a text file exported from a database. The field contains addresses but sometimes they are quite long and the database allows them to contain multiple lines. When exported, the newline character gets replaced with a dollar sign like this:
first part of very long address$second part of very long address$third part of very long address
Not every address has multiple lines and no address contains more than three lines. The length of each line is variable.
I'm massaging the data for import into MS Access which is used for a mailmerge. I want to split the field on the $ sign if it's there but if the field only contains 1 line, I want to set my two extra output fields to a zero length string so that I don't wind up with blank lines in the address when it gets printed.
I have an awk file that's working correctly on all the other data in the textfile but I need to get this last bit working. I tried the below code. Aside from the fact that I get a syntax error at the else, I'm not sure this is a good way to do what I want. This is being done with gawk on Windows.
BEGIN { FS = "|" }
$1 != "HEADER" {
if ($6 ~ /\$/)
split($6, arr, "$")
address = arr[1]
addresstwo = arr[2]
addressthree = arr[3]
addressLength = length(address)
addressTwoLength = length(addresstwo)
addressThreeLength = length(addressthree)
else {
address = $6
addressLength = length($6)
addresstwo = ""
addressTwoLength = length(addresstwo)
addressthree = ""
addressThreeLength = length(addressthree)
}
printf("%*s\t%*s\t\%*s\n",
addressLength, address, addressTwoLength, addresstwo, addressThreeLength, addressthree)
}
EDIT:
Sorry about that. Here's a sample
HEADER|0000000130|0000527350|0000171250|0000058000|0000756600|0000814753|0000819455|100106
rec1|ILL/COLORADO COLLEGE$TUTT LIBRARY|1021 N CASCADE$COLORADO SPRINGS, CO 80903|
rec2|ILL /PIKES PEAK LIBRARY DISTRICT|20 N. CASCADE AVE. / PO BOX 1579$COLORADO SPRINGS, CO 80903|
rec3|DOE,JOHN|PO Box 8034|
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY|INFORMATION DELIVERY DEPT$704 CHERRY ST$ATLANTA, GA 30332-0900
I match only lines without HEADER in them. I need to split the textstrings on the $ signs. The string between the pipes should not be padded (which is why I was trying to get the length in my original code). For this example, there are 6 output fields and any field for which there is no data is simply an empty string (also what I was trying to do in the code).
rec1|ILL/COLORADO COLLEGE|TUTT LIBRARY|1021 N CASCADE|COLORADO SPRINGS, CO 80903||
rec2|ILL /PIKES PEAK LIBRARY DISTRICT||20 N. CASCADE AVE. / PO BOX 1579|COLORADO SPRINGS, CO 80903||
rec3|DOE,JOHN||PO Box 8034|||
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY||INFORMATION DELIVERY DEPT|704 CHERRY ST|ATLANTA, GA 30332-0900|
Hope that helps! Let me know if this still isn't clear.
BEGIN { FS = "|" }
$1 != "HEADER" {
for(i = gsub(/\$/, "\t", $6); i < 3; i++)
$6 = $6 "\t"
print $6
}
I'm not really sure if I got your requirements right though.