Replace the first field with values from a mapping - linux

I have some data (basically bounding box annotations) in a txt files (space separated)
I would like to replace multiple occurrences of specific characters with some other characters. For example
0 0.649489 0.666668 0.0625 0.260877
1 0.89485 0.445085 0.0428084 0.084259
1 0.80625 0.508509 0.0469892 0.005556
2 0.529068 0.0906668 0.0582908 0.0954804
2 0.565625 0.0268509 0.0040625 0.0546296
I might have to change it to something like
2 0.649489 0.666668 0.0625 0.260877
4 0.89485 0.445085 0.0428084 0.084259
4 0.80625 0.508509 0.0469892 0.005556
7 0.529068 0.0906668 0.0582908 0.0954804
7 0.565625 0.0268509 0.0040625 0.0546296
and this should happen simultaneously for all the elements only in the first column (not one after the other replacement as that will index it incorrectly)
I'll basically have a mapping {old_class_1:new_class_1,old_class_2:new_class_2,old_class_3:new_class_3} and so on...
I looked into the post here, but it does not work for my case since the method described in those answers would change all the values to the last replacement.
I looked into this post as well, but am not sure if the answer here can be applied to my case since I'll have around 25 classes, so the indexes (the values of the first column) can range from 0-24
I know this can be probably be done in python by reading each file line by line and making the replacement, just was wondering if there was a quicker way
Any help would be appreciated. Thanks!

Here's a simple example of how to map the labels in the first column to different ones.
This specifies the mapping as a variable; you could equally well specify it in a file, or something else entirely. The main consideration is that you need to have unambiguous separator characters, and use a format which isn't unnecessarily hard for Awk to parse.
awk 'BEGIN { n = split("0:2 1:4 2:7", m);
for(i=1; i<=n; ++i) { split(m[i], p); map[p[1]] = p[2] } }
$1 in map { $1 = map[$1] }1' file
The BEGIN field could be simplified, but I wanted to make it easy to update; now all you have to do is update the string which is the first argument to the first split to specify a different mapping. We spend a bunch of temporary variables on parsing out the values into an associative array map which is what the main script then uses.
The final 1 is not a typo; it is a standard Awk idiom to say "print every line unconditionally".

Related

String permutations in lexigraphic order without inversions/reversions

The problem:
I would like to generate a list of permutations of strings in lexigraphical but excluding string inversions. For instance, if I have the following string: abc, I would like to generate the following list
abc
acb
bac
instead of the typical
abc
acb
bac
bca
cab
cba
An alternative example would look something like this:
100
010
instead of
100
010
001
Currently, I can generate the permutations using perl, but I am not sure on how to best remove the reverse duplicates.
I had thought of applying something like the following:
create map with the following:
1) 100
2) 010
3) 001
then perform the reversion/inversion on each element in the map and create a new map with:
1') 001
2') 010
3') 100
then compare and if the primed list value matches the original value, leave it in place, if it is different, if it's index is greater than the median index, keep it, else remove.
Trouble is, I am not sure if this is an efficient approach or not.
Any advice would be great.
Two possibilities represented by examples are for permutations where all elements are different (abcd), or for variations of two symbols where one appears exactly once (1000). More general cases are addressed as well.
Non-repeating elements (permutations)
Here we can make use of Algorithm::Permute, and of the particular observation:
Each permutation where the first element is greater than its last need be excluded. It comes from this post, brought up in the answer by ysth.
This rule holds as follows. Consider substrings of a string without its first and last elements. For each such substring, all permutations of the string must contain its inverse. One of these, padded with last and first, is thus the inverse of the string. By construction, for each substring there is exactly one inverse. Thus permutations with swapped first and last elements of each string need be excluded.
use warnings;
use strict;
use feature 'say';
use Algorithm::Permute;
my $size = shift || 4;
my #arr = ('a'..'z')[0..$size-1]; # 1..$size for numbers
my #res;
Algorithm::Permute::permute {
push #res, (join '', #arr) unless $arr[0] gt $arr[-1]
} #arr;
say for #arr;
Problems with repetead elements (abbcd) can be treated the exact same way as above, and we need to also prune duplicates as permutations of b generate abbcd and abbcd (same)
use List::MoreUtils 'uniq';
# build #res the same way as above ...
my #res = uniq #res;
Doing this during construction would not reduce complexity nor speed things up.
The permute is quoted as the fastest method in the module, by far. It is about an order of magnitude faster than the other modules I tested (below), taking about 1 second for 10 elements on my system. But note that this problem's complexity is factorial in size. It blows up really fast.
Two symbols, where one appears exactly once (variations)
This is different and the above module is not meant for it, nor would the exclusion criterion work. There are other modules, see at the end. However, the problem here is very simple.
Start from (1,0,0,...) and 'walk' 1 along the list, up to the "midpoint" – which is the half for even sized list (4 for 8-long), or next past half for odd sizes (5 for 9-long). All strings obtained this way, by moving 1 by one position up to midpoint, form the set. The second "half" are their inversions.
use warnings;
use strict;
my $size = shift || 4;
my #n = (1, map { 0 } 1..$size-1);
my #res = (join '', #n); # first element of the result
my $end_idx = ( #n % 2 == 0 ) ? #n/2 - 1 : int(#n/2);
foreach my $i (0..$end_idx-1) # stop one short as we write one past $i
{
#n[$i, $i+1] = (0, 1); # move 1 by one position from where it is
push #res, join '', #n;
}
print "$_\n" for #res;
We need to stop before the last index since it has been filled in the previous iteration.
This can be modified if both symbols (0,1) may appear repeatedly, but it is far simpler to use a module and then exclude inverses. The Algorithm::Combinatorics has routines for all needs here. For all variations of 0 and 1 of lenght $size, where both may repeat
use Algorithm::Combinatorics qw(variations_with_repetition);
my #rep_vars = variations_with_repetition([0, 1], $size);
Inverse elements can then be excluded by a brute-force search, with O(N2) complexity at worst.
Also note Math::Combinatorics.
The answer in the suggested duplicate Generating permutations which are not mirrors of each other doesn't deal with repeated elements (because that wasn't part of that question) so naively following it would include e.g. both 0100 and 0010. So this isn't an exact duplicate. But the idea applies.
Generate all the permutations but filter only for those with $_ le reverse $_. I think this is essentially what you suggest in the question, but there's no need to compute a map when a simple expression applied to each permutation will tell you whether to include it or not.

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS?

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS EG ?
For example,
row1: the sun is nice
row2: the sun looks great
row3: the sun left me
Is there a code that would produce the following result column (2 words where sun is the first):
SUN IS
SUN LOOKS
SUN LEFT
and possibly a second column with COUNT in case of duplicate matches.
So if there was 20 SUN LOOKS then it they would be grouped and have a count of 20.
Thanks
I think you can use functions findw() and scan() to do want you want. Both of those functions operate on the concept of word boundaries. findw() returns the position of the word in the string. Once you know the position, you can use scan() in a loop to get the next word or words following it.
Here is a simple example to show you the concept. It is by no means a finished or polished solution, but intended you point you in the right direction. The input data set (text) contains the sentences you provided in your question with slight modifications. The data step finds the word "sun" in the sentence and creates a variable named fragment that contains 3 words ("sun" + the next 2 words).
data text2;
set text;
length fragment $15;
word = 'sun'; * search term;
fragment_len = 3; * number of words in target output;
word_pos = findw(sentence, word, ' ', 'e');
if word_pos then do;
do i = 0 to fragmen_len-1;
fragment = catx(' ', fragment, scan(sentence, word_pos+i));
end;
end;
run;
Here is a partial print of the output data set.
You can use a combination of the INDEX, SUBSTR and SCAN functions to achieve this functionality.
INDEX - takes two arguments and returns the position at which a given substring appears in a string. You might use:
INDEX(str,'sun')
SUBSTR - simply returns a substring of the provided string, taking a second numeric argument referring to the starting position of the substring. Combine this with your INDEX function:
SUBSTR(str,INDEX(str,'sun'))
This returns the substring of str from the point where the word 'sun' first appears.
SCAN - returns the 'words' from a string, taking the string as the first argument, followed by a number referring to the 'word'. There is also a third argument that specifies the delimiter, but this defaults to space, so you wouldn't need it in your example.
To pick out the word after 'sun' you might do this:
SCAN(SUBSTR(str,INDEX(str,'sun')),2)
Now all that's left to do is build a new string containing the words of interest. That can be achieved with concatenation operators. To see how to concatenate two strings, run this illustrative example:
data _NULL_;
a = 'Hello';
b = 'World';
c = a||' - '||b;
put c;
run;
The log should contain this line:
Hello - World
As a result of displaying the value of the c variable using the put statement. There are a number of functions that can be used to concatenate strings, look in the documentation at CAT,CATX,CATS for some examples.
Hopefully there is enough here to help you.

Perl Morgan and a String?

I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character

Display lines containing duplicates within subset of string

How would I find duplicate lines by matching only one part of each line and not the whole line itself?
Take for example the following text:
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=152084(k152084) gid=10003(pemcln) groups=10003(pemcln) k152084
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
uid=154226(u154226) gid=10003(pemcln) groups=10003(pemcln) u154226
I would like to show only the 1st and 3rd lines only as the have the same duplicate UID value "154163"
The only ways I know how would match the whole line and not the subset of each one.
This code looks for the ID from each line. If any ID appears more than once, its lines are printed:
$ awk -F'[=(]' '{cnt[$2]++;lines[$2]=lines[$2]"\n"$0} END{for (k in cnt){if (cnt[k]>1)print lines[k]}}' file
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
How it works:
-F'[=(]'
awk separates input files into records (lines) and separates the records into fields. Here, we tell awk to use either = or ( as the field separator. This is done so that the second field is the ID.
cnt[$2]++; lines[$2]=lines[$2]"\n"$0
For every line that is read in, we keep a count, cnt, of how many times that ID has appeared. Also, we save all the lines associated with that ID in the array lines.
END{for (k in cnt){if (cnt[k]>1)print lines[k]}}
After we reach the end of the file, we go through each observed ID and, if it appeared more than once, its lines are printed.
Someone has already provided an awk script that will do what you need, assuming the files are small enough to fit into memory (they store all the lines until the end then decide what to output). There's nothing wrong with it, indeed it could be considered the canonical awk solution to this problem. I provide this answer really for those cases where awk may struggle with the storage requirements.
Specifically, if you have larger files that cause problems with that approach, the following awk script, myawkscript.awk, will handle it, provided you first sort the file so it can rely on the fact related lines are together. In order to ensure it's sorted and that you can easily get at the relevant key (using = and ( as field separators), you call it with:
sort <inputfile | awk -F'[=(]' -f myawkscript.awk
The script is:
state == 0 {
if (lastkey == $2) {
printf "%s", lastrec;
print;
state = 1;
};
lastkey = $2;
lastrec = $0"\n";
next;
}
state == 1 {
if (lastkey == $2) {
print;
} else {
lastkey = $2;
lastrec = $0"\n";
state = 0;
}
}
It's basically a state machine where state zero is scanning for duplicates and state one is outputting the duplicates.
In state zero, the relevant part of the current line is checked against the previous and, if there's a match, it outputs both and switches to state one. If there's no match, it simply moves on to the next line.
In state one, it checks each line against the original in the set and outputs it as long as it matches. When it finds one that doesn't match, it stores it and reverts to state zero.

Fortran read of data with * to signify similar data

My data looks like this
-3442.77 -16749.64 893.08 -3442.77 -16749.64 1487.35 -3231.45 -16622.36 902.29
.....
159*2539.87 10*0.00 162*2539.87 10*0.00
which means I start with either 7 or 8 reals per line and then (towards the end) have 159 values of 2539.87 followed by 10 values of 0 followed by 162 of 2539.87 etc. This seems to be a space-saving method as previous versions of this file format were regular 6 reals per line.
I am already reading the data into a string because of not knowing whether there are 7 or 8 numbers per line. I can therefore easily spot lines that contain *. But what then? I suppose I have to identify the location of each * and then identify the integer number before and real value after before assigning to an array. Am I missing anything?
Read the line. Split it into tokens delimited by whitespace(s). Replace the * in tokens that have it with space. Then read from the string one or two values, depending on wheather there was an asterisk or not. Sample code follows:
REAL, DIMENSION(big) :: data
CHARACTER(LEN=40) :: token
INTEGER :: iptr, count, idx
REAL :: val
iptr = 1
DO WHILE (there_are_tokens_left)
... ! Get the next token into "token"
idx = INDEX(token, "*")
IF (idx == 0) THEN
READ(token, *) val
count = 1
ELSE
! Replace "*" with space and read two values from the string
token(idx:idx) = " "
READ(token, *) count, val
END IF
data(iptr:iptr+count-1) = val ! Add "val" "count" times to the list of values
iptr = iptr + count
END DO
Here I have arbitrarily set the length of the token to be 40 characters. Adjust it according to what you expect to find in your input files.
BTW, for the sake of completeness, this method of compressing something by replacing repeating values with value/repetition-count pairs is called run-length encoding (RLE).
Your input data may have been written in a form suitable for list directed input (where the format specification in the READ statement is simply ''*''). List directed input supports the r*c form that you see, where r is a repeat count and c is the constant to be repeated.
If the total number of input items is known in advance (perhaps it is fixed for that program, perhaps it is defined by earlier entries in the file) then reading the file is as simple as:
REAL :: data(size_of_data)
READ (unit, *) data
For example, for the last line shown in your example on its own ''size_of_data'' would need to be 341, from 159+10+162+10.
With list directed input the data can span across multiple records (multiple lines) - you don't need to know how many items are on each line in advance - just how many appear in the next "block" of data.
List directed input has a few other "features" like this, which is why it is generally not a good idea to use it to parse "arbitrary" input that hasn't been written with it in mind - use an explicit format specification instead (which may require creating the format specification on the fly to match the width of the input field if that is not know ahead of time).
If you don't know (or cannot calculate) the number of items in advance of the READ statement then you will need to do the parsing of the line yourself.

Resources