How to create a lookup table in Groovy? - groovy

I want to create a lookup table in Groovy, given a size (in this case the size is of 4):
RGGG
RRGG
RRRG
RRRR
That is in first iteration only one R should be there and size-1 times of G should be there. As per the iteration value increases R should grow and G should decrease as well. So for size 4 I will have 4 lookup values.
How one could do this in Groovy?

You mean like this:
def lut( width, a='R', b='G' ) {
(1..width).collect { n ->
( a * n ) + ( b * ( width - n ) )
}
}
def table = lut( 4 )
table.each { println it }
prints:
RGGG
RRGG
RRRG
RRRR
Your question doesn't really say what sort of data you are expecting out? This code gives a List of Strings

Related

How to append float to list?

I want to append float to list, but I got an error like this:
<ipython-input-47-08d9c3f8f180> in maxEs()
12 Es = lists[0]*0.3862 + lists[1]*0.3091 + lists[2]*0.4884
13 aaa = []
---> 14 Es.append(aaa)
15
AttributeError: 'float' object has no attribute 'append'
I guess I can't append float to list. Can I add floats to list another way?
This is my code:
import math
def maxEs():
for a in range(1, 101):
for b in range(1,101):
for c in range(1,101):
if a+b+c == 100 :
lists = []
lists.append(a*0.01)
lists.append(b*0.01)
lists.append(c*0.01)
Es = lists[0]*0.3862 + lists[1]*0.3091 + lists[2]*0.4884
aaa = []
Es.append(aaa)
I don't know what you want, but you are trying to append a list to a float not the other way round.
Should be
aaa.append(Es)
The other answer already explained the main problem with your code, but there is more:
as already said, it has to be aaa.append(Es) (you did it right for the other list)
speaking of the other list: you don't need it at all; just use the values directly in the formula
aaa is re-initialized and overwritten in each iteration of the loop; you should probably move it to the top
you do not need the inner loop to find c; once you know a and b, you can calculate c so that it satisfies the condition
you can also restrict the loop for b, so the result does not exceed 100
finally, you should probably return some result (the max of aaa maybe?)
We do not know what exactly the code is trying to achieve, but maybe try this:
def maxEs():
aaa = []
for a in range(1, 98 + 1):
for b in range(1, 99-a + 1):
c = 100 - a - b
Es = 0.01 * (a * 0.3862 + b * 0.3091 + c * 0.4884)
aaa.append(Es)
return max(aaa)

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

What is the best way to check correspondence in r

I am trying to check if two variables have a one to one relation. One of the two variables contains characters of address, while the other contain a ID to the address. I would like to see if it's a one to one correspondence.
I was thinking about converting characters to ASCII code or assign them a value using a math function. But I want to know if there's other easier and more effective ways to do so.
You can use table, and check if the resulting matrix has exactly one 1
in each row and in each column. That also tells you where the duplicates are.
d <- data.frame(
x = sample( LETTERS, 10, replace=TRUE ),
y = sample( LETTERS, 10, replace=TRUE )
)
m <- table(d) != 0
all( rowSums( m ) == 1 ) && all( colSums( m ) == 1 )
But if there is a lot of data, this is not very efficient.
You can use a sparse matrix instead.
library(Matrix)
m <- sparseMatrix(
i = as.numeric( as.factor( d$x ) ),
j = as.numeric( as.factor( d$y ) ),
x = rep( 1, nrow(d) )
)
m <- m > 0
all( rowSums( m ) == 1 ) && all( colSums( m ) == 1 )
You can also use sqldf.
library(sqldf)
sqldf( "SELECT x, COUNT( DISTINCT y ) AS n FROM d GROUP BY x HAVING n > 1" )
sqldf( "SELECT y, COUNT( DISTINCT x ) AS n FROM d GROUP BY y HAVING n > 1" )
You can also simply count how many different pairs you have:
it should be the same as the number of distinct values of x
and of y.
nrow( unique(d) ) == length(unique(d$x)) && nrow( unique(d) ) == length(unique(d$y))

Evenly distribute repetitive strings

I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?
Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
Weighing an arrangement of items
By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
A B A C B A A
would give the count
A: 1 2 3 1 1
B: 2 3 3
C: 4 4
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
sub distances {
my %distances;
my %last_seen;
for my $i (0 .. $#_) {
my $s = $_[$i];
push #{ $distances{$s} }, $i - ($last_seen{$s} // -1);
$last_seen{$s} = $i;
}
push #{ $distances{$_} }, #_ - $last_seen{$_} for keys %last_seen;
return values %distances;
}
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
variance(d, a) = (a - d)²
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
svar(items) = sqrt ∑_i variance(items[i], average(items))
Expressed as Perl code:
use List::Util qw/sum min/;
sub svar (#) {
my $med = sum(#_) / #_;
sqrt sum map { ($med - $_) ** 2 } #_;
}
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
Strings with more occurrences should have greater weight that strings with fewer occurrences.
Uneven distributions should have greater weight than even distributions, to strongly penalize unevenness.
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
sub weigh_distance {
return sum map {
my #distances = #$_; # the distances of one string
svar(#distances) ** $#distances;
} distances(#_);
}
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to weigh_distance. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:
Selecting optimal permutations
Given a selection of permuations, we can select those permutations that are optimal:
sub select_best {
my %sorted;
for my $strs (#_) {
my $weight = weigh_distance(#$strs);
push #{ $sorted{$weight} }, $strs;
}
my $min_weight = min keys %sorted;
#{ $sorted{$min_weight} }
}
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
Creating all possible permutations
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
my %words = #_;
my #keys =
sort # sorting is important for cache access
grep { $words{$_} > 0 }
grep { length or carp "Can't use empty strings as identifiers" }
keys %words;
my ($perms, $ok) = _fetch_perm_cache(\#keys, \%words);
return #$perms if $ok;
# build perms manually, if it has to be.
# pushing into #$perms directly updates the cached values
for my $key (#keys) {
my #childs = make_perms(%words, $key => $words{$key} - 1);
push #$perms, (#childs ? map [$key, #$_], #childs : [$key]);
}
return #$perms;
}
The _fetch_perm_cache returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.
sub _fetch_perm_cache {
my ($keys, $idxhash) = #_;
state %perm_cache;
my $pointer = \%perm_cache;
my $ok = 1;
$pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for #$keys;
$pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
return $pointer, $ok;
}
That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache were shared).
Connecting the pieces
This is now a simple matter of
say "#$_" for select_best(make_perms(A => 4, B => 1, C => 1))
This would yield
A A C A B A
A A B A C A
A C A B A A
A B A C A A
which are all optimal solutions by the used definition. Interestingly, the solution
A B A A C A
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Completing the test cases
Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA
We can run these test cases by
use Test::More tests => 3;
my #test_cases = (
[0 => [qw/A A B A A/], [qw/A B A A A/]],
[1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
[0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (#test_cases) {
my ($correct_index, #cases) = #$test;
my $best = select_best(#cases);
ok $best ~~ $cases[$correct_index], "[#{$cases[$correct_index]}]";
}
Out of interest, we can calculate the optimal distributions for these letters:
my #counts = (
{ A => 4, B => 1 },
{ A => 4, B => 2, C => 1},
{ A => 2, B => 1, C => 1},
);
for my $count (#counts) {
say "Selecting best for...";
say " $_: $count->{$_}" for keys %$count;
say "#$_" for select_best(make_perms(%$count));
}
This brings us
Selecting best for...
A: 4
B: 1
A A B A A
Selecting best for...
A: 4
C: 1
B: 2
A B A C A B A
Selecting best for...
A: 2
C: 1
B: 1
A C A B
A B A C
C A B A
B A C A
Further work
Because the weighing attributes the same importance to the distance to the edges as to the distance between letters, symmetrical setups are preferred. This condition could be eased by reducing the value of the distance to the edges.
The permutation generation algorithm has to be improved. Memoization could lead to a speedup. Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.
It would be great to find a heuristic to guide the permutation generation, instead of evaluating all posibilities. A possible heuristic would consider whether there are enough different strings available that no string has to neighbour itself (i.e. distance 1). This information could be used to narrow the width of the search tree.
Transforming the recursive perm generation to an iterative solution would allow to interweave searching with weight calculation, which would make it easier to skip or defer unfavourable solutions.
The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
weight(svar, occurrences) → weighted_variance
weight(0.9, 10) → 0.35
weight(0.5, 1) → 0.5
This should in fact be reversed.
Edit
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
sub approximate {
my %def = #_;
my ($init, #keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
my #out = ($init) x $def{$init};
while(my $key = shift #keys) {
my $visited = 0;
for my $parts_left (reverse 2 .. $def{$key} + 1) {
my $interrupt = $visited + int((#out - $visited) / $parts_left);
splice #out, $interrupt, 0, $key;
$visited = $interrupt + 1;
}
}
# check if strings should be swapped
for my $i ( 0 .. $#out - 2) {
#out[$i, $i + 1] = #out[$i + 1, $i]
if $out[$i] ne $out[$i + 1]
and $out[$i + 1] eq $out[$i + 2]
and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
}
return #out;
}
Edit 2
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2, C => 4, B => 2 and $regex => 2, $fh => 4 represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.
The large bottleneck is in the select_best sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).
These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.

R: How to replace a character in a string after sampling and print out character instead of index?

I'd like to replace a character in a string with another character, by first sampling by the character. I'm having trouble having it print out the character instead of the index.
Example data, is labelled "try":
L 0.970223325 - 0.019851117 X 0.007444169
K 0.962779156 - 0.027295285 Q 0.004962779
P 0.972704715 - 0.027295285 NA 0
C 0.970223325 - 0.027295285 L 0.00248139
V 0.970223325 - 0.027295285 T 0.00248139
I'm trying to sample a character for a given row using weighted probabilities.
samp <- function(row) {
sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])
}
Then, I want to use the selected character to replace a position in a given string.
subchar <- function(string, pos, new) {
paste(substr(string, 1, pos-1), new , substr(string, pos+1, nchar(string)), sep='')
}
My question is - if I do, for example
> subchar("KLMN", 3, samp(4))
[1] "KL1N"
But I want it to read "KLCN". As.character(samp(4)) doesn't work either. How do I get it to print out the character instead of the index?
The problem arises because your letters are stored as factors rather than characters, and samp is returning a data.frame.
C is the first level in your factor so that is stored as 1 internally, and as.character (which gets invoked by the paste statement) pulls this out when working on the mini-data.frame:
samp(4)
V1
4 C
as.character(samp(4))
[1] "1"
You can solve this in 2 ways, either dropping the data.frame of the samp output in your call to subchar, or modifying samp to do so:
subchar("KLMN", 3, samp(4)[,1])
[1] "KLCN"
samp2 <- function(row)
{ sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])[,1]
}
subchar("KLMN",3,samp2(4))
[1] "KLCN
You may also find it easier to sample within your subsetting, and you can drop the data.frame from there:
samp3 <- function(row){
try[row,sample(seq(1,length(try),2),1,prob=try[row,seq(2,length(try),2)]),drop=TRUE]
}

Resources