Association and Sequence Mining - search

Suppose I have a string of numbers with hyphens representing a space, define
string, A := 1212-241321413-21341-3
and I have a known group of numbers that I am interested in,
group, G := ( 1, 2)
meaning that I don't care the order, 12 or 21. I just want to know, is there an algorithm that finds all the substrings and their beginning positions, however long, that contain ONLY 1 AND 2 ( the substring must contain a 1 and a 2 and there is no neighboring repitition, i.e. you will never see a 22 or 11 )
That means if I ran the algorithm with the string A and the group G, I would get something like
>> substringfind(A, G)
>>
>> { 1212 : [0], 21 : [9, 15] }
if the algorithm returned a dictionary with keys as the substring and key-values as lists of the beginning locations in the string.
Another example would be
group H := (1, 3, 4)
and the algorithm would produce
>> substringfind(A, H)
>>
>> { 413 : [7], 1413 : [11], 1341 : [16] }

Related

Replace string characters with their word index

Note the two consecutive spaces in this string:
string = "Hello there everyone!"
for i, c in enumerate(string):
print(i, c)
0 H
1 e
2 l
3 l
4 o
5
6 t
7 h
8 e
9 r
10 e
11
12
13 e
14 v
15 e
16 r
17 y
18 o
19 n
20 e
21 !
How can I make a list len(string) long, with each value containing the word count up to that point in the string?
Expected output: 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
The only way I could do it was by looping through each character, setting a space=True flag and increasing a counter each time I hit a non-space character when space == True. This is probably because I'm most proficient with C, but I would like to learn a more Pythonic way to solve this.
I feel like your solution is not that far from being pythonic. Maybe you can use the zip operator to iterate your string two by two and then just detect local changes (from a space to a letter -> this is a new word):
string = "Hello there everyone!"
def word_index(phrase):
nb_words = 0
for a, b in zip(phrase, phrase[1:]):
if a == " " and b != " ":
nb_words += 1
yield nb_words
print(list(word_index(string)))
This also make use of generators which is quite common in python (see the documentation for the yield keyword). You can probably do the same by using itertools.accumulate instead of the for loop, but I'm not sure it wouldn't obfuscate the code (see the third item from The Zen of Python). Here is what it would look like, note that I used a lambda function here, not because I think it's the best choice, but simply because I couldn't find any meaningful function name:
import itertools
def word_index(phrase):
char_pairs = zip(phrase, phrase[1:])
new_words = map(lambda p: int(p[0] == " " and p[1] != " "), char_pairs)
return itertools.accumulate(new_words)
This second version similarly to the first one returns an iterator. Note that using a iterators is usually a good idea as it doesn't make any assumption on whether your user want to instantiate anything. If the user want to transform an iterator it to a list he can always call list(it) as I did in the first piece of code. Iterators simply gives you the values one by one: at any point in time, there only is a single value in memory:
for word_index in word_index(string):
print(word_index)
Remark that phrase[1:] makes a copy of the phrase, which means it doubles the memory used. This can be improved by using itertools.islice which returns an iterator (and therefore only use constant memory). The second version would for example look like this:
def word_index(phrase):
char_pairs = zip(phrase, itertools.islice(phrase, 1, None))
new_words = map(lambda p: int(p[0] == " " and p[1] != " "), char_pairs)
return itertools.accumulate(new_words)

How to implement a sequence of numbers, each consecutive pair adds to 4?

I have a sequence (seq) of numbers.
I want the addition of each consecutive pair of numbers to equal 4.
Below is my attempt at implementing this. But, it is wrong. The Alloy Analyzer showed me it's wrong, by generating this instance:
2, 2, -2, 4
The first pair adds to 4. (2 + 2 = 4)
The second pair does not. (2 + -2 = 0)
What is the correct way to implement this? Note: I need to use sequences (seq), so please don't change the signature or its field. I am hoping that you can show me the correct way to express the fact. Or, tell me that it's impossible to implement given the use of seq.
one sig Test {
numbers: seq Int
}
fact {
all disj n, n': Test.numbers.elems {
(plus[Test.numbers.idxOf[n], 1] = Test.numbers.idxOf[n']) =>
plus[n, n'] = 4
}
}
run {#Test.numbers.indsOf[2] > 1}
To explain why your fact is incorrect, consider the following counterexample: the Test.numbers sequence is 2, 2, 2, 4.
In that counterexample:
Test.numbers.elems evaluates to 2, 4
Test.numbers.idxOf[2] is 0 (the first index of element 2)
Test.numbers.idxOf[4] is 3
there are no two disjoint n and n' in Test.numbers.elems (i.e., {2, 4}) such that plus[Test.numbers.idxOf[n], 1] = Test.numbers.idxOf[n'] so the fact trivially holds.
The following fact should express your desired property correctly:
fact {
all i: Test.numbers.inds - (#Test.numbers).prev |
plus[Test.numbers[i], Test.numbers[i.next]] = 4
}
mySeq.inds evaluates to indexes of the sequence mySeq
i.next evaluates to i + 1
i.prev evaluates to i - 1

Python convert string to variable name

Im aware that this may come up as a duplicate but so far I haven't found (or should that be understood) an answer to what Im looking for.
I have a list of strings and want to convert each one into a variable name which I then assign something to. I understand that I may need a dict for this but I am unfamiliar with them as I am relatively new to python and all the examples I have seen so far deal with values whilst I'm trying something different.
Im after something like:
list = ['spam', 'eggs', 'ham']
for i in range(len(list)):
list[i] = rat.readColumn(ratDataset, list[i])
where the first list[i] is a variable name and not a string. The second list[i] is a string (and for context is the name of a column Im reading from a raster attribute table (rat))
Essentially I want each string within the list to be set as a variable name.
The idea behind this is that I can create a loop without having to write out the line for each variable I want, with matching rat column name (the string). Maybe there is a beer way of doing this than I am suggesting?
Try the following:
lst = ['spam', 'eggs', 'ham']
d = {} # empty dictionary
for name in lst:
d[name] = rat.readColumn(ratDataset, name)
Do not use list for your identifiers as it is a type identifier and you would mask its existence. The for loop can iterate directly for the elements inside -- no need to construct index and use it aganist the list. The d['spam'] will be one of your variables.
Although, it is also possible to create the real variable names like spam, eggs, ham, you would not probably do that as the effect would be useless.
Here comes a simple dictionary use :
variables = ['spam', 'eggs', 'ham']
data = {}
datum = 0
for variable in variables:
data[variable] = datum
datum+=1
print(data)
print("value : ",data[variables[2]])
It gives as result :
{'eggs': 1, 'ham': 2, 'spam': 0}
value : 2
NB : don't use list as a variable name, list is a type identifier that you can use to transform an object into a list if possible (list("abc")==['a', 'b', 'c']) and you are overriding it with your value list right now.
one way is setting the variable name as a string and changing a part or all of it via format() method and then using the string as a varibale via vars()[STRING]
import numpy as np
X1= np.arange(1,10)
y1=[i**2 for i in X1]
X2= np.arange(-5,5)
y2=[i**2 for i in X2]
for i in range(1,3):
X = 'X{}'.format(i)
y = 'y{}'.format(i)
print('X_{}'.format(i) , vars()[X])
print('y_{}'.format(i) , vars()[y])
Output:
X_1 [1 2 3 4 5 6 7 8 9]
y_1 [1, 4, 9, 16, 25, 36, 49, 64, 81]
X_2 [-5 -4 -3 -2 -1 0 1 2 3 4]
y_2 [25, 16, 9, 4, 1, 0, 1, 4, 9, 16]

Evenly distribute repetitive strings

I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?
Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
Weighing an arrangement of items
By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
A B A C B A A
would give the count
A: 1 2 3 1 1
B: 2 3 3
C: 4 4
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
sub distances {
my %distances;
my %last_seen;
for my $i (0 .. $#_) {
my $s = $_[$i];
push #{ $distances{$s} }, $i - ($last_seen{$s} // -1);
$last_seen{$s} = $i;
}
push #{ $distances{$_} }, #_ - $last_seen{$_} for keys %last_seen;
return values %distances;
}
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
variance(d, a) = (a - d)²
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
svar(items) = sqrt ∑_i variance(items[i], average(items))
Expressed as Perl code:
use List::Util qw/sum min/;
sub svar (#) {
my $med = sum(#_) / #_;
sqrt sum map { ($med - $_) ** 2 } #_;
}
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
Strings with more occurrences should have greater weight that strings with fewer occurrences.
Uneven distributions should have greater weight than even distributions, to strongly penalize unevenness.
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
sub weigh_distance {
return sum map {
my #distances = #$_; # the distances of one string
svar(#distances) ** $#distances;
} distances(#_);
}
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to weigh_distance. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:
Selecting optimal permutations
Given a selection of permuations, we can select those permutations that are optimal:
sub select_best {
my %sorted;
for my $strs (#_) {
my $weight = weigh_distance(#$strs);
push #{ $sorted{$weight} }, $strs;
}
my $min_weight = min keys %sorted;
#{ $sorted{$min_weight} }
}
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
Creating all possible permutations
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
my %words = #_;
my #keys =
sort # sorting is important for cache access
grep { $words{$_} > 0 }
grep { length or carp "Can't use empty strings as identifiers" }
keys %words;
my ($perms, $ok) = _fetch_perm_cache(\#keys, \%words);
return #$perms if $ok;
# build perms manually, if it has to be.
# pushing into #$perms directly updates the cached values
for my $key (#keys) {
my #childs = make_perms(%words, $key => $words{$key} - 1);
push #$perms, (#childs ? map [$key, #$_], #childs : [$key]);
}
return #$perms;
}
The _fetch_perm_cache returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.
sub _fetch_perm_cache {
my ($keys, $idxhash) = #_;
state %perm_cache;
my $pointer = \%perm_cache;
my $ok = 1;
$pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for #$keys;
$pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
return $pointer, $ok;
}
That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache were shared).
Connecting the pieces
This is now a simple matter of
say "#$_" for select_best(make_perms(A => 4, B => 1, C => 1))
This would yield
A A C A B A
A A B A C A
A C A B A A
A B A C A A
which are all optimal solutions by the used definition. Interestingly, the solution
A B A A C A
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Completing the test cases
Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA
We can run these test cases by
use Test::More tests => 3;
my #test_cases = (
[0 => [qw/A A B A A/], [qw/A B A A A/]],
[1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
[0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (#test_cases) {
my ($correct_index, #cases) = #$test;
my $best = select_best(#cases);
ok $best ~~ $cases[$correct_index], "[#{$cases[$correct_index]}]";
}
Out of interest, we can calculate the optimal distributions for these letters:
my #counts = (
{ A => 4, B => 1 },
{ A => 4, B => 2, C => 1},
{ A => 2, B => 1, C => 1},
);
for my $count (#counts) {
say "Selecting best for...";
say " $_: $count->{$_}" for keys %$count;
say "#$_" for select_best(make_perms(%$count));
}
This brings us
Selecting best for...
A: 4
B: 1
A A B A A
Selecting best for...
A: 4
C: 1
B: 2
A B A C A B A
Selecting best for...
A: 2
C: 1
B: 1
A C A B
A B A C
C A B A
B A C A
Further work
Because the weighing attributes the same importance to the distance to the edges as to the distance between letters, symmetrical setups are preferred. This condition could be eased by reducing the value of the distance to the edges.
The permutation generation algorithm has to be improved. Memoization could lead to a speedup. Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.
It would be great to find a heuristic to guide the permutation generation, instead of evaluating all posibilities. A possible heuristic would consider whether there are enough different strings available that no string has to neighbour itself (i.e. distance 1). This information could be used to narrow the width of the search tree.
Transforming the recursive perm generation to an iterative solution would allow to interweave searching with weight calculation, which would make it easier to skip or defer unfavourable solutions.
The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
weight(svar, occurrences) → weighted_variance
weight(0.9, 10) → 0.35
weight(0.5, 1) → 0.5
This should in fact be reversed.
Edit
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
sub approximate {
my %def = #_;
my ($init, #keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
my #out = ($init) x $def{$init};
while(my $key = shift #keys) {
my $visited = 0;
for my $parts_left (reverse 2 .. $def{$key} + 1) {
my $interrupt = $visited + int((#out - $visited) / $parts_left);
splice #out, $interrupt, 0, $key;
$visited = $interrupt + 1;
}
}
# check if strings should be swapped
for my $i ( 0 .. $#out - 2) {
#out[$i, $i + 1] = #out[$i + 1, $i]
if $out[$i] ne $out[$i + 1]
and $out[$i + 1] eq $out[$i + 2]
and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
}
return #out;
}
Edit 2
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2, C => 4, B => 2 and $regex => 2, $fh => 4 represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.
The large bottleneck is in the select_best sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).
These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.

R: How to replace a character in a string after sampling and print out character instead of index?

I'd like to replace a character in a string with another character, by first sampling by the character. I'm having trouble having it print out the character instead of the index.
Example data, is labelled "try":
L 0.970223325 - 0.019851117 X 0.007444169
K 0.962779156 - 0.027295285 Q 0.004962779
P 0.972704715 - 0.027295285 NA 0
C 0.970223325 - 0.027295285 L 0.00248139
V 0.970223325 - 0.027295285 T 0.00248139
I'm trying to sample a character for a given row using weighted probabilities.
samp <- function(row) {
sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])
}
Then, I want to use the selected character to replace a position in a given string.
subchar <- function(string, pos, new) {
paste(substr(string, 1, pos-1), new , substr(string, pos+1, nchar(string)), sep='')
}
My question is - if I do, for example
> subchar("KLMN", 3, samp(4))
[1] "KL1N"
But I want it to read "KLCN". As.character(samp(4)) doesn't work either. How do I get it to print out the character instead of the index?
The problem arises because your letters are stored as factors rather than characters, and samp is returning a data.frame.
C is the first level in your factor so that is stored as 1 internally, and as.character (which gets invoked by the paste statement) pulls this out when working on the mini-data.frame:
samp(4)
V1
4 C
as.character(samp(4))
[1] "1"
You can solve this in 2 ways, either dropping the data.frame of the samp output in your call to subchar, or modifying samp to do so:
subchar("KLMN", 3, samp(4)[,1])
[1] "KLCN"
samp2 <- function(row)
{ sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])[,1]
}
subchar("KLMN",3,samp2(4))
[1] "KLCN
You may also find it easier to sample within your subsetting, and you can drop the data.frame from there:
samp3 <- function(row){
try[row,sample(seq(1,length(try),2),1,prob=try[row,seq(2,length(try),2)]),drop=TRUE]
}

Resources