Spill ranges: casting arrays to intersection value with # - excel

Before Excel introduced spill ranges, before the “#” operator, one could ‘cast’ a range into a single value with “0+” (numeric values) or “""&” (strings). But “#” isn’t quite the same.
Assume that there is a column of positive integers heading south from B4; and there is a row of positive integers heading east from D2; and that columns A and C and rows 1 and 3 are completely empty.
The object is to put into D4 a single spill formula, referring to something like $B$4# and $D$2#, that, when column integer bigger than row integer, calculates the pairwise Greatest Common Divisor of the two integers. Each of the desired spill cells is to be a pairwise GCD of just two integers.
So a candidate formula is:
= IF($B$4#>$D$2#, #GCD(#$B$4#, #$D$2#), "·")
Alas GCD sees two array parameters, rather two values cast/intersected taken from the two arrays, and so calculates the GCD of all these many integers, inevitably returning 1. Sigh.
Indeed, the next few don’t even spill.
= #IF(#$B$4#>#$D$2#, GCD(#$B$4#, #$D$2#), "·")
= IF(#$B$4#>#$D$2#, #GCD(#$B$4#, #$D$2#), "·")
= #IF($B$4#>$D$2#, #GCD(#$B$4#, #$D$2#), "·")
= GCD($B$4#, $D$2#)
= GCD(#$B$4#, #$D$2#)
Suggestions please.
(Mac Excel 16.32 (19120802) — which hopefully is irrelevant.)
Thank you.

It will be a while before this is widely available but this can be done with a recursive LAMBDA.
Set the name
gcdArray =LAMBDA(vData,hData,vIndex,hIndex,
LET(vSq,SEQUENCE(COUNT(vData)),
hSq,SEQUENCE(1,COUNT(hData)),
g, GCD(INDEX(vData,vIndex),INDEX(hData,hIndex)),
vFrame, IF(vIndex < COUNT(vData), IF(hIndex=1, gcdArray(vData, hData, vIndex+1, hIndex),""),""),
hFrame, IF(hIndex < COUNT(hData), gcdArray(vData, hData, vIndex, hIndex+1),""),
IF(vIndex=vSq,IF(hIndex=hSq,g,hFrame),vFrame)))
Then use =gcdArray(B4#, D2#, 1, 1)
This can be modified to evaluate similar 2D functions in general. Set the names
eval2Drecur =LAMBDA(func,vData,hData,vIndex,hIndex,
LET(vSq,SEQUENCE(COUNT(vData)),
hSq,SEQUENCE(1,COUNT(hData)),
y, func(INDEX(vData,vIndex),INDEX(hData,hIndex)),
vFrame, IF(vIndex < COUNT(vData),IF(hIndex = 1, eval2Drecur(func, vData, hData, vIndex+1, hIndex),""),""),
hFrame, IF(hIndex < COUNT(hData), eval2Drecur(func, vData, hData, vIndex, hIndex+1),""),
IF(vIndex=vSq,IF(hIndex=hSq,y,hFrame),vFrame)))
nameGCD =LAMBDA(x, y, GCD(x,y))
Then call =eval2Drecur(nameGCD, B4#, D2#, 1, 1)
Update for MAKEARRAY function
MAKEARRAY is still in beta. When it's released, it will simplify the answer.
=MAKEARRAY(ROWS(B4#),COLUMNS(D2#),LAMBDA(a, b,
LET(c,INDEX(B4#,a),
d,INDEX(D2#,b),
IF(c>d, GCD(c, d), "·"))))

Related

Is there a faster way to calculate the distance between elements in the same matrix with a Gaussian function?

Starting from an M matrix of shape 7000 x 2, I calculate the following quantity:
I do it in the following way (the variance sigma is arbitrary):
W = np.zeros((M.shape[0], M.shape[0]))
elements_sum_by_i = np.zeros((M.shape[0]))
for i in range (0,M.shape[0]):
#normalization
for k in range (0, M.shape[0]):
elements_sum_by_i[k] = math.exp(-(np.linalg.norm(M[i,:] - M[k,:])**2)/(2*sigma**2))
sum_by_i = sum(elements_sum_by_i)
#calculation
for j in range (0,M.shape[0]):
W[i,j] = (math.exp(-(np.linalg.norm(M[i,:] - M[j,:]))**2/(2*sigma**2)))/(sum_by_i)
The problem is that it is really very slow (takes about 30 minutes). Is there a faster way to do this calculation?
May be you can extract some ideas from the following comments:
1) Calculate the Log(W[i,j]) with the simplifications of the formula, the exponents disappear, the processing should be quicker.
2) Take the exponent of it: Exp(Log(W{i,j])) == W[i,j]
3) Use variables for values that are constants inside the iterations like sigma = 2*sigma**2, that you can compute at start outside of the iterations.
Important, before any change, memorize the result so that your new development can be tested on the final matrix result that you already know, I suppose, is correct.
Good luck.

Search and remove algorithm

Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.

Simultaneous Subset sums

I am dealing with a problem which is a variant of a subset-sum problem, and I am hoping that the additional constraint could make it easier to solve than the classical subset-sum problem. I have searched for a problem with this constraint but I have been unable to find a good example with an appropriate algorithm either on StackOverflow or through googling elsewhere.
The problem:
Assume you have two lists of positive numbers A1,A2,A3... and B1,B2,B3... with the same number of elements N. There are two sums Sa and Sb. The problem is to find the simultaneous set Q where |sum (A{Q}) - Sa| <= epsilon and |sum (B{Q}) - Sb| <= epsilon. So, if Q is {1, 5, 7} then A1 + A5 + A7 - Sa <= epsilon and B1 + B5 + B7 - Sb <= epsilon. Epsilon is an arbitrarily small positive constant.
Now, I could solve this as two completely separate subset sum problems, but removing the simultaneity constraint results in the possibility of erroneous solutions (where Qa != Qb). I also suspect that the additional constraint should make this problem easier than the two NP complete problems. I would like to solve an instance with 18+ elements in both lists of numbers, and most subset-sum algorithms have a long run time with this number of elements. I have investigated the pseudo-polynomial run time dynamic programming algorithm, but this has the problems that a) the speed relies on a short bit-depth of the list of numbers (which does not necessarily apply to my instance) and b) it does not take into account the simultaneity constraint.
Any advice on how to use the simultaneity constraint to reduce the run time? Is there a dynamic programming approach I could use to take into account this constraint?
If I understand your description of the problem correctly (I'm confused about why you have the distance symbols around "sum (A{Q}) - Sa" and "sum (B{Q}) - Sb", it doesn't seem to fit the rest of the explanation), then it is in NP.
You can see this by making a reduction from Subset sum (SUB) to Simultaneous subset sum (SIMSUB).
If you have a SUB problem consisting of a set X = {x1,x2,...,xn} and a target called t and you have an algorithm that solves SIMSUB when given two sets A = {a1,a2,...,an} and B = {b1,b2,...,bn}, two intergers Sa and Sb and a value for epsilon then we can solve SUB like this:
Let A = X and let B be a set of length n consisting of only 0's. Set Sa = t, Sb = 0 and epsilon = 0. You can now run the SIMSUB algorithm on this problem and get the solution to your SUB problem.
This shows that SUBSIM is as least as hard as SUB and therefore in NP.

How can I easily show that string index order when calculating Levenshtein distance don't matter for strings of the same length?

When working on my Levenshtein distance implementation I stumbled upon the fact that my indexes were swapped, as shown in this pseudocode (note the s1[j] == s2[i] instead of s1[i] == s2[j]).
L(i, j) = min(L(i - 1, j) + 1,
L(i, j - 1) + 1,
L(i - 1, j - 1) + (s1[j] == s2[i] ? 0 : 1))
But because my implementation calculates the matrix as a sequence of rectangular submatrixes, it doesn't seem to affect the computation at all, and always yields the correct result, no matter if the indexes are swapped or not. (Or for simplicity just think of the strings as having the same length.)
Now my question is, how can I prove (not necessarily in a formal way) that the index order doesn't matter for equal length strings? It seems that because this is the only places that affects the matrix, and because it ends up being symmetrical, swapping the indexes would just transpose the matrix, but I'm not sure if I'm not missing something important.
As you pointed out, this will only work if the two strings are of equal lengths.
But given a more formal definition of levenshtein in the image below, the only things actually referring to the content of the strings are the function r(x, y). The rest is only concerning the length of the strings, which in this case are the same. So the effect of using s1[j] == s2[i] instead of s1[i] == s2[j] is the same as swapping the two input parameters s1 and s2.
Note: MSD = minimum sum of distances

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

Resources