Sum of elementwise product of subsets' min of two arrays - dynamic-programming

As the title suggests, writing the question explicitly could create some confusion. So, I start by an example.
Let's say we have two arrays: a = [1,3,2] and [5,4,3]. What I would like compute in a brute-force way is to first compute the minimum of each of the subset as:
aa = [1, 3, 2, min(1,3), min(1,2), min(3,2), min(1,3,2)]
bb = [5, 4, 6, min(5,4), min(5,6), min(4,6), min(5,4,6)]
and then finally the sum of the product aa ** bb:
1*5 + 3*4 + 2*6 + min(1,3)*min(5,4) + min(1,2) * min(5,6) + min(3,2)*min(4,6) + min(1,3,2)*min(5,4,6)
Obviously, the brute-force calculation is not efficient, and the memory and time complexity increase exponentially with respect to the number of elements in the arrays.
To give a bit more context, the arrays could be of real numbers (positive and negatives), and each element in the first array corresponds to the element with the same index in the second array.
I have already seen some similar problems, like:
Sum of OR of smallest and largest elements of each subset of a set
Sum of products of elements of all subarrays of length k
But they are mainly focused on sum of the min of subsets for one array only.

Related

how can I find the missing integers when given a list of integers that can be repeated C times cannot add up to value V

So I am given 3 input values, C, D, and V. Where D is a list of integers and C is the number of times that any element in D can be repeated when summing to at most a value V.
Now what will always be the case is that not every combination of sums in D will equal 1 to V. Therefore, my function has to find the minimum amount of integers to include in D that will satisfy summing to at most V.
In other words, minimum number of integers to insert into D such that all integers from 1 <= x <= V can be represented by a summation of combination of numbers in D, where no number is repeated more than C times
For example,
D = [1, 5, 10, 25]
C = 2
V = 100
As you can see, the max value I can get from D and C is:
2(1+5+10+25) = 82 which is less than V = 100.
I solved it manually where I can satisfy every value from 1 to 100 when I include two new integers in D which is 12 and 2.
So my output would be 12 and 2.
Now the way I saw it was to basically cover all single digit values so since I already have 1 and 5 in D where I can repeat their sums 2 times given by C=2, the values I cannot account for are 3, 4, 8, and 9 which can be resolved by introducing 2 into D and likewise for double digits it would be 12 given C and D.
ie.
< 3=1+2, 4=2+2 or 2+1+1, 8=5+2+1, 9=5+2+2 or 5+2+1+1 >
What becomes obvious now is the fact that computationally it can be extremely expensive for large values of N, especially if I brute force it.
My strategy was this:
Find all possible combinations of sums given D and C then put into a list, and lets call it dSums.
Create a list ranging from 1 to V ie.
vList = [i for i in range(1, v+1, 1)]
iterate through dSums and remove any value in dSums from vList
so now I have a list of all integers that are missing but the problem is this.
If D = [1, 5], C=2 and V=10
vList = [3, 4, 8, 9]
I thought I can find the common integer that will satisfy all the values in vList using D and C but I am not sure how to traverse or map the lists in such a way that I ultimately arrive at my desired value which is 2.
Any thoughts?

What are handy Haskell concepts to generate numbers of the form 2^m*3^n*5^l [duplicate]

This question already has answers here:
New state of the art in unlimited generation of Hamming sequence
(3 answers)
Closed 10 months ago.
I am trying generate numbers of the form 2^m*3^n*5^l where m, n, and l are natural numbers including 0.
The sequence follows: 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30, 32, .....
I am testing it by getting the one millionth number. I implemented it using list comprehension and sorting, but it takes too long. I want a faster solution. I have been spending days trying to do this to no avail.
I do not want a complete solution. I just want to know what Haskell concepts are necessary in accomplishing it.
Here's an approach that doesn't need any Haskell concepts, just some math and computer science.
Grab a library that offers priority queues.
Initialize a priority queue containing only the number 1.
Loop the following indefinitely: extract the minimum value from the queue. Put it next in the output list. Insert that number times 2, 3, and 5 as three individual entries in the queue. Make sure the queue insert function merges duplicates, because there will be a lot of them thanks to commutativity of multiplication.
If you have a maximum you're working up to, you can use it to prune insertions to the queue as a minor optimization. Alternatively, you could take advantage of actual Haskell properties and just return an infinite list using laziness.
First, write a function of type Int -> Bool that dermines if a given integer is in the sequence you defined. It would divide the number by 2 as many times as possible (without creating a fraction), then divide it by 3 as many times as possible, and finally divide it by 5 as many times as possible. After all of this, if the number is larger than 1, then it cannot be expressed as a products of twos, threes, and fives, so the function would return false. Otherwise, the number is in your sequence, so the function returns true.
Then take the infinite sequence of integers greater than 0, and use the function above to filter out all numbers that are not in the sequence.
Carl's approach can be improved by inserting less elements when removing the minimal element x: As 2<3<4<5<6 you can just
append 3*x/2 if x is even but not divisible by 4
append 4*x/3 if x is divisible by 3
append 5*x/4 if x is divisible by 4
append 6*x/5 if x is divisible by 5
In code it looks like this:
g2 x | mod x 4 == 0 = [5*div x 4]
| even x = [3*div x 2]
| otherwise = []
g3 x | mod x 3 == 0 = [4*div x 3]
| otherwise = []
g5 x | mod x 5 == 0 = [6*div x 5]
| otherwise = []
g x = concatMap ($ x) [g2, g3, g5]
So you if your remove the minimal element x from the priority queue, you have to insert the elements of g x into the priority queue. On my laptop I get the millionth element after about 8 min, even if I use just a list instead of the better priority queue, as the list grows only to a bit more than 10000 elements.

What is the difference between [0,1,2,3,4] and [[0],[1],[2],[3],[4]]?

I have a list of years and multiply this with a factor 2 and get the result:
years = [0,1,2,3,4]
x = [[2*i] for i in years]
result = [[0],[2],[4],[6],[8]]
However, I would like to divide this by the sum of the result, but I seems like it is not possible.
Therefore, what is the difference between [0,1,2,3,4] and [[0],[2],[4],[6],[8]]?
And how can I change the format so it is possible to divide each number in the result list by the sum of the results?
[0, 1, 2, 3, 4] is a list of integers. It's a one-dimensional list and you only need one index to access its elements. For example, years[3] is the integer 3.
[[0],[2],[4],[6],[8]] is a list of lists of integers, so it's a two-dimensional list. You need two indices to access its integer elements. For example, result[3] will give you the list [6], and the zeroth index of that list will give you the integer 6. In other words, result[3][0] gives you the integer 6.
The list comprehension result = [[2*i] for i in years] is what creates the two-dimensional list because you asked it to. You said:
For every i in years,
Calculate 2 * i
Put that into a list [2 * i]
Collect all these lists of [2 * i] in a single list.
If you wanted a 1-d list, skip the brackets around [2 * i] like so: result = [2 * i for i in years]. This tells Python to:
For every i in years,
Calculate 2 * i
Collect all these 2 * i into a single list

Return N Choose K Columns of An NxN array (Choosing N choose K features from a Correlation Matrix)

I have a 40x40 Numpy array like so (here's a 4x4 example):
Correlationarray=np.array([[1, .1, .3, .4],
[.1, 1, .2, .7],
[.3, .2, 1, .5],
[.4, .7, .5, 1]])
I would like form all possible n choose k subsets of this array and compute the arrays with the lowest overall correlation values (so i would say, compute all the subsets and store them in another array, and then right a funciton that square each value in an arbitrary array in my array of arrays/list and sum along the row values)
For example, suppose these were two of the arrays in the 4 choose 2 set of arrays
a = np.array([[1, .1]
[.1, 1]
[.3, .2]
[.4, .7]])
b=([[.1, .4]
[1, .7]
[.2, .5]
[.7, 1]])
I'd like to square each value in the array, sum the values row wise (and then sum the row scores) save that array and all other such arrays generated from the subsets of my correlation matrix, and return the n number of smallest arrays with the lowest values of the sum of squares
I'm not sure how to select all the subsets in a computationally quick fashion or to optimize performance, and not necessarily sure how to return the arrays with the lowest correlation scores (I'm familiar with returning elements of an array based on their value from largest to smallest, however, I'm not sure If i can generalize this)
Thanks!

Explanation of normalized edit distance formula

Based on this paper:
IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:
Given two strings X and Y over a finite alphabet, the normalized edit
distance between X and Y, d( X , Y ) is defined as the minimum of W( P
) / L ( P )w, here P is an editing path between X and Y , W ( P ) is
the sum of the weights of the elementary edit operations of P, and
L(P) is the number of these operations (length of P).
Can i safely translate the normalized edit distance algorithm explained above as this:
normalized edit distance =
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))
You are probably misunderstanding the metric. There are two issues:
The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;
Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.
An explanation of Example 3.1 (c):
From aaab to abbb, the paper used the following transformations:
match a with a;
skip a in the first string;
skip a in the first string;
skip b in the second string;
skip b in the second string;
match the final bs.
These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)
The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".
In figure 2(b), W(P) = 6 because the algorithm does the following steps:
keep first a (cost 0)
convert first b to a (cost 3)
convert second b to a (cost 3)
keep last b (cost 0)
The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).
In figure 2(c), the steps are different:
keep first a (cost 0)
delete first b (cost 2)
delete second b (cost 2)
insert a (cost 2)
insert a (cost 2)
keep last b (cost 0)
In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.

Resources