PartitionProblem variation - fixed size of subsets - dynamic-programming

I have a problem which is a variation of the partition problem which is NP-complete. This is an optimization problem, not a decision problem.
Problem: Partition a list of numbers into two subsets such that their difference of sums is minimum, and find the two subsets. If n even, then the sizes should be n/2, and if odd, then floor[n/2] and ceil[n/2].
Assuming that the pseudo polynomial time DP algorithm is the best for an exact solution, how can it be modified to solve this? And what would be the best approximate algorithms to solve this?

Since you didn't specified which algorithm to use i'll assume you use the one defined here:
http://www.cs.cornell.edu/~wdtseng/icpc/notes/dp3.pdf
Then using this algorithm you add a variable to track the best result, initialize it to N (sum of all the numbers in the list as you can always take one subset to be the empty set) and every time you update T (e.g: T[i]=true) you do something like bestRes = abs(i-n/2)<bestRes : abs(i-n/2) : bestRes. And you return bestRes. This of course doesn't change the complexity of the algorithm.
I've got no idea about your 2nd question.

Related

General Question about Dynamic Programming

So I saw a video about the Knapsnack problem, which can be solved recursively as well as using dynamic programming. The gist I got about dynamic programming is that it's nothing more than a dictionary, list or collectively a record of stuff we have already computed so we don't have to compute it again.
Is that what dynamic programming is all about? Performing record keeping and using when necessary?
In simple words, we are solving a small problem(called subproblem) and then use it to solve bigger problems.
To achieve this we keep a record of what we have computed till now which can inturn be used next time rather than computing all over again.
We think of a dynamic programming approach to a problem if it has
overlapping subproblems
optimal substructure
In very simple words we can say dynamic programming has two faces, they are top-down and bottom-up approaches.
In the top-down approach, we will try to write a recursive solution or a brute-force solution and memoize the results so that we will try to use that result when a similar subproblem arrives, so it is brute-force + memoization.
In the bottom-up approach, we will try to form a solution from base cases or very small subproblems where we already know the solution to. We will build the solution to the larger problem by filling a dynamic programming table that maps every combination possible giving again a brute-force template.
Coming up with a mathematical relation to the problem and also identify the above two properties is the challenging part.
overlapping subproblems
Informally, When a problem needs the same subproblem to be solved more than once then we say it has got overlapping subproblems.
optimal substructure
Informally, when you need to solve a problem for size n, so you divide that problem into subproblems of size n'. so now let's say it has got two stages, one stage is the problem n and another stage is the subproblems n'. Also, let's assume that you know the optimal solutions for size n' so you somehow combine these subproblem solutions together and get a solution for the size n. if the combined solution is same as the actual optimal solution for the problem of size n then you can safely say that the problem has got optimal substructure.
Let's take a simple example of finding nth Fibonacci number, to understand the two properties well.
The usual mathematical recursive relation would be
F(n) = F(n-1) + F(n-2)
Let's try to figure out the two properties for this example.
Informally, it's always easy to take a value for the size n for self-understanding.
Let n be 3 then,
F(3) = F(2) + F(1)
We know the optimal solutions for the F(0) = 0 and F(1) = 1 as base cases.
overlapping subproblems
F(3)
/ \
/ \
F(2) F(1)
/ \ / \
F(1) F(0) F(0) 0
From the above recursive tree you can easily find that we have to recompute F(0) and F(1) more than once. So it has got overlapping subproblems.
Optimal Substructure
We know the Fibonacci sequence as 0 1 1 2 ...
Let's consider a subtree as
F(2)
/ \
F(1) F(0)
Combining the optimal solutions of subproblems with an addition operation would be as
F(2) = F(1) + F(0)
F(2) = 1 + 0
F(2) = 1
Combining subproblem solutions has given us the actual optimal solution solution to the problem n=2 which can be confirmed from the known Fibonacci sequence. So this problem also has an optimal substructure.
From Wikipedia: "Dynamic Programming is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions using a memory-based data structure (array, map,etc)."
Like with recursive algorithms, the key is breaking down the problem in smaller sub-problems, using efficient data-structures to help you in the task.
So, in a nutshell, it is exactly about efficient record keeping (+ sorting algorithms + smart data structures).

Quickest way to find closest set of point

I have three arrays of points:
A=[[5,2],[1,0],[5,1]]
B=[[3,3],[5,3],[1,1]]
C=[[4,2],[9,0],[0,0]]
I need the most efficient way to find the three points (one for each array) that are closest to each other (within one pixel in each axis).
What I'm doing right now is taking one point as reference, let's say A[0], and cycling all other B and C points looking for a solution. If A[0] gives me no result I'll move the reference to A[1] and so on. This approach as a huge problem because if I increase the number of points for each array and/or the number of arrays it requires too much time to converge some times, especially if the solution is in the last members of the arrays. So I'm wondering if there is any way to do this without maybe using a reference, or any quicker way than just looping all over the elements.
The rules that I must follow are the following:
the final solution has to be made by only one element from each array like: S=[A[n],B[m],C[j]]
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
For example in this simplified case the solution would be: S=[A[1],B[2],C[1]]
To clarify further the problem: what I wrote above it's just a simplify example to explain what I need. In my real case I don't know a priori the length of the lists nor the number of lists I have to work with, could be A,B,C, or A,B,C,D,E... (each of one with different number of points) etc. So I also need to find a way to make it as general as possible.
This requirement:
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
massively simplifies the problem, because it means that for any given (xi, yi), there are only nine possible choices of (xj, yj).
So I think the best approach is as follows:
Copy B and C into sets of tuples.
Iterate over A. For each point (xi, yi):
Iterate over the values of x from xi−1 to xi+1 and the values of y from yi−1 to yi+1. For each resulting point (xj, yj):
Check if (xj, yj) is in B. If so:
Iterate over the values of x from max(xi, xj)−1 to min(xi, xj)+1 and the values of y from max(yi, yj)−1 to min(yi, yj)+1. For each resulting point (xk, yk):
Check if (xk, yk) is in C. If so, we're done!
If we get to the end without having a match, that means there isn't one.
This requires roughly O(len(A) + len(B) + len(C)) time and O(len(B) + len(C) extra space.
Edited to add (due to a follow-up question in the comments): if you have N lists instead of just 3, then instead of nesting N loops deep (which gives time exponential in N), you'll want to do something more like this:
Copy B, C, etc., into sets of tuples, as above.
Iterate over A. For each point (xi, yi):
Create a set containing (xi, yi) and its eight neighbors.
For each of the lists B, C, etc.:
For each element in the set of nine points, see if it's in the current list.
Update the set to remove any points that aren't in the current list and don't have any neighbors in the current list.
If the set still has at least one element, then — great, each list contained a point that's within one pixel of that element (with all of those points also being within one pixel of each other). So, we're done!
If we get to the end without having a match, that means there isn't one.
which is much more complicated to implement, but is linear in N instead of exponential in N.
Currently, you are finding the solution with a bruteforce algorithm which has a O(n2) complexity. If your lists contains 1000 items, your algo will need 1000000 iterations to run... (It's even O(n3) as tobias_k pointed out)
Like you can see there: https://en.wikipedia.org/wiki/Closest_pair_of_points_problem, you could improve it by using a divide and conquer algorithm, which would run in a O(n log n) time.
You should search for Delaunay triangulation and/or Voronoi diagram implementations.
NB: if you can use external libs, you should also consider taking a look at the scipy lib: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.Delaunay.html

SVD and singular / non-singular matrices

I need to use the SVD form of a matrix to extract concepts from a series of documents. My matrix is of the form A = [d1, d2, d3 ... dN] where di is a binary vector of M components. Then the svd decomposition gives me svd(A) = U x S x V' with S containing the singular values.
I use SVDLIBC to do the processing in nodejs (using a small module I wrote to use it). It seemed to work all well, but I noticed something quite weird in the running time behavior depending on the state of my matrix (where N, M are growing, but already above 1000 for each).
First, I didn't consider extracting the same document vectors, but now after some tests, it looks like adding a document twice sometimes speeds the processing extraordinarily.
Do I have to make sure that each of the columns of A are pairwise-independent? Are they required to be all linearly independent? (I thought nope, since SVD just seems to be performing its job well even with some columns being exactly the same, it will simply show in the resulting decomposition which columns / rows are useless by having 0 components in U or V)
Now that it sometimes takes way too much time to compute the SVD of my big matrix, I was trying to reduce its size by removing the same columns, but I found out that actually adding dummy same vectors can make it way faster. Is that normal? What's happening?
Logically, I'd say that I want my matrix to contain as much information as possible, and thus
[A] Remove all same columns, and in the best case, maybe
[B] Remove linearly dependent columns.
Doing [A] seems pretty simple and not computationally too expensive, I could hash my vectors at construction to check what are the possibly same vectors, and then spend time to check these, but are there good computation techniques for [A] and [B]?
(I'd appreciate for [A] to not have to check equality of a new vector with the whole past vectors the brute-force way, and as for [B], I don't know any good way to check it / do it).
Added related question: about my second question, why would SVD's running time behavior change so massively by just adding one similar column? Is that a normal possible behavior, or does it mean I should look for a bug in SVDLIBC?
It is difficult to say where the problem is without samples of fast and slow input matrices. But, since one of the primary uses of the SVD is to provide a rotation that eliminates covariance, redundant (or the same) columns should not cause problems.
To answer your question about if the slow behavior being a bug in the library you're using, I'd suggest trying to retrieve the SVD of the same matrix using another tool. For example, in Octave, retrieve an SVD of your matrix to compare runtimes:
[U, S, V] = svd(A)

bin packing with overlapping objects

I have some bins with different capacities and some objects with specified size. The goal is to pack these objects in the bins. Until now it is similar to the bin-packing problem. But the twist is that each object has a partial overlap with another. So while object 1 and 2 has sizes s1 and s2, when I put them in the same bin the filled space is less than s1+s2. Supposing that I know this overlapping value for each pair of objects, is there any approximation algorithm like the ones for original bin-packing for this problem too?
The answer is to use a kind of tree that captures the similarity of objects assuming that objects can be broken. Then run a greedy algorithm to fill the bins according to the tree. This algorithm has 3-x approximation bound. However, there should also be better answers.
This method is presented in Michael Sindelar, Ramesh K. Sitaraman, Prashant J. Shenoy: Sharing-aware algorithms for virtual machine colocation. SPAA 2011: 367-378.
I got this answer from this thread but just wanted to close this question by giving the answer.
The only algorithm I think will work is to prune items that doesn't fit into the bins and use another bin. I don't mean first fit algorithm but to wait a period of time and then use new bins for the items. In reality you can use just another bin? It's a practical approach. I mean you can grow the bin to the left or to the right like in this example: http://codeincomplete.com/posts/2011/5/7/bin_packing/.

Speed up text comparisons (with sparse matrices)

I have a function which takes two strings and gives out the cosine similarity value which shows the relationship between both texts.
If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.
Is there a way to reduce this number of comparisons? For example sparse matrices or k-means?
I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.
What Ben says it's true, to get better help you need to tell us what's the goal.
For example, one possible optimization if you want to find similar strings is storing the string vectors in a spatial data structure such as a quadtree, where you can outright discard the vectors that are too far away from each other, avoiding many comparisons.
If your algorithm is pair-wise, then you probably can't reduce the number of comparisons, by definition.
You'll need to use a different algorithm, or at the very least pre-process your input if you want to reduce the number of comparisons.
Without the details of your function, it's difficult to give any concrete help.

Resources