How do I solve 0/1 knapsack problem with big weights given? - dynamic-programming

I've been trying to find the optimal solution for the 0/1 knapsack problem with big weights and values like {300: 256, 653: 300, 1500: 500}. From my research, I found out that the best solution relies on DP technique where rows have a format [0..W]. So for the big weights like 1500, table [0..1500][] would be a total waste of space. Any recommendations on how to avoid such waste? Is it better to use a recursive algorithm for such cases or there is a way to create a table without allocating this much memory?

Related

suggestion for clustering algorithm?

I have a dataset of 590000 records after preprocessing and i wanted to find clusters out of it and it contains string data (for now assume i have only one column with 590000 unique values in dataset). Also i am using custom defined distance measure and needed to calculate the distance matrix of size 590000*590000. Using some partition logic i created the distance matrix but cannot merge those partitions into one big distance matrix due to memory constarints. Does anyone have any sort of idea to resolve it ?? I picked DBSCAN for it. Is there any way to use deep learning methodologies?? any other ideas
Use a manageable sample first.
Because I doubt the results will be good enough to warrant any effort on scaling a method that does not work anyway.

How to fit a huge distance matrix into a memory?

I have a huge distance matrix of size aroud 590000 * 590000 (data type of each element is float16). Will it fit in Memory for clustering algorithm ?? If not could anyone give an idea of using it in clustering DBSCAN algorithm??
590000 * 590000 * 2 bytes (float16 size) = 696.2 GB of RAM
It won't fit in memory with a standard computer. Moreover, float16 are converted to float32 in order to perform computations (see Python numpy float16 datatype operations, and float8?), so it might use a lot more than 700GB of RAM.
Why do you have a square matrix ? Can't you use a condensed matrix ? It will use half the memory needed with a square matrix.
Clustering (creating chunks) to decrease the problem size for DBSCAN can e.g. be done by having areas with overlapping regions.
The size of the overlapping regions has to fit your problem.
Find a reasonable size for the chunks of your problem and the overlapping region.
Afterwards stitch the results manually by iterating and comparing the clusters found in the overlapping regions.
You have to check if the elements in one cluster are also present in other chunks.
You might have to apply some stitching parameters, e.g. if some number of elements are in clusters in two different chunks they are the same cluster.
I just saw this:
The problem apparently is a non-standard DBSCAN implementation in
scikit-learn. DBSCAN does not need a distance matrix.
But this has probably been fixed years ago.
Which implementation are you using?
DBSCAN only needs the neighbors of each point.
So if you would know the appropriate parameters (which I doubt), you could read the huge matrix one row at a time, and build a list of neighbors within your distance threshold. Assuming that less than 1% are neighbors (on such huge data, you'll likely want to go even lower) that would reduce the memory need 100x.
But usually you want to avoid computing such a matrix at all!

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.
My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Also, if you look at the Java code example there is also this
The number of columns should be small, e.g, less than 1000.
On the other hand, if you look at ML documentation, there are no limitations mentioned.
So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?
PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.
These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.
When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.
If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.
Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.
Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.
Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

k-means with ellipsoids

I have n points in R^3 that I want to cover with k ellipsoids or cylinders (I don't really care; whichever is easier). I want to approximately minimize the union of the volumes. Let's say n is tens of thousands and k is a handful. Development time (i.e. simplicity) is more important than runtime.
Obviously I can run k-means and use perfect balls for my ellipsoids. Or I can run k-means, then use minimum enclosing ellipsoids per cluster rather than covering with balls, though in the worst case that's no better. I've seen talk of handling anisotropy with k-means but the links I saw seemed to think I had a tensor in hand; I don't, I just know the data will be a union of ellipsoids. Any suggestions?
[Edit: There's a couple votes for fitting a mixture of multivariate Gaussians, which seems like a viable thing to try. Firing up an EM code to do that won't minimize the volume of the union, but of course k-means doesn't minimize volume either.]
So you likely know k-means is NP-hard, and this problem is even more general (harder). Because you want to do ellipsoids it might make a lot of sense to fit a mixture of k multivariate gaussian distributions. You would probably want to try and find a maximum likelihood solution, which is a non-convex optimization, but at least it's easy to formulate and there is likely code available.
Other than that you're likely to have to write your own heuristic search algorithm from scratch, this is just a huge undertaking.
I did something similar with multi-variate gaussians using this method. The authors use kurtosis as the split measure, and I found it to be a satisfactory method for my application, clustering points obtained from a laser range finder (i.e. computer vision).
If the ellipsoids can overlap a lot,
then methods like k-means that try to assign points to single clusters
won't work very well.
Part of each ellipsoid has to fit the surface of your object,
but the rest may be inside it, don't-cares.
That is, covering algorithms
seem to me quite different from clustering / splitting algorithms;
unions are not splits.
Gaussian mixtures with lots of overlaps ?
No idea, but see the picture and code on Numerical Recipes p. 845.
Coverings are hard even in 2d, see
find-near-minimal-covering-set-of-discs-on-a-2-d-plane.

2D shape optimization through genetic algorithms

I just recently started learning about genetic algorithms and am now trying to implement them in 2D shape optimization in physics simulaiton. The simulation produces a single scalar for each shape. (I guess this is kind of similar to boxcar2d http://boxcar2d.com/)
The 2D shapes are actually the union of several 2D "sub shapes." Each subshape is stored as an list of angles/radii. The 2D shape is then stored as a list of subshape lists. This serves as my chromosone right now.
Right now for fitness, I will probably use the scalar the simulation produced. My question is, how should I go about the selection and reproduction process? Would tournament be more appropriate, or would I want to use truncation in combination with the proportional selection? Also, how do you find a good mutation rate/population size, etc
sorry for so many questions but thanks in advance. I just don't really know where to start.
On my point of view the best way is to use adaptive reproduction strategy during evolution: at the first steps (let name it - "the first phase of calculations") you might set high mutation probability, at this phase you should find enough good solution. At the "second phase" of algorithm you might set decreasing of mutation probability every few steps - at this phase you should improve your solution. But sometimes in my practice I've noticed degradation of population during second phase of optimization (when each chromosome is strongly similar to other) - which effects with extremly slowing down of optimization pperformance, so my solution was to improve algorithm with high valued mutation random perturbations and it helps.
Also I'll advice you to read about differential evolution algorithm - http://en.wikipedia.org/wiki/Differential_evolution. As for me it's performance is much more faster than genetic algorithm.

Resources