Defining the Linear Programming Model for TSP with multi access per point - pulp

I am trying to use pulp to solve a TSP (similar to this:
Defining the Linear Programming Model for Traveling Salesman in Python
https://www.kaggle.com/itoeiji/solving-tsp-and-vrp-by-mip-using-pulp)
However, instead of going through all the points, some of the points have different access and only one access per point is enough. We can imagine a postman delivering n houses with some of the houses (m houses) having 2 letter boxes, one in the north and one in the south.
Example:
We can consider 5 points, with point 2 and 3 having 2 letter boxes.
Some potential solutions of the problem would be:
[1, 2S, 3N, 4, 5]
[1, 2N, 3S, 4, 5]
[1, 2S, 3S, 4, 5]
etc...
Any ideas how to modify the constraints in order to tackle this variation?

Related

Generating a Random Variable that does not overlap with previous ones while placing components

I'm randomly placing components on a grid using random variables in Python, with the following line of code repeated for the no of components to be placed:
[random.randint(0,32), random.randint(0,27), random.randint(0,3), 1]
And, I also have parts sizes for each of the parts in an array like this:
partSizes = [[[10, 10], [4, 2], [4, 2]]]
So, when I'm randomly placing parts on my grid, some of them get overlapped which I need to avoid by skipping the portion of grid that's occupied with the previously placed parts.
I'm very new to Python so I'm having a hard time figuring this out. Any help will be much appreciated.
Regards

How to use machine learning or prediction to solve this regression in Python 3?

I want to have a tip about machine learning.
Input:
[[2, 3, 7],
[3, 9, 5],
[2, 6, 4]]
Output:
[4, 1, 1].T
are already given dataset. I want to know the output of Input [8, 1, 7]. I think it is a kind of machine learning - regression problem. Fundamentally, let's guess Input dataset is a group of pure numbers.
In long term, I want to know the case that some kinds of input sets are not pure number, but this is not urgent now, so I'll think it later. Intuitionally, it looks so simple, but I cannot search how to solve it because of my bad skill.
How can I tackle this?
Your dataset of 3 observations is very small to employ a reasonable machine learning technique.
I'll probably go for k-nearest neighbours method here: given input, calculate the distance to known data points, and select the output associated with closest one.
Here, the closest observation (measured by Euclidian distance) to [8,1,7] is [2, 3, 7], so this method will predict that output is 4.
If you obtain larger dataset, you will be able to use much better methods.

Combinations to Equal Sums (similar to Subset-Sum and Coin Changing Algorithms)

I have the Coin Changing problem except with a twist: instead of finding solutions from infinite coins to equal a single sum, find a list of solutions from a finite set of coins to be less-than a collection of sums.
(A great link to the classic problem is here)
(This is also similar to subset-sum problem, except with a set of targets instead of just one - link here)
A similar, but different and seemingly more difficult problem to code:
Given -
A list of possible numeric values that all must be used [1, 9, 4, 1, 3, 2]
A list of groups with a maximum total that can be achieved [(GroupA,17), (GroupB,1), (GroupC,5)]
Goal - Assign each numeric value from the list to a Group, so that all items are used with the following constraint: the total sum of the values for each group must not exceed the assigned maximum.
For example, this example might find 3 solutions:
[[GroupA,[9,4,1,3]], [GroupB, [1]], [GroupC, [2]],
[GroupA,[9,4,1,2]], [GroupB, [1]], [GroupC, [3]],
[GroupA,[9,2,1,3]], [GroupB, [1]], [GroupC, [4]]]
This is called the multiple knapsack problem. You have n items with weights w[i], and m knapsacks with capacities c[i], and you need to assign the items to the knapsacks so that none exceeds its capacity.
Wikipedia provides the integer programming formulation of the problem, and that's the approach I'd take to solving practical examples -- by using an IP solver.
For completeness, the IP formulation is:
maximize sum(x[i,j] * w[j], i=1..m, j=1..n)
such that:
sum(x[i,j], i=1..m) <= 1 (for all j=1..n)
sum(x[i,j] * w[j], j=1..n) < c[i] (for all i=1..m)
x[i][j] = 0 or 1 (for all i=1..m, j=1..n)
That is, you're maximizing the total weight of things in the knapsacks, such that no item is assigned to more than one knapsack, no knapsack exceeds its capacity, and items are discrete (they can't be partially assigned to a knapsack). This is phrased as an optimization problem -- but of course, you can look at the best solution and see if it assigns all the items.

Apache Spark user-user recommendation?

I have a data set of some questions and answers that users have completed by choices. I'm trying to build a user-user recommendation engine to find similar users based on their answers to the quesitons. An important point is questions are shuffled and are not in an order and data is streaming.
So for each user I have a data like this:
user_1: {"question_1": "choice_1", ...}
user_2: {"question_3": "choice_4", ...}
user_3: {"question_1": "choice_3", ...}
I have found most tutorials to be about user-item recommendations, but nothing about user-user recomenndations.
I've realized Clustering and Cosine Similarity might be some good options and I've found columnSimilarity is very efficient.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
mat = RowMatrix(rows)
sims = mat.columnSimilarity()
I have two questions:
Is it wise to define each user as column and question/choices as rows to get the result I need?
And how should I vectorize this kind of data to numbers? If I need to do clustering.
Thanks in advance :)
Unfortunately, that's not the way it can be done. It's too good to be true, isn't it ?
columnSimilarity is to be used with skinny and tall matrices, so if you have a user-user matrix on which you wish to perform that task, it won't work. e.g if you have 1M users)
From your description, I see that you have might have a short and wide matrix, columnSimilarity won't work for you.
If you wish to perform UUCF, clustering would be a way to go. (among others, LSH is also a good approach.)

How can i cluster document using k-means (Flann with python)?

I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
Is k-means the right algorithm?
You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.
So here's what you do:
Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.
Here's an example.
If your "documents" are single sentences, and they look like (one doc per line):
there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog
If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:
[1, 1, 0, 0, 0] // dog 1 time, cat 1 time
[0, 0, 0, 1, 1] // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0] // dog 2 times, cat 1 time, street 1 time
You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.
There is one big problem here:
K-means is designed for Euclidean distance.
The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).
Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.
For text vectors, you probably will want to use a different distance function such as cosine similarity.
And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.
There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

Resources