I am using Google OR Tools for the problem:
Vehicles always start from the depot with all the weights loaded for drop points. And there are pickup points from where they have to load weights. The tour has to be completed within a time window of 10 hours.
Nodes = [A, B, C, D, E, F, G]
Weights = [50, 60, 30, 20, 80, 90, 40]
PointType = [D, D, P, D, P, D, D] where D = Delivery and P = Pickup
I have tried using the Pickup Deliveries example in OR Tools and used dummy pickup node for every drop and dummy drop node for every pickup.
Thus, having a unique pickup and drop combination.
This approach works well for upto 50-60 locations but the solver fails to return any solution for large number of locations (135 locations and hence 135 * 2 = 270 nodes, because one dummy node for each).
Is there another way to solve this problem using OR tools, that does not require using dummy nodes, which would effectively lower the total nodes?

OR-tools was not generating a solution because the problem of 135 nodes was infeasible within the Time Windows that I was passing.
It has no trouble generating solution when time window is right.
Also, creating dummy/duplicate nodes is probably necessary, because in any variant of Travelling Salesman Problem (TSP), one node can only be visited once.


ArangoDB AQL: can I traverse a graph from multiple start vertices, but ensure uniqueVertices across all traversals?

I have a graph dataset with large number of relatively small disjoint graphs. I need to find all vertices reachable from a set of vertices matching certain search criteria. I use the following query:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
FOR node IN 0..100000 OUTBOUND startnode edges
COLLECT k = node._key
The query is very slow, even though it returns the correct result. This is because Arango actually ends up traversing the same subgraphs many times. For example, say there is the following subgraph:
a -> b -> c -> d -> e
When vertices a and c are selected by the filter condition, Arango ends up doing two independent traversals starting from a and c. It visits vertices d and e during both of these traversals, which wastes time. Adding uniqueVertices option doesn't help, because the vertex uniqueness is not checked across different traversals.
To confirm the impact on performance, I created an extra root document and added links from it to all the documents found by my filter:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
INSERT { _from: 'fakeVertices/0', _to: startnode._id } IN fakeEdges
Now the following query runs 4x faster than my original query, while producing the same result:
FOR node IN 1..1000000 OUTBOUND 'fakeVertices/0' edges, fakeEdges
OPTIONS { uniqueVertices: 'global', bfs: true }
COLLECT k = node._key
Unfortunately, I cannot create fake vertex/edges for all of my queries as creating it takes even more time.
My question is: does Arango provide a way to ensure uniqueness of vertices visited across all traversals in given query? If not, are there any better way to solve the problem described above?
From what I understand, this is what the uniqueVertices option is for, but for each iteration of the FOR ... statement, it considers vertices unique for the traversal from that start node. It doesn't know about other traversals that have happened on other nodes in the FOR ... statement. It appears that you will traverse LOTS of vertices each time, and this happens from each new start node.
Just throwing this at the wall to see if it sticks, but what about a combination of the two queries, adding OPTIONS to the original?
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
FOR node IN 0..100000 OUTBOUND startnode edges
OPTIONS { uniqueVertices: 'global', bfs: true }
COLLECT k = node._key
Also, I would highly recommend a named graph instead of specifying edge collections. Not only is it far more flexible, it allows you to use shortest-path calculations as well, which might help here.

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist ="*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)
It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

Iterate cluster centers for K means in Python

I have 4 columns of data. For these Xs, I need to pick 3 cluster centers randomly and find the cluster with least SSE. Why is it that the centers and inertia(SSE) turn out to be the same both with varying random states, and init=random parameter?
kmeans1= KMeans(n_clusters=3, init='random', random_state=101)
On too simple data, many different initial seeds will converge to the same result.
Plus, he default of n_init is 10 if I remember correctly, so if just 1 out of ten runs yields the same...

Calculating the size of a full outer join in pandas

My issue here is that I'm stuck at calculating how many rows to anticipate on each part of a full outer merge when using Pandas DataFrames as part of a combinatorics graph.
Questions (repeated below).
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython) - Databases are out of the question.
The problem
Recently I re-wrote the py-upset graphing library to make it more efficient in terms of time when calculating combinations across DataFrames. I'm not looking for a review of this code, it works perfectly well in most instances and I'm happy with the approach. What I am looking for now is the answer to a very specific problem; uncovered when working with large data-sets.
The approach I took with the re-write was to formulate an in-memory merge of all provided dataframes on a full outer join as seen on lines 480 - 502 of pyupset.resources
for index, key in enumerate(keys):
frame = self._frames[key]
frame.columns = [
'{0}_{1}'.format(column, key)
if column not in self._unique_keys
for column in self._frames[key].columns
if index == 0:
self._merge = frame
suffixes = (
self._merge = self._merge.merge(
For small to medium dataframes using joins works incredibly well. In fact recent performance tests have shown that it'll handle 5 or 6 Data-Sets containing 10,000's of lines each in a less than a minute which is more than ample for the application structure I require.
The problem now moves from time based to memory based.
Given datasets of potentially 100s of thousands of records, the library very quickly runs out of memory even on a large server.
To put this in perspective, my test machine for this application is an 8-core VMWare box with 128GiB RAM running Centos7.
Given the following dataset sizes, when adding the 5th dataframe, memory usage spirals exponentially. This was pretty much anticipated but underlines the heart of the problem I am facing.
Rows | Dataframe
13963 | dataframe_one
48346 | dataframe_two
52356 | dataframe_three
337292 | dataframe_four
49936 | dataframe_five
24542 | dataframe_six
258093 | dataframe_seven
16337 | dataframe_eight
These are not "small" dataframes in terms of the number of rows although the column count for each is limited to one unique key + 4 non-unique columns. The size of each column in pandas is
column | type | unique
X | object | Y
id | int64 | N
A | float64 | N
B | float64 | N
C | float64 | N
This merge can cause problems as memory is eaten up. Occasionally it aborts with a MemoryError (great, I can catch and handle those), other times the kernel takes over and simply kills the application before the system becomes unstable, and occasionally, the system just hangs and becomes unresponsive / unstable until finally the kernel kills the application and frees the memory.
Sample output (memory sizes approximate):
[INFO] Creating merge table
[INFO] Merging table dataframe_one
[INFO] Data index length = 13963 # approx memory <500MiB
[INFO] Merging table dataframe_two
[INFO] Data index length = 98165 # approx memory <1.8GiB
[INFO] Merging table dataframe_three
[INFO] Data index length = 1296665 # approx memory <3.0GiB
[INFO] Merging table dataframe_four
[INFO] Data index length = 244776542 # approx memory ~13GiB
[INFO] Merging table dataframe_five
Killed # > 128GiB
When the merge table has been produced, it is queried in set combinations to produce graphs similar to
The approach I am trying to build for solving the memory issue is to look at the sets being offered for merge, pre-determine how much memory the merge will require, then if that combination requires too much, split it into smaller combinations, calculate each of those separately, then put the final dataframe back together (divide and conquer).
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of the merge.
Questions (repeated from above)
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython).
Apologies for the lengthy question. I'm happy to provide more information if required or possible.
Can anybody shed some light on what might be the reason for this?
Thank you.
Question 1.
Dask shows a lot of promise in being able to calculate the merge table "out of memory" by using hdf5 files as a temporary store.
By using multi-processing to create the merges, dask also offers a performance increase over pandas. Unfortunately this is not carried through to the query method so performance gains made on the merge are lost on querying.
It is still not a completely viable solution as dask may still run out of memory on large, complex merges.
Question 2.
Pre-calculating the size of the merge is entirely possible using the following method.
Group each dataframe by a unique key and calculate the size.
Create a set of key names for each dataframe.
Create an intersection of sets from 2.
Create a set difference for set 1 and for set 2
To accommodate for np.nan stored in the unique key, select all NAN values. If one frame contains nan and the other doesn't, write the other as 1.
for sets in the intersection, multiply the count from each groupby('...').size()
Add counts from the set differences
Add a count of np.nan values
In python this could be written as:
def merge_size(left_frame, right_frame, group_by):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_sub_right = left_keys - intersection
right_sub_left = right_keys - intersection
left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_groups[group_name] for group_name in left_sub_right]
sizes += [right_groups[group_name] for group_name in right_sub_left]
sizes += [left_nan * right_nan]
return sum(sizes)
Question 3
This method is fairly heavy on calculating and would be better written in Cython for performance gains.

Packet profile from netflow

I have netflow data from previous month in files per 5 minutes and I would like to do a packet profile of all this traffic. I need percentage representation of 1 packet flows, 2 packet flows etc. It is possible to do it in categories like 1 packet flow, 1-100 packet flows, 100 and more... Its not so important. But my question is how to do it. How to do percentage representation of data which I can't add together? Something like do percentage representation for every file and then do some type of average from it?
What do you mean with "I can't add together"? Actually you can do that with nfdump, if you look at the manual: -R expr /dir/file1:file2 Read all files from file1 to file2. For istance
nfdump -R /yournetflowfolder/nfcapd.201204051609:nfcapd.201204051639
will gather NetFlow informations from 16:09 to 16:39. Then you can do whatever query you need on that data.
It sounds like you're describing a histogram: You create 'bins' of the size you describe with the raw counts. The sum of the counts for the bins is the total number of sessions. To get the percentages of the total traffic, you just normalize by dividing each bin by the total flow count.
So, if you do a two-bin histogram where the first bin is the count of all sessions with < 100 packet flows and the other 100+ packet flows (note that there can't be gaps or overlaps), and it works out to 30 flows in the former and 60 in the latter, then the total number of flows is 90, and you have 33% of the flows being fewer than 100 packets.
When working with multiple files, the trick is to always use the same bin delineations and to store and work with the raw counts as long as possible and only derive the %s as the very last step. You can add together histograms with no trouble as long as their bins mean the same thing, and then when you normalize the result, you have for each bin the total percent for all files. If you're going to need to add a file, just keep track of the raw counts so that you can re-normalize when there's new data.
You can do this in a tool like Matlab pretty easily, but be careful because many of these tools will very kindly auto-determine bin widths for you. So, the histogram for one file might have bins {x < 100, 100 <= x < 200, x >= 200} and another file, {x < 90, 90 <= x < 180, x >=180} and you won't be able to add the results together.
