HyperLogLog intersection: why not use min? - hyperloglog

When doing a union between two compatible HyperLogLog objects, you can just take the maximum bucket to do a lossless union that doesn't introduce any new error:
Union.Bucket[i] = Max(A.Bucket[i], B.Bucket[i])
When doing an intersection though, you have to use the inclusion-exclusion principle:
IntersectionCountEstimate = A.CountEstimate() + B.CountEstimate() - Union.CountEstimate()
Why is it that using the minimum bucket value doesn't work as an effective intersection?
Intersection.Bucket[i] = Min(A.Bucket[i], B.Bucket[i])

The cause is that the relationship between two instances of the HyperLogLog statistic is not very intuitive:
Consider two corresponding buckets A[i] and B[i] from separate HyperLogLog structures A and B (which have the same number of buckets and use the same hash function), and for simplicity's sake assume the data in A and in B are independently drawn from the same distribution. Let's assume we first draw all the elements for A, and only then draw elements for B.
For every element we observe reaching B[i], what is the probability that is it in the intersection of A and B, i.e. what is the probability that it is already in A[i]? Well that depends - how "full" is A[i]? If A[i] is completely "full" (i.e., A[i] "contains" ALL the elements from the distribution which can reach A[i]), then the probability is 1. In that case, the cardinality of the intersection of A[i] and B[i] would indeed be the cardinality of B[i]. However, it is almost NEVER the case that A[i] is "full" - so the intersection is MUCH SMALLER than the cardinality of B[i].

Related

Reaching nth Stair

total number of ways to reach the nth floor with following types of moves:
Type 1 in a single move you can move from i to i+1 floor – you can use the this move any number of times
Type 2 in a single move you can move from i to i+2 floor – you can use this move any number of times
Type 3 in a single move you can move from i to i+3 floor – but you can use this move at most k times
i know how to reach nth floor by following step 1 ,step 2, step 3 any number of times using dp like dp[i]=dp[i-1]+dp[i-2]+dp[i-3].i am stucking in the condition of Type 3 movement with atmost k times.
someone tell me the approach here.
While modeling any recursion or dynamic programming problem, it is important to identify the goal, constraints, states, state function, state transitions, possible state variables and initial condition aka base state. Using this information we should try to come up with a recurrence relation.
In our current problem:
Goal: Our goal here is to somehow calculate number of ways to reach floor n while beginning from floor 0.
Constraints: We can move from floor i to i+3 at most K times. We name it as a special move. So, one can perform this special move at most K times.
State: In this problem, our situation of being at a floor could be one way to model a state. The exact situation can be defined by the state variables.
State variables: State variables are properties of the state and are important to identify a state uniquely. Being at a floor i alone is not enough in itself as we also have a constraint K. So to identify a state uniquely we want to have 2 state variables: i indicating floor ranging between 0..n and k indicating number of special move used out of K (capital K).
State functions: In our current problem, we are concerned with finding number of ways to reach a floor i from floor 0. We only need to define one function number_of_ways associated with corresponding state to describe the problem. Depending on problem, we may need to define more state functions.
State Transitions: Here we identify how can we transition between states. We can come freely to floor i from floor i-1 and floor i-2 without consuming our special move. We can only come to floor i from floor i-3 while consuming a special move, if i >=3 and special moves used so far k < K.
In other words, possible state transitions are:
state[i,k] <== state[i-1,k] // doesn't consume special move k
state[i,k] <== state[i-2,k] // doesn't consume special move k
state[i,k+1] <== state[i-3, k] if only k < K and i >= 3
We should now be able to form following recurrence relation using above information. While coming up with a recurrence relation, we must ensure that all the previous states needed for computation of current state are computed first. We can ensure the order by computing our states in the topological order of directed acyclic graph (DAG) formed by defined states as its vertices and possible transitions as directed edges. It is important to note that it is only possible to have such ordering if the directed graph formed by defined states is acyclic, otherwise we need to rethink if the states are correctly defined uniquely by its state variables.
Recurrence Relation:
number_of_ways[i,k] = ((number_of_ways[i-1,k] if i >= 1 else 0)+
(number_of_ways[i-2,k] if i >= 2 else 0) +
(number_of_ways[i-3,k-1] if i >= 3 and k < K else 0)
)
Base cases:
Base cases or solutions to initial states kickstart our recurrence relation and are sufficient to compute solutions of remaining states. These are usually trivial cases or smallest subproblems that can be solved without recurrence relation.
We can have as many base conditions as we require and there is no specific limit. Ideally we would want to have a minimal set of base conditions, enough to compute solutions of all remaining states. For the current problem, after initializing all not computed solutions so far as 0,
number_of_ways[0, 0] = 1
number_of_ways[0,k] = 0 where 0 < k <= K
Our required final answer will be sum(number_of_ways[n,k], for all 0<=k<=K).
You can use two-dimensional dynamic programming:
dp[i,j] is the solution value when exactly j Type-3 steps are used. Then
dp[i,j]=dp[i-1,j]+dp[i-2,j]+dp[i-3,j-1], and the initial values are dp[0,0]=0, dp[1,0]=1, and dp[3*m,m]=m for m<=k. You can build up first the d[i,0] values, then the d[i,1] values, etc. Or you can do a different order, as long as all necessary values are already computed.
Following #LaszloLadanyi approach ,below is the code snippet in python
def solve(self, n, k):
dp=[[0 for i in range(k+1)]for _ in range(n+1)]
dp[0][0]=1
for j in range(k+1):
for i in range(1,n+1):
dp[i][j]+=dp[i-1][j]
if i>1:
dp[i][j]+=dp[i-2][j]
if i>2 and j>0:
dp[i][j]+=dp[i-3][j-1]
return sum(dp[n])

Search and remove algorithm

Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.

TSP / CPP variant - subtour constraint

I'm developing an optimization problem that is a variant on Traveling Salesman. In this case, you don't have to visit all the cities, there's a required start and end point, there's a min and max bound on the tour length, you can traverse each arc multiple times if you want, and you have a nonlinear objective function that is associated with the arcs traversed (and number of times you traverse each arc). Decision variables are integers, how many times you traverse each arc.
I've developed a nonlinear integer program in Pyomo and am getting results from the NEOS server. However I didn't put in subtour constraints and my results are two disconnected subtours.
I can find integer programming formulations of TSP that say how to formulate subtour constraints, but this is a little different from the standard TSP and I'm trying to figure out how to start. Any help that can be provided would be greatly appreciated.
EDIT: problem formulation
50 arcs , not exhaustive pairs between nodes. 50 Decision variables N_ab are integer >=0, corresponds to how many times you traverse from a to b. There is a length and profit associated with each N_ab . There are two constraints that the sum of length_ab * N_ab for all ab are between a min and max distance. I have a constraint that the sum of N_ab into each node is equal to the sum N_ab out of the node you can either not visit a node at all, or visit it multiple times. Objective function is nonlinear and related to the interaction between pairs of arcs (not relevant for subtour).
Subtours: looking at math.uwaterloo.ca/tsp/methods/opt/subtour.htm , the formulation isn't applicable since I am not required to visit all cities, and may not be able to. So for example, let's say I have 20 nodes and 50 arcs (all arcs length 10). Distance constraints are for a tour of exactly length 30, which means I can visit at most three nodes (start at A -> B -> C ->A = length 30). So I will not visit the other nodes at all. TSP subtour elimination would require that I have edges from node subgroup ABC to subgroup of nonvisited nodes - which isn't needed for my problem
Here is an approach that is adapted from the prize-collecting TSP (e.g., this paper). Let V be the set of all nodes. I am assuming V includes a depot node, call it node 1, that must be on the tour. (If not, you can probably add a dummy node that serves this role.)
Let x[i] be a decision variable that equals 1 if we visit node i at least once, and 0 otherwise. (You might already have such a decision variable in your model.)
Add these constraints, which define x[i]:
x[i] <= sum {j in V} N[i,j] for all i in V
M * x[i] >= N[i,j] for all i, j in V
In other words: x[i] cannot equal 1 if there are no edges coming out of node i, and x[i] must equal 1 if there are any edges coming out of node i.
(Here, N[i,j] is 1 if we go from i to j, and M is a sufficiently large number, perhaps equal to the maximum number of times you can traverse one edge.)
Here is the subtour-elimination constraint, defined for all subsets S of V such that S includes node 1, and for all nodes i in V \ S:
sum {j in S} (N[i,j] + N[j,i]) >= 2 * x[i]
In other words, if we visit node i, which is not in S, then there must be at least two edges into or out of S. (A subtour would violate this constraint for S equal to the nodes that are on the subtour that contains 1.)
We also need a constraint requiring node 1 to be on the tour:
x[1] = 1
I might be playing a little fast and loose with the directional indices, i.e., I'm not sure if your model sets N[i,j] = N[j,i] or something like that, but hopefully the idea is clear enough and you can modify my approach as necessary.

Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
Thanks!
This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
#article{DBLP:journals/corr/GutinJSW14,
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {http://arxiv.org/abs/1402.2137},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Edit:
Please refer to David's answer for a fuller and correct explanation of the problem.
Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.

What is the complexity of this code?

I have a the above model represented in a Face Table List where the F1, F2,...F_n are the faces of the model and their face number is the index of the list array. Each list element is another array of 3 vertices. And each vertex is an array of 3 integers representing its x,y,z coordinates.
I want to find out all the neighbouring faces of the vertex with coordinates (x2, y2, z2). I came out with this code that I believe would do the task:
List faceList; //say the faceList is the table in the picture above.
int[] targetVertex = {x2, y2, z2}; //say this is the vertex I want to find with coordinates (x2, y2, z2)
List faceIndexFoundList; //This is the result, which is a list of index of the neighbouring faces of the targetVertex
for(int i=0; i<faceList.length; i++) {
bool vertexMatched = true;
for(int j=0; j<faceList[i].length; j++) {
if(faceList[i][j][0] != targetVertex[0] && faceList[i][j][1] != targetVertex[1] && faceList[i][j][2] != targetVertex[2]) {
vertexMatched = false;
break;
}
}
if(vertexMatched == true) {
faceIndexFoundList.add(i);
}
}
I was told that the complexity to do the task is O(N^2). But with the code that I have, it looks like only O(N). The length of targetVertex is 3 since there is only 3 vertices per polygon. So, the second inner loop is merely a constant. Then, I left only with the outer for loop, which is then O(N) only.
What is the complexity of the code that I have above? What could I have done wrong?
The complexity is (aproximatly) faceList.length * faceList[i].length, these are independent, but can both grow very large, and as they grow they will each approch infinity at which point (conceptually) they will converge on n, resulting in the complexity being O(n^2)
If the vertex list is explicitly limited to 3, then the complexity becomes faceList[i].length * 3, which is O(n)
It's pretty obvious that in the worst case you must look at each vertex of each polygon.
This is just O(size of the table) in your post, which in turn is the sum of all row lengths or the sum of all polygon vertex counts, whichever you prefer.
If you say polygons have no more than m vertices and there are n polygons, then the algorithm is O(mn).
FWIW it's possible to get the answer with no searching at all with a more sophisticated data structure. See for example the winged edge data structure and others. In this case, you just go to the vertex you're interested in and traverse the links that connect all adjacent polygons. Cost is constant for each polygon in the output.
These fancier data structures for polygonal meshes support lots of frequently used operations with wonderful efficiency.
From Wikipedia:
Big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity.
In this single case you might only be running the one for loop. But what happens when the number of vertices of the polygon approaches infinity? Do the majority of the cases cause the second for loop to run, or to break? This will determine whether your function is O(n) or O(n^2).

Resources