Count number of repeated elements in list considering the ones larger than them - python-3.x

I am trying to do some clustering analysis on a dataset. I am using a number of different approaches to estimate the number of clusters, then I put what every approach gives (number of clusters) in a list, like so:
total_pred = [0, 0, 1, 1, 0, 1, 1]
Now I want to estimate the real number of clusters, so I let the methods above vote, for example, above, more models found 1 cluster than 0, so I take 1 as the real number of clusters.
I do this by:
counts = np.bincount(np.array(total_pred))
real_nr_of_clusters = np.argmax(counts))
There is a problem with this method, however. If the above list contains something like:
[2, 0, 1, 0, 1, 0, 1, 0, 1]
I will get 0 clusters as the average, since 0 is repeated more often. However, if one model found 2 clusters, it's safe to assume it considers at least 1 cluster is there, hence the real number would be 1.
How can I do this by modifying the above snippet?
To make the problem clear, here are a few more examples:
[1, 1, 1, 0, 0, 0, 3]
should return 1,
[0, 0, 0, 1, 1, 3, 4]
should also return 1 (since most of them agree there is AT LEAST 1 cluster).

There is a problem with your logic
Here is an implementation of the described algorithm.
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
l = sorted(l, reverse=True)
votes = {x: i for i, x in enumerate(l, start=1)}
Output
{2: 1, 1: 5, 0: 9}
Notice that since you define a vote as agreeing with anything smaller than itself, then min(l) will always win, because everyone will agree that there are at least min(l) clusters. In this case min(l) == 0.
How to fix it
Mean and median
Beforehand, notice that taking the mean or the median are valid and light-weight options that both satisfy the desired output on your examples.
Bias
Although, taking the mean might not be what you want if, for say, you encounter votes with high variance such as [0, 0, 7, 8, 10] where it is unlikely that the answer is 5.
A more general way to fix that is to include a voter's bias toward votes close to theirs. Surely that a 2-voter will agree more to a 1 than a 0.
You do that by implementing a metric (note: this is not a metric in the mathematical sense) that determines how much an instance that voted for x is willing to agree to a vote for y on a scale of 0 to 1.
Note that this approach will allow voters to agree on a number that is not on the list.
We need to update our code to account for applying that pseudometric.
def d(x, y):
return x <= y
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
Output
{0: 9, 1: 5, 2: 1}
The above metric is a sanity check. It is the one your provided in your question and it indeed ends up determining that 0 wins.
Metric choices
You will have to toy a bit with your metrics, but here are a few which may make sense.
Inverse of the linear distance
def d(x, y):
return 1 / (1 + abs(x - y))
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 6.33, 1: 6.5, 2: 4.33}
Inverse of the nth power of the distance
This one is a generalization of the previous. As n grows, voters tend to agree less and less with distant vote casts.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 5.25, 2: 2.44}
Upper-bound distance
Similar to the previous metric, this one is close to what you described at first in the sense that a voter will never agree to a vote higher than theirs.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n if x >= y else 0
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 4.25, 2: 1.0}
Normal distribution
An other option that would be sensical is a normal distribution or a skewed normal distribution.

While the other answer provides a comprehensive review of possible metrics and methods, it seems what you are seeking is to find the closest number of clusters to mean!
So something as simple as:
cluster_num=int(np.round(np.mean(total_pred)))
Which returns 1 for all your cases as you expect.

Related

Neighbors sum of numpy array with mask

I have two large arrays, one containing values, and one being a mask basically. The code below shows the function I want to implement.
from scipy.signal import convolve2d
import numpy as np
sample = np.array([[6, 4, 5, 5, 5],
[7, 1, 0, 8, 3],
[2, 5, 4, 8, 4],
[2, 0, 2, 6, 0],
[5, 7, 2, 3, 2]])
mask = np.array([[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
neighbors_sum = convolve2d(sample, np.ones((3,3), dtype=int), mode='same', boundary='wrap')
# neighbors_sum = np.array([[40, 37, 35, 33, 44],
# [37, 34, 40, 42, 48],
# [24, 23, 34, 35, 40],
# [27, 29, 37, 31, 32],
# [31, 33, 34, 30, 34]])
result = np.where(mask, neighbors_sum, 0)
print(result)
This code works, and gets me what I expects:
np.array([[40, 0, 35, 33, 0],
[ 0, 0, 40, 0, 48],
[ 0, 23, 0, 0, 0],
[ 0, 0, 0, 31, 0],
[31, 33, 0, 0, 34]])
So far, so good. However, where I'm encountering some large issue is when I increase the size of the arrays. In my case, instead of a 5x5 input and a 3x3 summing mask, I need a 50,000x20,000 input and a 100x100 summing mask. And when I move to that, the convolve2d function is in all kinds of trouble and the calculation is extremely long.
Given that I only care about the masked result, and thus only care about the summation from convolve2d at those points, can anyone think of a smart approach to take here? Going to a for loop and selecting only the points of interest would lose the speed advantage of the vectorization so I'm not convinced this would be worth it.
Any suggestion welcome!
convolve2d is very inefficient in this case. Since the mask is np.ones, you can split the filter in two trivial ones thanks to separable filtering: one np.ones(100, 1) filter and one np.ones(1, 100) filter. Moreover, a rolling sum can be used to speed up even more the computation.
Here is a simple solution without a rolling sum:
# Simple faster implementation
tmp = convolve2d(sample, np.ones((1,100), dtype=int), mode='same', boundary='wrap')
neighbors_sum = convolve2d(tmp, np.ones((100,1), dtype=int), mode='same', boundary='wrap')
result = np.where(mask, neighbors_sum, 0)
You can compute the rolling sum efficiently using Numba. The strategy is to split the computation in 3 parts: the horizontal rolling sum, the vertical rolling sum and the final masking. Each step can be fully parallelized using multiple threads (although parallelizing the vertical rolling sum is harder with Numba). Each part needs to work line by line so to be cache friendly.
# Complex very-fast implementation
import numba as nb
# Numerical results may diverge if the input contains big
# values with many small ones.
# Does not support inputs containing NaN values or +/- Inf ones.
#nb.njit('float64[:,::1](float64[:,::1], int_)', parallel=True, fastmath=True)
def horizontalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
s = 0.0
lShift = fs//2
rShift = (fs-1)//2
for j in range(m-lShift, m):
s += sample[i, j]
for j in range(0, rShift+1):
s += sample[i, j]
tmp[i, 0] = s
for j in range(1, m):
jLeft, jRight = (j-1-lShift)%m, (j+rShift)%m
s += sample[i, jRight] - sample[i, jLeft]
tmp[i, j] = s
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_)', fastmath=True)
def verticaltalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
tShift = fs//2
bShift = (fs-1)//2
for j in range(m):
tmp[0, j] = 0.0
for i in range(n-tShift, n):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(0, bShift+1):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(1, n):
iTop = (i-1-tShift)%n
iBot = (i+bShift)%n
for j in range(m):
tmp[i, j] = tmp[i-1, j] + (sample[iBot, j] - sample[iTop, j])
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_[:,::1], int_)', parallel=True, fastmath=True)
def compute(sample, mask, filterSize):
n, m = sample.shape
tmp = horizontalRollingSum(sample, filterSize)
neighbors_sum = verticaltalRollingSum(tmp, filterSize)
res = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
for j in range(n):
res[i, j] = neighbors_sum[i, j] * mask[i, j]
return res
Benchmark & Notes
Here is the testing code:
n, m = 5000, 2000
sample = np.random.rand(n, m)
mask = (np.random.rand(n, m) < 0.05).astype(int)
Here are the results on my 6-core machine:
Initial solution: 174366 ms (x1)
With separate filters: 5710 ms (x31)
Final Numba solution: 40 ms (x4359)
Optimal theoretical time: 10 ms (optimistic)
Thus, the Numba implementation is 4359 times faster than the initial one.
That being said, be careful of possible numerical issues that this last implementation can have regarding the input array (see the comments in the code). It should be fine as long as np.std(sample) is relatively small and np.all(np.isfinite(sample)) is true.
Note that the code can be further optimized: the vertical rolling sum can be parallelized; modulus operations can be avoided in the horizontal rolling sum; the vertical rolling sum and the masking steps can be merged together (ie. by computing res on-the-fly and not storing tmp); tiling can be used to compute all the steps simultaneously in a more cache-friendly way. However, these optimizations make the code more complex and some of them are very hard to perform (especially the last one with Numba).
Note that using a boolean mask (instead of an integer-based one) should make the algorithm faster since it takes less memory and processors can fetch values faster.

Crossover and mutation in Differential Evolution

I'm trying to solve Traveling Salesman problem using Differential Evolution. For example, if I have vectors:
[1, 4, 0, 3, 2, 5], [1, 5, 2, 0, 3, 5], [4, 2, 0, 5, 1, 3]
how can I make crossover and mutation? I saw something like a+Fx(b-c), but I have no idea how to use this.
I ran into this question when looking for papers on solving the TSP problem using evolutionary computing. I have been working on a similar project and can provide a modified version of my written code.
For mutation, we can swap two indices in a vector. I assume that each vector represents an order of nodes you will visit.
def swap(lst):
n = len(lst)
x = random.randint(0, n)
y = random.randint(0, n)
# store values to be swapped
old_x = lst[x]
old_y = lst[y]
# do swap
lst[y] = old_x
lst[x] = old_y
return lst
For the case of crossover in respect to the TSP problem, we would like to keep the general ordering of values in our permutations (we want a crossover with a positional bias). By doing so, we will preserve good paths in good permutations. For this reason, I believe single-point crossover is the best option.
def singlePoint(parent1, parent2):
point = random.randint(1, len(parent1)-2)
def helper(v1, v2):
# this is a helper function to save with code duplication
points = [i1.getPoint(i) for i in range(0, point)]
# add values from right of crossover point in v2
# that are not already in points
for i in range(point, len(v2)):
pt = v2[i]
if pt not in points:
points.append(pt)
# add values from head of v2 which are not in points
# this ensures all values are in the vector.
for i in range(0, point):
pt = v2[i]
if pt not in points:
points.append(pt)
return points
# notice how I swap parent1 and parent2 on the second call
offspring_1 = helper(parent1, parent2)
offspring_2 = helper(parent2, parent1)
return offspring_1, offspring_2
I hope this helps! Even if your project is done, this could come in handy GA's are great ways to solve optimization problems.
if F=0.6, a=[1, 4, 0, 3, 2, 5], b=[1, 5, 2, 0, 3, 5], c=[4, 2, 0, 5, 1, 3]
then a+Fx(b-c)=[-0.8, 5.8, 1.2, 0, 3.2, 6.2]
then change the smallest number in the array to 0, change the second smallest number in the array to 1, and so on.
so it return [0, 4, 2, 1, 3, 5].
This method is inefficient when used to solve the jsp problems.

How to return the last index of a group in a list Python

I have a list in following form:
[0, 0, 0, 0, 1, 1, 1, 0.6, 0.6, 0, 0, 0]
each of the items in the list is a small decimal number. I'm looking for a way of returning the last index position of each group. In the above example it would be something like:
0: 3, 1: 6, 0.6: 8, 0: 11
I'm fairly new to python and I don't really know how to approach this
itertools.groupby may be useful here, it deals with pretty much everything except tracking the indices which isn't hard to do yourself:
import itertools
a = [0, 0, 0, 0, 1, 1, 1, 0.6, 0.6, 0, 0, 0]
i = 0
for val, group in itertools.groupby(a):
for inst in group:
# each element that is the same in sequence, increment index
i += 1
# after the inner for loop i will be the index of the first element of next group
# so i - 1 is the index of last occurence.
print(val, i - 1)
If you are particularly clever with enumerate and variable unpacking you can make this super short although less obvious how it's working.
import itertools
from operator import itemgetter
a = [0, 0, 0, 0, 1, 1, 1, 0.6, 0.6, 0, 0, 0]
# still group only by value but now use enumerate to have it keep track of indices
for val, group in itertools.groupby(enumerate(a), itemgetter(1)):
# this is tuple unpacking, irrelevant is a list of values that aren't the last one, and last is the one we care about.
[*irrelevent, last] = group
print(last)
This answer is less intended to say "here's how you should do it" and more "this is some of the things that exist in python", happy coding :)
Try this :
a=[0, 0, 0, 0, 1, 1, 1, 0.6, 0.6, 0, 0, 0]
i=0
while(i<=len(a)):
if (i == len(a)-1):
print(str(a[i]) + ":" + str(i))
break
if(a[i]!=a[i+1]):
print(str(a[i])+":"+str(i))
i=i+1

Find number of ‘+’ formed by all ones in a binary matrix

The question I have is similar to the problem found here: https://www.geeksforgeeks.org/find-size-of-the-largest-formed-by-all-ones-in-a-binary-matrix/
The difference is the '+' must have all other cells in the matrix to be zeros. For example:
00100
00100
11111
00100
00100
This will be a 5x5 matrix with 2 '+', one inside another.
Another example:
00010000
00010000
00010000
11111111
00010000
00010010
00010111
00010010
This matrix is 8x8, and will have 3 '+', one of it is the small 3x3 matrix in the bottom right, and the other 2 is formed from the 5x5 matrix, one inside another, similar to the first example.
Using the code from the link above, I can only get so far:
M = [[0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 1, 1], [0, 0, 0, 1, 0, 0, 1, 0]]
R = len(M)
N = len(M)
C = len(M[0])
left = [[0 for k in range(C)] for l in range(R)]
right = [[0 for k in range(C)] for l in range(R)]
top = [[0 for k in range(C)] for l in range(R)]
bottom = [[0 for k in range(C)] for l in range(R)]
for i in range(R):
top[0][i] = M[0][i]
bottom[N - 1][i] = M[N - 1][i]
left[i][0] = M[i][0]
right[i][N - 1] = M[i][N - 1]
for i in range(R):
for j in range(1,R):
if M[i][j] == 1:
left[i][j] = left[i][j - 1] + 1
else:
left[i][j] = 1
if (M[j][i] == 1):
top[j][i] = top[j - 1][i] + 1
else:
top[j][i] = 0
j = N - 1 - j
if (M[j][i] == 1):
bottom[j][i] = bottom[j + 1][i] + 1
else:
bottom[j][i] = 0
if (M[i][j] == 1):
right[i][j] = right[i][j + 1] + 1
else:
right[i][j] = 0
j = N - 1 - j
n = 0
for i in range(N):
for j in range(N):
length = min(top[i][j], bottom[i][j], left[i][j], right[i][j])
if length > n:
n = length
print(n)
Currently, it returns the output of the longest side of the '+'. The desired output would be the number of '+' in the square matrix.
I am having trouble checking for all other cells in the matrix to be zeros, and finding a separate '+' if there is one in the entire matrix.
Any help is greatly appreciated.
I don't want to spoil the fun of solving this problem, so rather than a solution, here are some hints:
Try to write a sub-routine (a function), that given a square matrix as input, decides whether this input matrix is a '+' or not (say the function returns a '1' if it is a '+' and a '0' otherwise).
Modify the function from 1. so that you can give it as input a submatrix of the full matrix (in which you want to count '+'). More specifically, the input could be the coordinate of the upper left entry of the submatrix and its size. The return value should be the same as for 1.
Can you write a loop that examines all the submatrices of your given matrix and counts the ones that are '+' using the function from 2.?
Here are some minor remarks: The algorithm that this leads to runs in polynomial time (in the dimension of the input matrix), so basically it shouldn't take to long.
I haven't thought about it too much, but probably the algorithm can be made more efficient.
Also, you should maybe think about whether or not you count a '1' that is surrounded by '0's as a '+' or not.

How to generate all the permutations of a multiset?

A multi-set is a set in which all the elements may not be unique.How to enumerate all the possible permutations among the set elements?
Generating all the possible permutations and then discarding the repeated ones is highly inefficient. Various algorithms exist to directly generate the permutations of a multiset in lexicographical order or other kind of ordering. Takaoka's algorithm is a good example, but probably that of Aaron Williams is better
http://webhome.csc.uvic.ca/~haron/CoolMulti.pdf
moreover, it has been implemented in the R package ''multicool''.
Btw, if you just want the total number of distinct permutations, the answer is the Multinomial coefficient:
e.g., if you have, say, n_a elements 'a', n_b elements 'b', n_c elements 'c',
the total number of distinct permutations is (n_a+n_b+n_c)!/(n_a!n_b!n_c!)
This is my translation of the Takaoka multiset permutations algorithm into Python (available here and at repl.it):
def msp(items):
'''Yield the permutations of `items` where items is either a list
of integers representing the actual items or a list of hashable items.
The output are the unique permutations of the items given as a list
of integers 0, ..., n-1 that represent the n unique elements in
`items`.
Examples
========
>>> for i in msp('xoxox'):
... print(i)
[1, 1, 1, 0, 0]
[0, 1, 1, 1, 0]
[1, 0, 1, 1, 0]
[1, 1, 0, 1, 0]
[0, 1, 1, 0, 1]
[1, 0, 1, 0, 1]
[0, 1, 0, 1, 1]
[0, 0, 1, 1, 1]
[1, 0, 0, 1, 1]
[1, 1, 0, 0, 1]
Reference: "An O(1) Time Algorithm for Generating Multiset Permutations", Tadao Takaoka
https://pdfs.semanticscholar.org/83b2/6f222e8648a7a0599309a40af21837a0264b.pdf
'''
def visit(head):
(rv, j) = ([], head)
for i in range(N):
(dat, j) = E[j]
rv.append(dat)
return rv
u = list(set(items))
E = list(reversed(sorted([u.index(i) for i in items])))
N = len(E)
# put E into linked-list format
(val, nxt) = (0, 1)
for i in range(N):
E[i] = [E[i], i + 1]
E[-1][nxt] = None
head = 0
afteri = N - 1
i = afteri - 1
yield visit(head)
while E[afteri][nxt] is not None or E[afteri][val] < E[head][val]:
j = E[afteri][nxt] # added to algorithm for clarity
if j is not None and E[i][val] >= E[j][val]:
beforek = afteri
else:
beforek = i
k = E[beforek][nxt]
E[beforek][nxt] = E[k][nxt]
E[k][nxt] = head
if E[k][val] < E[head][val]:
i = k
afteri = E[i][nxt]
head = k
yield visit(head)
sympy provides multiset_permutations.
from the doc:
>>> from sympy.utilities.iterables import multiset_permutations
>>> from sympy import factorial
>>> [''.join(i) for i in multiset_permutations('aab')]
['aab', 'aba', 'baa']
>>> factorial(len('banana'))
720
>>> sum(1 for _ in multiset_permutations('banana'))
60
There are O(1) (per permutation) algorithms for multiset permutation generation, for example, from Takaoka (with implementation)
Optimisation of smichr's answer, I unzipped the nxts to make the visit function more efficient with an accumulate() (the map() is faster than a list comprehension and it seemed shallow and pedantic to have to nest it in a second one with a constant index)
from itertools import accumulate
def msp(items):
def visit(head):
'''(rv, j) = ([], head)
for i in range(N):
(dat, j) = E[j]
rv.append(dat)
return(rv)'''
#print(reduce(lambda e,dontCare: (e[0]+[E[e[1]]],nxts[e[1]]),range(N),([],head))[0])
#print(list(map(E.__getitem__,accumulate(range(N-1),lambda e,N: nxts[e],initial=head))))
return(list(map(E.__getitem__,accumulate(range(N-1),lambda e,N: nxts[e],initial=head))))
u=list(set(items))
E=list(sorted(map(u.index,items)))
N=len(E)
nxts=list(range(1,N))+[None]
head=0
i,ai,aai=N-3,N-2,N-1
yield(visit(head))
while aai!=None or E[ai]>E[head]:
beforek=(i if aai==None or E[i]>E[aai] else ai)
k=nxts[beforek]
if E[k]>E[head]:
i=k
nxts[beforek],nxts[k],head = nxts[k],head,k
ai=nxts[i]
aai=nxts[ai]
yield(visit(head))
Here are the test results (the second has (13!/2!/3!/3!/4!)/10! = 143/144 times as many permutations but takes longer due to being more of a multiset, I suppose), mine seems 9% and 7% faster respectively:
cProfile.run("list(msp(list(range(10))))")
cProfile.run("list(msp([0,1,1,2,2,2,3,3,3,3,4,4,4]))")
original:
43545617 function calls in 28.452 seconds
54054020 function calls in 32.469 seconds
modification:
39916806 function calls in 26.067 seconds
50450406 function calls in 30.384 seconds
I have insufficient reputation to comment upon answers, but for an items input list, Martin Böschen's answer has time complexity the product of the factorials of the number of instances of each element value times greater, or
reduce(int.__mul__,map(lambda n: reduce(int.__mul__,range(1,n+1)),map(items.count,set(items))))
This can grow large quickly when computing large multisets with many occurrences. For instance, it will take 1728 times longer per permutation for my second example than my first.
You can reduce your problem to enumerate all permutations of a list. The typcial permutation generation algorithm takes a list and don't check if elements are equal. So you only need to generate a list out of your multiset, and feed it to your permutation generating algorithm.
For example, you have the multiset {1,2,2}.
You transform it to the list [1,2,2].
And generate all permutations, for example in python:
import itertools as it
for i in it.permutations([1,2,2]):
print i
And you will get the output
(1, 2, 2)
(1, 2, 2)
(2, 1, 2)
(2, 2, 1)
(2, 1, 2)
(2, 2, 1)
The problem is, that you get some permutations repeatedly. A simple solution would be just to filter them out:
import itertools as it
permset=set([i for i in it.permutations([1,2,2])])
for x in permset:
print x
Output:
(1, 2, 2)
(2, 2, 1)
(2, 1, 2)

Resources