I want to rank all the entities in a list based of two variables (both percentages). One of the variables is 'the bigger the better' (x) and the other is 'smaller the better' (y). What is the best way to give each entity a score in order to rank them?
I tried doing x*(1-y) but as some of the y values are over 1, the negatives it created caused some errors.
Below is the data:
x y
a 0.953882755 0.926422663
b 0.757267676 0.926967001
c 1 1.01607838
d 0.89805254 1.008814817
e 0.672989727 0.932579014
f 0.643306278 0.924523932
g 0.621091809 0.935122957
h 0.56891321 0.918181342
i 0.563662125 0.924102288
j 0.579410248 0.946421415
k 0.781299906 1.040418561
l 0.490013047 0.920900829
m 0.475050754 0.932586282
n 0.505211144 0.972570665
o 0.566582462 1.009732948
p 0.610994363 1.031047605
q 0.686065983 1.060742126
r 0.47642017 0.983301498
s 0.463552006 0.976645044
t 0.551532341 1.025816246
u 0.478092524 1.012675037
v 0.645790431 1.084143812
w 0.390365014 1.189518019
Two ways : averaged ranking OR sort by distance from min&max
average ranking :
use =RANK.AVG() on X & Y separately. Get the average, then rank again base on the average.
sort by distance from min&max :
do '=(B2-MIN(B:B)) + (MAX(C:C)-C2)' and drag downwards. Then use =RANK.AVG() on the results, being the smaller (the distance from min/max) the better.
Hope it solves.
Related
Definition of H Index used in this algorithm
Supposing a relational expression is represented as y = F(x1, x2, . . . , xn), where F returns an integer number greater than 0, and the function is to find a maximum value y satisfying the condition that there exist at least y elements whose values are not less than y. Hence, the H-index of any node i is defined as
H(i) = F(kj1 ,kj2 ,...,k jki)
where kj1, kj2, . . . , kjki represent the set of degrees of neighboring nodes of node i.
Now I want to find the H Index of the nodes of the following graphs using the algorithm given below :
Graph :
Code (Written in Python and NetworkX) :
def hindex(g, n):
nd = {}
h = 0
# print(len(list(g.neighbors(n))))
for v in g.neighbors(n):
#nd[v] = len(list(g.neighbors(v)))
nd[v] = g.degree(v)
snd = sorted(nd.values(), reverse=True)
for i in range(0,len(snd)):
h = i
if snd[i] < i:
break
#print("H index of " + str(n)+ " : " + str(h))
return h
Problem :
This algorithm is returning the wrong values of nodes 1, 5, 8 and 9
Actual Values :
Node 1 - 6 : H Index = 2
Node 7 - 9 : H Index = 1
But for Node 1 and 5 I am getting 1, and for Node 8 and 9 I am getting 0.
Any leads on where I am going wrong will be highly appreciated!
Try this:
def hindex(g, n):
sorted_neighbor_degrees = sorted((g.degree(v) for v in g.neighbors(n)), reverse=True)
h = 0
for i in range(1, len(sorted_neighbor_degrees)+1):
if sorted_neighbor_degrees[i-1] < i:
break
h = i
return h
There's no need for a nested loop; just make a decreasing list, and calculate the h-index like normal.
The reason for 'i - 1' is just that our arrays are 0-indexed, while h-index is based on rankings (i.e. the k largest values) which are 1-indexed.
From the definition of h-index: For a non-increasing function f, h(f) is max i >= 0 such that f(i) >= i. This is, equivalently, the min i >= 1 such that f(i) < i, minus 1. Here, f(i) is equal to sorted_neighbor_degrees[i - 1]. There are of course many other ways (with different time and space requirements) to calculate h.
I am trying to multiply the following:
A batch of matrices N x M x D
A batch of vectors N x D x 1
To get a result: N x M x 1
as if I were doing N dot products on M x D D x 1.
I cant seem to find the correct function in PyTorch.
torch.bmm as far as I can tell only works for a batch of vectors and a single matrix. If I have to use torch.einsum then so be it but id rather not!
It's pretty straightforward and intuitive with einsum:
torch.einsum('ijk, ikl->ijl', mats, vecs)
But your operation is just:
mats # vecs
I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.
You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B
I want to create an array A [1 ,1 , 2, 2 ,2 , 5, 5 ,5 ,....] with numbers from [a,b] such that
An histogram where Y-Axis is the frequency of the number in the array and X-axis is [a,b] resembles a bell curve.
Bell Curve
The sum of frequency(i)*i for all i in [a,b] is approximately around a large number K
Many functions are available in python like numpy.random.normal or scipsy.stats.truncnorm but I am not able to fully understand their use and how they can help me to create such an array.
The first point is easy, for the second point, I'm assuming you want the "integral" of freq * x to be close to K (making each x * freq(x) ~ K is mathematically impossible). You can do that by adjusting sample size.
First step: bell curve shaped integer numbers between a and b, use scipy.stats.truncnorm. From the docs:
Notes
The standard form of this distribution is a standard normal truncated to the range [a, b] --- notice that a and b are defined over
the domain of the standard normal. To convert clip values for a
specific mean and standard deviation, use::
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
Take a normal in the -3, 3 range, so the curve is nice. Adjust mean and standard deviation so -3, 3 becomes a, b:
from scipy.stats import truncnorm
a, b = 10, 200
loc = (a + b) / 2
scale = (b - a) / 6
n = 100
f = truncnorm(-3,3, loc=(a+b)/2,scale=(b-a)/6)
Now, since frequency is related to the probability density function: sum(freq(i) * i ) ~ n * sum(pdf(i) * i). Therefore, n = K / sum(pdf(i) * i). This can be obtained as:
K = 200000
i = np.arange(a, b +1)
n = int(K / i.dot(f.pdf(i)))
Now generate integer random samples, and check function:
samples = f.rvs(size=n).astype(np.int)
import matplotlib.pyplot as plt
plt.hist(samples, bins = 20)
print(np.histogram(samples, bins=b-a+1)[0].dot(np.arange(a,b+1)))
>> 200315
I am trying to fit the beneath data to the form - I am most interested in 'c' (I know that c ≈ 1/8, b ≈ 3) but would like to extract all these values from the data.
Formula:
y = a*(x-b)**c
Values.txt:
# "values.txt"
2.000000e+00 6.058411e-04
2.200000e+00 5.335520e-04
2.400000e+00 3.509583e-03
2.600000e+00 1.655943e-03
2.800000e+00 1.995418e-03
3.000000e+00 9.437851e-04
3.200000e+00 5.516159e-04
3.400000e+00 6.765981e-04
3.600000e+00 3.860859e-04
3.800000e+00 2.942881e-04
4.000000e+00 5.039975e-04
4.200000e+00 3.962199e-04
4.400000e+00 4.659717e-04
4.600000e+00 2.892683e-04
4.800000e+00 2.248839e-04
5.000000e+00 2.536980e-04
I have tried using the following commands in gnuplot however I am not meaningful results
f(x) = a*(x-b)**c
b = 3
c = 1/8
fit f(x) "values.txt" via a,b,c
Does anyone know the best way to extract these values? I would rather not provide initial guesses for 'b' & 'c' if possible.
Thanks,
J
The main problem with your fitting function is finding b. You can express your equation as a linear function in log(x-b), after which the fitting is trivial:
b = 3
f(x) = c0 + c1 * x
fit f(x) "values.txt" using (log($1-b)):(log($2)) via c0, c1
a = exp(c0)
c = c1
As you see, you need to provide b but do not need initial guesses for the other parameters because it's a trivial linear fit.
Now, I would suggest that you provide a series of values of b and check how good the fitting is for each value. gnuplot gives you the error in the fitting parameter. Then you can plot the overall error (error_c0 + error_c1) as a function of b and figure out for which b the error is minimum. About the optimum b the curve error_c0 + error_c1 vs b should be quadratic and have the minimum at b_opt. Then run the fitting as in the code above with this b = b_opt and get a and c.