Gini Coefficient in Julia: Efficient and Accurate Code - statistics

I'm trying to implement the following formula in Julia for calculating the Gini coefficient of a wage distribution:
where
Here's a simplified version of the code I'm using for this:
# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
# First calculate S values in formula
for i in 1:length(wagedistarray[:,1])
for j in 1:i
Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
end
end
# Now calculate value to subtract from 1 in gini formula
Gwages = Swages[1]*wagedistarray[1,2]
for i in 2:length(Swages)
Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
# Final step of gini calculation
return giniwages=1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
Swages=zeros(length(wagedistarray[:,1]))
for i in 1:length(wagedistarray[:,1])
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
#time result=gini(wagedistarray)
It gives a value of near zero, which is what you expect for a completely equal wage distribution. However, it takes quite a long time: 6.796 secs.
Any ideas for improvement?

Try this:
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
#time result=gini(wagedistarray)
Time before: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
Time after: 0.134799301 seconds (507260 bytes allocated)
Time after (second run): elapsed time: 0.123665107 seconds (80112 bytes allocated)
The primary problems are that Swages was a global variable (wasn't living in the function) which is not a good coding practice, but more importantly is a performance killer. The other thing I noticed was length(wagedistarray[:,1]), which makes a copy of that column and then asks its length - that was generating some extra "garbage". The second run is faster because there is some compilation time the very first time the function is run.
You crank performance even higher using #inbounds, i.e.
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
#inbounds for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
#inbounds for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)
Finally, check out this version, which is actually faster than all and is also the most accurate I think:
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
Which has elapsed time: 0.00041119 seconds (721664 bytes allocated). The main benefit was changing from a O(n^2) double for loop to the O(n) cumsum.

IainDunning has already provided a good answer with code that is fast enough for practical purposes (the function gini2). If one enjoys performance tweaking, one can get an additional speed increase of a factor 20 by avoiding temporary arrays (gini3). See the following code that compares the performance of the two implementations:
using TimeIt
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
wages = wagedistarray[:,1]
wagefrequencies = wagedistarray[:,2];
# original code
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
# new code
function gini3(wages, wagefrequencies)
Swages_previous = wages[1]*wagefrequencies[1]
Gwages = Swages_previous*wagefrequencies[1]
#inbounds for i = 2:length(wages)
freq = wagefrequencies[i]
Swages_current = Swages_previous + wages[i]*freq
Gwages += freq * (Swages_current+Swages_previous)
Swages_previous = Swages_current
end
return 1.0 - Gwages/Swages_previous
end
result=gini2(wagedistarray) # warming up JIT
println("result with gini2: $result, time:")
#timeit result=gini2(wagedistarray)
result=gini3(wages, wagefrequencies) # warming up JIT
println("result with gini3: $result, time:")
#timeit result=gini3(wages, wagefrequencies)
The output is:
result with gini2: 0.0, time:
1000 loops, best of 3: 321.57 µs per loop
result with gini3: -1.4210854715202004e-14, time:
10000 loops, best of 3: 16.24 µs per loop
gini3 is somewhat less accurate than gini2 due to sequential summation, one would have to use a variant of pairwise summation to increase accuracy.

Related

Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.
This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).
It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

Deciding if all intervals are overlapping

I'm doing a problem that n people is standing on a line and each person knows their own position and speed. I'm asked to find the minimal time to have all people go to any spot.
Basically what I'm doing is finding the minimal time using binary search and have every ith person's furthest distance to go in that time in intervals. If all intervals overlap, there is a spot that everyone can go to.
I have a solution to this question but the time limit exceeded for it for my bad solution to find the intervals. My current solution runs too slow and I'm hoping to get a better solution.
my code:
people = int(input())
peoplel = [list(map(int, input().split())) for _ in range(people)] # first item in people[i] is the position of each person, the second item is the speed of each person
def good(time):
return checkoverlap([[i[0] - time *i[1], i[0] + time * i[1]] for i in peoplel])
# first item,second item = the range of distance a person can go to
def checkoverlap(l):
for i in range(len(l) - 1):
seg1 = l[i]
for i1 in range(i + 1, len(l)):
seg2 = l[i1]
if seg2[0] <= seg1[0] <= seg2[1] or seg1[0] <= seg2[0] <= seg1[1]:
continue
elif seg2[0] <= seg1[1] <= seg2[1] or seg1[0] <= seg2[1] <= seg1[1]:
continue
return False
return True
(this is my first time asking a question so please inform me about anything that is wrong)
One does simply go linear
A while after I finished the answer I found a simplification that removes the need for sorting and thus allows us to further reduce the complexity of finding if all the intervals are overlapping to O(N).
If we look at the steps that are being done after the initial sort we can see that we are basically checking
if max(lower_bounds) < min(upper_bounds):
return True
else:
return False
And since both min and max are linear without the need for sorting, we can simplify the algorithm by:
Creating an array of lower bounds - one pass.
Creating an array of upper bounds - one pass.
Doing the comparison I mentioned above - two passes over the new arrays.
All this could be done together in one one pass to further optimize(and to prevent some unnecessary memory allocation), however this is clearer for the explanation's purpose.
Since the reasoning about the correctness and timing was done in the previous iteration, I will skip it this time and keep the section below since it nicely shows the thought process behind the optimization.
One sort to rule them all
Disclaimer: This section was obsoleted time-wise by the one above. However since it in fact allowed me to figure out the linear solution, I'm keeping it here.
As the title says, sorting is a rather straightforward way of going about this. It will require a little different data structure - instead of holding every interval as (min, max) I opted for holding every interval as (min, index), (max, index).
This allows me to sort these by the min and max values. What follows is a single linear pass over the sorted array. We also create a helper array of False values. These represent the fact that at the beginning all the intervals are closed.
Now comes the pass over the array:
Since the array is sorted, we first encounter the min of each interval. In such case, we increase the openInterval counter and a True value of the interval itself. Interval is now open - until we close the interval, the person can arrive at the party - we are within his(or her) range.
We go along the array. As long as we are opening the intervals, everything is ok and if we manage to open all the intervals, we have our party destination where all the social distancing collapses. If this happens, we return True.
If we close any of the intervals, we have found our party breaker who can't make it anymore. (Or we can discuss that the party breakers are those who didn't bother to arrive yet when someone has to go already). We return False.
The resulting complexity is O(Nlog(N)) caused by the initial sort since the pass itself is linear in nature. This is quite a bit better than the original O(n^2) caused by the "check all intervals pairwise" approach.
The code:
import numpy as np
import cProfile, pstats, io
#random data for a speed test. Not that useful for checking the correctness though.
testSize = 10000
x = np.random.randint(0, 10000, testSize)
y = np.random.randint(1, 100, testSize)
peopleTest = [x for x in zip(x, y)]
#Just a basic example to help with the reasoning about the correctness
peoplel = [(1, 2), (3, 1), (8, 1)]
# first item in people[i] is the position of each person, the second item is the speed of each person
def checkIntervals(people, time):
a = [(x[0] - x[1] * time, idx) for idx, x in enumerate(people)]
b = [(x[0] + x[1] * time, idx) for idx, x in enumerate(people)]
checks = [False for x in range(len(people))]
openCount = 0
intervals = [x for x in sorted(a + b, key=lambda x: x[0])]
for i in intervals:
if not checks[i[1]]:
checks[i[1]] = True
openCount += 1
if openCount == len(people):
return True
else:
return False
print(intervals)
def good(time, people):
return checkoverlap([[i[0] - time * i[1], i[0] + time * i[1]] for i in people])
# first item,second item = the range of distance a person can go to
def checkoverlap(l):
for i in range(len(l) - 1):
seg1 = l[i]
for i1 in range(i + 1, len(l)):
seg2 = l[i1]
if seg2[0] <= seg1[0] <= seg2[1] or seg1[0] <= seg2[0] <= seg1[1]:
continue
elif seg2[0] <= seg1[1] <= seg2[1] or seg1[0] <= seg2[1] <= seg1[1]:
continue
return False
return True
pr = cProfile.Profile()
pr.enable()
print(checkIntervals(peopleTest, 10000))
print(good(10000, peopleTest))
pr.disable()
s = io.StringIO()
sortby = "cumulative"
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
The profiling stats for the pass over test array with 10K random values:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 8.933 8.933 (good)
1 8.925 8.925 8.926 8.926 (checkoverlap)
1 0.003 0.003 0.023 0.023 (checkIntervals)
1 0.008 0.008 0.010 0.010 {built-in method builtins.sorted}

How to increase for loop performance in Python

I have a for loop where there are a series of operations performed but one method in particular I discovered takes about 44% of the time for the entire iteration to execute. In order to increase the speed of this for loop, what should I do? Use asyncio, multiprocessing? My idea for increasing the execution speed is to have the next iteration of the loop begin when huge_method is called and have both iterations run at the same time. Any suggestions on how I can do this?
Example of what I mean:
for i in range(len(some_list)):
x = some_list[i]['model']
y = some_other_list[i]['prediction']
result = huge_method(x, y) # This is what's taking up most of the time in this loop
# some more code...
list_of_results.append(result)
Note: anything calculated during an individual iteration is not dependent on anything calculated in a previous iteration. Also, each iteration appends a result to a list. Maintaining the order of each iterations results is important but I can make it work otherwise. What I'm referring to as huge_method here is a method from a 3rd party library and not code I would like to modify.
Edit: For clarity, here is the actual code I'm working with:
for ii_day in range(len(prediction_indices)):
model_idx = prediction_indices[ii_day]["model_idx"]
prediction_idx = prediction_indices[ii_day]["prediction_idx"]
# Fit model
regression_period_signal = historical_signal_levels[model_idx, :]
regression_period_price_change = historical_price_moves[model_idx]
try:
# This is the huge_method that takes half the time of an iteration
rolling_regression_model = LinearRegression().fit(regression_period_signal, regression_period_price_change)
self.coef_list.append(rolling_regression_model.coef_)
# Calculate model error
predictions = rolling_regression_model.predict(regression_period_signal)
forecast_horizon_model_error = np.sqrt(
mean_squared_error(regression_period_price_change, predictions))
# Predictions
forecast_distance = 1
current_research = historical_signal_levels[prediction_idx, :]
forecast_price_change = rolling_regression_model.predict(current_research)
# Calculate drift and volatility
volatility = ((1 + forecast_horizon_model_error) * (forecast_distance ** -0.5)) - 1
# Kelly recommended optimum
if volatility < 0:
raise ZeroDivisionError("Volatility needs to be positive value.")
if volatility == 0:
volatility = 0.01
kelly_recommended_optimum = forecast_price_change / volatility ** 2
rule_recommended_allocation = self.kelly_fraction * kelly_recommended_optimum
except:
rule_recommended_allocation = np.zeros(len(prediction_idx))
# Apply the calculated allocation to the dataframe.
price_research_series.loc[prediction_idx, position_key] = rule_recommended_allocation

Why the time taken to by an iterative algorithm to find the sum of list does not increase uniformly with size?

I wanted to see how drastic is the difference in time complexity between the iterative and recursive approaches to sum an array. So I plotted a 'time' versus 'size of the list' graph for a pretty decent range of values for size(995). What I got was pretty much what I wanted except something unexpected caught my eye.
The graph can be seen here 1
What's confusing me here are those bumps that the green line suddenly takes only for certain values and then comes back down. Why does that happen?
Here is the code I had written:
import matplotlib.pyplot as plt
from time import time
def sum_rec(lst): # Sums recursively
if len(lst) == 0:
return 0
return lst[0]+sum_rec(lst[1:])
def sum_iter(lst): # Sums iteratively
Sum = 0
for i in range(len(lst)):
Sum += i
return Sum
def check_time(lst): # Returns the time taken for both algorithms
start = time()
Sum = sum_iter(lst)
end = time()
t_iter = end - start
start = time()
Sum = sum_rec(lst)
end = time()
t_rec = end - start
return t_iter, t_rec
N = [n for n in range(995)]
T1 = [] # for iterative function
T2 = [] # for recursive function
for n in N: # values on the x-axis
lst = [i for i in range(n)]
t_iter, t_rec = check_time(lst)
T1.append(t_iter)
T2.append(t_rec)
plt.plot(N,T1)
plt.plot(N,T2) # Both plotted on graph
plt.show()
I'd say both the algorithms have a linear runtime but the recursive one has a higher constant factor, which causes the steeper slope.
Other than that:-
(1) You're mixing up the two plots.
The iterative one stays grounded while the recursive one increases.
One possible explanation may be that recursive calls make stack entries and require more computational time than iterative calls.
(2) You need to increase the size of the array as small sizes are more likely to cause spikes due to locality of reference.
(3) You need to repeat the experiment over multiple epochs to make sure random spikes due to some other process hogging the resource is distributed evenly.

What is the time complexity for repeatedly doubling a string?

Consider the following piece of C++ code:
string s = "a";
for (int i = 0; i < n; i++) {
s = s + s; // Concatenate s with itself.
}
Usually, when analyzing the time complexity of a piece of code, we would determine how much work the inner loop does, then multiply it by the number of times the outer loop runs. However, in this case, the amount of work done by the inner loop varies from iteration to iteration, since the string being built up gets longer and longer.
How would you analyze this code to get the big-O time complexity?
The time complexity of this function is Θ(2n). To see why this is, let's look at what the function does, then see how to analyze it.
For starters, let's trace through the loop for n = 3. Before iteration 0, the string s is the string "a". Iteration 0 doubles the length of s to make s = "aa". Iteration 1 doubles the length of s to make s = "aaaa". Iteration 2 then doubles the length of s to make s = "aaaaaaaa".
If you'll notice, after k iterations of the loop, the length of the string s is 2k. This means that each iteration of the loop will take longer and longer to complete, because it will take more and more work to concatenate the string s with itself. Specifically, the kth iteration of the loop will take time Θ(2k) to complete, because the loop iteration constructs a string of size 2k+1.
One way that we could analyze this function would be to multiply the worst-case time complexity of the inner loop by the number of loop iterations. Since each loop iteration takes time O(2n) to finish and there are n loop iterations, we would get that this code takes time O(n · 2n) to finish.
However, it turns out that this analysis is not very good, and in fact will overestimate the time complexity of this code. It is indeed true that this code runs in time O(n · 2n), but remember that big-O notation gives an upper bound on the runtime of a piece of code. This means that the growth rate of this code's runtime is no greater than the growth rate of n · 2n, but it doesn't mean that this is a precise bound. In fact, if we look at the code more precisely, we can get a better bound.
Let's begin by trying to do some better accounting for the work done. The work in this loop can be split apart into two smaller pieces:
The work done in the header of the loop, which increments i and tests whether the loop is done.
The work done in the body of the loop, which concatenates the string with itself.
Here, when accounting for the work in these two spots, we will account for the total amount of work done across all iterations, not just in one iteration.
Let's look at the first of these - the work done by the loop header. This will run exactly n times. Each time, this part of the code will do only O(1) work incrementing i, testing it against n, and deciding whether to continue with the loop. Therefore, the total work done here is Θ(n).
Now let's look at the loop body. As we saw before, iteration k creates a string of length 2k+1 on iteration k, which takes time roughly 2k+1. If we sum this up across all iterations, we get that the work done is (roughly speaking)
21 + 22 + 23 + ... + 2n+1.
So what is this sum? Previously, we got a bound of O(n · 2n) by noting that
21 + 22 + 23 + ... + 2n+1.
< 2n+1 + 2n+1 + 2n+1 + ... + 2n+1
= n · 2n+1 = 2(n · 2n) = Θ(n · 2n)
However, this is a very weak upper bound. If we're more observant, we can recognize the original sum as the sum of a geometric series, where a = 2 and r = 2. Given this, the sum of these terms can be worked out to be exactly
2n+2 - 2 = 4(2n) - 2 = Θ(2n)
In other words, the total work done by the body of the loop, across all iterations, is Θ(2n).
The total work done by the loop is given by the work done in the loop maintenance plus the work done in the body of the loop. This works out to Θ(2n) + Θ(n) = Θ(2n). Therefore, the total work done by the loop is Θ(2n). This grows very quickly, but nowhere near as rapidly as O(n · 2n), which is what our original analysis gave us.
In short, when analyzing a loop, you can always get a conservative upper bound by multiplying the number of iterations of the loop by the maximum work done on any one iteration of that loop. However, doing a more precisely analysis can often give you a much better bound.
Hope this helps!

Resources