I've been working on code to calculate the distance between 33 3D points and calculate the shortest route is between them. The initial code took in all 33 points and paired them consecutively and calculated the distances between the pairs using math.sqrt and sum them all up to get a final distance.
My problem is that with the sheer number of permutations of a list with 33 points (33 factorial!) the code is going to need to be at its absolute best to find the answer within a human lifetime (assuming I can use as many CPUs as I can get my hands on to increase the sheer computational power).
I've designed a simple web server to hand out an integer and convert it to a list and have the code perform a set number of lexicographical permutations from that point and send back the resulting shortest distance of that block. This part is fine but I have concerns over the code that does the distance calculations
I've put together a test version of my code so I could change things and see if it made the execution time faster or slower. This code starts at the beginning of the permutation list (0 to 32) in order and performs 50 million lexicographical iterations on it, checking the distance of the points at every iteration. the code is detailed below.
import json
import datetime
import math
def next_lexicographic_permutation(x):
i = len(x) - 2
while i >= 0:
if x[i] < x[i+1]:
break
else:
i -= 1
if i < 0:
return False
j = len(x) - 1
while j > i:
if x[j] > x[i]:
break
else:
j-= 1
x[i], x[j] = x[j], x[i]
reverse(x, i + 1)
return x
def reverse(arr, i):
if i > len(arr) - 1:
return
j = len(arr) - 1
while i < j:
arr[i], arr[j] = arr[j], arr[i]
i += 1
j -= 1
# ip for initial permutation
ip = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
lookup = '{"0":{"name":"van Maanen\'s Star","x":-6.3125,"y":-11.6875,"z":-4.125},\
"1":{"name":"Wolf 124","x":-7.25,"y":-27.1562,"z":-19.0938},\
"2":{"name":"Midgcut","x":-14.625,"y":10.3438,"z":13.1562},\
"3":{"name":"PSPF-LF 2","x":-4.40625,"y":-17.1562,"z":-15.3438},\
"4":{"name":"Wolf 629","x":-4.0625,"y":7.6875,"z":20.0938},\
"5":{"name":"LHS 3531","x":1.4375,"y":-11.1875,"z":16.7812},\
"6":{"name":"Stein 2051","x":-9.46875,"y":2.4375,"z":-15.375},\
"7":{"name":"Wolf 25","x":-11.0625,"y":-20.4688,"z":-7.125},\
"8":{"name":"Wolf 1481","x":5.1875,"y":13.375,"z":13.5625},\
"9":{"name":"Wolf 562","x":1.46875,"y":12.8438,"z":15.5625},\
"10":{"name":"LP 532-81","x":-1.5625,"y":-27.375,"z":-32.3125},\
"11":{"name":"LP 525-39","x":-19.7188,"y":-31.125,"z":-9.09375},\
"12":{"name":"LP 804-27","x":3.3125,"y":17.8438,"z":43.2812},\
"13":{"name":"Ross 671","x":-17.5312,"y":-13.8438,"z":0.625},\
"14":{"name":"LHS 340","x":20.4688,"y":8.25,"z":12.5},\
"15":{"name":"Haghole","x":-5.875,"y":0.90625,"z":23.8438},\
"16":{"name":"Trepin","x":26.375,"y":10.5625,"z":9.78125},\
"17":{"name":"Kokary","x":3.5,"y":-10.3125,"z":-11.4375},\
"18":{"name":"Akkadia","x":-1.75,"y":-33.9062,"z":-32.9688},\
"19":{"name":"Hill Pa Hsi","x":29.4688,"y":-1.6875,"z":25.375},\
"20":{"name":"Luyten 145-141","x":13.4375,"y":-0.8125,"z":6.65625},\
"21":{"name":"WISE 0855-0714","x":6.53125,"y":-2.15625,"z":2.03125},\
"22":{"name":"Alpha Centauri","x":3.03125,"y":-0.09375,"z":3.15625},\
"23":{"name":"LHS 450","x":-12.4062,"y":7.8125,"z":-1.875},\
"24":{"name":"LP 245-10","x":-18.9688,"y":-13.875,"z":-24.2812},\
"25":{"name":"Epsilon Indi","x":3.125,"y":-8.875,"z":7.125},\
"26":{"name":"Barnard\'s Star","x":-3.03125,"y":1.375,"z":4.9375},\
"27":{"name":"Epsilon Eridani","x":1.9375,"y":-7.75,"z":-6.84375},\
"28":{"name":"Narenses","x":-1.15625,"y":-11.0312,"z":21.875},\
"29":{"name":"Wolf 359","x":3.875,"y":6.46875,"z":-1.90625},\
"30":{"name":"LAWD 26","x":20.9062,"y":-7.5,"z":3.75},\
"31":{"name":"Avik","x":13.9688,"y":-4.59375,"z":-6.0},\
"32":{"name":"George Pantazis","x":-12.0938,"y":-16.0,"z":-14.2188}}'
lookup = json.loads(lookup)
lowest_total = 9999
# create 2D array for the distances and called it b to keep code looking clean.
b = [[0 for i in range(33)] for j in range(33)]
for x in range(33):
for y in range(33):
if x == y:
continue
else:
b[x][y] = math.sqrt(((lookup[str(x)]["x"] - lookup[str(y)]['x']) ** 2) + ((lookup[str(x)]['y'] - lookup[str(y)]['y']) ** 2) + ((lookup[str(x)]['z'] - lookup[str(y)]['z']) ** 2))
# begin timer
start_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
start = datetime.datetime.now()
print("[{}] Start".format(start_date))
# main iteration loop
for x in range(50_000_000):
distance = b[ip[0]][ip[1]] + b[ip[1]][ip[2]] + b[ip[2]][ip[3]] +\
b[ip[3]][ip[4]] + b[ip[4]][ip[5]] + b[ip[5]][ip[6]] +\
b[ip[6]][ip[7]] + b[ip[7]][ip[8]] + b[ip[8]][ip[9]] +\
b[ip[9]][ip[10]] + b[ip[10]][ip[11]] + b[ip[11]][ip[12]] +\
b[ip[12]][ip[13]] + b[ip[13]][ip[14]] + b[ip[14]][ip[15]] +\
b[ip[15]][ip[16]] + b[ip[16]][ip[17]] + b[ip[17]][ip[18]] +\
b[ip[18]][ip[19]] + b[ip[19]][ip[20]] + b[ip[20]][ip[21]] +\
b[ip[21]][ip[22]] + b[ip[22]][ip[23]] + b[ip[23]][ip[24]] +\
b[ip[24]][ip[25]] + b[ip[25]][ip[26]] + b[ip[26]][ip[27]] +\
b[ip[27]][ip[28]] + b[ip[28]][ip[29]] + b[ip[29]][ip[30]] +\
b[ip[30]][ip[31]] + b[ip[31]][ip[32]]
if distance < lowest_total:
lowest_total = distance
ip = next_lexicographic_permutation(ip)
# end timer
finish_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
finish = datetime.datetime.now()
print("[{}] Finish".format(finish_date))
diff = finish - start
print("Time taken => {}".format(diff))
print("Lowest distance => {}".format(lowest_total))
This is the result of a lot of work to make things faster. I was initially using string look-ups to find the distance to be calculated with a dict having keys like "1-2", but very quickly found out that it was very slow, I then moved onto hashed versions of the "1-2" key and the speed increased but the fastest way I have found so far is using a 2D array and looking up the values from there.
I have also found that manually constructing the distance calculation saved time over having a for x in ranges(32): loop adding the distances up and incrementing a variable to get the total.
Another great speed up was using pypy3 instead of python3 to execute it.
This usually takes 11 seconds to complete using pypy3
running 50 million of the distance calculation on its own takes 5.2 seconds
running 50 million of the next_lexicographic_permutation function on its own takes 6 seconds
I can't think of any way to make this faster and I believe there may be optimizations to be made in the next_lexicographic_permutation function. From what I've read about this the main bottleneck seems to be the switching of positions in the array:
x[i], x[j] = x[j], x[i]
Edit : added clarification of lifetime to represent human lifetime
The brute-force approach of calculating all the distances is going to be slower than a partitioning approach. Here is a similar question for the 3D case.
I tried the following code to find the range of a dataframe not within the range of another dataframe. However, it takes more than a day to compute the large files because, in the last 2 for-loops, it's comparing each row. Each of my 24 dataframes has around 10^8 rows. Is there any efficient alternative to the following approach?
Please refer to this thread for a better understanding of my I/O: Return the range of a dataframe not within a range of another dataframe
My approach:
I created the tuple pairs from the (df1['first.start'], df1['first.end']) and (df2['first.start'], df2['first.end']) initially in order to apply the range() function. After that, I put a condition whether df1_ranges are in the ranges of df2_ranges or not. Here the edge case was df1['first.start'] = df1['first.end']. I collected the filtered indices from the iterations and then passed into the df1.
df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
df2_lst.append(i)
df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
df1_lst.append(i)
df1_lst.append(j)
def range_subset(range1, range2):
"""Whether range1 is a subset of range2."""
if not range1:
return True # empty range is a subset of anything
if not range2:
return False # non-empty range can't be a subset of empty range
if len(range1) > 1 and range1.step % range2.step:
return False # must have a single value or integer multiple step
return range1.start in range2 and range1[-1] in range2
##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i],lst[i+1]
df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
x,y = i
for j in df2_lst2: #filter list
m,n = j
if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
indices.append(idx) #collecting the filtered indices
df1.iloc[indices]
If n and m are the number of rows in df1 and df2, any algorithm needs to make at least n * m comparision to check every range in df1 against every range in df2, The problem with your code as posted is that (a) it has too may intermediate steps and (b) it uses slow Python loops. If you switch to numpy broadcast, which uses highly optimized C loop under the hood, it will be a lot faster.
The downside with numpy broadcast is memory: it will create a comparison matrix of n * m bytes and the size of your problem may run your computer out of memory. We can mitigate that by chunking df1 to trade performance for lower memory usage.
# Sample data
def random_dataframe(size):
a = np.random.randint(1, 100, 2*size).cumsum()
return pd.DataFrame({
'first.start': a[::2],
'first.end': a[1::2]
})
n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)
# ---------------------------
# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
s1 = df1['first.start'].to_numpy()[offset:offset+chunk_size]
e1 = df1['first.end'].to_numpy()[offset:offset+chunk_size]
mask.append(
((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
)
offset += chunk_size
mask = np.hstack(mask)
# ---------------------------
# If memory is not a concern, use the following code. However, this
# may run slower than the chunking approach due to increased size of
# the array broadcasting operation. Profile your code to find out.
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
s1 = df1['first.start'].to_numpy()
e1 = df1['first.end'].to_numpy()
mask = ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
The chunking code took 30s on my computer. To access the result:
df1[mask] # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2
By tweaking the comparison, you can check for overlapping ranges too.
I'm trying to simulate what the crunch command does in Linux, with the difference of yield the words instead of writing them into a file and i came up with something like this:
def wordlist(chars, min, max = None):
if max is None: # Means that the user want only a singular length
max = min
length = len(chars)
for n in range(min, max + 1):
indexes = [0] * n
for _ in range(length ** n): # The length of all the chars to the power of the places to fill return the number of words in the wordlist
for m in range(1, len(indexes) + 1): # This is the reporting system, like if indexes instead of a list is a number
if indexes[-m] == length:
indexes[-m] = 0
indexes[-m - 1] += 1
yield ''.join(chars[i] for i in indexes)
indexes[-1] += 1
It's a bit rude and not too much readable, maybe neither too much performing. Without using any external module like itertools, has someone got a better idea?
EDIT:
After a bit of struggling I have improved the math behind coming up to something like this:
def wordlist(chars, min, max = None):
if max is None:
max = min
if min <= 0 or max <= 0:
return
base = len(chars)
for n in range(min, max + 1):
for m in range(base ** n):
yield ''.join(chars[m // base ** (n - v - 1) % base] for v in range(n))
Anyway I measured the time taken by each of the two function and, while this new one is much more readable and pretty, the first one still faster. I still waiting for better ideas from you
I have the following simple code:
function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
minsofar = fill(n, Threads.nthreads())
# minsofar = n
Threads.#threads for i in 1:N
# for i in 1:N
for j in i+1:N
dist = hamming4(strings[i], strings[j])
if dist < minsofar[Threads.threadid()]
minsofar[Threads.threadid()] = dist
end
end
end
return minimum(minsofar)
#return minsofar
end
function ave_min(n, N)
ITER = 10
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = new_min
# print("New min ", new_min, ". New ave ", avesofar, "\n")
total = avesofar
for i in 1:ITER-1
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
end
return avesofar
end
N = 2^16
n = 99
print("Overall average ", ave_min(n, N), "\n")
When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).
Is it possible to make the parallelisation work more efficiently?
Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?
Can the code be vectorized using LoopVectorization.jl?
(I am completely new to Julia)
The parallelization of your code seems to be correct.
Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).
Eg.g running in Atom on my machine:
julia> Threads.nthreads()
4
What you need to do is to explicitly set JULIA_NUM_THREADS
Windows command line (still assuming 8 logical cores)
set JULIA_NUM_THREADS=8
Linux command line
export JULIA_NUM_THREADS=8
After doing that your code takes 100% on all my cores.
EDIT
After discussion - you can get the time down to around 20% of a single threaded time by using Distributed instead of Threads since this avoids memory sharing:
The code will look more or less like this:
using Distributed
addprocs(8)
#everywhere function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
return #distributed (min) for i in 1:N-1
minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
end
end
### ... the rest of code remains unchanged
Suppose I have some binary mask mask. (e.g. 0b101011011101)
Is there an efficient method of computing all integers k such that k & mask == k? (where & is the bitwise AND operator) (alternatively, k & ~mask == 0)
If mask has m ones, then there are exactly 2m numbers that satisfy this property, so it seems like there should be some kind of process that is O(2m). Enumerating the integers less than the mask is wasteful (though easy to eliminate values that do not apply).
I figured it out... you can identify all the single bit patterns like as follows, since the least significant 1 bit of any integer k is cleared when calculating k & (k-1):
def onebits(x):
while x > 0:
# find least significant 1 bit
xprev = x
x &= x-1
yield x ^ xprev
and then I can use the ruler function to XOR in various combinations of 1 bits to emulate which bits of a counter are toggled each time:
def maskcount(mask):
maskbits = []
m = 0
for ls1 in onebits(mask):
m ^= ls1
maskbits.append(m)
# ruler function modified from
# http://lua-users.org/wiki/LuaCoroutinesVersusPythonGenerators
def ruler(k):
for i in range(k):
yield i
for x in ruler(i): yield x
x = 0
yield x
for k in ruler(len(maskbits)):
x ^= maskbits[k]
yield x
which looks like this:
>>> for x in maskcount(0xc05):
... print format(x, '#016b')
0b00000000000000
0b00000000000001
0b00000000000100
0b00000000000101
0b00010000000000
0b00010000000001
0b00010000000100
0b00010000000101
0b00100000000000
0b00100000000001
0b00100000000100
0b00100000000101
0b00110000000000
0b00110000000001
0b00110000000100
0b00110000000101
An easy way to solve the problem is to find the bits that are set in mask, and then simply count with i, but then replacing the bits of i with corresponding bits from the mask.
def codes(mask):
bits = filter(None, (mask & (1 << i) for i in xrange(mask.bit_length())))
for i in xrange(1 << len(bits)):
yield sum(b for j, b in enumerate(bits) if (i >> j) & 1)
print list(codes(39))
That gives you O(log(N)) work per iteration (where N is the number of bits set in mask).
It's possible to be more efficient, and do O(1) work per iteration by counting using gray codes. With gray code counting, only a single bit changes each iteration so it's possible to efficiently update the current value, v. Obviously this is much harder to understand than the simple solution above.
def codes(mask):
bits = filter(None, (mask & (1 << i) for i in xrange(mask.bit_length())))
blt = dict((1 << i, b) for i, b in enumerate(bits))
p, v = 0, 0
for i in xrange(1 << len(bits)):
n = i ^ (i >> 1)
v ^= blt.get(p^n, 0)
p = n
yield v
print list(codes(39))
A disadvantage of using gray codes is that the results are not returned in numeric order. But luckily that wasn't a condition in the question!