Improving the performance of multiple subsets on a large Dataframe - python-3.x

I have a dataframe containing 6.3 million records and 111 columns. For this example I've limited the dataframe to 27 columns (A-Z). On this dataframe I am trying to run an analysis in which I use different combinations of the columns (with pairs of 5 columns per combination) and subset each of those on the dataframe and do a count of the number of occurrences for each combination and finally evaluate if this count extends a certain threshold and then store the combination. The code is already optimized with an efficient way of running the individual subsets, using numba. But still the overal script I have takes quite some time (7-8 hours). This is because if you use for example 90 columns (which is my actual number used) to make combinations of 5, you get 43.949.268 different combinations. In my case I also use a shifted versions of some columns (value of day before). So for this example I've limited it to 20 columns (A-J 2 times, including the shifted versions).
The columns used are stored in a list, which is converted to numbers because it otherwise gets to big using long strings. The names in the list corresponds with a dictionary containing the subset variables.
Here is the full code example:
import pandas as pd
import numpy as np
import numba as nb
import time
from itertools import combinations
# Numba preparation
#nb.njit('int64(bool_[::1],bool_[::1],bool_[::1],bool_[::1],bool_[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
n = len(cond1)
assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
subsetLength = 0
for i in nb.prange(n):
subsetLength += cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i]
return subsetLength
# Example Dataframe
np.random.seed(101)
bigDF = pd.DataFrame(np.random.randint(0,11,size=(6300000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
# Example query list
queryList = ['A_shift0','B_shift0','C_shift0','D_shift0','E_shift0','F_shift0','G_shift0','H_shift0','I_shift0','J_shift0','A_shift1','B_shift1','C_shift1','D_shift1','E_shift1','F_shift1','G_shift1','H_shift1','I_shift1','J_shift1']
# Convert list to numbers for creation combinations
listToNum = list(range(len(queryList)))
# Generate 15504 combinations of the 20 queries without repitition
queryCombinations = combinations(listToNum,5)
# Example query dict
queryDict = {
'query_A_shift0': ((bigDF.A >= 1) & (bigDF.A < 3)),
'query_B_shift0': ((bigDF.B >= 3) & (bigDF.B < 5)),
'query_C_shift0': ((bigDF.C >= 5) & (bigDF.C < 7)),
'query_D_shift0': ((bigDF.D >= 7) & (bigDF.D < 9)),
'query_E_shift0': ((bigDF.E >= 9) & (bigDF.E < 11)),
'query_F_shift0': ((bigDF.F >= 1) & (bigDF.F < 3)),
'query_G_shift0': ((bigDF.G >= 3) & (bigDF.G < 5)),
'query_H_shift0': ((bigDF.H >= 5) & (bigDF.H < 7)),
'query_I_shift0': ((bigDF.I >= 7) & (bigDF.I < 9)),
'query_J_shift0': ((bigDF.J >= 7) & (bigDF.J < 11)),
'query_A_shift1': ((bigDF.A.shift(1) >= 1) & (bigDF.A.shift(1) < 3)),
'query_B_shift1': ((bigDF.B.shift(1) >= 3) & (bigDF.B.shift(1) < 5)),
'query_C_shift1': ((bigDF.C.shift(1) >= 5) & (bigDF.C.shift(1) < 7)),
'query_D_shift1': ((bigDF.D.shift(1) >= 7) & (bigDF.D.shift(1) < 9)),
'query_E_shift1': ((bigDF.E.shift(1) >= 9) & (bigDF.E.shift(1) < 11)),
'query_F_shift1': ((bigDF.F.shift(1) >= 1) & (bigDF.F.shift(1) < 3)),
'query_G_shift1': ((bigDF.G.shift(1) >= 3) & (bigDF.G.shift(1) < 5)),
'query_H_shift1': ((bigDF.H.shift(1) >= 5) & (bigDF.H.shift(1) < 7)),
'query_I_shift1': ((bigDF.I.shift(1) >= 7) & (bigDF.I.shift(1) < 9)),
'query_J_shift1': ((bigDF.J.shift(1) >= 7) & (bigDF.J.shift(1) < 11))
}
totalCountDict = {'queryStrings': [],'totalCounts': []}
# Loop through all query combinations and count subset lengths
start = time.time()
for combi in list(queryCombinations):
tempList = list(combi)
queryOne = str(queryList[tempList[0]])
queryTwo = str(queryList[tempList[1]])
queryThree = str(queryList[tempList[2]])
queryFour = str(queryList[tempList[3]])
queryFive = str(queryList[tempList[4]])
queryString = '-'.join(map(str,tempList))
count = computeSubsetLength5(queryDict["query_" + queryOne].to_numpy(), queryDict["query_" + queryTwo].to_numpy(), queryDict["query_" + queryThree].to_numpy(), queryDict["query_" + queryFour].to_numpy(), queryDict["query_" + queryFive].to_numpy())
if count > 1300:
totalCountDict['queryStrings'].append(queryString)
totalCountDict['totalCounts'].append(count)
print(len(totalCountDict['totalCounts']))
stop = time.time()
print("Loop time:", stop - start)
This currently takes about 20 seconds on my Macbook Pro 2020 Intel version, for the 15504 combinations. Any thoughts on how this could be improved? I have tried using multiprocessing, but since I am using numba already for the individual subsets this did not work well together. Am I using an inefficient way to do multiple subsets using a list, dictionary and for loop to subset all the combinations, or is 7-8 hours realistic for doing 44 million subsets on a dataframe of 6.3 million records?

One solution to speed up by a large factor this code is to pack the bits in the boolean arrays stored in queryDict. Indeed, the code computeSubsetLength5 is likely memory bound (I thought the speed-up provided in my previous answer would be sufficient for the needs).
Here is the function to pack the bits of a boolean array:
#nb.njit('uint64[::1](bool_[::1])')
def toPackedArray(cond):
n = len(cond)
res = np.empty((n+63)//64, dtype=np.uint64)
for i in range(n//64):
tmp = np.uint64(0)
for j in range(64):
tmp |= nb.types.uint64(cond[i*64+j]) << j
res[i] = tmp
# Remainder
if n % 64 > 0:
tmp = 0
for j in range(n - (n % 64), n):
tmp |= cond[j] << j
res[len(res)-1] = tmp
return res
Note that the end of the arrays are padded with 0 which does not affect the specific following computation (it may not be the case if you plan to use the boolean arrays in another context).
This function is called once for each array like this:
'query_A_shift0': toPackedArray((((bigDF.A >= 1) & (bigDF.A < 3))).to_numpy()),
Once packed, the array can be computed much more efficiently by working directly on 64-bits integers (64-bits per integers are computed at once). Here is the resulting code:
# See: https://en.wikipedia.org/wiki/Hamming_weight
#nb.njit('uint64(uint64)', inline='always')
def popcount64c(x):
m1 = 0x5555555555555555
m2 = 0x3333333333333333
m4 = 0x0f0f0f0f0f0f0f0f
h01 = 0x0101010101010101
x -= (x >> 1) & m1
x = (x & m2) + ((x >> 2) & m2)
x = (x + (x >> 4)) & m4
return (x * h01) >> 56
# Numba preparation
#nb.njit('uint64(uint64[::1],uint64[::1],uint64[::1],uint64[::1],uint64[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
n = len(cond1)
assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
subsetLength = 0
for i in nb.prange(n):
subsetLength += popcount64c(cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i])
return subsetLength
popcount64c counts the number of bits set to 1 in each 64-bits chunks.
Here are results on my 6-core i5-9600KF machine:
Reference implementation: 13.41 s
Proposed implementation: 0.38 s
The proposed implementation is 35 times faster than the (already optimized) Numba implementation. The reason why it is much faster than just 8 times is that data should now fit in the last-level cache of your processor with is often much faster than the RAM (about 5 times on my machine).
If you when to optimize further this code you can work on smaller chunks that fits in the L2 cache and use threads in the combinatoric loop rather than in the still memory bound computeSubsetLength5 function. This should give you a significant speed-up (I expect at least a x2).
The biggest optimisation to apply then probably comes from the overall algorithm. The same logical AND operations are computed several times over and over. Pre-computing most of them on the fly while keeping only the most useful ones should significantly speed the algorithm up (I expect a speed-up of x2).
I am pretty sure there are many other optimisations that can be applied on on the overall algorithm. Performing a brute-force is often sufficient to solve a problem but hardly a requirement.

Related

I'm Looking for ways to make this lexicographical code faster

I've been working on code to calculate the distance between 33 3D points and calculate the shortest route is between them. The initial code took in all 33 points and paired them consecutively and calculated the distances between the pairs using math.sqrt and sum them all up to get a final distance.
My problem is that with the sheer number of permutations of a list with 33 points (33 factorial!) the code is going to need to be at its absolute best to find the answer within a human lifetime (assuming I can use as many CPUs as I can get my hands on to increase the sheer computational power).
I've designed a simple web server to hand out an integer and convert it to a list and have the code perform a set number of lexicographical permutations from that point and send back the resulting shortest distance of that block. This part is fine but I have concerns over the code that does the distance calculations
I've put together a test version of my code so I could change things and see if it made the execution time faster or slower. This code starts at the beginning of the permutation list (0 to 32) in order and performs 50 million lexicographical iterations on it, checking the distance of the points at every iteration. the code is detailed below.
import json
import datetime
import math
def next_lexicographic_permutation(x):
i = len(x) - 2
while i >= 0:
if x[i] < x[i+1]:
break
else:
i -= 1
if i < 0:
return False
j = len(x) - 1
while j > i:
if x[j] > x[i]:
break
else:
j-= 1
x[i], x[j] = x[j], x[i]
reverse(x, i + 1)
return x
def reverse(arr, i):
if i > len(arr) - 1:
return
j = len(arr) - 1
while i < j:
arr[i], arr[j] = arr[j], arr[i]
i += 1
j -= 1
# ip for initial permutation
ip = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
lookup = '{"0":{"name":"van Maanen\'s Star","x":-6.3125,"y":-11.6875,"z":-4.125},\
"1":{"name":"Wolf 124","x":-7.25,"y":-27.1562,"z":-19.0938},\
"2":{"name":"Midgcut","x":-14.625,"y":10.3438,"z":13.1562},\
"3":{"name":"PSPF-LF 2","x":-4.40625,"y":-17.1562,"z":-15.3438},\
"4":{"name":"Wolf 629","x":-4.0625,"y":7.6875,"z":20.0938},\
"5":{"name":"LHS 3531","x":1.4375,"y":-11.1875,"z":16.7812},\
"6":{"name":"Stein 2051","x":-9.46875,"y":2.4375,"z":-15.375},\
"7":{"name":"Wolf 25","x":-11.0625,"y":-20.4688,"z":-7.125},\
"8":{"name":"Wolf 1481","x":5.1875,"y":13.375,"z":13.5625},\
"9":{"name":"Wolf 562","x":1.46875,"y":12.8438,"z":15.5625},\
"10":{"name":"LP 532-81","x":-1.5625,"y":-27.375,"z":-32.3125},\
"11":{"name":"LP 525-39","x":-19.7188,"y":-31.125,"z":-9.09375},\
"12":{"name":"LP 804-27","x":3.3125,"y":17.8438,"z":43.2812},\
"13":{"name":"Ross 671","x":-17.5312,"y":-13.8438,"z":0.625},\
"14":{"name":"LHS 340","x":20.4688,"y":8.25,"z":12.5},\
"15":{"name":"Haghole","x":-5.875,"y":0.90625,"z":23.8438},\
"16":{"name":"Trepin","x":26.375,"y":10.5625,"z":9.78125},\
"17":{"name":"Kokary","x":3.5,"y":-10.3125,"z":-11.4375},\
"18":{"name":"Akkadia","x":-1.75,"y":-33.9062,"z":-32.9688},\
"19":{"name":"Hill Pa Hsi","x":29.4688,"y":-1.6875,"z":25.375},\
"20":{"name":"Luyten 145-141","x":13.4375,"y":-0.8125,"z":6.65625},\
"21":{"name":"WISE 0855-0714","x":6.53125,"y":-2.15625,"z":2.03125},\
"22":{"name":"Alpha Centauri","x":3.03125,"y":-0.09375,"z":3.15625},\
"23":{"name":"LHS 450","x":-12.4062,"y":7.8125,"z":-1.875},\
"24":{"name":"LP 245-10","x":-18.9688,"y":-13.875,"z":-24.2812},\
"25":{"name":"Epsilon Indi","x":3.125,"y":-8.875,"z":7.125},\
"26":{"name":"Barnard\'s Star","x":-3.03125,"y":1.375,"z":4.9375},\
"27":{"name":"Epsilon Eridani","x":1.9375,"y":-7.75,"z":-6.84375},\
"28":{"name":"Narenses","x":-1.15625,"y":-11.0312,"z":21.875},\
"29":{"name":"Wolf 359","x":3.875,"y":6.46875,"z":-1.90625},\
"30":{"name":"LAWD 26","x":20.9062,"y":-7.5,"z":3.75},\
"31":{"name":"Avik","x":13.9688,"y":-4.59375,"z":-6.0},\
"32":{"name":"George Pantazis","x":-12.0938,"y":-16.0,"z":-14.2188}}'
lookup = json.loads(lookup)
lowest_total = 9999
# create 2D array for the distances and called it b to keep code looking clean.
b = [[0 for i in range(33)] for j in range(33)]
for x in range(33):
for y in range(33):
if x == y:
continue
else:
b[x][y] = math.sqrt(((lookup[str(x)]["x"] - lookup[str(y)]['x']) ** 2) + ((lookup[str(x)]['y'] - lookup[str(y)]['y']) ** 2) + ((lookup[str(x)]['z'] - lookup[str(y)]['z']) ** 2))
# begin timer
start_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
start = datetime.datetime.now()
print("[{}] Start".format(start_date))
# main iteration loop
for x in range(50_000_000):
distance = b[ip[0]][ip[1]] + b[ip[1]][ip[2]] + b[ip[2]][ip[3]] +\
b[ip[3]][ip[4]] + b[ip[4]][ip[5]] + b[ip[5]][ip[6]] +\
b[ip[6]][ip[7]] + b[ip[7]][ip[8]] + b[ip[8]][ip[9]] +\
b[ip[9]][ip[10]] + b[ip[10]][ip[11]] + b[ip[11]][ip[12]] +\
b[ip[12]][ip[13]] + b[ip[13]][ip[14]] + b[ip[14]][ip[15]] +\
b[ip[15]][ip[16]] + b[ip[16]][ip[17]] + b[ip[17]][ip[18]] +\
b[ip[18]][ip[19]] + b[ip[19]][ip[20]] + b[ip[20]][ip[21]] +\
b[ip[21]][ip[22]] + b[ip[22]][ip[23]] + b[ip[23]][ip[24]] +\
b[ip[24]][ip[25]] + b[ip[25]][ip[26]] + b[ip[26]][ip[27]] +\
b[ip[27]][ip[28]] + b[ip[28]][ip[29]] + b[ip[29]][ip[30]] +\
b[ip[30]][ip[31]] + b[ip[31]][ip[32]]
if distance < lowest_total:
lowest_total = distance
ip = next_lexicographic_permutation(ip)
# end timer
finish_date = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
finish = datetime.datetime.now()
print("[{}] Finish".format(finish_date))
diff = finish - start
print("Time taken => {}".format(diff))
print("Lowest distance => {}".format(lowest_total))
This is the result of a lot of work to make things faster. I was initially using string look-ups to find the distance to be calculated with a dict having keys like "1-2", but very quickly found out that it was very slow, I then moved onto hashed versions of the "1-2" key and the speed increased but the fastest way I have found so far is using a 2D array and looking up the values from there.
I have also found that manually constructing the distance calculation saved time over having a for x in ranges(32): loop adding the distances up and incrementing a variable to get the total.
Another great speed up was using pypy3 instead of python3 to execute it.
This usually takes 11 seconds to complete using pypy3
running 50 million of the distance calculation on its own takes 5.2 seconds
running 50 million of the next_lexicographic_permutation function on its own takes 6 seconds
I can't think of any way to make this faster and I believe there may be optimizations to be made in the next_lexicographic_permutation function. From what I've read about this the main bottleneck seems to be the switching of positions in the array:
x[i], x[j] = x[j], x[i]
Edit : added clarification of lifetime to represent human lifetime
The brute-force approach of calculating all the distances is going to be slower than a partitioning approach. Here is a similar question for the 3D case.

Efficient solution of dataframe range filtering based on another ranges

I tried the following code to find the range of a dataframe not within the range of another dataframe. However, it takes more than a day to compute the large files because, in the last 2 for-loops, it's comparing each row. Each of my 24 dataframes has around 10^8 rows. Is there any efficient alternative to the following approach?
Please refer to this thread for a better understanding of my I/O: Return the range of a dataframe not within a range of another dataframe
My approach:
I created the tuple pairs from the (df1['first.start'], df1['first.end']) and (df2['first.start'], df2['first.end']) initially in order to apply the range() function. After that, I put a condition whether df1_ranges are in the ranges of df2_ranges or not. Here the edge case was df1['first.start'] = df1['first.end']. I collected the filtered indices from the iterations and then passed into the df1.
df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
df2_lst.append(i)
df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
df1_lst.append(i)
df1_lst.append(j)
def range_subset(range1, range2):
"""Whether range1 is a subset of range2."""
if not range1:
return True # empty range is a subset of anything
if not range2:
return False # non-empty range can't be a subset of empty range
if len(range1) > 1 and range1.step % range2.step:
return False # must have a single value or integer multiple step
return range1.start in range2 and range1[-1] in range2
##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i],lst[i+1]
df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
x,y = i
for j in df2_lst2: #filter list
m,n = j
if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
indices.append(idx) #collecting the filtered indices
df1.iloc[indices]
If n and m are the number of rows in df1 and df2, any algorithm needs to make at least n * m comparision to check every range in df1 against every range in df2, The problem with your code as posted is that (a) it has too may intermediate steps and (b) it uses slow Python loops. If you switch to numpy broadcast, which uses highly optimized C loop under the hood, it will be a lot faster.
The downside with numpy broadcast is memory: it will create a comparison matrix of n * m bytes and the size of your problem may run your computer out of memory. We can mitigate that by chunking df1 to trade performance for lower memory usage.
# Sample data
def random_dataframe(size):
a = np.random.randint(1, 100, 2*size).cumsum()
return pd.DataFrame({
'first.start': a[::2],
'first.end': a[1::2]
})
n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)
# ---------------------------
# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
s1 = df1['first.start'].to_numpy()[offset:offset+chunk_size]
e1 = df1['first.end'].to_numpy()[offset:offset+chunk_size]
mask.append(
((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
)
offset += chunk_size
mask = np.hstack(mask)
# ---------------------------
# If memory is not a concern, use the following code. However, this
# may run slower than the chunking approach due to increased size of
# the array broadcasting operation. Profile your code to find out.
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
s1 = df1['first.start'].to_numpy()
e1 = df1['first.end'].to_numpy()
mask = ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
The chunking code took 30s on my computer. To access the result:
df1[mask] # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2
By tweaking the comparison, you can check for overlapping ranges too.

Faster way to simulate the crunch command behaviour on Python3.8

I'm trying to simulate what the crunch command does in Linux, with the difference of yield the words instead of writing them into a file and i came up with something like this:
def wordlist(chars, min, max = None):
if max is None: # Means that the user want only a singular length
max = min
length = len(chars)
for n in range(min, max + 1):
indexes = [0] * n
for _ in range(length ** n): # The length of all the chars to the power of the places to fill return the number of words in the wordlist
for m in range(1, len(indexes) + 1): # This is the reporting system, like if indexes instead of a list is a number
if indexes[-m] == length:
indexes[-m] = 0
indexes[-m - 1] += 1
yield ''.join(chars[i] for i in indexes)
indexes[-1] += 1
It's a bit rude and not too much readable, maybe neither too much performing. Without using any external module like itertools, has someone got a better idea?
EDIT:
After a bit of struggling I have improved the math behind coming up to something like this:
def wordlist(chars, min, max = None):
if max is None:
max = min
if min <= 0 or max <= 0:
return
base = len(chars)
for n in range(min, max + 1):
for m in range(base ** n):
yield ''.join(chars[m // base ** (n - v - 1) % base] for v in range(n))
Anyway I measured the time taken by each of the two function and, while this new one is much more readable and pretty, the first one still faster. I still waiting for better ideas from you

How to get near optimal parallel efficiency for this simple Julia code?

I have the following simple code:
function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
minsofar = fill(n, Threads.nthreads())
# minsofar = n
Threads.#threads for i in 1:N
# for i in 1:N
for j in i+1:N
dist = hamming4(strings[i], strings[j])
if dist < minsofar[Threads.threadid()]
minsofar[Threads.threadid()] = dist
end
end
end
return minimum(minsofar)
#return minsofar
end
function ave_min(n, N)
ITER = 10
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = new_min
# print("New min ", new_min, ". New ave ", avesofar, "\n")
total = avesofar
for i in 1:ITER-1
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
end
return avesofar
end
N = 2^16
n = 99
print("Overall average ", ave_min(n, N), "\n")
When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).
Is it possible to make the parallelisation work more efficiently?
Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?
Can the code be vectorized using LoopVectorization.jl?
(I am completely new to Julia)
The parallelization of your code seems to be correct.
Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).
Eg.g running in Atom on my machine:
julia> Threads.nthreads()
4
What you need to do is to explicitly set JULIA_NUM_THREADS
Windows command line (still assuming 8 logical cores)
set JULIA_NUM_THREADS=8
Linux command line
export JULIA_NUM_THREADS=8
After doing that your code takes 100% on all my cores.
EDIT
After discussion - you can get the time down to around 20% of a single threaded time by using Distributed instead of Threads since this avoids memory sharing:
The code will look more or less like this:
using Distributed
addprocs(8)
#everywhere function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
return #distributed (min) for i in 1:N-1
minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
end
end
### ... the rest of code remains unchanged

efficiently generating all integers within a binary mask

Suppose I have some binary mask mask. (e.g. 0b101011011101)
Is there an efficient method of computing all integers k such that k & mask == k? (where & is the bitwise AND operator) (alternatively, k & ~mask == 0)
If mask has m ones, then there are exactly 2m numbers that satisfy this property, so it seems like there should be some kind of process that is O(2m). Enumerating the integers less than the mask is wasteful (though easy to eliminate values that do not apply).
I figured it out... you can identify all the single bit patterns like as follows, since the least significant 1 bit of any integer k is cleared when calculating k & (k-1):
def onebits(x):
while x > 0:
# find least significant 1 bit
xprev = x
x &= x-1
yield x ^ xprev
and then I can use the ruler function to XOR in various combinations of 1 bits to emulate which bits of a counter are toggled each time:
def maskcount(mask):
maskbits = []
m = 0
for ls1 in onebits(mask):
m ^= ls1
maskbits.append(m)
# ruler function modified from
# http://lua-users.org/wiki/LuaCoroutinesVersusPythonGenerators
def ruler(k):
for i in range(k):
yield i
for x in ruler(i): yield x
x = 0
yield x
for k in ruler(len(maskbits)):
x ^= maskbits[k]
yield x
which looks like this:
>>> for x in maskcount(0xc05):
... print format(x, '#016b')
0b00000000000000
0b00000000000001
0b00000000000100
0b00000000000101
0b00010000000000
0b00010000000001
0b00010000000100
0b00010000000101
0b00100000000000
0b00100000000001
0b00100000000100
0b00100000000101
0b00110000000000
0b00110000000001
0b00110000000100
0b00110000000101
An easy way to solve the problem is to find the bits that are set in mask, and then simply count with i, but then replacing the bits of i with corresponding bits from the mask.
def codes(mask):
bits = filter(None, (mask & (1 << i) for i in xrange(mask.bit_length())))
for i in xrange(1 << len(bits)):
yield sum(b for j, b in enumerate(bits) if (i >> j) & 1)
print list(codes(39))
That gives you O(log(N)) work per iteration (where N is the number of bits set in mask).
It's possible to be more efficient, and do O(1) work per iteration by counting using gray codes. With gray code counting, only a single bit changes each iteration so it's possible to efficiently update the current value, v. Obviously this is much harder to understand than the simple solution above.
def codes(mask):
bits = filter(None, (mask & (1 << i) for i in xrange(mask.bit_length())))
blt = dict((1 << i, b) for i, b in enumerate(bits))
p, v = 0, 0
for i in xrange(1 << len(bits)):
n = i ^ (i >> 1)
v ^= blt.get(p^n, 0)
p = n
yield v
print list(codes(39))
A disadvantage of using gray codes is that the results are not returned in numeric order. But luckily that wasn't a condition in the question!

Resources