How to keep track of unique ranges in a 2D list of ranges? - python-3.x

I want to build a list of probability ranges for characters in a string, so I can do Arithmetic Coding on them.
Here's an example of what I'm looking to accomplish (from the tutorial/overview here):
a 30% [0.00, 0.30)
b 15% [0.30, 0.45)
c 25% [0.45, 0.70)
d 10% [0.70, 0.80)
e 20% [0.80, 1.00)
Expressed Pythonically / in the way my implementation does, this would look like:
[(0.00, 0.30), (0.30, 0.45), (0.45, 0.70), (0.70, 0.80), (0.80, 1.00)]
With an associated list of characters which match up with that list.
The ranges must be unique, and they must not collide. (Note that the upper bound of each range is actually an infinitely long list of 9's, because a given range's upper bound itself is occupied by the next range in the list.)
Here's my current implementation:
from decimal import Decimal, getcontext
getcontext().prec = 2
def _nodupsfreq(string):
"""list deduplicator"""
l, res = [], []
for ch in string:
if ch not in l:
l.append(ch)
res.append(string.count(ch))
return l, res
def getprobs(string):
"""return a set of probability ranges for a string"""
k, v = _nodupsfreq(string)
rs = [(0, 0)] # need a 0th element for first iteration (messy)
x = [] # construct the keys ensuring they match
for i in range(len(v)):
y = 0 if i == 0 else i - 1 # this is the reason for the 0th element
lower = rs[y][1]
upper = Decimal(lower) + Decimal(v[i] / len(string))
res = (lower, upper)
rs.append(res)
x.append(k[i])
return rs[1:], x # more messiness because of the first item
def probs_as_dict(string):
"""get a list of probability ranges as a dictionary"""
l, k = getprobs(string)
d = {}
for i in range(len(k)):
d[k[i]] = (float(l[i][0]), float(l[i][1]))
return d
m = "BILL GATES"
__import__("pprint").pprint(probs_as_dict(m))
In theory, it does what it says on the tin, but in practice the only way it keeps the ranges "unique" is by basing the range in the next iteration on the upper bound of the range in the last iteration, which is demonstrably flimsy, shown by the results:
{
' ': (0.1, 0.2),
'A': (0.2, 0.3),
'B': (0.0, 0.1), # occupied here!
'E': (0.3, 0.4), # occupied here!
'G': (0.3, 0.4), # junk
'I': (0.0, 0.1), # junk
'L': (0.1, 0.3),
'S': (0.5, 0.6),
'T': (0.4, 0.5)
}
Overlapping ranges of the same and different lengths.
Surely, I could hack away even more at my implementation and get it to be better at eliminating dupe ranges, or I could select a better method for generating probability ranges from a string in the first place.
Is there a better way to express the probability of a given character in a string, a better way to ensure unique and noncolliding multidimensional elements in a collection, or how should I fix my code?

To calculate the length of the probability range of a character, you can simply divide the number of occurrences of the character by the total length of the string.
These range lengths that you obtain this way are enough to describe the ranges, because you know that the first range starts at 0.0, and each successive range starts where the previous ends.
This way, there's no need to explicitly save the range boundaries, eliminating the possibility of collisions. If you need the boundaries in your computations, you can easily calculate them with a function.
So, in you example, you would just save an array of reals, like this:
[30.0, 15.0, 25.0, 10.0, 20.0]

Related

Probabilistic random selection [duplicate]

I needed to write a weighted version of random.choice (each element in the list has a different probability for being selected). This is what I came up with:
def weightedChoice(choices):
"""Like random.choice, but each element can have a different chance of
being selected.
choices can be any iterable containing iterables with two items each.
Technically, they can have more than two items, the rest will just be
ignored. The first item is the thing being chosen, the second item is
its weight. The weights can be any numeric values, what matters is the
relative differences between them.
"""
space = {}
current = 0
for choice, weight in choices:
if weight > 0:
space[current] = choice
current += weight
rand = random.uniform(0, current)
for key in sorted(space.keys() + [current]):
if rand < key:
return choice
choice = space[key]
return None
This function seems overly complex to me, and ugly. I'm hoping everyone here can offer some suggestions on improving it or alternate ways of doing this. Efficiency isn't as important to me as code cleanliness and readability.
Since version 1.7.0, NumPy has a choice function that supports probability distributions.
from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
p=probability_distribution)
Note that probability_distribution is a sequence in the same order of list_of_candidates. You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.
Since Python 3.6 there is a method choices from the random module.
In [1]: import random
In [2]: random.choices(
...: population=[['a','b'], ['b','a'], ['c','b']],
...: weights=[0.2, 0.2, 0.6],
...: k=10
...: )
Out[2]:
[['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b']]
Note that random.choices will sample with replacement, per the docs:
Return a k sized list of elements chosen from the population with replacement.
Note for completeness of answer:
When a sampling unit is drawn from a finite population and is returned
to that population, after its characteristic(s) have been recorded,
before the next unit is drawn, the sampling is said to be "with
replacement". It basically means each element may be chosen more than
once.
If you need to sample without replacement, then as #ronan-paixão's brilliant answer states, you can use numpy.choice, whose replace argument controls such behaviour.
def weighted_choice(choices):
total = sum(w for c, w in choices)
r = random.uniform(0, total)
upto = 0
for c, w in choices:
if upto + w >= r:
return c
upto += w
assert False, "Shouldn't get here"
Arrange the weights into a
cumulative distribution.
Use random.random() to pick a random
float 0.0 <= x < total.
Search the
distribution using bisect.bisect as
shown in the example at http://docs.python.org/dev/library/bisect.html#other-examples.
from random import random
from bisect import bisect
def weighted_choice(choices):
values, weights = zip(*choices)
total = 0
cum_weights = []
for w in weights:
total += w
cum_weights.append(total)
x = random() * total
i = bisect(cum_weights, x)
return values[i]
>>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)])
'WHITE'
If you need to make more than one choice, split this into two functions, one to build the cumulative weights and another to bisect to a random point.
If you don't mind using numpy, you can use numpy.random.choice.
For example:
import numpy
items = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05]
elems = [i[0] for i in items]
probs = [i[1] for i in items]
trials = 1000
results = [0] * len(items)
for i in range(trials):
res = numpy.random.choice(items, p=probs) #This is where the item is selected!
results[items.index(res)] += 1
results = [r / float(trials) for r in results]
print "item\texpected\tactual"
for i in range(len(probs)):
print "%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i])
If you know how many selections you need to make in advance, you can do it without a loop like this:
numpy.random.choice(items, trials, p=probs)
As of Python v3.6, random.choices could be used to return a list of elements of specified size from the given population with optional weights.
random.choices(population, weights=None, *, cum_weights=None, k=1)
population : list containing unique observations. (If empty, raises IndexError)
weights : More precisely relative weights required to make selections.
cum_weights : cumulative weights required to make selections.
k : size(len) of the list to be outputted. (Default len()=1)
Few Caveats:
1) It makes use of weighted sampling with replacement so the drawn items would be later replaced. The values in the weights sequence in itself do not matter, but their relative ratio does.
Unlike np.random.choice which can only take on probabilities as weights and also which must ensure summation of individual probabilities upto 1 criteria, there are no such regulations here. As long as they belong to numeric types (int/float/fraction except Decimal type) , these would still perform.
>>> import random
# weights being integers
>>> random.choices(["white", "green", "red"], [12, 12, 4], k=10)
['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white']
# weights being floats
>>> random.choices(["white", "green", "red"], [.12, .12, .04], k=10)
['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green']
# weights being fractions
>>> random.choices(["white", "green", "red"], [12/100, 12/100, 4/100], k=10)
['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green']
2) If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence.
Specifying both weights and cum_weights raises a TypeError.
>>> random.choices(["white", "green", "red"], k=10)
['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green']
3) cum_weights are typically a result of itertools.accumulate function which are really handy in such situations.
From the documentation linked:
Internally, the relative weights are converted to cumulative weights
before making selections, so supplying the cumulative weights saves
work.
So, either supplying weights=[12, 12, 4] or cum_weights=[12, 24, 28] for our contrived case produces the same outcome and the latter seems to be more faster / efficient.
Crude, but may be sufficient:
import random
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))
Does it work?
# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]
# initialize tally dict
tally = dict.fromkeys(choices, 0)
# tally up 1000 weighted choices
for i in xrange(1000):
tally[weighted_choice(choices)] += 1
print tally.items()
Prints:
[('WHITE', 904), ('GREEN', 22), ('RED', 74)]
Assumes that all weights are integers. They don't have to add up to 100, I just did that to make the test results easier to interpret. (If weights are floating point numbers, multiply them all by 10 repeatedly until all weights >= 1.)
weights = [.6, .2, .001, .199]
while any(w < 1.0 for w in weights):
weights = [w*10 for w in weights]
weights = map(int, weights)
If you have a weighted dictionary instead of a list you can write this
items = { "a": 10, "b": 5, "c": 1 }
random.choice([k for k in items for dummy in range(items[k])])
Note that [k for k in items for dummy in range(items[k])] produces this list ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']
Here's is the version that is being included in the standard library for Python 3.6:
import itertools as _itertools
import bisect as _bisect
class Random36(random.Random):
"Show the code included in the Python 3.6 version of the Random class"
def choices(self, population, weights=None, *, cum_weights=None, k=1):
"""Return a k sized list of population elements chosen with replacement.
If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.
"""
random = self.random
if cum_weights is None:
if weights is None:
_int = int
total = len(population)
return [population[_int(random() * total)] for i in range(k)]
cum_weights = list(_itertools.accumulate(weights))
elif weights is not None:
raise TypeError('Cannot specify both weights and cumulative weights')
if len(cum_weights) != len(population):
raise ValueError('The number of weights does not match the population')
bisect = _bisect.bisect
total = cum_weights[-1]
return [population[bisect(cum_weights, random() * total)] for i in range(k)]
Source: https://hg.python.org/cpython/file/tip/Lib/random.py#l340
A very basic and easy approach for a weighted choice is the following:
np.random.choice(['A', 'B', 'C'], p=[0.3, 0.4, 0.3])
import numpy as np
w=np.array([ 0.4, 0.8, 1.6, 0.8, 0.4])
np.random.choice(w, p=w/sum(w))
I'm probably too late to contribute anything useful, but here's a simple, short, and very efficient snippet:
def choose_index(probabilies):
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]
No need to sort your probabilities or create a vector with your cmf, and it terminates once it finds its choice. Memory: O(1), time: O(N), with average running time ~ N/2.
If you have weights, simply add one line:
def choose_index(weights):
probabilities = weights / sum(weights)
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]
If your list of weighted choices is relatively static, and you want frequent sampling, you can do one O(N) preprocessing step, and then do the selection in O(1), using the functions in this related answer.
# run only when `choices` changes.
preprocessed_data = prep(weight for _,weight in choices)
# O(1) selection
value = choices[sample(preprocessed_data)][0]
If you happen to have Python 3, and are afraid of installing numpy or writing your own loops, you could do:
import itertools, bisect, random
def weighted_choice(choices):
weights = list(zip(*choices))[1]
return choices[bisect.bisect(list(itertools.accumulate(weights)),
random.uniform(0, sum(weights)))][0]
Because you can build anything out of a bag of plumbing adaptors! Although... I must admit that Ned's answer, while slightly longer, is easier to understand.
I looked the pointed other thread and came up with this variation in my coding style, this returns the index of choice for purpose of tallying, but it is simple to return the string ( commented return alternative):
import random
import bisect
try:
range = xrange
except:
pass
def weighted_choice(choices):
total, cumulative = 0, []
for c,w in choices:
total += w
cumulative.append((total, c))
r = random.uniform(0, total)
# return index
return bisect.bisect(cumulative, (r,))
# return item string
#return choices[bisect.bisect(cumulative, (r,))][0]
# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]
tally = [0 for item in choices]
n = 100000
# tally up n weighted choices
for i in range(n):
tally[weighted_choice(choices)] += 1
print([t/sum(tally)*100 for t in tally])
A general solution:
import random
def weighted_choice(choices, weights):
total = sum(weights)
treshold = random.uniform(0, total)
for k, weight in enumerate(weights):
total -= weight
if total < treshold:
return choices[k]
Here is another version of weighted_choice that uses numpy. Pass in the weights vector and it will return an array of 0's containing a 1 indicating which bin was chosen. The code defaults to just making a single draw but you can pass in the number of draws to be made and the counts per bin drawn will be returned.
If the weights vector does not sum to 1, it will be normalized so that it does.
import numpy as np
def weighted_choice(weights, n=1):
if np.sum(weights)!=1:
weights = weights/np.sum(weights)
draws = np.random.random_sample(size=n)
weights = np.cumsum(weights)
weights = np.insert(weights,0,0.0)
counts = np.histogram(draws, bins=weights)
return(counts[0])
It depends on how many times you want to sample the distribution.
Suppose you want to sample the distribution K times. Then, the time complexity using np.random.choice() each time is O(K(n + log(n))) when n is the number of items in the distribution.
In my case, I needed to sample the same distribution multiple times of the order of 10^3 where n is of the order of 10^6. I used the below code, which precomputes the cumulative distribution and samples it in O(log(n)). Overall time complexity is O(n+K*log(n)).
import numpy as np
n,k = 10**6,10**3
# Create dummy distribution
a = np.array([i+1 for i in range(n)])
p = np.array([1.0/n]*n)
cfd = p.cumsum()
for _ in range(k):
x = np.random.uniform()
idx = cfd.searchsorted(x, side='right')
sampled_element = a[idx]
There is lecture on this by Sebastien Thurn in the free Udacity course AI for Robotics. Basically he makes a circular array of the indexed weights using the mod operator %, sets a variable beta to 0, randomly chooses an index,
for loops through N where N is the number of indices and in the for loop firstly increments beta by the formula:
beta = beta + uniform sample from {0...2* Weight_max}
and then nested in the for loop, a while loop per below:
while w[index] < beta:
beta = beta - w[index]
index = index + 1
select p[index]
Then on to the next index to resample based on the probabilities (or normalized probability in the case presented in the course).
On Udacity find Lesson 8, video number 21 of Artificial Intelligence for Robotics where he is lecturing on particle filters.
Another way of doing this, assuming we have weights at the same index as the elements in the element array.
import numpy as np
weights = [0.1, 0.3, 0.5] #weights for the item at index 0,1,2
# sum of weights should be <=1, you can also divide each weight by sum of all weights to standardise it to <=1 constraint.
trials = 1 #number of trials
num_item = 1 #number of items that can be picked in each trial
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# gives number of times an item was selected at a particular index
# this assumes selection with replacement
# one possible output
# selected_item_arr
# array([[0, 0, 1]])
# say if trials = 5, the the possible output could be
# selected_item_arr
# array([[1, 0, 0],
# [0, 0, 1],
# [0, 0, 1],
# [0, 1, 0],
# [0, 0, 1]])
Now let's assume, we have to sample out 3 items in 1 trial. You can assume that there are three balls R,G,B present in large quantity in ratio of their weights given by weight array, the following could be possible outcome:
num_item = 3
trials = 1
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# selected_item_arr can give output like :
# array([[1, 0, 2]])
you can also think number of items to be selected as number of binomial/ multinomial trials within a set. So, the above example can be still work as
num_binomial_trial = 5
weights = [0.1,0.9] #say an unfair coin weights for H/T
num_experiment_set = 1
selected_item_arr = np.random.multinomial(num_binomial_trial, weights, num_experiment_set)
# possible output
# selected_item_arr
# array([[1, 4]])
# i.e H came 1 time and T came 4 times in 5 binomial trials. And one set contains 5 binomial trails.
let's say you have
items = [11, 23, 43, 91]
probability = [0.2, 0.3, 0.4, 0.1]
and you have function which generates a random number between [0, 1) (we can use random.random() here).
so now take the prefix sum of probability
prefix_probability=[0.2,0.5,0.9,1]
now we can just take a random number between 0-1 and use binary search to find where that number belongs in prefix_probability. that index will be your answer
Code will go something like this
return items[bisect.bisect(prefix_probability,random.random())]
One way is to randomize on the total of all the weights and then use the values as the limit points for each var. Here is a crude implementation as a generator.
def rand_weighted(weights):
"""
Generator which uses the weights to generate a
weighted random values
"""
sum_weights = sum(weights.values())
cum_weights = {}
current_weight = 0
for key, value in sorted(weights.iteritems()):
current_weight += value
cum_weights[key] = current_weight
while True:
sel = int(random.uniform(0, 1) * sum_weights)
for key, value in sorted(cum_weights.iteritems()):
if sel < value:
break
yield key
Using numpy
def choice(items, weights):
return items[np.argmin((np.cumsum(weights) / sum(weights)) < np.random.rand())]
I needed to do something like this really fast really simple, from searching for ideas i finally built this template. The idea is receive the weighted values in a form of a json from the api, which here is simulated by the dict.
Then translate it into a list in which each value repeats proportionally to it's weight, and just use random.choice to select a value from the list.
I tried it running with 10, 100 and 1000 iterations. The distribution seems pretty solid.
def weighted_choice(weighted_dict):
"""Input example: dict(apples=60, oranges=30, pineapples=10)"""
weight_list = []
for key in weighted_dict.keys():
weight_list += [key] * weighted_dict[key]
return random.choice(weight_list)
I didn't love the syntax of any of those. I really wanted to just specify what the items were and what the weighting of each was. I realize I could have used random.choices but instead I quickly wrote the class below.
import random, string
from numpy import cumsum
class randomChoiceWithProportions:
'''
Accepts a dictionary of choices as keys and weights as values. Example if you want a unfair dice:
choiceWeightDic = {"1":0.16666666666666666, "2": 0.16666666666666666, "3": 0.16666666666666666
, "4": 0.16666666666666666, "5": .06666666666666666, "6": 0.26666666666666666}
dice = randomChoiceWithProportions(choiceWeightDic)
samples = []
for i in range(100000):
samples.append(dice.sample())
# Should be close to .26666
samples.count("6")/len(samples)
# Should be close to .16666
samples.count("1")/len(samples)
'''
def __init__(self, choiceWeightDic):
self.choiceWeightDic = choiceWeightDic
weightSum = sum(self.choiceWeightDic.values())
assert weightSum == 1, 'Weights sum to ' + str(weightSum) + ', not 1.'
self.valWeightDict = self._compute_valWeights()
def _compute_valWeights(self):
valWeights = list(cumsum(list(self.choiceWeightDic.values())))
valWeightDict = dict(zip(list(self.choiceWeightDic.keys()), valWeights))
return valWeightDict
def sample(self):
num = random.uniform(0,1)
for key, val in self.valWeightDict.items():
if val >= num:
return key
Provide random.choice() with a pre-weighted list:
Solution & Test:
import random
options = ['a', 'b', 'c', 'd']
weights = [1, 2, 5, 2]
weighted_options = [[opt]*wgt for opt, wgt in zip(options, weights)]
weighted_options = [opt for sublist in weighted_options for opt in sublist]
print(weighted_options)
# test
counts = {c: 0 for c in options}
for x in range(10000):
counts[random.choice(weighted_options)] += 1
for opt, wgt in zip(options, weights):
wgt_r = counts[opt] / 10000 * sum(weights)
print(opt, counts[opt], wgt, wgt_r)
Output:
['a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd']
a 1025 1 1.025
b 1948 2 1.948
c 5019 5 5.019
d 2008 2 2.008
In case you don't define in advance how many items you want to pick (so, you don't do something like k=10) and you just have probabilities, you can do the below. Note that your probabilities do not need to add up to 1, they can be independent of each other:
soup_items = ['pepper', 'onion', 'tomato', 'celery']
items_probability = [0.2, 0.3, 0.9, 0.1]
selected_items = [item for item,p in zip(soup_items,items_probability) if random.random()<p]
print(selected_items)
>>>['pepper','tomato']
Step-1: Generate CDF F in which you're interesting
Step-2: Generate u.r.v. u
Step-3: Evaluate z=F^{-1}(u)
This modeling is described in course of probability theory or stochastic processes. This is applicable just because you have easy CDF.

Foobar Lucky Triple

I am trying to solve the following problem:
Write a function solution(l) that takes a list of positive integers l and counts the number of "lucky triples" of (li, lj, lk) where the list indices meet the requirement i < j < k. The length of l is between 2 and 2000 inclusive. A "lucky triple" is a tuple (x, y, z) where x divides y and y divides z, such as (1, 2, 4). The elements of l are between 1 and 999999 inclusive. The solution fits within a signed 32-bit integer. Some of the lists are purposely generated without any access codes to throw off spies, so if no triples are found, return 0.
For example, [1, 2, 3, 4, 5, 6] has the triples: [1, 2, 4], [1, 2, 6], [1, 3, 6], making the solution 3 total.
My solution only passes the first two tests; I am trying to understand what it is wrong with my approach rather then the actual solution. Below is my function for reference:
def my_solution(l):
from itertools import combinations
if 2<len(l)<=2000:
l = list(combinations(l, 3))
l= [value for value in l if value[1]%value[0]==0 and value[2]%value[1]==0]
#l= [value for value in l if (value[1]/value[0]).is_integer() and (value[2]/value[1]).is_integer()]
if len(l)<0xffffffff:
l= len(l)
return l
else:
return 0
If you do nested iteration of the full list and remaining list, then compare the two items to check if they are divisors... the result counts as the beginning and middle numbers of a 'triple',
then on the second round it will calculate the third... All you need to do is track which ones pass the divisor test along the way.
For Example
def my_solution(l):
row1, row2 = [[0] * len(l) for i in range(2)] # Tracks which indices pass modulus
for i1, first in enumerate(l):
for i2 in range(i1+1, len(l)): # iterate the remaining portion of the list
middle = l[i2]
if not middle % first: # check for matches
row1[i2] += 1 # increment the index in the tracker lists..
row2[i1] += 1 # for each matching pair
result = sum([row1[i] * row2[i] for i in range(len(l))])
# the final answer will be the sum of the products for each pair of values.
return result

Combination of elements in a list leading to a sum K

I have been struggling with a programming problem lately, and I would appreciate any help that I could get.
Basically, the input is a list of numbers (both positive and negative, also, note that the numbers can repeat within the list), and I want to find the combinations of the numbers that lead upto a sum K.
For example,
List - [1,2,3,6,5,-2]
Required sum - 8
Output - Many combinations like : [5,3], [5,2,1], [6,3,1,-2]... and so on
I do understand that there are solutions available to this problem using module functions like “itertools.combinations” or using recursion - subset of sum (works efficiently only for positive numbers on the list), but I am looking for an efficient solution in python3 for lists upto a 100 numbers.
Any and every help would be appreciated.
You are trying to compute the sum over all subsets, not combinations.
This will never be efficient for 100 numbers (atleast in no way I know, either in C++ or in Python), because you need to actually compute all sums, and then filter.
This takes O(n*2^n) time using a method called Dynamic Programming - Sum over Subsets (DP-SOS). The following is the most efficient code possible:
a, k = [1, 2, 3, 6, 5, -2], 8
dp = [0] * (1 << len(a))
for idx, val in enumerate(a):
dp[1 << idx] = a[idx]
for i in range(len(a)):
for mask in range(len(dp)):
if mask & (1 << i):
dp[mask] += dp[mask^(1<<i)]
answer = []
for pos, sum in enumerate(dp):
if sum == k:
answer.append([val for idx, val in enumerate(a)
if (1 << idx) & pos])
Using itertools seems unnecessary to me, though you might get an answer in almost the same (or slightly longer) time, because the bitwise operators even in python are pretty fast.
Nevertheless, using itertools:
from itertools import chain, combinations
a, k = [1, 2, 3, 6, 5, -2], 8
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
answer = list(filter(lambda x: sum(x) == k, powerset(a)))
If you want to scale to 100 numbers, there might not be an exact solution, you would need to use heuristics and not compute the answer explicitly, because if the array is [8, 0, 0, ...(100 times)] then there are 2^99 subsets which you can anyways never compute explicitly or store.

Roll of different amount along a single axis in a 3D matrix [duplicate]

I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.

How can I compare two arrays with different sizes but with some floats that are approximate? [Python3]

How can I compare two arrays with different sizes but with some floats that are approximate? For example:
# I have two arrays
a = np.array( [-2.83, -2.54, ..., 0.05, ..., 2.54, 2.83] )
b = np.array( [-3.0, -2.9, -2.8, ..., -0.1, 0.0, 0.1, ..., 2.9, 3.0] )
# wherein len( b ) > len( a )
What I need is the index where (considering those two values from both lists)
math.isclose( -2.54, -2.5, rel_tol=1e-1) == True
The answer that I need is something like
list_of_index_of_b = [1, 5, ..., -2]
Here list_of_index_of_b is a list with the "coordinates" where that specific element of b is approximate to some element of a. Not all ellements of a have an approximate in b. Also:
len(list_of_index_of_b) == len(a)
You can use broadcasting. This creates an array of the pairwise differences between every element in a and b which you can then check against the specified tolerance.
Of course this is computationally inefficient from a complexity standpoint since you construct an array of size |a|*|b| and compare every pairwise distance against the tolerance, even if once of the differences is already small enough. That said, if one of |a| or |b| is relatively small, then then this approach can be quite fast since it is pure numpy and requires no loops.
a = np.array([1,5,6,7])
b = np.array([1.1,2,3,4.8,4.9,5,8])
rtol = 0.15
diff = a - b[:,None]
mask2d = (1/np.abs(b))*np.abs(a - b[:,None]) < rtol
mask = np.any(mask2d,axis=1)
This can be combined to a single line:
indices = np.where(np.any((1/np.abs(b))*np.abs(a-b[:,None]) < rtol,axis=1))

Resources