Rust: BTreeMap equality+inequality range query - pagination

I want to call range on a BTreeMap, where the keys are tuples like (a,b). Say we have:
(1, 2) => "a"
(1, 3) => "b"
(1, 4) => "c"
(2, 1) => "d"
(2, 2) => "e"
(2, 3) => "f"
The particularity is that I want all the entries that have a specific value for the first field, but a range on the second field, i.e. I want all entries where a = 1 AND 1 < b <= 4. The RangeBounds operator in that case is not too complicated, it would be (Excluded((1, 1)), Included((1, 4))). If I have an unbounded range, say a = 1 AND b > 3, we would have the following RangeBounds: (Excluded((1, 3)), Included((1, i64::max_value()))).
The problem arises when the type inside of the tuple does not have a maximum value, for instance a string (CStr specifically). Is there a way to solve that problem? It would be useful to be able to use Unbounded inside of the tuple, but I don't think it's right. The less interesting solution would be to have multiple layers of datastructures (for instance a hashmap for the first field, where keys map to... a BTreeMap). Any thoughts?

If the first field of your tuple is an integer type, then you can use an exclusive bound on the next integer value, paired with an empty CStr. (I'm assuming that <&CStr>::default() is the "smallest" value in &CStr's total order.)
let range = my_btree_map.range((Excluded((1, some_cstr)), Excluded((2, <&CStr>::default()))));
If the first field is of a type for which it is difficult or impossible to obtain the "next greater value", then a combination of range and take_while will give the correct results, though with a little overhead.
let range = my_btree_map
.range((Excluded((1, some_cstr)), Unbounded))
.take_while(|&((i, _), _)| *i == 1);

Related

Conditionally use parts of a nested for loop

I've searched for this answer extensively, but can't seem to find an answer. Therefore, for the first time, I am posting a question here.
I have a function that uses many parameters to perform a calculation. Based on user input, I want to iterate through possible values for some (or all) of the parameters. If I wanted to iterate through all of the parameters, I might do something like this:
for i in range(low1,high1):
for j in range(low2,high2):
for k in range(low3,high3):
for m in range(low4,high4):
doFunction(i, j, k, m)
If I only wanted to iterate the 1st and 4th parameter, I might do this:
for i in range(low1,high1):
for m in range(low4,high4):
doFunction(i, user_input_j, user_input_k, m)
My actual code has almost 15 nested for-loops with 15 different parameters - each of which could be iterable (or not). So, it isn't scalable for me to use what I have and code a unique block of for-loops for each combination of a parameter being iterable or not. If I did that, I'd have 2^15 different blocks of code.
I could do something like this:
if use_static_j == True:
low2 = -999
high2 = -1000
for i in range(low1,high1):
for j in range(low2,high2):
for k in range(low3,high3):
for m in range(low4,high4):
j1 = j if use_static_j==False else user_input_j
doFunction(i, j1, k, m)
I'd just like to know if there is a better way. Perhaps using filter(), map(), or list comprehension... (which I don't have a clear enough understanding of yet)
As suggested in the comments, you could build an array of the parameters and then call the function with each of the values in the array. The easiest way to build the array is using recursion over a list defining the ranges for each parameter. In this code I've assumed a list of tuples consisting of start, stop and scale parameters (so for example the third element in the list produces [3, 2.8, 2.6, 2.4, 2.2]). To use a static value you would use a tuple (static, static+1, 1).
def build_param_array(ranges):
r = ranges[0]
if len(ranges) == 1:
return [[p * r[2]] for p in range(r[0], r[1], -1 if r[1] < r[0] else 1)]
res = []
for p in range(r[0], r[1], -1 if r[1] < r[0] else 1):
pa = build_param_array(ranges[1:])
for a in pa:
res.append([p * r[2]] + a)
return res
# range = (start, stop, scale)
ranges = [(1, 5, 1),
(0, 10, .1),
(15, 10, .2)
]
params = build_param_array(ranges)
for p in params:
doFunction(*p)

Duplicate tuples inside a tuple

I am working on a project for university and I am stuck.
There is a tuple made out of several positions, these positions being represented as tuples.
So, let's call this tuple "positions".
positions = ((2, 1), (2, 2), (1, 1), (2, 1))
This would be an example of what positions could be in the tuple.
I am supposed to check if any of the position (small tuples) is being repeated in the tuple presenting all position (big tuple), resulting in a False output.
In this example, there is a position that is being repeated.
I tried using for loops. I really have no clue on how else to do it.
def positions_func(positions):
for i in range(len(positions)):
for j in range(len(positions)):
if positions[i] == positions[:j]:
return False
The error coming out is that the tuple is out of the index, proving that I am doing it wrong.
Two easy ways, depending on what you need to do next one may be better than the other.
One, turn the big tuple into a set and compare their lengths:
if len(positions) != len(set(positions)):
print("There were duplicates.")
Or with collections.Counter, e.g. if you need to know which one was duplicate:
from collections import Counter
counts = Counter(positions)
for item, count in counts.most_common():
print(item, "occurred", count, "times.")
if count > 1:
print("(so there was a duplicate)")
I think it's happening because of the colun you added with the j index.
def positions_func(positions):
for i in range(len(positions)):
for j in range(len(positions)):
if positions[i] == positions[j] and i != j:
return False
Try th above code. It will check if two tuples are identical and return false

setting an array element with a list

I'd like to create a numpy array with 3 columns (not really), the last of which will be a list of variable lengths (really).
N = 2
A = numpy.empty((N, 3))
for i in range(N):
a = random.uniform(0, 1/2)
b = random.uniform(1/2, 1)
c = []
A[i,] = [a, b, c]
Over the course of execution I will then append or remove items from the lists. I used numpy.empty to initialize the array since this is supposed to give an object type, even so I'm getting the 'setting an array with a sequence error'. I know I am, that's what I want to do.
Previous questions on this topic seem to be about avoiding the error; I need to circumvent the error. The real array has 1M+ rows, otherwise I'd consider a dictionary. Ideas?
Initialize A with
A = numpy.empty((N, 3), dtype=object)
per numpy.empty docs. This is more logical than A = numpy.empty((N, 3)).astype(object) which first creates an array of floats (default data type) and only then casts it to object type.

Find distinct values for each column in an RDD in PySpark

I have an RDD that is both very long (a few billion rows) and decently wide (a few hundred columns). I want to create sets of the unique values in each column (these sets don't need to be parallelized, as they will contain no more than 500 unique values per column).
Here is what I have so far:
data = sc.parallelize([["a", "one", "x"], ["b", "one", "y"], ["a", "two", "x"], ["c", "two", "x"]])
num_columns = len(data.first())
empty_sets = [set() for index in xrange(num_columns)]
d2 = data.aggregate((empty_sets), (lambda a, b: a.add(b)), (lambda x, y: x.union(y)))
What I am doing here is trying to initate a list of empty sets, one for each column in my RDD. For the first part of the aggregation, I want to iterate row by row through data, adding the value in column n to the nth set in my list of sets. If the value already exists, it doesn't do anything. Then, it performs the union of the sets afterwards so only distinct values are returned across all partitions.
When I try to run this code, I get the following error:
AttributeError: 'list' object has no attribute 'add'
I believe the issue is that I am not accurately making it clear that I am iterating through the list of sets (empty_sets) and that I am iterating through the columns of each row in data. I believe in (lambda a, b: a.add(b)) that a is empty_sets and b is data.first() (the entire row, not a single value). This obviously doesn't work, and isn't my intended aggregation.
How can I iterate through my list of sets, and through each row of my dataframe, to add each value to its corresponding set object?
The desired output would look like:
[set(['a', 'b', 'c']), set(['one', 'two']), set(['x', 'y'])]
P.S I've looked at this example here, which is extremely similar to my use case (it's where I got the idea to use aggregate in the first place). However, I find the code very difficult to convert into PySpark, and I'm very unclear what the case and zip code is doing.
There are two problems. One, your combiner functions assume each row is a single set, but you're operating on a list of sets. Two, add doesn't return anything (try a = set(); b = a.add('1'); print b), so your first combiner function returns a list of Nones. To fix this, make your first combiner function non-anonymous and have both of them loop over the lists of sets:
def set_plus_row(sets, row):
for i in range(len(sets)):
sets[i].add(row[i])
return sets
unique_values_per_column = data.aggregate(
empty_sets,
set_plus_row, # can't be lambda b/c add doesn't return anything
lambda x, y: [a.union(b) for a, b in zip(x, y)]
)
I'm not sure what zip does in Scala, but in Python, it takes two lists and puts each corresponding element together into tuples (try x = [1, 2, 3]; y = ['a', 'b', 'c']; print zip(x, y);) so you can loop over two lists simultaneously.

SML - Find same elements in a string

Sorry but I need your help again!
I need a function which takes as input a
list of (string*string)
and returns as output a
list of (int*int)
The function should manipulate the list such that, every element that is represented by the same string has to be replaced by the same number.
For example if I insert:
changState([("s0","l0"),("l0","s1"),("s1","l1"),("l1","s0")]);
the output should be:
val it = [(0,1),(1,2),(2,3),(3,0)]
Is there someone who has an idea to solve this problem?
I would really appreciate that. Thanks a lot!
Here is one way:
structure StringKey =
struct
type ord_key = string
val compare = String.compare
end
structure Map = RedBlackMapFn(StringKey)
fun changState l =
let fun lp(l, map, max) =
case l
of nil => nil
| (s1,s2)::l' =>
case (Map.find(map, s1), Map.find(map, s2))
of (SOME i1, SOME i2) => (i1, i2)::lp(l', map, max)
| (NONE, SOME i) => (max, i)::lp(l', Map.insert(map, s1, max), max+1)
| (SOME i, NONE) => (i, max)::lp(l', Map.insert(map, s2, max), max+1)
| (NONE, NONE) =>
if s1 <> s2
then (max, max+1)::lp(l', Map.insert(Map.insert(map, s1, max), s2, max+1), max+2)
else (max, max)::lp(l', Map.insert(map, s1, max), max+1)
in lp(l, Map.empty, 0) end
Here lp takes the list of string pairs, a map which relates strings to integers, and a variable max, which keeps track of the next unused number. In each iteration, we lookup the two strings in the map. If they are found, return that integer, otherwise, use the next available integer. The last case, where neither string exists in the map, we need to check if the strings are equal, in case your input was [("s0", "s0")], we would expect [(0, 0)]. If they are equal, we map them to the same number, otherwise create distinct ones.
If you are not familiar with functors, the first 5 lines are creating a structure that satisfies the ORD_KEY signature. You can find more details in the documentation at
http://www.smlnj.org/doc/smlnj-lib/Manual/ord-map.html
http://www.smlnj.org/doc/smlnj-lib/Manual/ord-key.html#ORD_KEY:SIG:SPEC
By applying the ReBlackMapFn functor to the StringKey structure, it creates a new structure of a map (implemented as a red black tree) that maps strings to things of type 'a

Resources