How to keep test instance IDs apart in Django? - python-3.x

There's a bug in the following code:
def test_should_filter_foobar_by_foo_id(self) -> None:
first_foo = Foo(name="a")
second_foo = Foo(name="b")
first_foobar = FooBar(foo=first_foo)
second_foobar = FooBar(foo=second_foo)
results = FooBarFilter(data={"foo": first_foobar.id}
self.assertListEqual([first_foobar], results.qs.order_by("id"))
The test passes for the wrong reason, because when starting from a newly initialized DB the Foo.id and FooBar.id sequences both start at 1. first_foobar.id should be first_foo.id. This is very similar to real-world tests I've written where the sequences running in lockstep has caused misleading test implementations.
Is there a (cheap, simple) way to do something similar to Django's reset_sequences to jolt all the sequences out of lockstep? Whether randomly or by adding 1 million to the first sequence, 2 million to the next etc., or some other method. A few suboptimal options come to mind:
Create dummy instances at the start of the test to move the sequences out of lockstep.
Prohibitively expensive.
Error prone because we'd have to keep track of how many instances of each model we create.
Adds irrelevant complexity to the tests.
Change the current values of each sequence as part of test setup.
Error prone because we'd have to make sure to change all the relevant sequences as we add more models to the tests.
Adds irrelevant complexity to the tests.
Swap around the order in which we create the instances in the test, so that first_foo has ID 2 and second_foo has ID 1.
Adds irrelevant complexity to the tests.
Once we have three or more models in the test at least two of them will be in lockstep, and we'd need to complement this with some other technique.
Modify the IDs of each instance after saving, thereby bypassing the sequences entirely.
Again, adds irrelevant complexity to the tests.
Error prone, since now it would be easy to accidentally end up with ID collisions.
Change the current values of each sequence as part of the template DB. I'm not sure how to do this, and I'm not sure how expensive it would be.

Related

How to check internal size parameters that controls example generation

In PropEr, there's an internal variable called Size that represents the size of generated example.
For instance, when we have 2 variables and would like to make them proportional each other, PropEr let you write the following test:
prop_profile2() ->
?FORALL(Profile, [{name, string()},
{age, pos_integer()},
{bio, ?SIZED(Size, resize(Size*35, string()))}],
begin
NameLen = to_range(10, length(proplists:get_value(name, Profile))),
BioLen = to_range(300, length(proplists:get_value(bio, Profile))),
aggregate([{name, NameLen}, {bio, BioLen}], true)
end).
In this test, the internal variable Size holds the internal size of string() (string value generator), so what ?SIZED(Size, resize(Size*35, string())) does here is make this part 35 times larger than string() called next to name atom.
I tried to something similar to this with Hypothesis, but what I could come up with was the following:
#composite
def profiles(draw: DrawFn):
name = draw(text(max_size=10))
name_len = len(name)
age = draw(integers(min_value=1, max_value=150))
bio_len = 35 * name_len
bio = draw(text(min_size=bio_len, max_size=bio_len))
return Profile(name, age, bio)
Are there any other smarter ways to have proportional sizes among multiple variables?
If the code you're testing actually requires these proportions, your test looks good, though I think a direct translation of your PropEr code would have bio = draw(text(min_size=bio_len, max_size=max(bio_len, 300)))? As-is, you'd never be able to find bugs with even-length bios, or short bios, and so on.
This points to a more fundamental question: why do you want proportional inputs in the first place?
Hypothesis style is to just express the limits of your allowed input directly - "builds(name=text(max_size=10), age=integers(1, 150), bio=text(max_size=300))` - and let the framework give you diverse and error-inducing inputs from whatever weird edge case it finds. (note: if you're checking the distribution, do so over 10,000+ examples - each run of 100 won't look very diverse)
PropEr style often adds further constraints on the inputs, in order to produce more realistic data or guide generation to particular areas of interest. I think this is a mistake: my goal is not to generate realistic data, it's to maximize the probability that I find a bug - ignoring part of the input space can only hurt - and then to minimize the expected time to do so (a topic too large for this answer).

Recursive methods on CUDD

This is a follow-up to a suggestion by #DCTLib in the post below.
Cudd_PrintMinterm, accessing the individual minterms in the sum of products
I've been pursuing part (b) of the suggestion and will share some pseudo-code in a separate post.
Meanwhile, in his part (b) suggestion, #DCTLib posted a link to https://github.com/VerifiableRobotics/slugs/blob/master/src/BFAbstractionLibrary/BFCudd.cpp. I've been trying to read this program. There is a recursive function in the classic Somenzi paper, Binary Decision Diagrams, which describes an algo to compute the number of satisfying assignments (below, Fig. 7). I've been trying to compare the two, slugs and Fig. 7. But having a hard time seeing any similarities. But then C is mostly inscrutable to me. Do you know if slugs BFCudd is based on Somenze fig 7, #DCTLib?
Thanks,
Gui
It's not exactly the same algorithm.
There are two main differences:
First, the "SatHowMany" function does not take a cube of variables to consider for counting. Rather, that function considers all variables. The fact that "recurse_getNofSatisfyingAssignments" supports cubes manifest in the function potentially returning NaN (not a number) if a variable is found in the BDD that does not appear in the cube. The rest of the differences seem to stem from this support.
Second, SatHowMany returns the number of satisfying assignments to all n variables for a node. This leads, for instance, to the division by 2 in line -4. "recurse_getNofSatisfyingAssignments" only returns the number of assignments for the remaining variables to be considered.
Both algorithms cache information - in "SatHowMany", it's called a table, in "recurse_getNofSatisfyingAssignments" it's called a buffer. Note that in line 24 of "recurse_getNofSatisfyingAssignments", there is a constant string thrown. This means that either the function does not work, or the code is never reached. Most likely it's the latter.
Function "SatHowMany" seems to assume that it gets a BDD node - it cannot be a pointer to a complemented BDD node. Function "recurse_getNofSatisfyingAssignments" works correctly with complemented nodes, as a DdNode* may store a pointer to a complemented node.
Due to the support for cubes, "recurse_getNofSatisfyingAssignments" supports flexible variable ordering (hence the lookup of "cuddI" which denotes for a variable where it is in the current BDD variable ordering). For function SatHowMany, the variable ordering does not make a difference.

How to define a strategy in hypothesis to generate a pair of similar recursive objects

I am new to hypothesis and I am looking for a way to generate a pair of similar recursive objects.
My strategy for a single object is similar to this example in the hypothesis documentation.
I want to test a function which takes a pair of recursive objects A and B and the side effect of this function should be that A==B.
My first approach would be to write a test which gets two independent objects, like:
#given(my_objects(), my_objects())
def test_is_equal(a, b):
my_function(a, b)
assert a == b
But the downside is that hypothesis does not know that there is a dependency between this two objects and so they can be completely different. That is a valid test and I want to test that too.
But I also want to test complex recursive objects which are only slightly different.
And maybe that hypothesis is able to shrink a pair of very different objects where the test fails to a pair of only slightly different objects where the test fails in the same way.
This one is tricky - to be honest I'd start by writing the same test you already have above, and just turn up the max_examples setting a long way. Then I'd probably write some traditional unit tests, because getting specific data distributions out of Hypothesis is explicitly unsupported (i.e. we try to break everything that assumes a particular distribution, using some combination of heuristics and a bit of feedback).
How would I actually generate similar recursive structures though? I'd use a #composite strategy to build them at the same time, and for each element or subtree I'd draw a boolean and if True draw a different element or subtree to use in the second object. Note that this will give you a strategy for a tuple of two objects and you'll need to unpack it inside the test; that's unavoidable if you want them to be related.
Seriously do try just cracking up max_examples on the naive approach first though, running Hypothesis for ~an hour is amazingly effective and I would even expect it to shrink the output fairly well.

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

Can I repeatedly create & destroy random generator/distributor objects and destroy them (with a 'dice' class)?

I'm trying out the random number generation from the new library in C++11 for a simple dice class. I'm not really grasping what actually happens but the reference shows an easy example:
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(1,6);
int dice_roll = distribution(generator);
I read somewhere that with the "old" way you should only seed once (e.g. in the main function) in your application ideally. However I'd like an easily reusable dice class. So would it be okay to use this code block in a dice::roll() method although multiple dice objects are instantiated and destroyed multiple times in an application?
Currently I made the generator as a class member and the last two lines are in the dice:roll() methods. It looks okay but before I compute statistics I thought I'd ask here...
Think of instantiating a pseudo-random number generator (PRNG) as digging a well - it's the overhead you have to go through to be able to get access to water. Generating instances of a pseudo-random number is like dipping into the well. Most people wouldn't dig a new well every time they want a drink of water, why invoke the unnecessary overhead of multiple instantiations to get additional pseudo-random numbers?
Beyond the unnecessary overhead, there's a statistical risk. The underlying implementations of PRNGs are deterministic functions that update some internally maintained state to generate the next value. The functions are very carefully crafted to give a sequence of uncorrelated (but not independent!) values. However, if the state of two or more PRNGs is initialized identically via seeding, they will produce the exact same sequences. If the seeding is based on the clock (a common default), PRNGs initialized within the same tick of the clock will produce identical results. If your statistical results have independence as a requirement then you're hosed.
Unless you really know what you're doing and are trying to use correlation induction strategies for variance reduction, best practice is to use a single instantiation of a PRNG and keep going back to it for additional values.

Resources