How to check internal size parameters that controls example generation - python-hypothesis

In PropEr, there's an internal variable called Size that represents the size of generated example.
For instance, when we have 2 variables and would like to make them proportional each other, PropEr let you write the following test:
prop_profile2() ->
?FORALL(Profile, [{name, string()},
{age, pos_integer()},
{bio, ?SIZED(Size, resize(Size*35, string()))}],
begin
NameLen = to_range(10, length(proplists:get_value(name, Profile))),
BioLen = to_range(300, length(proplists:get_value(bio, Profile))),
aggregate([{name, NameLen}, {bio, BioLen}], true)
end).
In this test, the internal variable Size holds the internal size of string() (string value generator), so what ?SIZED(Size, resize(Size*35, string())) does here is make this part 35 times larger than string() called next to name atom.
I tried to something similar to this with Hypothesis, but what I could come up with was the following:
#composite
def profiles(draw: DrawFn):
name = draw(text(max_size=10))
name_len = len(name)
age = draw(integers(min_value=1, max_value=150))
bio_len = 35 * name_len
bio = draw(text(min_size=bio_len, max_size=bio_len))
return Profile(name, age, bio)
Are there any other smarter ways to have proportional sizes among multiple variables?

If the code you're testing actually requires these proportions, your test looks good, though I think a direct translation of your PropEr code would have bio = draw(text(min_size=bio_len, max_size=max(bio_len, 300)))? As-is, you'd never be able to find bugs with even-length bios, or short bios, and so on.
This points to a more fundamental question: why do you want proportional inputs in the first place?
Hypothesis style is to just express the limits of your allowed input directly - "builds(name=text(max_size=10), age=integers(1, 150), bio=text(max_size=300))` - and let the framework give you diverse and error-inducing inputs from whatever weird edge case it finds. (note: if you're checking the distribution, do so over 10,000+ examples - each run of 100 won't look very diverse)
PropEr style often adds further constraints on the inputs, in order to produce more realistic data or guide generation to particular areas of interest. I think this is a mistake: my goal is not to generate realistic data, it's to maximize the probability that I find a bug - ignoring part of the input space can only hurt - and then to minimize the expected time to do so (a topic too large for this answer).

Related

How to keep test instance IDs apart in Django?

There's a bug in the following code:
def test_should_filter_foobar_by_foo_id(self) -> None:
first_foo = Foo(name="a")
second_foo = Foo(name="b")
first_foobar = FooBar(foo=first_foo)
second_foobar = FooBar(foo=second_foo)
results = FooBarFilter(data={"foo": first_foobar.id}
self.assertListEqual([first_foobar], results.qs.order_by("id"))
The test passes for the wrong reason, because when starting from a newly initialized DB the Foo.id and FooBar.id sequences both start at 1. first_foobar.id should be first_foo.id. This is very similar to real-world tests I've written where the sequences running in lockstep has caused misleading test implementations.
Is there a (cheap, simple) way to do something similar to Django's reset_sequences to jolt all the sequences out of lockstep? Whether randomly or by adding 1 million to the first sequence, 2 million to the next etc., or some other method. A few suboptimal options come to mind:
Create dummy instances at the start of the test to move the sequences out of lockstep.
Prohibitively expensive.
Error prone because we'd have to keep track of how many instances of each model we create.
Adds irrelevant complexity to the tests.
Change the current values of each sequence as part of test setup.
Error prone because we'd have to make sure to change all the relevant sequences as we add more models to the tests.
Adds irrelevant complexity to the tests.
Swap around the order in which we create the instances in the test, so that first_foo has ID 2 and second_foo has ID 1.
Adds irrelevant complexity to the tests.
Once we have three or more models in the test at least two of them will be in lockstep, and we'd need to complement this with some other technique.
Modify the IDs of each instance after saving, thereby bypassing the sequences entirely.
Again, adds irrelevant complexity to the tests.
Error prone, since now it would be easy to accidentally end up with ID collisions.
Change the current values of each sequence as part of the template DB. I'm not sure how to do this, and I'm not sure how expensive it would be.

OpenNLP doccat trainer always results in "1 outcome patterns"

I am evaluating OpenNLP for use as a document categorizer. I have a sanitized training corpus with roughly 4k files, in about 150 categories. The documents have many shared, mostly irrelevant words - but many of those words become relevant in n-grams, so I'm using the following parameters:
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 20000);
params.put(TrainingParameters.CUTOFF_PARAM, 10);
DoccatFactory dcFactory = new DoccatFactory(new FeatureGenerator[] { new NGramFeatureGenerator(3, 10) });
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
Some of these categories apply to documents that are almost completely identical (think boiler-plate legal documents, with maybe only names and addresses different between document instances) - and will be mostly identical to documents in the test set. However, no matter how I tweak these params, I can't break out of the "1 outcome patterns" result. When running a test, every document in the test set is tagged with "Category A."
I did manage to effect a single minor change in output, by moving from previous use of the BagOfWordsFeatureGenerator to the NGramFeatureGenerator, and from maxent to Naive Bayes; before the change, every document in the test set was assigned "Category A", but after the change, all the documents were now assigned to "Category B." But other than that, I can't seem to move the dial at all.
I've tried fiddling with iterations, cutoff, ngram sizes, using maxent instead of bayes, etc; but all to no avail.
Example code from tutorials that I've found on the interweb have used much smaller training sets with less iterations, and are able to perform at least some rudimentary differentation.
Usually in such a situation - bewildering lack of expected behavior - the engineer has forgotten to flip some simple switch, or has some fatal lack of fundamental understanding. I am eminently capable of both those failures. Also, I have no Data Science training, although I have read a couple of O'Reilly books on the subject. So the problem could be procedural. Is the training set too small? Is the number of iterations off by an order of magnitude? Would a different algo be a better fit? I'm utterly surprised that no tweaks have even slightly moved the dial away from the "1 outcome" outcome.
Any response appreciated.
Well, the answer to this one did not come from the direction in which the question was asked. It turns out that there was a code sample in the OpenNLP documentation that was wrong, and no amount of parameter tuning would have solved it. I've submitted a jira to the project so it should be resolved; but for those who make their way here before then, here's the rundown:
Documentation (wrong):
String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);
Should be something like:
String inputText = ... // sanitized document to be classified
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText.split(" "));
String category = myCategorizer.getBestCategory(outcomes);
DocumentCategorizerME.categorize() needs an array; since this is an obviously self-documenting bug the second you run the code, I had assumed the necessary array parameter should be an array of documents in string form; instead it needs
an array of tokens from a single document.

Unit conversion errors for value objects in ddd

As you might know DDD literature suggests that we should treat " numeric quantativies with some unit " as value objects, not as primitive types ( ints, bigdecimal ). Some examples of such value objects are money, distance or file size. I agree with the big picture
However there is something I cannot understand. Namely conversion errors when representating something in one unit, converting it to other unit and back. This process might lose some information. Take for example file size. Lets say I have file whose size is 3.67 MB and I convert that to other instance of FileSize whose unit would be GB by dividing 3.67 with 1024. Now I have FileSize of ( approximately ) 0.00358398437 GB. If I now try to convert it back to MB the result is not 3.67 MB. If however I dont use value object but instead only use primitive information " sizeInBytes " ( long ) I cannot lose information on conversion errors.
I must have missed something. Is my example just plain stupid? Or is it acceptable to lose some info when converting from one unit to another? Or should FileSize always carry also excat file size in bytes ( with approx.size in given unit )?
Thanks in advance!
What you are describing is more an implementation problem of your concrete example than a problem with the approach. The idea of using value objects to represent amounts with a unit is to avoid mistakes like adding Liters to Kilometers or doing 10cm + 10Km = 20cm. Value objects, when developed correctly, will enforce that the operations are done correctly between different units.
Now, how you implement these value objects with your programming language, is a different problem. But for your concrete example, I would say that the value object will internally have a long field with the size in Bytes, no matter what unit you use to initialize the object. In this case, the unit will be used to convert the initialization value to the right amount of bytes and also for display purposes, but when you have to add 2 FileSizes, you can add the internal amounts in bytes.
we should treat " numeric quantativies with some unit " as value objects, not as primitive types ( ints, bigdecimal ).
Yes, that's right. More generally, we're encouraged to encapsulate data structures (an integer alone is a trivially simple data structure) behind domain specific abstractions. This is one good way to leverage type checking - it gives the checker the hints that it needs to detect a category of dumb mistakes.
Namely conversion errors when representating something in one unit, converting it to other unit and back. This process might lose some information.
That's right. More generally: rounding loses information.
I dont use value object but instead only use primitive information " sizeInBytes " ( long ) I cannot lose information on conversion errors.
So look carefully at that: if you perform the same sequence of conversions you described using primitive data structures, you would end up with the same rounding error (after all, that's where the rounding error came from: the abstraction of the measurement defers the calculation to its internal general purpose representation).
The thing that saves you from the error is not discarding the original exact answer.
What domain modeling is telling you to do is make explicit which values are "exact" and which have "rounding errors".
(Note that in some domains, they aren't even "errors"; many domains have explicit rules about how rounding is supposed to happen. Sadly, they are rarely the rounding rules defined by IEEE-754, so you can't just lean on the general purpose floating point type.)
DDD will also encourage you to track precisely which values are for display/reporting, and which are to be used in later calculations.
Reading this, I think you're misunderstanding what DDD is. The first D is DDD, stands for Domain - aka Domain is a sphere of knowledge. The way you represent a sphere of knowledge aka a Domain - is entirely based on the business domain you're attempting to represent, and will be different based on the business domain.
So...
Domain A: Business User that has X amount of storage space
I upload X file
file X uses 3.67 MB
You have used 1% of your allocated space.
You have 97 MB space remaining
Domain B: Sys Admin - total space is Y amount of storage space
Users have uploaded 3.67 MB
That user has used 1% of their space
That user has 97 MB space remaining
There is 1000 GB total space remaining to allocate to all users / total space remaining.
aka. Sys Admin has one domain - total disk; User has allocated space (sub-set) - they have different domains of knowledge - space.
Also note... DDD is really about sectioning of a domain or sphere of knowledge to the specific users of sub-sections of a system - and not the facts of a system. aka Facts are different from knowledge.
I hope this makes some sense!

ID3 Implementation Clarification

I am trying to implement the ID3 algorithm, and am looking at the pseudo-code:
(Source)
I am confused by the bit where it says:
If examples_vi is empty, create a leaf node with label = most common value in TargegetAttribute in Examples.
Unless I am missing out on something, shouldn't this be the most common class?
That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
1) Unless I am missing out on something, shouldn't this be the most
common class?
You're correct, and the text also says the same. Look at the function description at the top :
Target_Attribute is the attribute whose value is to be predicted by the tree
so the value of Target_Attribute is the class/label.
2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Yes, but not among all samples in your whole dataset, but rather those samples that reached up to this point in the tree/recursion. (ID3 functions is recursive and so the current Examples is actually Examples_vi of the caller)
3) Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
No, picking a random class (with equal chances for each class) is not the same. Because often the inputs do have an unbalanced class distribution (this distribution is often called the prior distribution in many texts), so you may have 99% of examples are positive and only 1% negative. So whenever you really have no information whatsoever to decide on the outcome of some input, it makes sense to predict the most probable class, so that you have the most probability of being correct. This maximizes your classifier's accuracy on unseen data only under the assumption that the class distribution in your training data is the same as in the unseen data.
This explanation holds with the same reasoning for the base case when Attributes is empty (see 4 line in your pseudocode text); whenever we have no information, we just report the most common class of the data at hand.
If you never implemented the codes(ID3) but still want to know more in processing details, I suggest you to read this paper:
Building Decision Trees in Python
and here is the source code from the paper:
decision tree source code
This paper has a example or use example from your book(replace the "data" file with the same format). And you can just debug it (with some breakpoints) in eclipse to check the attribute values during the algorithms running.
Go over it, you will understand ID3 better.

How is integer overflow exploitable?

Does anyone have a detailed explanation on how integers can be exploited? I have been reading a lot about the concept, and I understand what an it is, and I understand buffer overflows, but I dont understand how one could modify memory reliably, or in a way to modify application flow, by making an integer larger than its defined memory....
It is definitely exploitable, but depends on the situation of course.
Old versions ssh had an integer overflow which could be exploited remotely. The exploit caused the ssh daemon to create a hashtable of size zero and overwrite memory when it tried to store some values in there.
More details on the ssh integer overflow: http://www.kb.cert.org/vuls/id/945216
More details on integer overflow: http://projects.webappsec.org/w/page/13246946/Integer%20Overflows
I used APL/370 in the late 60s on an IBM 360/40. APL is language in which essentially everything thing is a multidimensional array, and there are amazing operators for manipulating arrays, including reshaping from N dimensions to M dimensions, etc.
Unsurprisingly, an array of N dimensions had index bounds of 1..k with a different positive k for each axis.. and k was legally always less than 2^31 (positive values in a 32 bit signed machine word). Now, an array of N dimensions has an location assigned in memory. Attempts to access an array slot using an index too large for an axis is checked against the array upper bound by APL. And of course this applied for an array of N dimensions where N == 1.
APL didn't check if you did something incredibly stupid with RHO (array reshape) operator. APL only allowed a maximum of 64 dimensions. So, you could make an array of 1-64 dimension, and APL would do it if the array dimensions were all less than 2^31. Or, you could try to make an array of 65 dimensions. In this case, APL goofed, and surprisingly gave back a 64 dimension array, but failed to check the axis sizes.
(This is in effect where the "integer overflow occurred"). This meant you could create an array with axis sizes of 2^31 or more... but being interpreted as signed integers, they were treated as negative numbers.
The right RHO operator incantation applied to such an array to could reduce the dimensionaly to 1, with an an upper bound of, get this, "-1". Call this matrix a "wormhole" (you'll see why in moment). Such an wormhole array has
a place in memory, just like any other array. But all array accesses are checked against the upper bound... but the array bound check turned out to be done by an unsigned compare by APL. So, you can access WORMHOLE[1], WORMHOLE[2], ... WORMHOLE[2^32-2] without objection. In effect, you can access the entire machine's memory.
APL also had an array assignment operation, in which you could fill an array with a value.
WORMHOLE[]<-0 thus zeroed all of memory.
I only did this once, as it erased the memory containing my APL workspace, the APL interpreter, and obvious the critical part of APL that enabled timesharing (in those days it wasn't protected from users)... the terminal room
went from its normal state of mechanically very noisy (we had 2741 Selectric APL terminals) to dead silent in about 2 seconds.
Through the glass into the computer room I could see the operator look up startled at the lights on the 370 as they all went out. Lots of runnning around ensued.
While it was funny at the time, I kept my mouth shut.
With some care, one could obviously have tampered with the OS in arbitrary ways.
It depends on how the variable is used. If you never make any security decisions based on integers you have added with input integers (where an adversary could provoke an overflow), then I can't think of how you would get in trouble (but this kind of stuff can be subtle).
Then again, I have seen plenty of code like this that doesn't validate user input (although this example is contrived):
int pricePerWidgetInCents = 3199;
int numberOfWidgetsToBuy = int.Parse(/* some user input string */);
int totalCostOfWidgetsSoldInCents = pricePerWidgetInCents * numberOfWidgetsToBuy; // KA-BOOM!
// potentially much later
int orderSubtotal = whatever + totalCostOfWidgetInCents;
Everything is hunky-dory until the day you sell 671,299 widgets for -$21,474,817.95. Boss might be upset.
A common case would be code that prevents against buffer overflow by asking for the number of inputs that will be provided, and then trying to enforce that limit. Consider a situation where I claim to be providing 2^30+10 integers. The receiving system allocates a buffer of 4*(2^30+10)=40 bytes (!). Since the memory allocation succeeded, I'm allowed to continue. The input buffer check won't stop me when I send my 11th input, since 11 < 2^30+10. Yet I will overflow the actually allocated buffer.
I just wanted to sum up everything I have found out about my original question.
The reason things were confusing to me was because I know how buffer overflows work, and can understand how you can easily exploit that. An integer overflow is a different case - you cant exploit the integer overflow to add arbitrary code, and force a change in the flow of an application.
However, it is possible to overflow an integer, which is used - for example - to index an array to access arbitrary parts of memory. From here, it could be possible to use that mis-indexed array to override memory and cause the execution of an application to alter to your malicious intent.
Hope this helps.

Resources