Randomized (as in QuickCheck) versus deterministic (as in SmallCheck) property checking - haskell

I know of two approaches to property checking:
Randomized approach (as in QuickCheck) makes you define a generator of random values for your types, then verifies your invariants for each of a large number of randomly generated cases. For example, in case of a vector space ℤⁿ (defined, say, as [Word]), it would generate vectors of any length and direction.
Deterministic approach (as in SmallCheck) works the same, except that it generates valus from simple and small to more complex and extensive deterministically, covering (as I understand) a small part of domain tightly. For example, in the same case as above, it would generate zero vector, then all vectors of length <= 1, then all vectors of length <= 2, and so on, eventually covering all interesting values (such as unit vectors), for which there may conceivably be unforeseen corner cases that break an invariant.
Did I get it correct? What are the benefits and downsides of either approach? What are the preferred use cases for either? Maybe, for best results, both approaches should be combined, say, covering some moderately sized "base" with a deterministic checker and then a random sample with a randomized one?
I'm looking for some solid, motivated best practice to use daily and advance from now on.
P.S. This question is not intended to provoke discussion. I am asking specifically what benefits either approach to property checking has matter-of-factly, not which is nicer, easier to use or better by any other measure that can be considered a matter of taste. An answer would provide either a critique of the problem posed, a mindful, mathematical observation of the nature of either method of property checking, or a case / experience report that justifies one of the approaches as preferable for a certain class of situations. In the end, I intend this question to generate a set of criteria that help determine which approach to testing will catch more programming mistakes. Put this way, I don't see how this question could be considered subjective in any way other than good, as outlined in the rules. There is a companion reddit thread that offers a channel for more liberal communication.

Related

Failing to see the difference between these two lines of thought in dynamic programming

Always there are multiple ways people describe differences in tabulation and memoization in dynamic programming, but I will summarize to what is normally said.
memoization is a where we add caching to a function, to make recursive calls take less computations. typically used on recursive functions for a top down solution that starts with the initial problem and then recursively calls itself to smaller problems
tabulation uses a table to keep track of subproblem results and works up in bottom up manner, solving smallest sub problems before larger ones in a iterative manner.
Well my question is whats the difference? Sometimes I look at different situations and the line is super blurred. Also, with memoization working in a "top down" fashion, its really just referring to the stack nature of it, and in that sense its still going to the base case, aka bottom and then using those results to build up to the final result, so how is that really different from a tabulation going from bottom up until its done? Or is it a situational case where tabulation aproaches don't involve recursion, the fact that a dynmaic programming problem uses it IS what differentiates the two different methods? If someone knowledgable could offer there thoughts it would be much appreciated
You're right that they're just two implementation methods for the same computation. A recursive formulation with memoization will fill in the memo cache with the same entries that an explicit tabular formulation will put in its table.
Explicit tabular formulations are strictly less useful, however. This is because they need more information about the problem in advance. They start by enumerating all possibly useful base cases and putting those in the table. (So what's "possibly useful?" That's the rub!) Then they enumerate the new "layer" of all possible problem versions that can be solved with the base cases. Then a layer of others that can be solved with those, etc. etc. This continues until it the "top level" problem turns up in a layer.
For the kinds of problems typically seen in textbooks and coding interviews, determining all useful base cases is deliberately easy. The problem parameters are 2 or 3 "dense" natural numbers, so the table of solutions can be a 2d or 3d array with all elements containing useful values. In many of these, you can prove that the current layer only depends on a few (possibly one) previous layer, so all the rest can be discarded, which saves memory.
Practical problems aren't often so nice. The parameter sets aren't small or aren't natural numbers, or - even when they are - they're sparse so that filling in all entries of an array would be a waste.
In these cases, memoization is the only reasonable choice. The top-down recursion determines the sub-problems (on down to the base cases) that need solving as they occur. Sparseness doesn't matter because the memo cache can store parameter sets as explicit keys. When the current layer doesn't need more than K previous ones, various strategies can still be applied to discard the others.

Testing for heteroskedasticity and autocorrelation in large unbalanced panel data

I want to test for heteroskedasticity and autocorrelation in a large unbalanced panel dataset.
I do so using the following code:
* Heteroskedasticity test
// iterated GLS with only heteroskedasticity produces
// maximum-likelihood parameter estimates
xtgls adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT, igls panels(heteroskedastic)
estimates store hetero
* Autocorrelation
findit xtserial
net sj 3-2 st0039
net install st0039
xtserial adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT
Though I use the calculation power of high process center, because of the iteration method, this procedure takes more than 15 hours.
What is the most efficient program to perform these tests using Stata?
This question is borderline off-topic and quite broad, but i suspect still of
considerable interest to new users. As such, here i will try to consolidate our
conversation in the comments as an answer.
I strongly advise in the future to refrain from using highly subjective
words such as 'best', which can mean different things to different people. Or
terms like 'efficient', which can have a different meaning in a different context.
It is also difficult to provide specific advice regarding the use of commands
when we know nothing about what you are trying to do.
In my view, the 'best' choice, is the choice that gets the job done as accurately
as possible given the available data. Speed is an important consideration nowadays, but accuracy is still the most fundamental one. As you continue to use Stata, you will see that it has a considerable number of commands, often with overlapping functionality. Depending on the use case, sometimes opting for one implementation over another can be 'better', in the sense that it may be more practical or faster in achieving the desired end result.
Case in point, your comment in your previous post where the noconstant option is unavailable in rreg. In that particular context you can get a reasonably good alternative using regress with the vce(robust) option. In fact, this alternative may often be adequate for several use cases.
In this particular example, xtgls will be considerably faster if the igls
option is not used. This will be especially true with larger and more 'difficult' datasets. In cases where MLE is necessary, the iterate option will allow you to specify a fixed number of iterations, which could speed things up but can be a recipe for disaster if you don't know what you are doing and is thus not recommended. This option is usually used for other purposes. However, is xtgls the only command you could use? Read here why this may in fact not necessarily be the case.
Regarding speed, Stata in general is slow, at least when the ado language is used. This is because it is an interpreted language. The only realistic option for speed gains here is through parallelisation if you have Stata MP. Even in this case, whether any gains are achieved it will depend on a number of factors,
including which command you use.
Finally, xtserial is a community-contributed command, something which you
fail to make clear in your question. It is customary and useful to provide this
information right from the start, so others know that you do not refer to an
official, built-in command.

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Introductory reading on classifiers that are not "yes/no" naive Bayes

I want to manually implement a classifier for certain short strings of words, getting a "goodness" rank for each of them. I have made a naive Bayesian classifier which is basically spam-filter-like and scores strings based on previous "good"/"bad" ratings. So far so good.
Now, there are two problems that I want to solve (by properly understanding things)...
The question is - what would be good introductory material for below, not of "cookbook" variety but more systematic, and yet ideally shorter than a university statistics course :) Set of articles that is shorter than the book, or a good book. Aimed at programmers ideally.
The problems are:
first, in my system there are actually 3 types of user feedback - "good", "bad", and "neutral". Most items are neutral, and right now I simply don't include them in the ranking. I am wondering how these things are properly handled (I still need to obtain a single "goodness probability" per item, so if I calculate probability of good and bad separately, are there any pitfalls/proper methods to combining those).
Then, I want to remove the naive part from my classifier (i.e. take relations between words into account), so some different classifier may be in order. Or, I could add all pairs-triples-etc. of words as features, since the strings are short - this feels like a hack, but then again my CS/maths background is rusty enough and/or insufficient to say whether this is a valid technique.

What's the real difference between 3GLs and 4GLs?

I can't find anything that is useful to determine if a language is third generation or fourth. All I find is open statements like "higher level" and "closer to English" and some sources say that they are domain specific languages like SQL and others say that they can be general purpose. I'm really confused.
If 2GLs are the Assembly languages and 5GLs are the inference languages like Prolog, how do you determine if a programming language is a 3GL or a 4GL?
Most use of the terms was pure marketing -- "Oh, you're still using a third generation language? That's so last week!"
Behind that, there was a tiny bit of technical meaning though (at least for a little while, though many "4GLs" ignored it). The basic difference was (supposed to be that) third generation languages allowed you to manipulate only individual data items, where fourth generation languages allows you to manipulate groups of items as a group rather than individually.
Two obvious examples of this are SQL and APL. In SQL, you mostly work with sets. The result of a query is a set (not exactly a mathematical set, but at least somewhat similar). You can use and manipulate that set as a whole, merge it with other sets, etc. Until or unless you're exposing it to the outside world (e.g., with a cursor) you don't have to deal with the individual records/rows/tuples that make up that set.
In APL you get somewhat the same idea, except you're working with arrays instead of sets. To get an idea of what this means, let's assume you wanted to "rotate" an array so the currently-first element was moved to the end, and each other element was shifted ahead a spot. To do that in a typical 3GL (Fortran, Pascal, C, etc.) you'd write a loop that worked with the individual elements in the array. In APL, however, you have a single operator that will do that to the array as a whole, all in one operation. Even operations that work with individual items are generally trivial to apply to an entire array at once with the / operator, so (for example) the sum of all the elements in an array named a could be computed with +/a (or maybe that was /+a -- it's been a long time since I wrote any APL).
There are some pretty serious problems with the general idea of the distinction involved there though. One is that it placed a heavy emphasis on syntax -- obviously the actions involved required things like loops internally, so the distinction amounted to a syntax for an implicit loop. Another problem was that in quite a few cases you ended up with something of a blend of the two generations -- e.g., most BASICs being able to treat a string as a single thing, but requiring loops for similar operations on arrays/matrices. Finally, there was a little bit of a problem with relevance: although in a few special cases (like SQL) being able to work with a group/set/array/whatever of data as a whole really made a big difference -- but for the most part it did little to let people think and work with higher level abstractions (as was at least apparently the original intent).
That combined with a move toward languages that blurred the distinction between what was built in, and what was part of a library (or whatever). In C++, most ML-family languages, etc., it's trivial to write a function with arbitrary actions (including but not limited to loops) and attach it to an operator that's essentially indistinguishable from one that's built into the language.
It was a catchy phrase with a meaning most couldn't explain and even fewer cared about -- a prime candidate for being turned into pure marketspeak, usually translated roughly as: "you should pay me a lot for my slow, ugly, buggy CRUD application generator."
"Language generations" were a hot buzzword in the 1980s and early 1990s. They were always ill-defined, and little used in actual academic discourse.
The terms had little meaning at the time, and none now.

Resources