SemVer: Do different results for the same seed warrant a major change? - backwards-compatibility

Say I have written a piece of software (in R, for didactic purposes) which is following the Semantic Versioning Specification. This is the content of version 1.0.0 of the software:
funk <- function(x) {
jitter(x)
}
Which works so that
set.seed(1)
print(funk(0))
yields
[1] -0.009379653
Now suppose I change my function to this:
funk <- function(x) {
unrelated_random_stuff <- sample(1:10)
jitter(x)
}
And now, set.seed(1); print(funk(0)) yields
[1] -0.01176102
According to SemVer, does this constitute a major change? I.e., if I publish the software with these changes, should it be 2.0.0? I'm inclined to think so, since this technically changes results from scripts based on version 1.0.0, but I am not sure this qualifies as "breaking backwards compatibility" since we're talking about randomly-generated numbers.

If your customers are inclined to take a dependency on the output value, then yes, you probably want to bump the major version number. Even if this is library code, it's possible someone is using it for fuzz testing and it's critically important to yield reproducible result, in order to find track down and fix bugs, as well as ensure the fix does not regress in the future.

Related

HTTP Restful Semantic Versioning

Currently, I use the semantic versioning for an API.
Versioning is envolves like this:
MAJOR version when you make incompatible API changes
MINOR version when you add functionality in a backwards compatible manner
PATCH version when you make backwards compatible bug fixes
Should I increment the PATCH, if I only update the documentation (swagger, internal documenation, YAML, ...) to add example, or correct a description attach to the API?
Thanks for your help ;)
Should I increment the PATCH, if I only update the documentation (swagger, internal documenation, YAML, ...) to add example, or correct a description attach to the API?
Depends on the example/correction. Does it represent a break from previous understanding of the use of your API's? Here's a very contrived example for discussion:
API: int plus(int a, int b)
Documentation: int plus(int a, int b) sums a + b.
The above was released as 1.0.0, then someone reviews the code and points out that on overflow, the function returns 0.
Updated documentation: int plus(int a, int b) sums a + b where a < 32767 and b < 32767, otherwise, returns 0.
So, whether this is a breaking change, depends on the language and its behavior when a + b overflow. Some languages throw an exception or segfault, others, it's common for them to simply return a modulo result of some kind. Let's say it's C, in that case, this documentation change is likely a breaking change (well, probably most programming languages actually).
The point here is that initial documentation is often not much better than a superficial restatement of the API itself. Subsequent complaints (bug reports) from customers, often drive the next round of documentation changes, when the customers are surprised by the results. So yes, even though the original intent of the developer hasn't changed, the documentation does represent a breaking change in regard to the expected results.
Had the documentation been changed to match exactly what the customers expected/witnessed in use, then no, it's not a breaking change.
The documentation is part of the feature set. Back-filling missing documentation is usually a feature addition, so you'd bump the minor version. A minor correction would be a patch.

How do I determine reasonable package dependency bounds when releasing a Haskell library?

When releasing a library on Hackage, how can I determine reasonable bounds for my dependencies?
It's a very brief question - not sure what additional information I can provide.
It would also be helpful to know if this is handled differently depending on whether stack or cabal is used.
Essentially my question relates to the cabal constraints which currently is set as:
library
hs-source-dirs: src
default-language: Haskell2010
exposed-modules: Data.ByteUnits
build-depends: base >=4.9 && <4.10
, safe == 0.3.15
I don't think the == is a good idea.
This is a tricky question, since there are different opinions in the community on best practices, and there are trade-offs between ease of figuring out bounds and providing the most compatibility with versions of dependencies possible. As I see it, there are basically three approaches you can take:
Look at the version of the dependencies you're currently using, e.g. safe-0.3.15. Assume that the package is following the PVP and will not release a breaking change before version 0.4, and add this: safe >= 0.3.15 && < 0.4
The above is nice, but limits a lot of potentially valid build plans. You can spend time testing against other versions of the dependency. For example, if you test against about 0.2.12 and 0.4.3 and they both seem to work, you may want to expand to safe >= 0.2.12 && < 0.5.
NOTE: A common mistake that crops up is that in a future version of your package, you forget to check compatibility with older versions, and it turns out you're using a new featured introduced in say safe-0.4.1, making the old bounds invalid. There's unfortunately not much in the way of automated tooling to check for this.
Just forget all of it: no version bounds at all, and make it the responsibility of the consumer of the package to ensure compatibility in a build plan. This has the downside that it's possible to create invalid build plans, but the upside that your bounds won't eliminate potentially good ones. (This is basically a false positive vs false negative tradeoff.)
The Stackage project runs nightly builds that can often let you know when your package is broken by new versions of dependencies, and make it easier for users to consume your package by providing pre-built snapshots that are known to work. This especially helps with case (3), and a little bit with the loose lower bounds in (2).
You may also want to consider using a Travis configuration the tests against old Stackage snapshots, e.g. https://github.com/commercialhaskell/stack/blob/master/doc/travis-complex.yml
I assume you're aware of the Haskell Package Versioning Policy (PVP). This provides some guidance, both implicitly in the meaning it assigns to the first three components of the version ("A.B.C") plus some explicit advice on Cabal version ranges.
Roughly speaking, future versions with the same "A.B" will not have introduced any breaking changes (including introducing orphan instances that might change the behavior of other code), but might have added new bindings, types, etc. Provided you have used only qualified imports or explicit import lists:
import qualified Something as S
import Something (foo, bar)
you can safely write a dependency of the form:
something >= 1.2.0 && < 1.6
where the assumption would be that you've tested 1.2.0 through 1.5.6, say, and you're confident that it'll continue to run with all future 1.5.xs (non-breaking changes) but could conceivably break on a future 1.6.
If you have imported a package unqualified (which you might very well do if you are re-exporting a big chunk of its API), you'll want a variant of:
the-package >= 1.2.0 && < 1.5.4 -- tested up to 1.5.3 API
the-package >= 1.5.3 && < 1.5.4 -- actually, this API precisely
There is also a caveat (see the PVP) if you define an orphan instance.
Finally, when importing some simple, stable packages where you've imported only the most obviously stable components, you could probably make the assumption that:
the-package >= 1.2.0 && < 2
is going to be pretty safe.
Looking at the Cabal file for a big, complex, well-written package might give you some sense of what's done in practice. The lens package, for example, extensively uses dependencies of the form:
array >= 0.3.0.2 && < 0.6
but has occasional dependencies like:
free >= 4 && < 6
(In many cases, these broader dependencies are on packages written by the same author, and he can obviously ensure that he doesn't break his own packages, so can be a little more lax.)
The purpose of the bounds is to ensure the version of the dependency you use has the feature(s) that you need. There is some earliest version X that introduces all those features, so you need a lower bound that is at least X. It's possible that a required feature is removed from a later version Y, in which case you would need to specify an upper bound that is less than Y:
build-depends: foo >= X && < Y
Ideally, a feature you need never gets removed, in which case you can drop the upper bound. This means the upper bound is only needed if you know your feature disappears from a later version. Otherwise, assume that foo >= X is sufficient until you have evidence to the contrary.
foo == X should rarely be used; it is basically short for foo >= X && <= X, and states that you are using a feature that is only in version X; it wasn't in earlier versions, and it was removed in a later version. If you find yourself in such a situation, it would probably be better to try to rewrite your code to not rely on that feature anymore, so that you can return to using foo >= Z (by relaxing the requirement for version X exactly, you may be able to get by with an even earlier version Z < X of foo).
A “foolproof” answer would be: allow exactly those versions that you're sure will work successfully! If you've only ever compiled your project with safe-0.3.15, then technically speaking you don't know whether it'll also work with safe-0.3.15, thus the constraint that cabal offers is right. If you want compatibility with other versions, test them by successively going backwards. This can be done easiest by completely disabling the constraint in the .cabal file and then doing
$ cabal configure --constraint='safe==XYZ' && cabal test
For each version XYZ = 0.3.14 etc..
Practically speaking, that's a bit of a paranoid approach. In particular, it's good etiquette for packages to follow the Package Versioning Policy, which demands that new minor versions should never break any builds. I.e., if 0.3.15 works, then 0.3.16 etc. should at any rate work too. So the conservative constraint if you've only checked 0.3.15 would actually be safe >=0.3.15 && <0.4. Probably, safe >=0.3 && <0.4 would be safe† too. The PVP also requires that you don't use looser major-version bounds than you can confirm to work, i.e. it mandates the <0.4 constraint.
Often, this is still needlessly strict. It depends on how tightly you work with some package. In particular, sometimes you'll need to explicitly depend on a package just for some extra configuration function of a type used by a more important dependency. In such a case, I tend to not give any bounds at all for the ancillary package. For an extreme example, if I depend on diagrams-lib, that there's no good reason to give any bounds to diagrams-core, because that is anyway coupled to diagrams-lib.
I also don't usually bother with bounds for very stable and standard packages like containers. The exception is of course base.
†Did you have to pick the safe package as an example?

Convert string version information to integer for easy comparison

I am in a process of designing a program that runs end-user scripts written in Lua. The program defines an interface that tells how many parameters are being passed to which function, their order and so on. At some point in the future, the interface may change so I would like to pass a version information of the host program to a script. My common practice is to use MAJOR.MINOR.PATCH format.
The goal is to keep the representation simple as much as possible, so that the information can be processed using arithmetic operations, like so:
if program.version < SCRIPT_SUPPORTS_VERSION then
error ("This version is not supported, please upgrade...")
end
Common sense dictates to use integers written in base of sixteen, where each byte represents a part in the version format. Like so:
-- Version 1.3.2
-- Compatible with versions >=1.2.0
--
local program = {}
program.version = 0x010302
local SCRIPT_SUPPORTS_VERSION = 0x010200
As I mentioned before, this is the first thing that came to my head when thinking about the problem and it seems to work. On other hand, hexadecimal numbers may seem scary to some.
I would like to know your opinion, please propose alternative solutions.

Algorithm to detect if a file(or string) have been patched

This question is related to string algorithm, not version control tools or management tools.
I learnt the diff algorithm and tried to implement one. That is, given string A and string B, the diff calculate a sequence of actions that can convert A into B.
I wonder, if it possible, given a string S, and a sequence of actions that diff algorithm can produce, the algorithm will tell if the string S is (a) the origin string A, (b) the patched string B, (c) unrelated string. And what if S is only one of A and B.
Actuallly, what I'm really doing is researching a method that can tell if a patch have been applied (source code level or binary code level). I tried google some time, but didn't find something useful.
It's pretty complicated, but it can be done, on some level.
Essentially, you parse the source level into tokens, after that, you build the abstract syntax tree. Once that is done, you must build a diff tool that can do semantic differential analysis between abstract syntax trees. SemanticMerge for example, does that.
Once that is done, you have semantical difference between two source codes, and then you need to define what exactly consists of a patch.
Some of the rules can be:
1) Variable content was changed
2) A if check was added
The bottom line is, differenting between patch and new functionality is not an easy task. The most reliable way is to probably check the binary file version numbers, and understand the versioning schema.
Eg, only minor version is updated, if patches are applied.

Is there an object-identity-based, thread-safe memoization library somewhere?

I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following #luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.
If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
data Foo = Foo { ... }
expensive :: Foo -> Bar
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
data Foo = Foo { ..., memo :: Bar }
To make it easier to use, add a smart constructor to tie the knot.
makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.
It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Whereas most memo combinators memoize based on equality, stable-memo does it based on whether the exact same argument has been passed to the function before (that is, is the same argument in memory).
stable-memo only evaluates keys to WHNF.
This can be more suitable for recursive functions over graphs with cycles.
stable-memo doesn't retain the keys it has seen so far, which allows them to be garbage collected if they will no longer be used. Finalizers are put in place to remove the corresponding entries from the memo table if this happens.
Data.StableMemo.Weak provides an alternative set of combinators that also avoid retaining the results of the function, only reusing results if they have not yet been garbage collected.
There is no type class constraint on the function's argument.
stable-memo will not work for arguments which happen to have the same value but are not the same heap object. This rules out many candidates for memoization, such as the most common example, the naive Fibonacci implementation whose domain is machine Ints; it can still be made to work for some domains, though, such as the lazy naturals.
Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses Ints to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actually StableNames. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...

Resources