partial functions vs input verification - haskell

I really love using total functions. That said, sometimes I'm not sure what the best approach is for guaranteeing that. Lets say that I'm writing a function similar to chunksOf from the split package, where I want to split up a list into sublists of a given size. Now I'd really rather say that the input for sublist size needs to be a positive int (so excluding 0). As I see it I have several options:
1) all-out: make a newtype for PositiveInt, hide the constructor, and only expose safe functions for creating a PositiveInt (perhaps returning a Maybe or some union of Positive | Negative | Zero or what have you). This seems like it could be a huge hassle.
2) what the split package does: just return an infinite list of size-0 sublists if the size <= 0. This seems like you risk bugs not getting caught, and worse: those bugs just infinitely hanging your program with no indication of what went wrong.
3) what most other languages do: error when the input is <= 0. I really prefer total functions though...
4) return an Either or Maybe to cover the case that the input might have been <= 0. Similar to #1, it seems like using this could just be a hassle.
This seems similar to this post, but this has more to do with error conditions than just being as precise about types as possible. I'm looking for thoughts on how to decide what the best approach for a case like this is. I'm probably most inclined towards doing #1, and just dealing with the added overhead, but I'm concerned that I'll be kicking myself down the road. Is this a decision that needs to be made on a case-by-case basis, or is there a general strategy that consistently works best?

Related

Unclear why functions from Data.Ratio are not exposed and how to work around

I am implementing an algorithm using Data.Ratio (convergents of continued fractions).
However, I encounter two obstacles:
The algorithm starts with the fraction 1%0 - but this throws a zero denominator exception.
I would like to pattern match the constructor a :% b
I was exploring on hackage. An in particular the source seems to be using exactly these features (e.g. defining infinity = 1 :% 0, or pattern matching for numerator).
As beginner, I am also confused where it is determined that (%), numerator and such are exposed to me, but not infinity and (:%).
I have already made a dirty workaround using a tuple of integers, but it seems silly to reinvent the wheel about something so trivial.
Also would be nice to learn how read the source which functions are exposed.
They aren't exported precisely to prevent people from doing stuff like this. See, the type
data Ratio a = a:%a
contains too many values. In particular, e.g. 2/6 and 3/9 are actually the same number in ℚ and both represented by 1:%3. Thus, 2:%6 is in fact an illegal value, and so is, sure enough, 1:%0. Or it might be legal but all functions know how to treat them so 2:%6 is for all observable means equal to 1:%3 – I don't in fact know which of these options GHC chooses, but at any rate it's an implementation detail and could change in future releases without notice.
If the library authors themselves use such values for e.g. optimisation tricks that's one thing – they have after all full control over any algorithmic details and any undefined behaviour that could arise. But if users got to construct such values, it would result in brittle code.
So – if you find yourself starting an algorithm with 1/0, then you should indeed not use Ratio at all there but simply store numerator and denominator in a plain tuple, which has no such issues, and only make the final result a Ratio with %.

Why would more array accesses perform better?

I'm taking a course on coursera that uses minizinc. In one of the assignments, I was spinning my wheels forever because my model was not performing well enough on a hidden test case. I finally solved it by changing the following types of accesses in my model
from
constraint sum(neg1,neg2 in party where neg1 < neg2)(joint[neg1,neg2]) >= m;
to
constraint sum(i,j in 1..u where i < j)(joint[party[i],party[j]]) >= m;
I dont know what I'm missing, but why would these two perform any differently from eachother? It seems like they should perform similarly with the former being maybe slightly faster, but the performance difference was dramatic. I'm guessing there is some sort of optimization that the former misses out on? Or, am I really missing something and do those lines actually result in different behavior? My intention is to sum the strength of every element in raid.
Misc. Details:
party is an array of enum vars
party's index set is 1..real_u
every element in party should be unique except for a dummy variable.
solver was Gecode
verification of my model was done on a coursera server so I don't know what optimization level their compiler used.
edit: Since minizinc(mz) is a declarative language, I'm realizing that "array accesses" in mz don't necessarily have a direct corollary in an imperative language. However, to me, these two lines mean the same thing semantically. So I guess my question is more "Why are the above lines different semantically in mz?"
edit2: I had to change the example in question, I was toting the line of violating coursera's honor code.
The difference stems from the way in which the where-clause "a < b" is evaluated. When "a" and "b" are parameters, then the compiler can already exclude the irrelevant parts of the sum during compilation. If "a" or "b" is a variable, then this can usually not be decided during compile time and the solver will receive a more complex constraint.
In this case the solver would have gotten a sum over "array[int] of var opt int", meaning that some variables in an array might not actually be present. For most solvers this is rewritten to a sum where every variable is multiplied by a boolean variable, which is true iff the variable is present. You can understand how this is less efficient than an normal sum without multiplications.

Excel: Avoiding the use of IF formula - Insights

Supposed I have
=IF(A1>0, 100, 50)
which I can also write as
=(A1>0)*(100) + (A1<=0)*(50)
I am interested in their performances. So my question is, who will win in the racing tracks?
Anyway I can think of two drawbacks (not related to performance).
The first drawback I had in mind is the latter form cannot return a text.
The second drawback is error handling, supposed I have =IFERROR(A1/B1, 0), I cannot substitute it with =(B1=0)*(0) + (B1<>0)*(A1/B1) which will still return a DIV/0 error.
In terms of performance drawback, I have one tickling my mind:
The latter form computes both IfTrue part and IfFalse part, which seems to be a drawback unless IF also compute both parts which I'm not sure about.
In terms of performance benefits:
The IF formula needs to consider for number or text outcome. The latter form simply return a number which makes it more light weighted.
Thanks!

Time Complexity for index and drop of first item in Data.Sequence

I was recently working on an implementation of calculating moving average from a stream of input, using Data.Sequence. I figured I could get the whole operation to be O(n) by using a deque.
My first attempt was (in my opinion) a bit more straightforward to read, but not a true a deque. It looked like:
let newsequence = (|>) sequence n
...
let dropFrontTotal = fromIntegral (newtotal - index newsequence 0)
let newsequence' = drop 1 newsequence.
...
According to the hackage docs for Data.Sequence, index should take O(log(min(i,n-i))) while drop should also take O(log(min(i,n-i))).
Here's my question:
If I do drop 1 someSequence, doesn't this mean a time complexity of O(log(min(1, (length someSequence)))), which in this case means: O(log(1))?
If so, isn't O(log(1)) effectively constant?
I had the same question for index someSequence 0: shouldn't that operation end up being O(log(0))?
Ultimately, I had enough doubts about my understanding that I resorted to using Criterion to benchmark the two implementations to prove that the index/drop version is slower (and the amount it's slower by grows with the input). The informal results on my machine can be seen at the linked gist.
I still don't really understand how to calculate time complexity for these operations, though, and I would appreciate any clarification anyone can provide.
What you suggest looks correct to me.
As a minor caveat remember that these are amortized complexity bounds, so a single operation could require more than constant time, but a long chain of operations will only require a constant times the number of the chain.
If you use criterion to benchmark and "reset" the state at every computation, you might see non-constant time costs, because the "reset" is preventing the amortization. It really depends on how you perform the test. If you start from a sequence an perform a long chain of operations on that, it should be OK. If you repeat many times a single operation using the same operands, then it could be not OK.
Further, I guess bounds such as O(log(...)) should actually be read as O(log(1 + ...)) -- you can't realistically have O(log(1)) = O(0) or, worse O(log(0))= O(-inf) as a complexity bound.

Functional alternative to caching known "answers"

I think the best way to form this question is with an example...so, the actual reason I decided to ask about this is because of because of Problem 55 on Project Euler. In the problem, it asks to find the number of Lychrel numbers below 10,000. In an imperative language, I would get the list of numbers leading up to the final palindrome, and push those numbers to a list outside of my function. I would then check each incoming number to see if it was a part of that list, and if so, simply stop the test and conclude that the number is NOT a Lychrel number. I would do the same thing with non-lychrel numbers and their preceding numbers.
I've done this before and it has worked out nicely. However, it seems like a big hassle to actually implement this in Haskell without adding a bunch of extra arguments to my functions to hold the predecessors, and an absolute parent function to hold all of the numbers that I need to store.
I'm just wondering if there is some kind of tool that I'm missing here, or if there are any standards as a way to do this? I've read that Haskell kind of "naturally caches" (for example, if I wanted to define odd numbers as odds = filter odd [1..], I could refer to that whenever I wanted to, but it seems to get complicated when I need to dynamically add elements to a list.
Any suggestions on how to tackle this?
Thanks.
PS: I'm not asking for an answer to the Project Euler problem, I just want to get to know Haskell a bit better!
I believe you're looking for memoizing. There are a number of ways to do this. One fairly simple way is with the MemoTrie package. Alternatively if you know your input domain is a bounded set of numbers (e.g. [0,10000)) you can create an Array where the values are the results of your computation, and then you can just index into the array with your input. The Array approach won't work for you though because, even though your input numbers are below 10,000, subsequent iterations can trivially grow larger than 10,000.
That said, when I solved Problem 55 in Haskell, I didn't bother doing any memoization whatsoever. It turned out to just be fast enough to run (up to) 50 iterations on all input numbers. In fact, running that right now takes 0.2s to complete on my machine.

Resources