Convert String to Int checking for overflow - haskell

When I tried to convert a very long integer to Int, I was surprised that no error was thrown:
Prelude> read "123456789012345678901234567890" :: Int
-4362896299872285998
readMaybe from Text.Read module gives the same result.
Two questions:
Which function should I call to perform a safe conversion?
How can the most type safe language on Earth allow such unsafe things?
Update 1:
This is my attempt to write a version of read that checks bounds:
{-# LANGUAGE ScopedTypeVariables #-}
parseIntegral :: forall a . (Integral a, Bounded a) => String -> Maybe a
parseIntegral s = integerToIntegral (read s :: Integer) where
integerToIntegral n | n < fromIntegral (minBound :: a) = Nothing
integerToIntegral n | n > fromIntegral (maxBound :: a) = Nothing
integerToIntegral n = Just $ fromInteger n
Is it the best I can do?

Background: why unchecked overflow is actually wonderful
Haskell 98 leaves overflow behavior explicitly unspecified, which is good for implementers and bad for everyone else. Haskell 2010 discusses it in two sections—in the section inherited from Haskell 98, it's left explicitly unspecified, whereas in the sections on Data.Int and Data.Word, it is specified. This inconsistency will hopefully be resolved eventually.
GHC is kind enough to specify it explicitly:
All arithmetic is performed modulo 2^n, where n is the number of bits in the type.
This is an extremely useful specification. In particular, it guarantees that Int, Word, Int64, Word32, etc., form rings, and even principal ideal rings, under addition and multiplication. This means that arithmetic will always work right—you can transform expressions and equations in lots of different ways without breaking things. Throwing exceptions on overflow would break all these properties, making it much more difficult to write and reason about programs. The only times you really need to be careful are when you use comparison operators like < and compare—fixed width integers do not form ordered groups, so these operators are a bit touchy.
Why it makes sense not to check reads
Reading an integer involves many multiplications and additions. It also needs to be fast. Checking to make sure the read is "valid" is not so easy to do quickly. In particular, while it's easy to find out whether an addition has overflowed, it is not easy to find out whether a multiplication has. The only sensible ways I can think of to perform a checked read for Int are
Read as an Integer, check, then convert. Integer arithmetic is significantly more expensive than Int arithmetic. For smaller things, like Int16, the read can be done with Int, checking for Int16 overflow, then narrowed. This is cheaper, but still not free.
Compare the number in decimal to maxBound (or, for a negative number, minBound) while reading. This seems more likely to be reasonably efficient, but there will still be some cost. As the first section of this answer explains, there is nothing inherently wrong with overflow, so it's not clear that throwing an error is actually better than giving an answer modulo 2^n.

If isn't "unsafe", in that the behaviour of the problem isn't undefined. (It's perfectly defined, just probably not what you wanted.) For example, unsafeWriteAray is unsafe, in that if you make a mistake with it, it writes data into arbitrary memory locations, either causing your application to segfault, or merely making corrupt its own memory, causing it behave in arbitrary undefined ways.
So much for splitting hairs. If you want to deal with such huge numbers, Integer is the only way to go. But you probably knew that already.
As for why there's no overflow check... Sometimes you actually want a number to overflow. (E.g., you might convert to Word8 without explicitly ANDing out the bottom 8 bits.) At any rate, every possible arithmetic operation can potentially overflow (e.g., maxBound + 1 = minBound, and that's just normal addition.) Do you really want every single arithmetic operation to have an overflow check, slowing your program down at every step?
You get the exact same behaviour in C, C++ or C#. I guess the difference is, in C# we have the checked keyword, which allows you to automatically check for overflow. Maybe somebody has a Haskell package for doing checked arithmetic… For now, it's probably simpler to just implement this one check yourself.

Related

How can I add constraints to a type in a data type in haskell? [duplicate]

In many articles about Haskell they say it allows to make some checks during compile time instead of run time. So, I want to implement the simplest check possible - allow a function to be called only on integers greater than zero. How can I do it?
module Positive (toPositive, getPositive, Positive) where
newtype Positive = Positive { unPositive :: Int }
toPositive :: Int -> Maybe Positive
toPositive n = if (n <= 0) then Nothing else Just (Positive n)
-- We can't export unPositive, because unPositive can be used
-- to update the field. Trivially renaming it to getPositive
-- ensures that getPositive can only be used to access the field
getPositive :: Positive -> Int
getPositive = unPositive
The above module doesn't export the constructor, so the only way to build a value of type Positive is to supply toPositive with a positive integer, which you can then unwrap using getPositive to access the actual value.
You can then write a function that only accepts positive integers using:
positiveInputsOnly :: Positive -> ...
Haskell can perform some checks at compile time that other languages perform at runtime. Your question seems to imply you are hoping for arbitrary checks to be lifted to compile time, which isn't possible without a large potential for proof obligations (which could mean you, the programmer, would need to prove the property is true for all uses).
In the below, I don't feel like I'm saying anything more than what pigworker touched on while mentioning the very cool sounding Inch tool. Hopefully the additional words on each topic will clarify some of the solution space for you.
What People Mean (when speaking of Haskell's static guarantees)
Typically when I hear people talk about the static guarantees provided by Haskell they are talking about the Hindley Milner style static type checking. This means one type can not be confused for another - any such misuse is caught at compile time (ex: let x = "5" in x + 1 is invalid). Obviously, this only scratches the surface and we can discuss some more aspects of static checking in Haskell.
Smart Constructors: Check once at runtime, ensure safety via types
Gabriel's solution is to have a type, Positive, that can only be positive. Building positive values still requires a check at runtime but once you have a positive there are no checks required by consuming functions - the static (compile time) type checking can be leveraged from here.
This is a good solution for many many problems. I recommended the same thing when discussing golden numbers. Never-the-less, I don't think this is what you are fishing for.
Exact Representations
dflemstr commented that you can use a type, Word, which is unable to represent negative numbers (a slightly different issue than representing positives). In this manner you really don't need to use a guarded constructor (as above) because there is no inhabitant of the type that violates your invariant.
A more common example of using proper representations is non-empty lists. If you want a type that can never be empty then you could just make a non-empty list type:
data NonEmptyList a = Single a | Cons a (NonEmptyList a)
This is in contrast to the traditional list definition using Nil instead of Single a.
Going back to the positive example, you could use a form of Peano numbers:
data NonNegative = One | S NonNegative
Or user GADTs to build unsigned binary numbers (and you can add Num, and other instances, allowing functions like +):
{-# LANGUAGE GADTs #-}
data Zero
data NonZero
data Binary a where
I :: Binary a -> Binary NonZero
O :: Binary a -> Binary a
Z :: Binary Zero
N :: Binary NonZero
instance Show (Binary a) where
show (I x) = "1" ++ show x
show (O x) = "0" ++ show x
show (Z) = "0"
show (N) = "1"
External Proofs
While not part of the Haskell universe, it is possible to generate Haskell using alternate systems (such as Coq) that allow richer properties to be stated and proven. In this manner the Haskell code can simply omit checks like x > 0 but the fact that x will always be greater than 0 will be a static guarantee (again: the safety is not due to Haskell).
From what pigworker said, I would classify Inch in this category. Haskell has not grown sufficiently to perform your desired tasks, but tools to generate Haskell (in this case, very thin layers over Haskell) continue to make progress.
Research on More Descriptive Static Properties
The research community that works with Haskell is wonderful. While too immature for general use, people have developed tools to do things like statically check function partiality and contracts. If you look around you'll find it's a rich field.
I would be failing in my duty as his supervisor if I failed to plug Adam Gundry's Inch preprocessor, which manages integer constraints for Haskell.
Smart constructors and abstraction barriers are all very well, but they push too much testing to run time and don't allow for the possibility that you might actually know what you're doing in a way that checks out statically, with no need for Maybe padding. (A pedant writes. The author of another answer appears to suggest that 0 is positive, which some might consider contentious. Of course, the truth is that we have uses for a variety of lower bounds, 0 and 1 both occurring often. We also have some use for upper bounds.)
In the tradition of Xi's DML, Adam's preprocessor adds an extra layer of precision on top of what Haskell natively offers but the resulting code erases to Haskell as is. It would be great if what he's done could be better integrated with GHC, in coordination with the work on type level natural numbers that Iavor Diatchki has been doing. We're keen to figure out what's possible.
To return to the general point, Haskell is currently not sufficiently dependently typed to allow the construction of subtypes by comprehension (e.g., elements of Integer greater than 0), but you can often refactor the types to a more indexed version which admits static constraint. Currently, the singleton type construction is the cleanest of the available unpleasant ways to achieve this. You'd need a kind of "static" integers, then inhabitants of kind Integer -> * capture properties of particular integers such as "having a dynamic representation" (that's the singleton construction, giving each static thing a unique dynamic counterpart) but also more specific things like "being positive".
Inch represents an imagining of what it would be like if you didn't need to bother with the singleton construction in order to work with some reasonably well behaved subsets of the integers. Dependently typed programming is often possible in Haskell, but is currently more complicated than necessary. The appropriate sentiment toward this situation is embarrassment, and I for one feel it most keenly.
I know that this was answered a long time ago and I already provided an answer of my own, but I wanted to draw attention to a new solution that became available in the interim: Liquid Haskell, which you can read an introduction to here.
In this case, you can specify that a given value must be positive by writing:
{-# myValue :: {v: Int | v > 0} #-}
myValue = 5
Similarly, you can specify that a function f requires only positive arguments like this:
{-# f :: {v: Int | v > 0 } -> Int #-}
Liquid Haskell will verify at compile-time that the given constraints are satisfied.
This—or actually, the similar desire for a type of natural numbers (including 0)—is actually a common complaints about Haskell's numeric class hierarchy, which makes it impossible to provide a really clean solution to this.
Why? Look at the definition of Num:
class (Eq a, Show a) => Num a where
(+) :: a -> a -> a
(*) :: a -> a -> a
(-) :: a -> a -> a
negate :: a -> a
abs :: a -> a
signum :: a -> a
fromInteger :: Integer -> a
Unless you revert to using error (which is a bad practice), there is no way you can provide definitions for (-), negate and fromInteger.
Type-level natural numbers are planned for GHC 7.6.1: https://ghc.haskell.org/trac/ghc/ticket/4385
Using this feature it's trivial to write a "natural number" type, and gives a performance you could never achieve (e.g. with a manually written Peano number type).

What type to give to a function that measures a data structure? Int, Integer, Integral?

I have a data structure, such as an expression tree or graph. I want to add some "measure" functions, such as depth and size.
How best to type these functions?
I see the following three variants as of roughly equal usefulness:
depth :: Expr -> Int
depth :: Expr -> Integer
depth :: Num a => Expr -> a
I have the following considerations in mind:
I'm looking at base and fgl as examples, and they consistently use Int, but Data.List also has functions such as genericLength that are polymorphic in return type, and I am thinking that maybe the addition of these generic functions is a reflection of a modernizing trend that I probably should respect and reinforce.
A similar movement of thought is noticeable in some widely used libraries providing a comprehensive set of functions with the same functionality when there are several probable choices of a return type to be desired by the user (e.g. xml-conduit offers parsers that accept both lazy and strict kinds of either ByteString or Text).
Integer is a nicer type overall than Int and I sometimes find that I need to cast a length of a list to an Integer, say because an algorighm that operates in Integer needs to take this length into account.
Making functions return Integral means these functions are made polymorphic, and it may have performance penalty. I don't know all the particulars well, but, as I understand, there may be some run-time cost, and polymorphic things are harder to memoize.
What is the accepted best practice? Which part of it is due to legacy and compatibility considerations? (I.e. if Data.List was designed today, what type would functions such as length have?) Did I miss any pros and cons?
Short answer: As a general rule use Int, and if you need to convert it to something else, use fromIntegral. (If you find yourself doing the conversion a lot, define fi = fromIntegral to save typing or else create your own wrapper.)
The main consideration is performance. You want to write the algorithm so it uses an efficient integer type internally. Provided Int is big enough for whatever calculation you're doing (the standard guarantees a signed 30-bit integer, but even on 32-bit platforms using GHC, it's a signed 32-bit integer), you can assume it will be a high-speed integer type on the platform, particularly in comparison to Integer (which has boxing and bignum calculation overhead that can't be optimized away). Note that the performance differences can be substantial. Simple counting algorithms will often be 5-10x faster using Ints compared to Integers.
While you could give your function a different signature:
depth :: Expr -> Integer
depth :: (Num a) => Expr -> a
but actually implement it under the hood using the efficient Int type and do the conversion at the end, making the conversion implicit strikes me as poor practice. Particularly if this is a library function, making it clear that Int is being used internally by making it part of the signature strikes me as more sensible.
With respect to your listed considerations:
First, the generic* functions in Data.List aren't modern. In particular, genericLength was available in GHC 0.29, released July, 1996. At some prior point, length had been defined in terms of genericLength, as simply:
length :: [a] -> Int
length = genericLength
but in GHC 0.29, this definition was commented out with an #ifdef USE_REPORT_PRELUDE, and several hand-optimized variants of length were defined independently. The other generic* functions weren't in 0.29, but they were already around by GHC 4.02 (1998).
Most importantly, when the Prelude version of length was generalized from lists to Foldables, which is a fairly recent development (since GHC 7.10?), nobody cared enough to do anything with genericLength. I also don't think I've ever seen these functions used "in the wild" in any serious Haskell code. For the most part, you can think of them as deprecated.
Second, the use of lazy/strict and ByteString/Text variants in libraries represents a somewhat different situation. In particular, a conduit-xml user will normally be making the decision between lazy and strict variants and between ByteString and Text types based on considerations about the data being processed and the construction of the algorithms that are far-reaching and pervade the entire type system of a given program. If the only way to use conduit-xml with a lazy Text type was to convert it piecemeal to strict ByteStrings, pass it to the library, and then pull it back out and convert it back to a lazy Text type, no one would accept that complexity. In contrast, a monomorphic Int-based definition of depth works fine, because all you need is fromInteger . depth to adapt it to any numeric context.
Third, as noted above, Integer is only a "nicer" type from the standpoint of having arbitrary precision in situations where you don't care about performance. For things like depth and count in any practical setting, performance is likely to be more important than unlimited precision.
Fourth, I don't think either the runtime cost or failure-to-memoize should be serious considerations in choosing between polymorphic or non-polymorphic versions here. In most situations, GHC will generate a specialized version of the polymorphic function in a context where memoization is no problem.
On this basis, I suspect if Data.List was designed today, it would still use Ints.
I agree with all the points in K. A. Buhr's great answer, but here are a couple more:
You should use a return type of Integer if you expect to support an expression tree that somehow doesn't fit in memory (which seems interesting, but unlikely). If I saw Expr -> Integer I would go looking in the code or docs to try to understand how or why the codomain might be so large.
Re. performance of Integer: normal machine-word arithmetic will be used if the number is not larger than the max width of a machine word. Simplifying, the type is basically:
data Integer = SmallInteger Int | LargeInteger ByteArray
K. A. Buhr mentions that there is an unavoidable performance penalty which is that this value cannot be unboxed (that is it will always have a heap representation, and will be read from and written to memory), and that does sound right to me.
In contrast, functions on Int (or Word) are often unboxed, so that in core you will see types that look like Int# -> Int# ->. You can think of an Int# as only existing in a machine register. This is what you want your numeric code to look like if you care about performance.
Re. polymorphic versions: designing libraries around concrete numeric inputs and polymorphic numeric outputs probably works okay, in terms of convenient type inference. We already have this to a certain degree, in that numeric literals are overloaded. There are certainly times when literals (e.g. also string literals when -XOverloadedStrings) need to be given type signatures, and so I'd expect that if base were designed to be more polymorphic that you would run into more occasions where this would be required (but fewer uses of fromIntegral).
Another option you haven't mentioned is using Word to express that the depth is non-negative. This is more useful for inputs, but even then is often not worth it: Word will still overflow, and negative literals are valid (although GHC will issue a warning); to a certain extent it's just moving where the bug occurs.

Primitive types as expanded classes in Haskell?

In Eiffel one is allowed to use an expanded class which doesn't allocate from the heap. From a developer's perspective one rarely has to think about conversion from Int to Float as it is automatic. My question is this: Why did Haskell not choose a similar approach to modeling Num. Specifically, lets consider the Int instance. Here is the rationale for my question:
[1..3] = [1,2,3]
[1..3.5] = [1.0,2.0,3.0,4.0] -- rounds up
The second list was something that I was not expecting because there are by definition infinite floating point numbers between any two integers. Of course once we test the sequence it is clear that it is returning the floor of the floating point number rounded up. One of the reasons these conversions are needed is allow us to compute mean of a set of Integers for example.
In Eiffel the number type hierarchy is a bit more programmer friendly and the conversion happens as needed: for example creating a sequence can still be a set of Ints that result in a floating point mean. This has a readability advantage.
Is there a reason that expanded class was not implemented in Haskell?Any references will greatly help.
#ony: the point about parallel strategies: wont we face the same issue when using primitives? The manual does discourage using primitives and that makes sense to me in general where ever we can use primitives we probably need to use the abstract type. The issue I faced when trying to a mean of numbers is the missing Fractional Int instance and as to why does 5/3 not promote to a floating point instead of having to create floating point array to achieve the same result. There must be a reason as to why Fractional instance of Int and Integer is not defined? That could help me understand the rationale better.
#leftroundabout: the question is not about expanded classes per se but the convenience that such a feature can offer although that feature alone is not sufficient to handle the type promotion to float from an int for example as mentioned in my response to #ony. Lets take the classic example of a mean and try to define it as
> [Int] :: Double
let mean xs = sum xs / length xs (--not valid haskell code)
[Int] :: Double
let mean = sum xs / fromIntegral (length xs)
I would have liked it if I did not have to call the fromIntegral to get the mean function to work and that ties to the missing Fractional Int. Although the explanation seems to make sense, it has to, what I dont understand is if I am clear that I expect a double and I state it in my type signature is that not sufficient to do the appropriate conversion?
[a..b] is shorthand for enumFromTo a b, a method of the Enum typeclass. It begins at a and succs until the first time b is exceeded.
[a,b..c] is shorthand for enumFromThenTo a b c is similar to enumFromTo except instead of succing it adds the difference b-a each time. By default this difference is computed by roundtripping through Int so fractional differences may or may not be respected. That said, Double works as you'd expect
Prelude> [0.0, 0.5.. 10]
[0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5,7.0,7.5,8.0,8.5,9.0,9.5,10.0]
[a..] is shorthand for enumFrom a which just succs forever.
[a,b..] is shorthand for enumFromThen a b which just adds (b-a) forever.
As for behaviour #J.Abrahamson already replied. That's definition enumFromThenTo.
As for design...
Actually GHC have Float# that represents unboxed type (can be allocated anywhere, but value is strict).
Since Haskell is a lazy language it assumes that most of the values are not required initially, until they actually referred with a primitive with a strict arguments.
Consider length [2..10]. In this case without optimization Haskell may even avoid generation of numbers and simply build up a list (without values). Probably more useful example takeWhile (<100) [x*(x-1) | x <- [2..]].
But you shouldn't think that we have overhead here since you are writing in language that abstracts away all that stuff with thumbs (except of strict notation). Haskell compiler have to take this as a work for itself. I.e. when compiler will be able to tell that all elements of list will be referenced (transformed to normal form) and it decides to process it within one stack of returns it can allocate it on stack.
Also with such approach you can gain more out of your code by using multiple CPU cores. Imagine that using Strategies your list being processed on a different cores and thus they should share common data on heap (not on stack).

convert one data type to another

I am so familiar with imperative language and their features. So, I wonder how I can convert one data type to another?
ex:
in c++
static_cast<>
in c
( data_type) <another_data_type>
Use an explicit coercion function.
For example, fromIntegral will convert any Integral type (Int, Integer, Word*, etc.) into any numeric type.
You can use Hoogle to find the actual function that suits your need by its type.
Haskell's type system is very different and quite a bit smarter then C/C++/Javas. To understand why you are not going to get the answer you expect, it will help to compare the two.
For C and friends the type is a way of describing the layout of data in memory. The compiler does makes a few sanity checks trying to ensure that memory is not corrupted, but in the end its all bytes and you can call them what ever you want. This is even more true of pointers which are always laid out the same in memory but can reference anything or (frighteningly) nothing.
In Haskell, types are a language that one writes to the compiler with. As a programmer you have no control over how the compiler represents data, and because haskell is lazy a lot of data in your program may be no more then a promise to produce a value on demand (called a thunk in the code GHC's and HUGS). While a c compiler can be instructed to treat data differently, there is no equivalent way to tell a haskell compiler to treat one type like another in general.
As mentioned in other answers, there are some types where there are obvious ways to convert one type to another. Any of the numerical types like Double, Fraction, or Real (in general any instance of the Num class) can be made from an Integer, but we need to use a function specifically designed to make this happen. In a sense this is not a 'cast' but an actual function in the same way that \x -> x > 0 is a function for converting numbers into booleans.
I'll make one last guess as to why you might be asking a question like this. When I was just starting with haskell I wrote a lot of functions like:
area :: Double -> Double -> Double -- find the area of a rectangle
area x y = x * y
I would then find myself dumping fromInteger calls all over the place to get my data in the correct type for the function. Having come from a C background I wrote all my functions with monomorphic types. The trick to not needing to cast from one type to another is to write functions that work with different types. Haskell type classes are a huge shift for OOP programmers and so they often get ignored the first couple of tries, but they are what make the otherwise very strict haskell type system usable. If you can relax your type signatures (e.g. area :: (Num a)=> a -> a -> a) you will find yourself wishing for that cast function much less often.
There are many different functions that convert different data types.
The examples would be:
fromIntegral - to convert between say Int and Double
pack / unpack - to convert between ByteString and String
read - to convert from String to Int

Using low bitsize integral types like `Int8` and what they are for

Recently I've learned that every computation cycle performs on machine words which on most contemporary processors and OS'es are either 32-bit or 64-bit. So what are the benefits of using the smaller bit-size values like Int16, Int8, Word8? What are they exactly for? Is it storage reduction only?
I write a complex calculation program which consists of several modules but is interfaced by only a single function which returns a Word64 value, so the whole program results in Word64 value. I'm interested in the answer to this question because inside this program I found myself utilizing a lot of different Integral types like Word16 and Word8 to represent small entities, and seeing that they quite often got converted with fromIntegral got me thinking: was I making a mistake there and what was the exact benefit of those types which I not knowing about got blindly attracted by? Did it make sense at all to utilize other integral types and evetually convert them with fromIntegral or maybe I should have just used Word64 everywhere?
In GHC, the fixed-size integral types all take up a full machine word, so there's no space savings to be had. Using machine-word-sized types (i.e. Int and Word) will probably be faster than the fixed-size types in most cases, but using a fixed-size integral type will be faster than doing explicit wrap-around.
You should choose the appropriate type for the range of values you're using. maxBound :: Word8 is 255, 255 + 1 :: Word8 is 0 — and if you're dealing with octets, that's exactly what you want. (For instance, ByteStrings are defined as storing Word8s.)
If you just have some integers that don't need a specific number of bits, and the calculations you're doing aren't going to overflow, just use Int or Word (or even Integer). Fixed-size types are less common than the regular integral types because, most of the time, you don't need a specific size.
So, don't use them for performance; use them if you're looking for their specific semantics: fixed-size integral types with defined overflow behaviour.
These smaller types give you a memory reduction only when you store them in unboxed arrays or similar. There, each will take as many bits as indicated by the type suffix.
In general use, they all take exactly as much storage as an Int or Word, the main difference is that the values are automatically narrowed to the appropriate bit size when using fixed-width types, and there are (still) more optimisations (in the form of rewrite rules mainly) for Int and Word than for Int8 etc., so some operations will be slower using those.
Concerning the question whether to use Word64 throughout or to use smaller types, that depends. On a 64-bit system, when compiling with optimisations, the performance of Word and Word64 should mostly be the same since where it matters both should be unpacked and the work is done on the raw machine Word#. But there probably still are a few rules for Word that have no Word64 counterpart yet, so perhaps there is a difference after all. On a 32-bit system, most operations on Word64 are implemented via C calls, so there operations on Word64 are much slower than operations on Word.
So depending on what is more important, simplicity of code or performance on different systems, either
use Word64 throughout: simple code, good performance on 64-bit systems
use Word as long as your values are guaranteed to fit into 32 bits and transform to Word64 at the latest safe moment: more complicated code, but better performance on 32-bit systems.

Resources