Automatic conversion between String and Data.Text in haskell - string

As Nikita Volkov mentioned in his question Data.Text vs String I also wondered why I have to deal with the different String implementations type String = [Char] and Data.Text in haskell. In my code I use the pack and unpack functions really often.
My question: Is there a way to have an automatic conversion between both string types so that I can avoid writing pack and unpack so often?
In other programming languages like Python or JavaScript there is for example an automatic conversion between integers and floats if it is needed. Can I reach something like this also in haskell? I know, that the mentioned languages are weakly typed, but I heard that C++ has a similar feature.
Note: I already know the language extension {-# LANGUAGE OverloadedStrings #-}. But as I understand this language extensions just applies to strings defined as "...". I want to have an automatic conversion for strings which I got from other functions or I have as arguments in function definitions.
Extended question: Haskell. Text or Bytestring covers also the difference between Data.Text and Data.ByteString. Is there a way to have an automatic conversion between the three strings String, Data.Text and Data.ByteString?

No.
Haskell doesn't have implicit coercions for technical, philosophical, and almost religious reasons.
As a comment, converting between these representations isn't free and most people don't like the idea that you have hidden and potentially expensive computations lurking around. Additionally, with strings as lazy lists, coercing them to a Text value might not terminate.
We can convert literals to Texts automatically with OverloadedStrings by desugaring a string literal "foo" to fromString "foo" and fromString for Text just calls pack.
The question might be to ask why you're coercing so much? Is there some why do you need to unpack Text values so often? If you constantly changing them to strings it defeats the purpose a bit.

Almost Yes: Data.String.Conversions
Haskell libraries make use of different types, so there are many situations in which there is no choice but to heavily use conversion, distasteful as it is - rewriting libraries doesn't count as a real choice.
I see two concrete problems, either of which being potentially a significant problem for Haskell adoption :
coding ends up requiring specific implementation knowledge of the libraries you want to use.This is a big issue for a high-level language
performance on simple tasks is bad - which is a big issue for a generalist language.
Abstracting from the specific types
In my experience, the first problem is the time spent guessing the package name holding the right function for plumbing between libraries that basically operate on the same data.
To that problem there is a really handy solution : the Data.String.Conversions package, provided you are comfortable with UTF-8 as your default encoding.
This package provides a single cs conversion function between a number of different types.
String
Data.ByteString.ByteString
Data.ByteString.Lazy.ByteString
Data.Text.Text
Data.Text.Lazy.Text
So you just import Data.String.Conversions, and use cs which will infer the right version of the conversion function according to input and output types.
Example:
import Data.Aeson (decode)
import Data.Text (Text)
import Data.ByteString.Lazy (ByteString)
import Data.String.Conversions (cs)
decodeTextStoredJson' :: T.Text -> MyStructure
decodeTextStoredJson' x = decode (cs x) :: Maybe MyStructure
NB : In GHCi you generally do not have a context that gives the target type so you direct the conversion by explicitly stating the type of the result, like for read
let z = cs x :: ByteString
Performance and the cry for a "true" solution
I am not aware of any true solution as of yet - but we can already guess the direction
it is legitimate to require conversion because the data does not change ;
best performance is achieved by not converting data from one type to another for administrative purposes ;
coercion is evil - coercitive, even.
So the direction must be to make these types not different, i.e. to reconcile them under (or over) an archtype from which they would all derive, allowing composition of functions using different derivations, without the need to convert.
Nota : I absolutely cannot evaluate the feasability / potential drawbacks of this idea. There may be some very sound stoppers.

Related

A version of "read" that doesn't require quoted strings

In my app I'm doing a lot of conversions from Text to various datatypes, often just to Text itself, but sometimes to other datatypes.
I also rarely do conversions from other string types, e.g. String and ByteString.
Interestingly, Readable.fromText does the job for me, at least for Integer and Text. However I also now need UTCTime, which Readable.fromText doesn't have an instance for (but which I could write myself).
I was thinking that Readable.fromText was a Text analogy of Text.Read.readEither for [Char], however I've realised that Readable.fromText is actually subtlety different, in that readEither for text isn't just pure, but instead expects the input string to be quoted. This isn't the case however for reading integers however, who don't expect quotes.
I understand that this is because show shows strings with quotes, so for read to be consistent it needs to require quotes.
However this is not the behaviour I want. I'm looking for a typeclass where reading strings to strings is basically the id function.
Readable seems to do this, but it's misleadingly named, as its behaviour is not entirely analogous to read on [Char]. Is there another typeclass that has this behaviour also? Or am I best of just extending Readable, perhaps with newtypes or alternatively PRs?
The what
Just use Data.Text and Data.Text.Read directly
With signed decimal or just decimal you get a simple and yet expressive minimalistic parser function. It's directly usable:
type Reader a = Text -> Either String (a, Text)
decimal :: Integral a => Reader a
signed :: Num a => Reader a -> Reader a
Or you cook up your own runReader :: Reader a -> M a combinator for some M to possibly handle non-empty leftover and deal with the Left case.
For turning a String -> Text, all you have to do is use pack
The why
Disclaimer: The matter of parsing data the right way is answered differently depending on who you ask.
I belong to the school that believes typeclasses are a poor fit for parsing mainly for two reasons.
Typeclasses limit you to one instance per type
You can easily have two different time formats in the data. Now you might tell yourself that you only have one use case, but what if you depend on another library that itself or transitively introduces another instance Readable UTCTime? Now you have to use newtypes for no reason other than be able to select a particular implementation, which is not nice!
Code transparency
You cannot make any inference as to what parser behavior you get from a typename alone. And for the most part haddock instance documentation often does not exist because it is often assumed the behavior be obvious.
Consider for example: What will instance Readable Int64 do?
Will it assume an ASCII encoded numeric representation? Or some binary representation?
If binary, which endianness is going to be assumed?
What representation of signedness is expected? In ASCII case perhaps a minus? Or maybe with a space? Or if binary, is it going to be one-complement? Two-complement?
How will it handle overflow?
Code transparency on call-sites
But the intransparency extends to call-sites as well. Consider the following example
do fieldA <- fromText
fieldB <- fromText
fieldB <- fromText
pure T{..}
What exactly does this do? Which parsers will be invoked? You will have to know the types of fieldA, fieldB and fieldB to answer that question. Now in simple code that might seem obvious, but you might easily forget if you look at the same code 2 weeks from now. Or you have more elaborate code, where the types involves are inferred non-locally. It becomes hard to follow which instance this will end up selecting (and the instance can make a huge difference, especially if you start newtyping for different formats. Say you cannot make any inference from a field name fooTimestamp because it might perhaps be UnixTime or UTCTime)
And much worse: If you refactor and alter one of the field types data declaration from one type to another - say a time field from Word64 to UTCTime - this might silently and unexpectedly switch out to a different parser, leading to a bug. Yuk!
On the topic of Show/Read
By the way, the reason why show/read behave they way they do for Prelude instances and deriving-generated instances can be discovered in the Haskell Report 2010.
On the topic of show it says
The result of show is a syntactically correct Haskell expression
containing only constants [...]
And equivalently for read
The result of show is readable by read if all component types are readable.
(This is true for all instances defined in the Prelude but may not be true
for user-defined instances.) [...]
So show for a string foo produces "foo" because that is the syntactically valid Haskell literal representing the string value of foo, and read will read that back, acting as a kind of eval

Escaping monad IO

One of the things I like best about Haskell is how the compiler locates the side effects via the IO monad in function signatures. However, it seems easy to bypass this type check by importing 2 GHC primitives :
{-# LANGUAGE MagicHash #-}
import GHC.Magic(runRW#)
import GHC.Types(IO(..))
hiddenPrint :: ()
hiddenPrint = case putStrLn "Hello !" of
IO sideEffect -> case runRW# sideEffect of
_ -> ()
hiddenPrint is of type unit, but it does trigger a side effect when called (it prints Hello). Is there a way to forbid those hidden IOs (other than trusting no one imports GHC's primitives) ?
This is the purpose of Safe Haskell. If you add {-# language Safe #-} to the top of your source file, you will only be allowed to import modules that are either inferred safe or labeled {-# language Trustworthy #-}. This also imposes some mild restrictions on overlapping instances.
There are many ways in which you can break purity in Haskell. However, you have to go out of your way to find them. Here are a few:
Importing GHC internal modules to access low-level primitives
Using the FFI to import a C function as a pure one, when it is not pure
Using unsafePerformIO
Using unsafeCoerce
Using Ptr a and dereference pointers not pointing to valid data
Declaring your own Typeable instances, and lying so that you can abuse cast. (No longer possible in recent GHC)
Using the unsafe array operations
All these, and others, are not meant to be used regularly. They are there so that, if one is really, really sure, can tell the compiler "I know what I am doing, don't get in my way". Doing so, you take the burden of the proof -- proving that what you are doing is safe.

How to build Abstract Syntax Trees from grammar specification in Haskell?

I'm working on a project which involves optimizing certain constructs in a very small subset of Java, formalized in BNF.
If I were to do this in Java, I would use a combination of JTB and JavaCC which builds an AST. Visitors are then used to manipulate the tree. But, given the vast libraries for parsing in Haskell (parsec, happy, alex etc), I'm a bit confused in chossing the appropriate library.
So, simply put, when a language is specified in BNF, which library offers the easiest means to build an AST? And what is the best way to go about modifying this tree in idiomatic Haskell?
Well in Haskell there are 2 main ways of parsing something, parse combinators or a parser generator. Since you already have a BNF I'd suggest the latter.
A good one is alex. GHC's parser IIRC is written using this so you'd be in good company.
Next you'll have a big honking stack of data declarations to parse into:
data JavaClass = {
className :: Name,
interfaces :: [Name],
contents :: [ClassContents],
...
}
data ClassContents = M Method
| F Field
| IC InnerClass
and for expressions and whatever else you need. Finally you'll combine these into something like
data TopLevel = JC JavaClass
| WhateverOtherForms
| YouWillParse
Once you have this you'll have the entire AST represented as one TopLevel or a list of them depending on how many you classes/files you parse.
To proceed from here depends on what you want to do. There are a number of libraries such as syb (scrap your boilerplate) that let you write very concise tree traversals and modifications. lens is also an option. At a minimum check out Data.Traversable and Data.Foldable.
To modify the tree, you can do something as simple as
ignoreInnerClasses :: JavaClass -> JavaClass
ignoreInnerContents c = c{contents = filter isClass $ contents c}
-- ^^^ that is called a record update
where isClass (IC _) = True
isClass _ = False
and then you could potentially use something like syb to write
everywhere (mkT ignoreInnerClass) toplevel
which will traverse everything and apply ignoreInnerClass to all JavaClasses. This is possible to do in lens and many other libraries too, but syb is very easy to read.
I've never used bnfc-meta (suggested by #phg), but I would strongly recommend you look into BNFC (on hackage: http://hackage.haskell.org/package/BNFC). The basic approach is that you write your grammar in an annotated BNF style, and it will automatically generate an AST, parser, and pretty-printer for the grammar.
How suitable BNFC is depends upon the complexity of your grammar. If it's not context-free, you'll likely have a difficult time making any progress (I did make some success hacking up context-sensitive extensions, but that code's likely bit-rotted by now). The other downside is that your AST will very directly reflect the grammar specification. But since you already have a BNF specification, adding the necessary annotations for BNFC should be rather straightforward, so it's probably the fastest way to get a usable AST. Even if you decide to go a different route, you might be able to take the generated data types as a starting point for a hand-written version.
Alex + Happy.
There are many approaches to modify/investigate the parsed terms (ASTs). The keyword to search for is "datatype-generic" programming. But beware: it is a complex topic ...
http://people.cs.uu.nl/andres/Rec/MutualRec.pdf
http://www.cs.uu.nl/wiki/GenericProgramming/Multirec
It has a generic implementation of the zipper available here:
http://hackage.haskell.org/packages/archive/zipper/0.3/doc/html/Generics-MultiRec-Zipper.html
Also checkout https://github.com/pascalh/Astview
You might also check out the Haskell Compiler Series which is nice as an introduction to using alex and happy to parse a subset of Java: http://bjbell.wordpress.com/haskell-compiler-series/.
Since your grammar can be expressed in BNF, it is in the class of grammars that are efficiently parseable with a shift-reduce parser (LALR grammars). Such efficient parsers can be generated by the parser generator yacc/bison (C,C++), or its Haskell equivalent "Happy".
That's why I would use "Happy" in your case. It takes grammar rules in BNF form and generates a parser from it directly. The resulting parser will accept the language that is described by your grammar rules and produce an AST (abstract syntax tree). The Happy user guide is quite nice and gets you started quickly:
http://www.haskell.org/happy/doc/html/
To transform the resulting AST, generic programming is a good idea. Here is a classical explanation on how to do this in Haskell in a practical fashion, from scratch:
http://research.microsoft.com/en-us/um/people/simonpj/papers/hmap/
I have used exactly this to build a compiler for a small domain specific language, and it was a simple and concise solution.

Has anyone ever compiled a list of the imports needed to avoid the "not polymorphic enough" definitions in Haskell's standard libraries?

I have been using Haskell for quite a while. The more I use it, the more I fall in love with the language. I simply cannot believe I have spent almost 15 years of my life using other languages.
However, I am slowly but steadily growing fed up with Haskell's standard libraries. My main pet peeve is the "not polymorphic enough" definitions (Prelude.map, Control.Monad.forM_, etc.). I have a lot of Haskell source code files whose first lines look like
{-# LANGUAGE NoMonomorphismRestriction #-}
module Whatever where
import Control.Monad.Error hiding (forM_, mapM_)
import Control.Monad.State hiding (forM_, mapM_)
import Data.Foldable (forM_, mapM_)
{- ... -}
In order to avoid constantly hoogling which definitions I should hide, I would like to have a single or a small amount of source code files that wrap this import boilerplate into manageable units.
So...
Has anyone else tried doing this before?
If the answer to the previous question is "Yes", have they posted the resulting boilerplate-wrapping source code files?
It is not as clear cut as you imagine it to be. I will list all the disadvantages I can think of off the top of my head:
First, there is no limit to how general these functions can get. For example, right now I am writing a library for indexed types that subsumes ordinary types. Every function you mentioned has a more general indexed equivalent. Do I expect everybody to switch to my library for everything? No.
Here's another example. The mapM function defines a higher order functor that satisfies the functor laws in the Kleisli category:
mapM return = return
mapM (f >=> g) = mapM f >=> mapM g
So I could argue that your traversable generalization is the wrong one and instead we should generalize it as just being an instance of a higher order functor class.
Also, check out the category-extras package for some examples of these higher order classes and functions which subsume all your examples.
There is also the issue of performance. Many of thesr more specialized functions have really finely tuned implementations that dramatically help performance. Sometimes classes expose ways to admit more performant versions, but sometimes they don't.
There is also the issue of typeclass overload. I actually prefer to minimize use of typeclasses unless they have sound laws derived from theory rather than convenience. Also, typeclasses generally play poorly with the monomorphism restriction and I enjoy writing functions without signatures for my application code.
There is also the issue of taste. A lot of people simply don't agree what is the best Haskell style. We still can't even agree on the Prelude. Speaking of which, there have been many attempts to write new Preludes, but nobody can agree on what is best so we all default back to the Haskell98 one anyway.
However, I think the overall spirit of improving things is good and the worst enemy of progress is satisfaction, but don't assume there will be a clear-cut right way to do everything.

Why does Haskell lack a literal Data.Map constructor syntax?

I'm still new to Haskell (learning it on and off). I'm wondering why Haskell doesn't have a literal Data.Map constructor syntax, like the Map/Hash constructor syntax in Clojure or Ruby. Is there a reason? I thought that since Haskell does have a literal constructor syntax for Data.List, there should be one for Data.Map.
This question is not meant to be critical at all. I would just like to learn more about Haskell through the answers.
Unlike Clojure and Ruby, Haskell's finite maps are provided as libraries. This has tradeoffs: for example, as you noticed, there's no built-in syntax for finite maps; however, because it's a library, we can (and do) have many alternative implementations, and you as a programmer can choose the one that's most appropriate for your uses.
In addition to the answers already given (the "historical accident" one notwithstanding), I think there's also something to be said for the use of Data.Map in Haskell compared to Hash in Ruby or similar things; map-like objects in other languages tend to see a lot more use for general ad-hoc storage.
Whereas in Haskell you'd whip up a data definition in no time at all, creating a class in other languages tends to be somewhat heavy-weight, and so we find that even for data with a well-known structure, we'll just use a Hash or dict or similar. The fact that we have a direct syntax for doing so makes it all the more attractive an option.
Contrast to Lisp: using MAKE-HASH-TABLE and then repeatedly SETFing it is relatively annoying (similar to using Data.Map), so everything gets thrown into nested lists, instead—because it's what's convenient.
Similarly, I'm happy that the most convenient choice for storing data is creating new types to suit, and then I leave Data.Map to when I'm actually constructing a map or hash-table as an intrinsic component. There are a few cases where I think the syntax would be nice (usually just for smaller throw-away programs), but in general I don't miss it.
Actually I'm not sure why nobody has pointed it out in an answer (there's only sam boosalis' comment) but with OverloadedLists you can pretty much get literal syntax for Map and Set:
{-# LANGUAGE OverloadedLists #-}
import Data.Map
import Data.Set
foo :: Map Int Int
foo = [(1,2)]
bar :: Set Int
bar = [1]
from there it's only one more step to get even nicer looking maps, for example:
a =: b = (a,b)
ages :: Map String Int
ages = [ "erik" =: 30
, "john" =: 45
, "peter" =: 21 ]
Although I personally prefer explicit over implicit, so unless I'm building a DSL, I'd still stick to fromList and (foo, bar) — Haskell is about the big wins not the small ones.
Haskell has special syntax for lists because in a lazy functional language they more or less take the place of loop control structures in imperative languages. So they're much more important than Map in the grand scheme.
Also, I know you were referring to [1,2,3] when you said "list syntax", but I wanted to add that list constructor syntax could almost be implemented in haskell-98, in that type constructors can be infix when they start with :, e.g.
data Pair = Int :-- Int
So the list constructor : is just a slight special case of this general syntax rule, which is pretty elegant. Some people miss that.
Haskell does have a Map constructor, but it is "hidden" (like a private method in an object oriented paradigm). You are encouraged to use "public" constructors, such as empty, singleton or fromList.
However, if you inspect the code, available at https://hackage.haskell.org/package/containers-0.4.0.0/docs/src/Data-Map.html , you get the following definition
data Map k a = Tip
| Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a)
You can use the Tip and Bin constructors, but that is not recommended.

Resources