Referential transparency and mmap in Haskell - haskell

I was hoping to use System.INotify and System.IO.MMap together in order to watch for file modifications and then quickly perform diffs for sending patches over a network. However, in the documentation for System.IO.MMap there's a couple of warnings about referential transparency:
The documentation states
It is only safe to mmap a file if you know you are the sole user. Otherwise referential transparency may be or may be not compromised. Sadly semantics differ much between operating systems.
The values that MMap returns are IO ByteString, surely when I use this value with putStr I'm expecting a different result each time? I assume that the author means that the value could change during an IO operation such as putStr and crash?
START-OF-EDIT: Come to think of it, I guess answer to this part of the question is somewhat obvious...
If the value changes any time after it is unboxed it would be problematic.
do
v <- mappedValue :: IO ByteString
putStr v
putStr v -- Expects the same value of v everywhere
END-OF-EDIT
Shouldn't it be possible to acquire some kind of lock on the mapped region or on the file?
Alternatively, would it be possible to write a function copy :: IO ByteString -> IO ByteString that takes a snapshot of the file in its current state in a safe way?

I think the author means that the value can change even inside a lifted function that can view it as a plain ByteString (no IO).
The meory mapped file is a region of memory. It doesn't make much sense to copy its content back and forth, for performance reasons (otherwise one could just do plain old stream-based I/O). So the ByteString you are getting is live.
If you want to have a snapshot, just use a stream-based I/O. That's what reading a file does: creates a file snapshot in the memory! I guess an alternative would be using the ForeignPtr interface which does not carry the referential transparency warning. I'm not familiar with ForeignPtrs so I cannot guarantee it will work, but it looks promising and I would investigate it.
You can also try calling map id on your ByteString but it is not guaranteed you will get a copy distinct from the original.
Mandatory file locking, especially on Linux, is a mess that is better avoided. Advisory file locking is OK, except nobody is using it, so it effectively does not exist.

Related

Most efficient way of converting a Data.ByteString.Lazy to a CStringLen

I need to encode some data to JSON and then push is to the syslog using hsyslog. The types of the two relevant functions are:
Aeson.encode :: a -> Data.ByteString.Lazy.ByteString
System.Posix.Syslog.syslog :: Maybe Facility
-> Priority
-> CStringLen
-> IO ()
What's the most efficient way (speed & memory) to convert a Lazy.ByteString -> CStringLen? I found Data.ByteString.Unsafe, but it works only with ByteString, not Lazy.ByteString?
Shall I just stick a unsafeUseAsCStringLen . Data.String.Conv.toS and call it a day? Will it to the right thing wrt efficiency?
I guess I would use Data.ByteString.Lazy.toStrict in place of toS, to avoid the additional package dependency.
Anyway, you won't find anything more efficient than:
unsafeUseAsCStringLen (toStrict lbs) $ \cstrlen -> ...
In general, toStrict is an "expensive" operation, because a lazy ByteString will generally be made up of a bunch of "chunks" each consisting of a strict ByteString and not necessarily yet loaded into memory. The toStrict function must force all the strict ByteString chunks into memory and ensure that they are copied into a single, contiguous block as required for a strict ByteString before the no-copy unsafeUseAsCStringLen is applied.
However, toStrict handles a lazy ByteString that consists of a single chunk optimally without any copying.
In practice, aeson uses an efficient Data.ByteString.Builder to create the JSON, and if the JSON is reasonably small (less than 4k, I think), it will build a single-chunk lazy ByteString. In this case, toStrict is zero-copy, and unsafeUseAsCStringLen is zero copy, and the entire operation is basically free.
But note that, in your application, where you are passing the string to the syslogger, fretting about the efficiency of this operation is crazy. My guess would be that you'd need thousands of copy operations to even make a dent in the performance of the overall action.

What is the analogue of ConcurrentHashMap in Haskell?

update: please, bear in mind, I'm just started learning Haskell
Let's say we're writing an application with the following general functionality:
when starting, it gathers some data from an external source;
this data are a set of complex structures which contain lists,
arrays, ints, string, etc.;
when running, the application serves web API (servlets) that provides
access to the data.
Now, if the application would be written in Java, we could use static ConcurrentHashMap object where the data could be stored (representing Java classes). So that, during start, the app could fill the map with data, and then servlets could access it providing some API to the clients.
If the application would be written in Erlang, we could use ETS/DETS for storing the data (as native Erlang structures).
Now the question: what is the proper Haskell way for implementing such design?
It shouldn't be DB, it should be some sort of a lightweight in-memory something, that could store complex structures (Haskell native structures), and that could be accessible from different threads (servlets, talking by Java-world entities). In Haskell: no static global vars as in Java, no ETS and OTP as in Erlang, - so how to do it the right way (with no using external solutions like Redis)?
Thanks
update: another important part of the question - since Haskell doesn't (?) have 'global static' variables, then what would be the right way for implementing this 'global accessible' data keeping object (say, it is "stm-containers")? Should I initialize it somewhere in the 'main' function and then just pass it to every REST API handler? Or is there any other more correct way?
It's not clear from your question whether the client API will provide ways of mutating the data.
If not (i.e., the API will only be about querying), then any immutable data-structure will suffice, since one beauty of immutable data is that it can be accessed from multiple threads safely with you being sure that it can't change. No need for the overhead of locks or other strategies for working with concurrency. You'll simply construct the immutable data during the initialisation and then just query it. For this consider a package like "unordered-containers".
If your API will also be mutating the data, then you will need mutable data-structures, which are optimised for concurrency. "stm-containers" is one package, which provides those.
First off, I'm going to assume you mean it needs to be available to multiple threads, not multiple processes. (The difference being that threads share memory, processes do not.) If that assumption is wrong, much of your question doesn't make sense.
So, the first important point: Haskell has mutable data structures. They can easily be shared between threads. Here's a small example:
import Control.Concurrent
import Control.Monad
main :: IO ()
main = do
v <- newMVar 0 :: IO (MVar Int)
forkIO . forever $ do
x <- takeMVar v
putMVar v $! x + 1
forM_ [1..10] $ \_ -> do
x <- readMVar v
threadDelay 100
print x
Note the use of ($!) when putting the value in the MVar. MVars don't enforce that their contents are evaluated. There's some subtlety in making sure everything works properly. You will get lots of space leaks until you understand Haskell's evaluation model. That's part of why this sort of thing is usually done in a library that handles all those details.
Given this, the first pass approach is to just store a map of some sort in an MVar. Unless it's under a lot of contention, that actually has pretty good performance properties.
When it is under contention, you have a good fallback secondary approach, especially when using a hash map. That's striping. Instead of storing one map in one MVar, use N maps in N MVars. The first step in a lookup is using the hash to determine which of the N MVars to look in.
There are fancy lock-free algorithms, which could be implemented using finer-grained mutable values. But in general, they are a lot of engineering effort for a few percent improvement in performance that doesn't really matter in most use cases.

Should check for change before writing IORef?

There's code that reads an IORef and based on some conditions and calculations, creates a new value. Now it writes the new value to that IORef. But there is a chance that it didn't get changed at all. The new value may be identical to the old.
What are the considerations regarding whether to check to see if the value is different before writing the IORef, or just write the IORef regardless?
Does writeIORef check to see if the value is changed before setting?
By checking first, could you avoid the write and save a little on performance possibly?
Does writeIORef check to see if the value is changed before setting?
No. writeIORef wraps writeSTRef, which is defined as
-- |Write a new value into an 'STRef'
writeSTRef :: STRef s a -> a -> ST s ()
writeSTRef (STRef var#) val = ST $ \s1# ->
case writeMutVar# var# val s1# of { s2# ->
(# s2#, () #) }
By checking first, could you avoid the write and save a little on performance possibly?
What are the considerations regarding whether to check to see if the value is different before writing the IORef, or just write the IORef regardless?
This is really contingent on the algorithm in question. What are you trying to optimize for? What is the frequency/ratio of reads to writes? what kind of data are you storing? how is it packed? what is the cost of an equality comparison for the data in question?
There exists a whole host of factors to be taken into account when determining whether or not you want to destructively update cells in-place: some algorithm specific, some depending on cache locality, others depending on the structure and form of the code GHC generates. As such, it's exceedingly difficult to answer your question.
A quote from Donald Knuth:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil
Unless you're at the stage where you're trying to eek out every modicum of performance from some well-understood implementation, you're probably better off picking the path that is
simplest to implement
easiest to reason about
and getting on with it. If you are at the stage where you'd like to tweak your program, I'd suggest learning to read GHC's human-readable generated output (Core), as you'd then be in the position to make these sorts of decisions (on a very granular level) on a per-program basis.

Why does Haskell not have an I Monad (for input only, unlike the IO monad)?

Conceptually, it seems that a computation that performs output is very different from one that performs input only. The latter is, in one sense, much purer.
I, for one, would like to have a way to separate the input only parts of my programme from the ones that might actually write something out.
So, why is there no input only Monad?
Any reason why it wouldn't work to have an I monad (and an O Monad, which could be combined into the IO Monad)?
Edit: I mostly meant input as reading files, not interacting with the user. This is also my use case, where I can assume that input files do not change during the execution of the programme (otherwise, it's fine to get undefined behaviour).
I disagree with bdonlan's answer. It's true that neither input nor output are more "pure" but they are quite different. It's quite valid to critique IO as the single "sin bin" where all effects get crammed together, and it does make ensuring certain properties harder. For example, if you have many functions that you know only read from certain memory locations, and which could never cause those locations to be altered, it would be nice to know that you can reorder their execution. Or if you have a program that uses forkIO and MVars, it would be nice to know, based on its type, that it isn't also reading /etc/passwd.
Furthermore, one can compose monadic effects in a fashion besides just stacked transformers. You can't do this with all monads (just free monads), but for a case like this that's all you really need. The iospec package, for example, provides a pure specification of IO -- it doesn't seperate reading and writing, but it does seperate them from, e.g., STM, MVars, forkIO, soforth.
http://hackage.haskell.org/package/IOSpec
The key ideas for how you can combine the different monads cleanly are described in the Data Types a la Carte paper (great reading, very influential, can't recommend enough, etc.etc.).
The 'Input' side of the IO monad is just as much output as it is input. If you consume a line of input, the fact that you consumed that input is communicated to the outside, and also serves to be recorded as impure state (ie, you don't consume the same line again later); it's just as much an output operation as a putStrLn. Additionally, input operations must be ordered with respect to output operations; this again limits how much you can separate the two.
If you want a pure read-only monad, you should probably use the reader monad instead.
That said, you seem to be a bit confused about what combining monads can do. While you can indeed combine two monads (assuming one is a monad transformer) and get some kind of hybrid semantics, you have to be able to run the result. That is, even if you could define an IT (OT Identity) r, how do you run it? You have no root IO monad in this case, so main must be a pure function. Which would mean you'd have main = runIdentity . runOT . runIT $ .... Which is nonsense, since you're getting impure effects from a pure context.
In other words, the type of the IO monad has to be fixed. It can't be a user-selectable transformed type, because its type is nailed down into main. Sure, you could call it I (O Identity), but you don't gain anything; O (I Identity) would be a useless type, as would be I [] or O Maybe, because you'd never be able to run any of these.
Of course, if IO is left as the fundamental IO monad type, you could define routines like:
runI :: I Identity r -> IO r
This works, but again, you can't have anything underneath this I monad very easily, and you're not gaining much from this complexity. What would it even mean to have an Output monad transformed over a List base monad, anyway?
When you obtain input, you cause side-effects that changes both the state of the outside world (the input is consumed) and your program (the input is used). When you output, you cause side-effects that only change the state of the outside world (output is produced); the act of outputting itself does not change the state of your program. So you might actually say that O is more "pure" than I.
Except that output does actually change the execution state of your program (It won't repeat the same output operation over and over; it has to have some sort of state change in order to move on). It all depends on how you look at it. But it's so much easier to lump the dirtiness of input and output into the same monad. Any useful program will both input and output. You can categorize the operations you use into one or the other, but I'm not seeing a convincing reason to employ the type system for the task.
Either you're messing with the outside world or you're not.
Short answer: IO is not I/O.
Other folks have longer answers if you like.
I think the division between pure and impure code is somewhat arbitrary. It depends on where you put the barrier. Haskell's designers decided to clearly separate pure functional part of the language from the rest.
So we have IO monad which incorporates all the possible effects (as different, as disk reads/writes, networking, memory access). And language enforces a clear division by means of return type. And this induces a kind of thinking which divides everything in pure and impure.
If the information security is concerned, it would be quite naturally to separate reading and writing. But for haskell's initial goal, to be a standard lazy pure functional language, it was an overkill.

How do laziness and I/O work together in Haskell?

I'm trying to get a deeper understanding of laziness in Haskell.
I was imagining the following snippet today:
data Image = Image { name :: String, pixels :: String }
image :: String -> IO Image
image path = Image path <$> readFile path
The appeal here is that I could simply create an Image instance and pass it around; if I need the image data it would be read lazily - if not, the time and memory cost of reading the file would be avoided:
main = do
image <- image "file"
putStrLn $ length $ pixels image
But is that how it actually works? How is laziness compatible with IO? Will readFile be called regardless of whether I access pixels image or will the runtime leave that thunk unevaluated if I never refer to it?
If the image is indeed read lazily, then isn't it possible I/O actions could occur out of order? For example, what if immediately after calling image I delete the file? Now the putStrLn call will find nothing when it tries to read.
How is laziness compatible with I/O?
Short answer: It isn't.
Long answer: IO actions are strictly sequenced, for pretty much the reasons you're thinking of. Any pure computations done with the results can be lazy, of course; for instance if you read in a file, do some processing, and then print out some of the results, it's likely that any processing not needed by the output won't be evaluated. However, the entire file will be read, even parts you never use. If you want lazy I/O, you have roughly two options:
Roll your own explicit lazy-loading routines and such, like you would in any strict language. Seems annoying, granted, but on the other hand Haskell makes a fine strict, imperative language. If you want to try something new and interesting, try looking at Iteratees.
Cheat like a cheating cheater. Functions such as hGetContents will do lazy, on-demand I/O for you, no questions asked. What's the catch? It (technically) breaks referential transparency. Pure code can indirectly cause side effects, and funny things can happen involving ordering of side effects if your code is really convoluted. hGetContents and friends are implemented using unsafeInterleaveIO, which is... exactly what it says on the tin. It's nowhere near as likely to blow up in your face as using unsafePerformIO, but consider yourself warned.
Lazy I/O breaks Haskell's purity. The results from readFile are indeed produced lazily, on demand. The order in which I/O actions occur is not fixed, so yes, they could occur "out of order". The problem of deleting the file before pulling the pixels is real. In short, lazy I/O is a great convenience, but it's a tool with very sharp edges.
The book on Real World Haskell has a lengthy treatment of lazy I/O and goes over some of the pitfalls.

Resources