Mixing ByteString parsing and network IO in Haskell

Mixing ByteString parsing and network IO in Haskell - haskell

Background
I'm trying to write a client for a binary network protocol.
All network operations are carried out over a single TCP connection, so in that sense
the input from the server is a continuous stream of bytes.
At the application layer, however, the server conceptually sends a packet on the
stream, and the client keeps reading until it knows the packet has been received
in its entirety, before sending a response of its own.
A lot of the effort needed to make this work involves parsing and generating
binary data, for which I'm using the Data.Serialize module.
The problem
The server sends me a "packet" on the TCP stream.
The packet is not necessarily terminated by a newline, nor is it of a predetermined
size.
It does consist of a predetermined number of fields, and fields generally begin
with a 4 byte number describing the length of that field.
With some help from Data.Serialize, I already have the code to parse a ByteString
version of this packet into a more manageable type.
I'd love to be able to write some code with these properties:
The parsing is only defined once, preferably in my Serialize instance(s).
I'd rather not do extra parsing in the IO monad to read the correct number of bytes.
When I try to parse a given packet and not all the bytes have arrived yet, lazy
IO will just wait for the extra bytes to arrive.
Conversely, when I try to parse a given packet and all its bytes have arrived
IO does not block anymore. That is, I want to read just enough of the stream
from the server to parse my type and form a response to send back. If the IO
blocks even after enough bytes have arrived to parse my type, then the client
and server will become deadlocked, each waiting for more data from the other.
After I send my own response, I can repeat the process by parsing the next type
of packet I expect from the server.
So in brief,
is it possible to leverage my current ByteString parsing code in combination
with lazy IO to read exactly the right number of bytes off the network?
What I've Tried
I tried to use lazy ByteStreams in combination with my Data.Serialize instance, like
so:
import Network
import System.IO
import qualified Data.ByteString.Lazy as L
import Data.Serialize
data MyType
instance Serialize MyType
main = withSocketsDo $ do
h <- connectTo server port
hSetBuffering h NoBuffering
inputStream <- L.hGetContents h
let Right parsed = decodeLazy inputStream :: Either String MyType
-- Then use parsed to form my own response, then wait for the server reply...
This seems to fail mostly on point 3 above: it stays blocked even after a sufficient
number of bytes have arrived to parse MyType. I strongly suspect this is because
ByteStrings are read with a given block size at a time, and L.hGetContents is
waiting for the rest of this block to arrive. While this property of reading an
efficient blocksize is helpful for making efficient reads from disk, it seems to be
getting in my way for reading just enough bytes to parse my data.

Something is wrong with your parser, it is too eager. Most likely it need the next byte after the message for some reason. hGetContents from bytestring doesn't block waiting for the whole chunk. It uses hGetSome internally.
I created simple test case. The server sends "hello" every second:
import Control.Concurrent
import System.IO
import Network
port :: Int
port = 1234
main :: IO ()
main = withSocketsDo $ do
s <- listenOn $ PortNumber $ fromIntegral port
(h, _, _) <- accept s
let loop :: Int -> IO ()
loop 0 = return ()
loop i = do
hPutStr h "hello"
threadDelay 1000000
loop $ i - 1
loop 5
sClose s
The client reads the whole contents lazily:
import qualified Data.ByteString.Lazy as BSL
import System.IO
import Network
port :: Int
port = 1234
main :: IO ()
main = withSocketsDo $ do
h <- connectTo "localhost" $ PortNumber $ fromIntegral port
bs <- BSL.hGetContents h
BSL.putStrLn bs
hClose h
If you try to run both of then, you'll see the client printing "hello" every seconds. So, the network subsystem is ok, the issue is somewhere else -- most likely in your parser.

Related

HTTP manager shared state corruption when sending wrong length for stream

Given a shared HTTP manager, it seems that if the requestBody is of type requestBodySource and if wrong length is supplied for the request body, then subsequent requests will crap out on same HTTP manager for about 20 seconds. There seems to be something about interaction of shared state and GivesPopper perhaps that is causing this issue. Here is a sample code that reproduces it - we use requestb.in for sending wrong length upload, and then try to read another valid URL on requestb.in.
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit.Binary (sourceFile)
import Network.HTTP.Conduit
import Network.HTTP.Types
import qualified Data.ByteString.Lazy as LBS
import System.IO
import Control.Monad.Trans.Resource (runResourceT)
import Control.Concurrent.Async (async,waitCatch)
import Control.Exception (displayException)
main :: IO ()
main = do
{- Set up a ResourceT region with an available HTTP manager. -}
httpmgr <- newManager tlsManagerSettings
httpmgr2 <- newManager tlsManagerSettings
let file ="out" -- some byte contents with length > 1
lenb <- System.IO.withFile file ReadMode hFileSize
let inbytes = sourceFile file
initReq <- parseUrl "http://requestb.in/saxbx3sa"
putreq <- async $ runResourceT $ do
let req = initReq { method = "POST",
-- let us send wrong length in requestBodySource
requestBody = (requestBodySource (fromIntegral $ lenb - 1) inbytes)}
resp <- httpLbs req httpmgr
return (statusCode . responseStatus $ resp)
putreqRes <- waitCatch putreq
case putreqRes of
Left e -> print $ displayException $ e
Right r -> print $ r
getreq <- async $ runResourceT $ do
-- Let us do a GET on a different resource to see if it works
initReq <- parseUrl "http://requestb.in/1l15sz21"
let req = initReq { method = "GET"}
resp <- httpLbs req httpmgr
return (statusCode . responseStatus $ resp)
getreqRes <- waitCatch getreq
case getreqRes of
Left e -> print $ displayException $ e
Right r -> print $ r
Output - first bad upload goes through as HTTP 200, and subsequent GET request immediately causes HTTP 400 error:
*Main> main
200
"StatusCodeException (Status {statusCode = 400, statusMessage = \"Bad Request\"})
[(\"Date\",\"Wed, 29 Jun 2016 11:54:59 GMT\"),(\"Content-Type\",\"text/html\"),
(\"Content-Length\",\"177\"),(\"Connection\",\"close\"),(\"Server\",\"-nginx\"),
(\"CF-RAY\",\"-\"),(\"X-Response-Body-Start\",\"<html>\\r\\n<head><title>400 Bad
Request</title></head>\\r\\n<body bgcolor=\\\"white\\\">\\r\\n<center><h1>400 Bad
Request</h1></center>\\r\\n<hr><center>cloudflare-
nginx</center>\\r\\n</body>\\r\\n</html>\\r\\n\"),(\"X-Request-URL\",\"GET
http://requestb.in:80/saxbx3sa\")] (CJ {expose = []})"
Using a different http manager instead for GET request will return HTTP 200. So, shared state in http manager seems to be the problem here.
Has anyone else observed it? I went through github issues for HTTP Manager but haven't seen this reported. On wrong streaming length, the behavior shouldn't be to corrupt the HTTP manager as seems to happen here.
I have also simulated a source file for requestBodySource where length is correct, but the source aborts mid-way due to a simulated failure (to simulate network issues). In that case, there are no errors. So, it seems we have just one case where sending wrong length without any failures will cause some kind of shared state to become corrupt here, which gets released within 25 seconds.
If anyone has any insights on what is going on here, it will be very helpful. I have a workaround of enforcing right streaming length. However, I will like to understand what is going on so that I can avoid running into this situation in production.

This is an issue with http-client as reported here. It leaves it up to the caller to make sure that the passed content length is correct. It is the shared connection to the server that seems to be in the bad state. Depending on the actual length vs expected length, the beginning of the next request might be treated as the end of the previous request body, causing the next request to be misinterpreted by the server.
This has been fixed and merged into the trunk through a pull request. The solution was to add a simple length validation.

How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

I am in the midst of porting the amqp package from using GHC.IO.Handle to using Network.Connection.Connection. The motivation for this is to gain transparent SSL/TLS support to allow for encrypted AMQP communications from Haskell.
The trouble is that my Connnection-based implementation doesn't work. I was running into some fairly surprising (to me) differences when I packet trace the alternate implementations.
It became evident that Network.Connection.connectionGet and GHC.IO.Handle.hGet are very different (non-blocking vs blocking):
http://hackage.haskell.org/package/connection-0.1.3.1/docs/Network-Connection.html#v:connectionGet
http://hackage.haskell.org/package/bytestring-0.10.4.0/docs/Data-ByteString-Lazy.html#v:hGet
Network.Connection.connectionGet acts like GHC.IO.Handle.hGetNonBlocking.
I was replacing GHC.IO.Handle.hGet with Network.Connection.connectionGet thinking it was a drop-in replacement which it isn't.
How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

I wouldn't call this a blocking vs non-blocking thing as that thought lead to the async IO concept which a different concept then what is happening in these APIs.
The difference here is that in hGet when you ask for x number of bytes to be read, it will try to read and wait till it gets that number of bytes from the connection OR the connection get closed, where as the connectionGet function will return whatever bytes it can read from the connection buffer but the count of these bytes will be less than or equals to requested bytes i.e x.
You can make connectionGet behave like hGet using a simple recursion as shown below, NOTE: below has been verified to work :)
import qualified Network.Connection as Conn
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC
-- ... intervening code
connectionGet' :: Conn.Connection -> Int -> IO BC.ByteString
connectionGet' conn x = do
bs <- Conn.connectionGet conn x
let diff = BS.length bs - x
if BS.length bs == 0 || diff == 0
then do
return bs
else do
next <- connectionGet' conn diff
return $ BC.append bs next

Game server in Haskell

I'm using Network and Gloss for a game server in Haskell. It works fine, except that the client has to close for the server to receive the data it sent. I bet it's a case of laziness...
Minimalist server:
import Network
import System.IO
main = do
sock <- listenOn (PortNumber (fromIntegral 12345))
loop sock
loop sock = do
(hIn, _, _) <- accept sock
str <- hGetContents hIn
print str
loop sock
Minimalist client:
import Network
import System.IO
import Graphics.Gloss.Interface.IO.Game
main = playIO
(InWindow "Test Multi" (500, 500) (500, 500))
white
60
Nothing
draw
(\_ x -> return x)
advance
draw Nothing = return blank
draw (Just x) = return (Text (show x))
advance _ Nothing = do
hOut <- connectTo "000.000.0.0" (PortNumber (fromIntegral 12345))
hSetBuffering hOut NoBuffering
hPutStr hOut "Hello!"
return (Just hOut)
advance _ x = return x
I start the server, wait 10 seconds, then start the client, wait 15 seconds, see that nothing happens on the server, closes the client, see "Hello!" suddenly appear on the server.
I would like "Hello!" to appear while the client is running, in an advance call, otherwise I can't make a multiplayer game (sob)!
However, if I change the client's code to
main = loop Nothing
loop x = do
x' <- advance 0 x
getLine
the sever immediatly shows "Hello!" while the client is waiting for my input.
I tried, as suggested in another question, to use bang patterns and hClose:
-- ...
!str <- hGetContents hIn
hClose hIn
-- ...
which makes the output appear immediatly, without the client closing. That's great. But, I plan to use bytestrings because the data I send to the server is serialized, so I import qualified Data.ByteString as B and change hGetContents to B.hGetContents, which makes the problem re-appear.
The problem was indeed a case of laziness. hGetContents reads lazily all the contents of the Handle, so it finishes only when it's closed, when the client aborts the connection. Instead, I used hGetLine that returns the content each time it encounters a \n, which I use as a end-of-message tag.

I might be completely wrong, but isn't the problem hGetContents? Surely that should wait till the entire contents sent through your socket have arrived before the next line (print...) starts. hGetContents is designed to give you all the contents sent until the socket is closed. Something like hGetLine could terminate straight away and you can leave the socket open to send more data through later. Your client could then use a hPutStrLn instead of hPutStr.

It defaults to line-buffered output, which is why hPutStr (which doesn't provide a line ending) doesn't output anything until you flush the buffer. There are two ways you can solve this:
a) Call hFlush stdout manually any time you want to flush the output.
b) Use hSetBuffering to set the buffering to NoBuffering
All of those functions are found in the System.IO module.
Edit: Never mind, I just saw where you did that in the client. I retract my answer with apologies.

Probably, you need to disable algorithm Nagle.
Try this code:
import Network.Socket
setSocketOption sock NoDelay 1

Why is Haskell/unpack messing with my bytes?

I've built a tiny UDP/protobuf transmitter and receiver. I've spent the morning trying to track down why the protobuf decoding was producing errors, only to find that it was the transmitter (Spoke.hs) which was sending incorrect data.
The code used unpack to turn Lazy.ByteStrings into Strings that the Network package will send. I found unpack in Hoogle. It may not be the function I'm looking for, but its description looks suitable: "O(n) Converts a ByteString to a String."
Spoke.hs produces the following output:
chris#gigabyte:~/Dropbox/haskell-workspace/hub/dist/build/spoke$ ./spoke
45
45
["a","8","4a","6f","68","6e","20","44","6f","65","10","d2","9","1a","10","6a","64","6f","65","40","65","78","61","6d","70","6c","65","2e","63","6f","6d","22","c","a","8","35","35","35","2d","34","33","32","31","10","1"]
While wireshark shows me that the data in the packet is:
0a:08:4a:6f:68:6e:20:44:6f:65:10:c3:92:09:1a:10:6a:64:6f:65:40:65:78:61:6d:70:6c:65:2e:63:6f:6d:22:0c:0a:08:35:35:35:2d:34:33:32:31:10
The length (45) is the same from Spoke.hs and Wireshark.
Wireshark is missing the last byte (value Ox01) and a stream of central values is different (and one byte larger in Wireshark).
"65","10","d2","9" in Spoke.hs vs 65:10:c3:92:09 in Wireshark.
As 0x10 is DLE, it struck me that there's probably some escaping going on, but I don't know why.
I have many years of trust in Wireshark and only a few tens of hours of Haskell experience, so I've assumed that it's the code that's at fault.
Any suggestions appreciated.
-- Spoke.hs:
module Main where
import Data.Bits
import Network.Socket -- hiding (send, sendTo, recv, recvFrom)
-- import Network.Socket.ByteString
import Network.BSD
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import Text.ProtocolBuffers.Header (defaultValue, uFromString)
import Text.ProtocolBuffers.WireMessage (messageGet, messagePut)
import Data.Char (ord, intToDigit)
import Numeric
import Data.Sequence ((><), fromList)
import AddressBookProtos.AddressBook
import AddressBookProtos.Person
import AddressBookProtos.Person.PhoneNumber
import AddressBookProtos.Person.PhoneType
data UDPHandle =
UDPHandle {udpSocket :: Socket,
udpAddress :: SockAddr}
opensocket :: HostName -- ^ Remote hostname, or localhost
-> String -- ^ Port number or name
-> IO UDPHandle -- ^ Handle to use for logging
opensocket hostname port =
do -- Look up the hostname and port. Either raises an exception
-- or returns a nonempty list. First element in that list
-- is supposed to be the best option.
addrinfos <- getAddrInfo Nothing (Just hostname) (Just port)
let serveraddr = head addrinfos
-- Establish a socket for communication
sock <- socket (addrFamily serveraddr) Datagram defaultProtocol
-- Save off the socket, and server address in a handle
return $ UDPHandle sock (addrAddress serveraddr)
john = Person {
AddressBookProtos.Person.id = 1234,
name = uFromString "John Doe",
email = Just $ uFromString "jdoe#example.com",
phone = fromList [
PhoneNumber {
number = uFromString "555-4321",
type' = Just HOME
}
]
}
johnStr = B.unpack (messagePut john)
charToHex x = showIntAtBase 16 intToDigit (ord x) ""
main::IO()
main =
do udpHandle <- opensocket "localhost" "4567"
sent <- sendTo (udpSocket udpHandle) johnStr (udpAddress udpHandle)
putStrLn $ show $ length johnStr
putStrLn $ show sent
putStrLn $ show $ map charToHex johnStr
return ()

The documentation I see for the bytestring package lists unpack as converting a ByteString to [Word8], which is not the same as a String. I would expect some byte difference between ByteString and String because String is Unicode data while ByteString is just an efficient array of bytes, but unpack shouldn't be able to produce a String in the first place.
So you're probably falling foul of Unicode conversion here, or at least something's interpreting it as Unicode when the underlying data really isn't and that seldom ends well.

I think you'll want toString and fromString from utf8-string instead of unpack and pack. This blog post was very helpful for me.

How can I make file I/O more transactional?

I'm writing CGI scripts in Haskell. When the user hits ‘submit’, a Haskell program runs on the server, updating (i.e. reading in, processing, overwriting) a status file. Reading then overwriting sometimes causes issues with lazy IO, as we may be able to generate a large output prefix before we've finished reading the input. Worse, users sometimes bounce on the submit button and two instances of the process run concurrently, fighting over the same file!
What's a good way to implement
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
where the function (‘update’) computes the new file contents from the old file contents? It is not safe to presume that ‘update’ is strict, but it may be presumed that it is total (robustness to partial update functions is a bonus). Transactions may be attempted concurrently, but no transaction should be able to update if the file has been written by anyone else since it was read. It's ok for a transaction to abort in case of competition for file access. We may assume a source of systemwide-unique temporary filenames.
My current attempt writes to a temporary file, then uses a system copy command to overwrite. That seems to deal with the lazy IO problems, but it doesn't strike me as safe from races. Is there a tried and tested formula that we could just bottle?

The most idiomatic unixy way to do this is with flock:
http://hackage.haskell.org/package/flock
http://swoolley.org/man.cgi/2/flock

Here is a rough first cut that relies on the atomicity of the underlying mkdir. It seems to fulfill the specification, but I'm not sure how robust or fast it is:
import Control.DeepSeq
import Control.Exception
import System.Directory
import System.IO
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
transactionalUpdate file upd = bracket acquire release update
where
acquire = do
let lockName = file ++ ".lock"
createDirectory lockName
return lockName
release = removeDirectory
update _ = nonTransactionalUpdate file upd
nonTransactionalUpdate :: FilePath -> (String -> String) -> IO ()
nonTransactionalUpdate file upd = do
h <- openFile file ReadMode
s <- upd `fmap` hGetContents h
s `deepseq` hClose h
h <- openFile file WriteMode
hPutStr h s
hClose h
I tested this by adding the following main and throwing a threadDelay in the middle of nonTransactionalUpdate:
main = do
[n] <- getArgs
transactionalUpdate "foo.txt" ((show n ++ "\n") ++)
putStrLn $ "successfully updated " ++ show n
Then I compiled and ran a bunch of instances with this script:
#!/bin/bash
rm foo.txt
touch foo.txt
for i in {1..50}
do
./SO $i &
done
A process that printed a successful update message if and only if the corresponding number was in foo.txt; all the others printed the expected SO: foo.txt.notveryunique: createDirectory: already exists (File exists).
Update: You actually do not want to use unique names here; it must be a consistent name across the competing processes. I've updated the code accordingly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Mixing ByteString parsing and network IO in Haskell - haskell

Related

HTTP manager shared state corruption when sending wrong length for stream

How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

Game server in Haskell

Why is Haskell/unpack messing with my bytes?

How can I make file I/O more transactional?

Categories

Resources