Why is Haskell/unpack messing with my bytes?

Why is Haskell/unpack messing with my bytes? - haskell

I've built a tiny UDP/protobuf transmitter and receiver. I've spent the morning trying to track down why the protobuf decoding was producing errors, only to find that it was the transmitter (Spoke.hs) which was sending incorrect data.
The code used unpack to turn Lazy.ByteStrings into Strings that the Network package will send. I found unpack in Hoogle. It may not be the function I'm looking for, but its description looks suitable: "O(n) Converts a ByteString to a String."
Spoke.hs produces the following output:
chris#gigabyte:~/Dropbox/haskell-workspace/hub/dist/build/spoke$ ./spoke
45
45
["a","8","4a","6f","68","6e","20","44","6f","65","10","d2","9","1a","10","6a","64","6f","65","40","65","78","61","6d","70","6c","65","2e","63","6f","6d","22","c","a","8","35","35","35","2d","34","33","32","31","10","1"]
While wireshark shows me that the data in the packet is:
0a:08:4a:6f:68:6e:20:44:6f:65:10:c3:92:09:1a:10:6a:64:6f:65:40:65:78:61:6d:70:6c:65:2e:63:6f:6d:22:0c:0a:08:35:35:35:2d:34:33:32:31:10
The length (45) is the same from Spoke.hs and Wireshark.
Wireshark is missing the last byte (value Ox01) and a stream of central values is different (and one byte larger in Wireshark).
"65","10","d2","9" in Spoke.hs vs 65:10:c3:92:09 in Wireshark.
As 0x10 is DLE, it struck me that there's probably some escaping going on, but I don't know why.
I have many years of trust in Wireshark and only a few tens of hours of Haskell experience, so I've assumed that it's the code that's at fault.
Any suggestions appreciated.
-- Spoke.hs:
module Main where
import Data.Bits
import Network.Socket -- hiding (send, sendTo, recv, recvFrom)
-- import Network.Socket.ByteString
import Network.BSD
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import Text.ProtocolBuffers.Header (defaultValue, uFromString)
import Text.ProtocolBuffers.WireMessage (messageGet, messagePut)
import Data.Char (ord, intToDigit)
import Numeric
import Data.Sequence ((><), fromList)
import AddressBookProtos.AddressBook
import AddressBookProtos.Person
import AddressBookProtos.Person.PhoneNumber
import AddressBookProtos.Person.PhoneType
data UDPHandle =
UDPHandle {udpSocket :: Socket,
udpAddress :: SockAddr}
opensocket :: HostName -- ^ Remote hostname, or localhost
-> String -- ^ Port number or name
-> IO UDPHandle -- ^ Handle to use for logging
opensocket hostname port =
do -- Look up the hostname and port. Either raises an exception
-- or returns a nonempty list. First element in that list
-- is supposed to be the best option.
addrinfos <- getAddrInfo Nothing (Just hostname) (Just port)
let serveraddr = head addrinfos
-- Establish a socket for communication
sock <- socket (addrFamily serveraddr) Datagram defaultProtocol
-- Save off the socket, and server address in a handle
return $ UDPHandle sock (addrAddress serveraddr)
john = Person {
AddressBookProtos.Person.id = 1234,
name = uFromString "John Doe",
email = Just $ uFromString "jdoe#example.com",
phone = fromList [
PhoneNumber {
number = uFromString "555-4321",
type' = Just HOME
}
]
}
johnStr = B.unpack (messagePut john)
charToHex x = showIntAtBase 16 intToDigit (ord x) ""
main::IO()
main =
do udpHandle <- opensocket "localhost" "4567"
sent <- sendTo (udpSocket udpHandle) johnStr (udpAddress udpHandle)
putStrLn $ show $ length johnStr
putStrLn $ show sent
putStrLn $ show $ map charToHex johnStr
return ()

The documentation I see for the bytestring package lists unpack as converting a ByteString to [Word8], which is not the same as a String. I would expect some byte difference between ByteString and String because String is Unicode data while ByteString is just an efficient array of bytes, but unpack shouldn't be able to produce a String in the first place.
So you're probably falling foul of Unicode conversion here, or at least something's interpreting it as Unicode when the underlying data really isn't and that seldom ends well.

I think you'll want toString and fromString from utf8-string instead of unpack and pack. This blog post was very helpful for me.

Related

How to save, append and read a List of tuple including Lists into a File using Data.Serialize and ByteString

Hello i am having problems reading after saving and appending a List of Tuple Lists inside a File.
Saving something into a File works without problems.
I am saving into a file with
import qualified Data.ByteString as BS
import qualified Data.Serialize as S (decode, encode)
import Data.Either
toFile path = do
let a = take 1000 [100..] :: [Float]
let b = take 100 [1..] :: [Float]
BS.appendFile path $ S.encode (a,b)
and reading with
fromFile path = do
bstr<-BS.readFile path
let d = S.decode bstr :: Either String ([Float],[Float])
return (Right d)
but reading from that file with fromFileonly gives me 1 Element of it although i append to that file multiple times.
Since im appending to the file it should have multiple Elements inside it so im missing something like map on my fromFile function but i couldnt work out how.
I appreciate any help or any other solutions so using Data.Serialize and ByteString is not a must. Other possibilities i thought of are json files with Data.Aeson if i cant get it to work with Serialize
Edit :
I realized that i made a mistake on the decoding type in fromFile
let d = S.decode bstr :: Either String ([Float],[Float])
it should be like this
let d = S.decode bstr :: Either String [([Float],[Float])]

The Problem In Brief The default format used by serialize (or binary) encoding isn't trivially append-able.
The Problem (Longer)
You say you appended:
S.encode (a,b)
to the same file "multiple times". So the format of the file is now:
[ 64 bit length field | # floats encoded | 64 length field | # floats encoded ]
Repeated however many times you appended to the file. That is, each append will add new length fields and list of floats while leaving the old values in place.
After that you returned to read the file and decode some floats using, morally, S.decode <$> BS.readFile path. This will decode the first two lists of floats by first reading the length field (of the first time you wrote to the file) then the following floats and the second length field followed by its related floats. After reading the stated length worth of floats the decoder will stop.
It should now be clear that just because you appended more data does not make your encoding or decoding script look for any additional data. The default format used by serialize (or binary) encoding isn't trivially append-able.
Solutions
You mentioned switching to Aeson, but using JSON to encode instead of binary won't help you. Decoding two appended JSON strings like { "first": [1], "second": [2]}{ "first": [3], "second": [4]} is logically the same as your current problem. You have some unknown number of interleaved chunks of lists - just write a decoder to keep trying:
import Data.Serialize as S
import Data.Serialize.Get as S
import Data.ByteString as BS
fromFile path = do
bstr <- BS.readFile path
let d = S.runGet getMultiChunks bstr :: Either String ([Float],[Float])
return (Right d)
getMultiChunks :: Get ([Float],[Float])
getMultiChunks = go ([], [])
where
go (l,r) = do
b <- isEmpty
if b then pure ([],[])
else do (lNext, rNext) <- S.get
go (l ++ lNext, r ++ rNext) -- inefficient
So we've written our own getter (untested) that will look to see if byte remain and if so decode another pair of lists of floats. Each time it decodes a new chunk it prepends the old chunk (which is inefficient, use something like a dlist if you want it to be respectable).

how to read String from file and convert to Data.ByteString for use in Data.Serialize.decode

I need to analyze a log file that contains lines like:
2018-07-11T17:25:14.07565; [ZMQ] info; Sending message to; ROne (Addr {_unAddr = "tcp://127.0.0.1:10002"}); ## MSG ##; "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\t127.0.0.1\NUL\NUL\NUL\NUL\NUL\NUL'\DC3\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NAKtcp://127.0.0.1:10003\NUL\NUL\NUL\NUL\NUL\NUL\NUL#h_\214\169\213n\RS\212o\tu|\191\"\207GvP\224\167\222V*n6\140\236q\von0\148\240\nV\157\206\225\251A\240\DC4\US\228\140\253\255L\242\163\&0\134\\\241'\r\229W{WD\218\b\143\213go\194\161o\STXM\154X\195R\128\134eN=\144\153\129\242\133f\149\\:A\213\158C\DLE\b\NUL\NUL\NUL\NUL\NUL\NUL\NULG\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\t127.0.0.1\NUL\NUL\NUL\NUL\NUL\NUL'\DLE\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NAKtcp://127.0.0.1:10000\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL;\154\202\NUL"
The parts with \NUL were written using
Prelude.putStr (show msg)
where msg is a Data.ByteString created by an instance of Data.Serialize
I need to
read the entire String line
select the ByteString portion
turn that portion into a ByteString
then give it to Data.Serialize.decode
It is not clear to me how to do step 3 : turn a String representation of a ByteString into a real ByteString.
My initial try was to give #2 to Data.ByteString.Char8.pack. But, when trying to decode
decode (BSC8.pack l) :: Either String SignedRPC
I get:
Left "too few bytes\nFrom:\tdemandInput\n\n"

Perhaps:
Data.ByteString Text.Read> readMaybe "\"\\NUL\"" :: Maybe ByteString
Just "\NUL"

How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

I am in the midst of porting the amqp package from using GHC.IO.Handle to using Network.Connection.Connection. The motivation for this is to gain transparent SSL/TLS support to allow for encrypted AMQP communications from Haskell.
The trouble is that my Connnection-based implementation doesn't work. I was running into some fairly surprising (to me) differences when I packet trace the alternate implementations.
It became evident that Network.Connection.connectionGet and GHC.IO.Handle.hGet are very different (non-blocking vs blocking):
http://hackage.haskell.org/package/connection-0.1.3.1/docs/Network-Connection.html#v:connectionGet
http://hackage.haskell.org/package/bytestring-0.10.4.0/docs/Data-ByteString-Lazy.html#v:hGet
Network.Connection.connectionGet acts like GHC.IO.Handle.hGetNonBlocking.
I was replacing GHC.IO.Handle.hGet with Network.Connection.connectionGet thinking it was a drop-in replacement which it isn't.
How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

I wouldn't call this a blocking vs non-blocking thing as that thought lead to the async IO concept which a different concept then what is happening in these APIs.
The difference here is that in hGet when you ask for x number of bytes to be read, it will try to read and wait till it gets that number of bytes from the connection OR the connection get closed, where as the connectionGet function will return whatever bytes it can read from the connection buffer but the count of these bytes will be less than or equals to requested bytes i.e x.
You can make connectionGet behave like hGet using a simple recursion as shown below, NOTE: below has been verified to work :)
import qualified Network.Connection as Conn
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC
-- ... intervening code
connectionGet' :: Conn.Connection -> Int -> IO BC.ByteString
connectionGet' conn x = do
bs <- Conn.connectionGet conn x
let diff = BS.length bs - x
if BS.length bs == 0 || diff == 0
then do
return bs
else do
next <- connectionGet' conn diff
return $ BC.append bs next

Mixing ByteString parsing and network IO in Haskell

Background
I'm trying to write a client for a binary network protocol.
All network operations are carried out over a single TCP connection, so in that sense
the input from the server is a continuous stream of bytes.
At the application layer, however, the server conceptually sends a packet on the
stream, and the client keeps reading until it knows the packet has been received
in its entirety, before sending a response of its own.
A lot of the effort needed to make this work involves parsing and generating
binary data, for which I'm using the Data.Serialize module.
The problem
The server sends me a "packet" on the TCP stream.
The packet is not necessarily terminated by a newline, nor is it of a predetermined
size.
It does consist of a predetermined number of fields, and fields generally begin
with a 4 byte number describing the length of that field.
With some help from Data.Serialize, I already have the code to parse a ByteString
version of this packet into a more manageable type.
I'd love to be able to write some code with these properties:
The parsing is only defined once, preferably in my Serialize instance(s).
I'd rather not do extra parsing in the IO monad to read the correct number of bytes.
When I try to parse a given packet and not all the bytes have arrived yet, lazy
IO will just wait for the extra bytes to arrive.
Conversely, when I try to parse a given packet and all its bytes have arrived
IO does not block anymore. That is, I want to read just enough of the stream
from the server to parse my type and form a response to send back. If the IO
blocks even after enough bytes have arrived to parse my type, then the client
and server will become deadlocked, each waiting for more data from the other.
After I send my own response, I can repeat the process by parsing the next type
of packet I expect from the server.
So in brief,
is it possible to leverage my current ByteString parsing code in combination
with lazy IO to read exactly the right number of bytes off the network?
What I've Tried
I tried to use lazy ByteStreams in combination with my Data.Serialize instance, like
so:
import Network
import System.IO
import qualified Data.ByteString.Lazy as L
import Data.Serialize
data MyType
instance Serialize MyType
main = withSocketsDo $ do
h <- connectTo server port
hSetBuffering h NoBuffering
inputStream <- L.hGetContents h
let Right parsed = decodeLazy inputStream :: Either String MyType
-- Then use parsed to form my own response, then wait for the server reply...
This seems to fail mostly on point 3 above: it stays blocked even after a sufficient
number of bytes have arrived to parse MyType. I strongly suspect this is because
ByteStrings are read with a given block size at a time, and L.hGetContents is
waiting for the rest of this block to arrive. While this property of reading an
efficient blocksize is helpful for making efficient reads from disk, it seems to be
getting in my way for reading just enough bytes to parse my data.

Something is wrong with your parser, it is too eager. Most likely it need the next byte after the message for some reason. hGetContents from bytestring doesn't block waiting for the whole chunk. It uses hGetSome internally.
I created simple test case. The server sends "hello" every second:
import Control.Concurrent
import System.IO
import Network
port :: Int
port = 1234
main :: IO ()
main = withSocketsDo $ do
s <- listenOn $ PortNumber $ fromIntegral port
(h, _, _) <- accept s
let loop :: Int -> IO ()
loop 0 = return ()
loop i = do
hPutStr h "hello"
threadDelay 1000000
loop $ i - 1
loop 5
sClose s
The client reads the whole contents lazily:
import qualified Data.ByteString.Lazy as BSL
import System.IO
import Network
port :: Int
port = 1234
main :: IO ()
main = withSocketsDo $ do
h <- connectTo "localhost" $ PortNumber $ fromIntegral port
bs <- BSL.hGetContents h
BSL.putStrLn bs
hClose h
If you try to run both of then, you'll see the client printing "hello" every seconds. So, the network subsystem is ok, the issue is somewhere else -- most likely in your parser.

Haskell IO russian symbols

I an trying to process a file which contains russian symbols. When reading and after writing some text to the file I get something like:
\160\192\231\229\240\225\224\233\228\230\224\237
How can I get normal symbols?

If you are getting strings with backslashes and numbers in, then it sounds like you might be calling "print" when you want to call "putStr".

If you deal with Unicode, you might try utf8-string package
import System.IO hiding (hPutStr, hPutStrLn, hGetLine, hGetContents, putStrLn)
import System.IO.UTF8
import Codec.Binary.UTF8.String (utf8Encode)
main = System.IO.UTF8.putStrLn "Вася Пупкин"
However it didn't work well in my windows CLI garbling the output because of codepage. I expect it to work fine on other Unix-like systems if your locale is set correctly. However writing to file should be successfull on all systems.
UPDATE:
An example on encoding package usage.

I have got success.
{-# LANGUAGE ImplicitParams #-}
import Network.HTTP
import Text.HTML.TagSoup
import Data.Encoding
import Data.Encoding.CP1251
import Data.Encoding.UTF8
openURL x = do
x <- simpleHTTP (getRequest x)
fmap (decodeString CP1251) (getResponseBody x)
main :: IO ()
main = do
tags <- fmap parseTags $ openURL "http://www.trade.su/search?ext=1"
let TagText r = partitions (~== "<input type=checkbox>") tags !! 1 !! 4
appendFile "out" r

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why is Haskell/unpack messing with my bytes? - haskell

I think you'll want toString and fromString from utf8-string instead of unpack and pack. This blog post was very helpful for me.

Related

How to save, append and read a List of tuple including Lists into a File using Data.Serialize and ByteString

how to read String from file and convert to Data.ByteString for use in Data.Serialize.decode

How do I use Network.Connection.connectionGet in a blocking manner like Data.ByteString.Lazy.hGet?

Mixing ByteString parsing and network IO in Haskell

Haskell IO russian symbols

Categories

Resources