Haskell IO russian symbols - haskell

I an trying to process a file which contains russian symbols. When reading and after writing some text to the file I get something like:
\160\192\231\229\240\225\224\233\228\230\224\237
How can I get normal symbols?

If you are getting strings with backslashes and numbers in, then it sounds like you might be calling "print" when you want to call "putStr".

If you deal with Unicode, you might try utf8-string package
import System.IO hiding (hPutStr, hPutStrLn, hGetLine, hGetContents, putStrLn)
import System.IO.UTF8
import Codec.Binary.UTF8.String (utf8Encode)
main = System.IO.UTF8.putStrLn "Вася Пупкин"
However it didn't work well in my windows CLI garbling the output because of codepage. I expect it to work fine on other Unix-like systems if your locale is set correctly. However writing to file should be successfull on all systems.
UPDATE:
An example on encoding package usage.

I have got success.
{-# LANGUAGE ImplicitParams #-}
import Network.HTTP
import Text.HTML.TagSoup
import Data.Encoding
import Data.Encoding.CP1251
import Data.Encoding.UTF8
openURL x = do
x <- simpleHTTP (getRequest x)
fmap (decodeString CP1251) (getResponseBody x)
main :: IO ()
main = do
tags <- fmap parseTags $ openURL "http://www.trade.su/search?ext=1"
let TagText r = partitions (~== "<input type=checkbox>") tags !! 1 !! 4
appendFile "out" r

Related

Write surrogate pairs to file using Haskell

This is the code I have:
import qualified System.IO as IO
writeSurrogate :: IO ()
writeSurrogate = do
IO.writeFile "/home/sibi/surrogate.txt" ['\xD800']
Executing the above code gives error:
text-tests: /home/sibi/surrogate.txt: commitBuffer: invalid argument (invalid character)
The reason being is that it is prevented by the GHC itself as they are surrogate code points: https://github.com/ghc/ghc/blob/21f0f56164f50844c2150c62f950983b2376f8b6/libraries/base/GHC/IO/Encoding/Failure.hs#L114
I want to write some test files which needs to have that data. Right now, I'm using Python to achieve what I want - But I would love to know if there is an way (workaround using Haskell) to achieve this.
Sure, just write the bytes you want:
import Data.ByteString as BS
main = BS.writeFile "surrogate.txt" (pack [0xd8, 0x00])

Read large lines in huge file without buffering

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.
I also tried using pipes-text with folds and view lines:
s <- Pipes.sum $
folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s
to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)
I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.
Thanks!
The following code uses Conduit, and will:
UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it
You can replace the yield 1 code with something which will do processing on the individual lines.
#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit
main :: IO ()
main = (runConduit
$ stdinC
.| decodeUtf8C
.| peekForeverE (lineC (yield (1 :: Int)))
.| sumC) >>= print
This is probably easiest as a fold over the decoded text stream
{-#LANGUAGE BangPatterns #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
print n
It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.
You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

Data.ByteString.Lazy.Internal.ByteString to string?

Trying to write a module which returns the external IP address of my computer.
Using Network.Wreq get function, then applying a lense to obtain responseBody, the type I end up with is Data.ByteString.Lazy.Internal.ByteString. As I want to filter out the trailing "\n" of the result body, I want to use this for a regular expression subsequently.
Problem: That seemingly very specific ByteString type is not accepted by regex library and I found no way to convert it to a String.
Here is my feeble attempt so far (not compiling).
{-# LANGUAGE OverloadedStrings #-}
module ExtIp (getExtIp) where
import Network.Wreq
import Control.Lens
import Data.BytesString.Lazy
import Text.Regex.Posix
getExtIp :: IO String
getExtIp = do
r <- get "http://myexternalip.com/raw"
let body = r ^. responseBody
let addr = body =~ "[^\n]*\n"
return (addr)
So my question is obviously: How to convert that funny special ByteString to a String? Explaining how I can approach such a problem myself is also appreciated. I tried to use unpack and toString but have no idea what to import to get those functions if they exist.
Being a very sporadic haskell user, I also wonder if someone could show me the idiomatic haskell way of defining such a function. The version I show here does not account for possible runtime errors/exceptions, after all.
Short answer: Use unpack from Data.ByteString.Lazy.Char8
Longer answer:
In general when you want to convert a ByteString (of any variety) to a String or Text you have to specify an encoding - e.g. UTF-8 or Latin1, etc.
When retrieving an HTML page the encoding you are suppose to use may appear in the Content-type header or in the response body itself as a <meta ...> tag.
Alternatively you can just guess at what the encoding of the body is.
In your case I presume you are accessing a site like http://whatsmyip.org and you only need to parse out your IP address. So without examining the headers or looking through the HTML, a safe encoding to use would be Latin1.
To convert ByteStrings to Text via an encoding, have a look at the functions in Data.Text.Encoding
For instance, the decodeLatin1 function.
I simply do not understand why you insist on using Strings, when you have already a ByteString at hand that is the faster/more efficient implementation.
Importing regex gives you almost no benefit - for parsing an ip-address I would use attoparsec which works great with ByteStrings.
Here is a version that does not use regex but returns a String - note I did not compile it for I have no haskell setup where I am right now.
{-# LANGUAGE OverloadedStrings #-}
module ExtIp (getExtIp) where
import Network.Wreq
import Control.Lens
import Data.ByteString.Lazy.Char8 as Char8
import Data.Char (isSpace)
getExtIp :: IO String
getExtIp = do
r <- get "http://myexternalip.com/raw"
return $ Char8.unpack $ trim (r ^. responseBody)
where trim = Char8.reverse . (Char8.dropWhile isSpace) . Char8.reverse . (Char8.dropWhile isSpace)

Adding the possibility to write a AST-file to my (rail-)compiler

I'm writing rail-compiler (rail is an esoteric language) in Haskell and I get some problems within the main-function of my mainmodule.
1) I want my program to ask wheter I want to run the compiling-pipeline or simply stop after the lexer and write the AST to a file so another compiler can deal with my AST (Abstract Synatx Tree). Here is my program:
module Main (
main -- main function to run the program
)
where
-- imports --
import InterfaceDT as IDT
import qualified Testing as Test
import qualified Preprocessor as PreProc
import qualified Lexer
import qualified SyntacticalAnalysis as SynAna
import qualified SemanticalAnalysis as SemAna
import qualified IntermediateCode as InterCode
import qualified CodeOptimization as CodeOpt
import qualified Backend
-- functions --
main :: IO()
main = do putStr "Enter inputfile (path): "
inputfile <- getLine
input <- readFile inputfile
putStr "Enter outputfile (path): "
outputfile <- getLine
input <- readFile inputfile
putStr "Only create AST (True/False): "
onlyAST <- getLine
when (onlyAST=="True") do putStrLn "Building AST..."
writeFile outputfile ((Lexer.process . PreProc.process) input)
when (onlyAST=="False") do putStrLn ("Compiling "++inputfile++" to "++outputfile)
writeFile outputfile ((Backend.process . CodeOpt.process . InterCode.process . SemAna.process . SynAna.process . Lexer.process . PreProc.process) input)
I get an error in Line 21 (input <- readFile inputfile) caused by the <-. Why?
How should I do it?
2) Next thing is that I want to refactor the program in that way, that I can call it from the terminal with parameters like runhaskell Main(AST) (in that way it should just create the AST) or like runhaskell Main.hs (in that way it should do the whole pipeline).
I hope for your help!
For your error in (1), your program doesn't look syntactically incorrect at line 21 to me. However an error at <- would happen if that line were indented differently from the previous one. I suspect that you are having an indentation error due to mixing tabs and spaces in a way that looks correct in your editor but disagrees with Haskell's interpretation of tabs. The simplest recommendation is to always use spaces and never tabs.
You also have an extra copy of that line later, which you might want to remove.
I also suspect you may need to use hFlush stdin after your putStr's, for them to work as prompts.
For (2), I'd suggest using a library for proper command line argument and option parsing, such as System.Console.GetOpt which is included with GHC, or one of the fancier ones which you can find on Hackage.

Why is Haskell/unpack messing with my bytes?

I've built a tiny UDP/protobuf transmitter and receiver. I've spent the morning trying to track down why the protobuf decoding was producing errors, only to find that it was the transmitter (Spoke.hs) which was sending incorrect data.
The code used unpack to turn Lazy.ByteStrings into Strings that the Network package will send. I found unpack in Hoogle. It may not be the function I'm looking for, but its description looks suitable: "O(n) Converts a ByteString to a String."
Spoke.hs produces the following output:
chris#gigabyte:~/Dropbox/haskell-workspace/hub/dist/build/spoke$ ./spoke
45
45
["a","8","4a","6f","68","6e","20","44","6f","65","10","d2","9","1a","10","6a","64","6f","65","40","65","78","61","6d","70","6c","65","2e","63","6f","6d","22","c","a","8","35","35","35","2d","34","33","32","31","10","1"]
While wireshark shows me that the data in the packet is:
0a:08:4a:6f:68:6e:20:44:6f:65:10:c3:92:09:1a:10:6a:64:6f:65:40:65:78:61:6d:70:6c:65:2e:63:6f:6d:22:0c:0a:08:35:35:35:2d:34:33:32:31:10
The length (45) is the same from Spoke.hs and Wireshark.
Wireshark is missing the last byte (value Ox01) and a stream of central values is different (and one byte larger in Wireshark).
"65","10","d2","9" in Spoke.hs vs 65:10:c3:92:09 in Wireshark.
As 0x10 is DLE, it struck me that there's probably some escaping going on, but I don't know why.
I have many years of trust in Wireshark and only a few tens of hours of Haskell experience, so I've assumed that it's the code that's at fault.
Any suggestions appreciated.
-- Spoke.hs:
module Main where
import Data.Bits
import Network.Socket -- hiding (send, sendTo, recv, recvFrom)
-- import Network.Socket.ByteString
import Network.BSD
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import Text.ProtocolBuffers.Header (defaultValue, uFromString)
import Text.ProtocolBuffers.WireMessage (messageGet, messagePut)
import Data.Char (ord, intToDigit)
import Numeric
import Data.Sequence ((><), fromList)
import AddressBookProtos.AddressBook
import AddressBookProtos.Person
import AddressBookProtos.Person.PhoneNumber
import AddressBookProtos.Person.PhoneType
data UDPHandle =
UDPHandle {udpSocket :: Socket,
udpAddress :: SockAddr}
opensocket :: HostName -- ^ Remote hostname, or localhost
-> String -- ^ Port number or name
-> IO UDPHandle -- ^ Handle to use for logging
opensocket hostname port =
do -- Look up the hostname and port. Either raises an exception
-- or returns a nonempty list. First element in that list
-- is supposed to be the best option.
addrinfos <- getAddrInfo Nothing (Just hostname) (Just port)
let serveraddr = head addrinfos
-- Establish a socket for communication
sock <- socket (addrFamily serveraddr) Datagram defaultProtocol
-- Save off the socket, and server address in a handle
return $ UDPHandle sock (addrAddress serveraddr)
john = Person {
AddressBookProtos.Person.id = 1234,
name = uFromString "John Doe",
email = Just $ uFromString "jdoe#example.com",
phone = fromList [
PhoneNumber {
number = uFromString "555-4321",
type' = Just HOME
}
]
}
johnStr = B.unpack (messagePut john)
charToHex x = showIntAtBase 16 intToDigit (ord x) ""
main::IO()
main =
do udpHandle <- opensocket "localhost" "4567"
sent <- sendTo (udpSocket udpHandle) johnStr (udpAddress udpHandle)
putStrLn $ show $ length johnStr
putStrLn $ show sent
putStrLn $ show $ map charToHex johnStr
return ()
The documentation I see for the bytestring package lists unpack as converting a ByteString to [Word8], which is not the same as a String. I would expect some byte difference between ByteString and String because String is Unicode data while ByteString is just an efficient array of bytes, but unpack shouldn't be able to produce a String in the first place.
So you're probably falling foul of Unicode conversion here, or at least something's interpreting it as Unicode when the underlying data really isn't and that seldom ends well.
I think you'll want toString and fromString from utf8-string instead of unpack and pack. This blog post was very helpful for me.

Resources