Write surrogate pairs to file using Haskell - haskell

This is the code I have:
import qualified System.IO as IO
writeSurrogate :: IO ()
writeSurrogate = do
IO.writeFile "/home/sibi/surrogate.txt" ['\xD800']
Executing the above code gives error:
text-tests: /home/sibi/surrogate.txt: commitBuffer: invalid argument (invalid character)
The reason being is that it is prevented by the GHC itself as they are surrogate code points: https://github.com/ghc/ghc/blob/21f0f56164f50844c2150c62f950983b2376f8b6/libraries/base/GHC/IO/Encoding/Failure.hs#L114
I want to write some test files which needs to have that data. Right now, I'm using Python to achieve what I want - But I would love to know if there is an way (workaround using Haskell) to achieve this.

Sure, just write the bytes you want:
import Data.ByteString as BS
main = BS.writeFile "surrogate.txt" (pack [0xd8, 0x00])

Related

Read large lines in huge file without buffering

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.
I also tried using pipes-text with folds and view lines:
s <- Pipes.sum $
folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s
to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)
I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.
Thanks!
The following code uses Conduit, and will:
UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it
You can replace the yield 1 code with something which will do processing on the individual lines.
#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit
main :: IO ()
main = (runConduit
$ stdinC
.| decodeUtf8C
.| peekForeverE (lineC (yield (1 :: Int)))
.| sumC) >>= print
This is probably easiest as a fold over the decoded text stream
{-#LANGUAGE BangPatterns #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
print n
It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.
You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

Data.ByteString.Lazy.Internal.ByteString to string?

Trying to write a module which returns the external IP address of my computer.
Using Network.Wreq get function, then applying a lense to obtain responseBody, the type I end up with is Data.ByteString.Lazy.Internal.ByteString. As I want to filter out the trailing "\n" of the result body, I want to use this for a regular expression subsequently.
Problem: That seemingly very specific ByteString type is not accepted by regex library and I found no way to convert it to a String.
Here is my feeble attempt so far (not compiling).
{-# LANGUAGE OverloadedStrings #-}
module ExtIp (getExtIp) where
import Network.Wreq
import Control.Lens
import Data.BytesString.Lazy
import Text.Regex.Posix
getExtIp :: IO String
getExtIp = do
r <- get "http://myexternalip.com/raw"
let body = r ^. responseBody
let addr = body =~ "[^\n]*\n"
return (addr)
So my question is obviously: How to convert that funny special ByteString to a String? Explaining how I can approach such a problem myself is also appreciated. I tried to use unpack and toString but have no idea what to import to get those functions if they exist.
Being a very sporadic haskell user, I also wonder if someone could show me the idiomatic haskell way of defining such a function. The version I show here does not account for possible runtime errors/exceptions, after all.
Short answer: Use unpack from Data.ByteString.Lazy.Char8
Longer answer:
In general when you want to convert a ByteString (of any variety) to a String or Text you have to specify an encoding - e.g. UTF-8 or Latin1, etc.
When retrieving an HTML page the encoding you are suppose to use may appear in the Content-type header or in the response body itself as a <meta ...> tag.
Alternatively you can just guess at what the encoding of the body is.
In your case I presume you are accessing a site like http://whatsmyip.org and you only need to parse out your IP address. So without examining the headers or looking through the HTML, a safe encoding to use would be Latin1.
To convert ByteStrings to Text via an encoding, have a look at the functions in Data.Text.Encoding
For instance, the decodeLatin1 function.
I simply do not understand why you insist on using Strings, when you have already a ByteString at hand that is the faster/more efficient implementation.
Importing regex gives you almost no benefit - for parsing an ip-address I would use attoparsec which works great with ByteStrings.
Here is a version that does not use regex but returns a String - note I did not compile it for I have no haskell setup where I am right now.
{-# LANGUAGE OverloadedStrings #-}
module ExtIp (getExtIp) where
import Network.Wreq
import Control.Lens
import Data.ByteString.Lazy.Char8 as Char8
import Data.Char (isSpace)
getExtIp :: IO String
getExtIp = do
r <- get "http://myexternalip.com/raw"
return $ Char8.unpack $ trim (r ^. responseBody)
where trim = Char8.reverse . (Char8.dropWhile isSpace) . Char8.reverse . (Char8.dropWhile isSpace)

How to get a String from a Lazy.Builder?

I need to manipulate the binary encoding as '0' and '1' of simple strings given as input, using ascii 7-bits.
For the encoding I have used the function Data.ByteString.Lazy.Builder.string7 :: String -> Builder
However, I have not found a way to convert back the resulting Builder object into a string of '0' and '1'. Is it possible ? Is there another way ?
Subsidiary question: And if I wanted it in hexadecimal form as text ?
There's an unpackChars function in Data.ByteString.Lazy.Internal. There's also a non-lazy counterpart in Data.ByteString.Internal.
import qualified Data.ByteString.Lazy.Builder as Build
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString.Lazy.Internal as BSI
--> BSI.unpackChars $ Build.toLazyByteString $ Build.string7 "010101"
--"010101"
You can also use map (chr . fromIntegral) . BS.unpack instead of unpackChars, but unpackChars is probably faster.
Alternatively, as Michael Snoyman commented below, you could use Data.ByteString.Char8 or its lazy version and you'll get the right conversions to begin with.

Adding the possibility to write a AST-file to my (rail-)compiler

I'm writing rail-compiler (rail is an esoteric language) in Haskell and I get some problems within the main-function of my mainmodule.
1) I want my program to ask wheter I want to run the compiling-pipeline or simply stop after the lexer and write the AST to a file so another compiler can deal with my AST (Abstract Synatx Tree). Here is my program:
module Main (
main -- main function to run the program
)
where
-- imports --
import InterfaceDT as IDT
import qualified Testing as Test
import qualified Preprocessor as PreProc
import qualified Lexer
import qualified SyntacticalAnalysis as SynAna
import qualified SemanticalAnalysis as SemAna
import qualified IntermediateCode as InterCode
import qualified CodeOptimization as CodeOpt
import qualified Backend
-- functions --
main :: IO()
main = do putStr "Enter inputfile (path): "
inputfile <- getLine
input <- readFile inputfile
putStr "Enter outputfile (path): "
outputfile <- getLine
input <- readFile inputfile
putStr "Only create AST (True/False): "
onlyAST <- getLine
when (onlyAST=="True") do putStrLn "Building AST..."
writeFile outputfile ((Lexer.process . PreProc.process) input)
when (onlyAST=="False") do putStrLn ("Compiling "++inputfile++" to "++outputfile)
writeFile outputfile ((Backend.process . CodeOpt.process . InterCode.process . SemAna.process . SynAna.process . Lexer.process . PreProc.process) input)
I get an error in Line 21 (input <- readFile inputfile) caused by the <-. Why?
How should I do it?
2) Next thing is that I want to refactor the program in that way, that I can call it from the terminal with parameters like runhaskell Main(AST) (in that way it should just create the AST) or like runhaskell Main.hs (in that way it should do the whole pipeline).
I hope for your help!
For your error in (1), your program doesn't look syntactically incorrect at line 21 to me. However an error at <- would happen if that line were indented differently from the previous one. I suspect that you are having an indentation error due to mixing tabs and spaces in a way that looks correct in your editor but disagrees with Haskell's interpretation of tabs. The simplest recommendation is to always use spaces and never tabs.
You also have an extra copy of that line later, which you might want to remove.
I also suspect you may need to use hFlush stdin after your putStr's, for them to work as prompts.
For (2), I'd suggest using a library for proper command line argument and option parsing, such as System.Console.GetOpt which is included with GHC, or one of the fancier ones which you can find on Hackage.

Haskell IO russian symbols

I an trying to process a file which contains russian symbols. When reading and after writing some text to the file I get something like:
\160\192\231\229\240\225\224\233\228\230\224\237
How can I get normal symbols?
If you are getting strings with backslashes and numbers in, then it sounds like you might be calling "print" when you want to call "putStr".
If you deal with Unicode, you might try utf8-string package
import System.IO hiding (hPutStr, hPutStrLn, hGetLine, hGetContents, putStrLn)
import System.IO.UTF8
import Codec.Binary.UTF8.String (utf8Encode)
main = System.IO.UTF8.putStrLn "Вася Пупкин"
However it didn't work well in my windows CLI garbling the output because of codepage. I expect it to work fine on other Unix-like systems if your locale is set correctly. However writing to file should be successfull on all systems.
UPDATE:
An example on encoding package usage.
I have got success.
{-# LANGUAGE ImplicitParams #-}
import Network.HTTP
import Text.HTML.TagSoup
import Data.Encoding
import Data.Encoding.CP1251
import Data.Encoding.UTF8
openURL x = do
x <- simpleHTTP (getRequest x)
fmap (decodeString CP1251) (getResponseBody x)
main :: IO ()
main = do
tags <- fmap parseTags $ openURL "http://www.trade.su/search?ext=1"
let TagText r = partitions (~== "<input type=checkbox>") tags !! 1 !! 4
appendFile "out" r

Resources