Inverse of `Data.Text.Encoding.decodeLatin1`?

Inverse of `Data.Text.Encoding.decodeLatin1`? - haskell

Is there a function f :: Text -> Maybe ByteString such that forall x:
f (decodeLatin1 x) == Just x
Note, decodeLatin1 has the signature:
decodeLatin1 :: ByteString -> Text
I'm concerned that encodeUtf8 is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1 made on the way in to characters in the upper half of the character set.
I understand that f has to return a Maybe, because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString we should get back to it.

DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.
I think you can do it with witch library. It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.
Luckily witch provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString), meaning that you can convert between Text and latin1 encoded ByteString. So I think (not tested!!) this should work
{-# LANGUAGE TypeApplications #-}
import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))
f :: Text -> Maybe ByteString
f s = case tryInto #(ISO_8859_1 ByteString) s of
Left err -> Nothing
Right bs -> Just (unTagged bs)
Notice that tryInto returns a Either TryFromException s, so if you want to handle errors you can do it with Either. Up to you.
Also, witch docs points out that this conversion is done via String type, so probably there is an out-of-the-box solution without the need of depending on witch package. I don't know such a solution, and looking to the source code hasn't helped
Edit:
Having read witch source code aparently this should work
import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C
f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
where str = T.unpack t
allCharsAreLatin = all isLatin1 str

The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.
import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS
latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)
It's possible to avoid the intermediate String, but you should probably make sure this is your bottleneck before trying for that.

Related

Updating a value in Data.ByteString

The C language provides a very handy way of updating the nth element of an array: array[n] = new_value. My understanding of the Data.ByteString type is that it provides a very similar functionality to a C array of uint8_t - access via index :: ByteString -> Int -> Word8. It appears that the opposite operation - updating a value - is not that easy.
My initial approach was to use the take, drop and singleton functions, concatetaned in the following way:
updateValue :: ByteString -> Int -> Word8 -> ByteString
updateValue bs n value = concat [take (n-1) bs, singleton value, drop (n+1) bs]
(this is a very naive implementation as it does not handle edge cases)
Coming with a C background, it feels a bit too heavyweight to call 4 functions to update one value. Theoretically, the operation complexity is not that bad:
take is O(1)
drop is O(1)
singleton is O(1)
concat is O(n), but here I am not sure if the n is the length of the concatenated list altogether or if its just, in our case, 3.
My second approach was to ask Hoogle for a function with a similar type signature: ByteString -> Int -> a -> ByteString, but nothing appropriate appeared.
Am I missing something very obvious, or is really that complex to update the value?
I would like to note that I understand the fact that the ByteString is immutable and that changing any of its elements will result into a new ByteString instance.
EDIT:
A possible solution that I found while reading about the Control.Lens library uses the set lens. The following is an outtake from GHCi with omitted module names:
> import Data.ByteString
> import Control.Lens
> let clock = pack [116, 105, 99, 107]
> clock
"tick"
> let clock2 = clock & ix 1 .~ 111
> clock2
"tock"

One solution is to convert the ByteString to a Storable Vector, then modify that:
import Data.ByteString (ByteString)
import Data.Vector.Storable (modify)
import Data.Vector.Storable.ByteString -- provided by the "spool" package
import Data.Vector.Storable.Mutable (write)
import Data.Word (Word8)
updateAt :: Int -> Word8 -> ByteString -> ByteString
updateAt n x s = vectorToByteString . modify inner . byteStringToVector
where
inner v = write v n x
See the documentation for spool and vector.

differences between Lazy.ByteString and Lazy.Char8.ByteString

I am bit confused over the codes in real world haskell
import qualified Data.ByteString.Lazy.Char8 as L8
import qualified Data.ByteString.Lazy as L
matchHeader :: L.ByteString -> L.ByteString -> Maybe L.ByteString
matchHeader prefix str
| prefix `L8.isPrefixOf` str
= Just (L8.dropWhile isSpace (L.drop (L.length prefix) str))
| otherwise
= Nothing
It seems L and L8 can be used interchangeably somewhere in this function, compiles fine if I replace L with L8 especially for the type L.ByteString and L8.ByteString, I saw in hackage, they're linked to the same source, does that mean Data.ByteString.Lazy.Char8.ByteString is the same as Data.ByteString.Lazy.ByteString ? Why L8.isPrefixOf is used here but not L.isPrefixOf?

That's funny, I've used all the ByteStrings but never noticed (until you mentioned) it that the Char8 and Word8 versions are internally the same data type.
Once mentioned though, I had to go and look at the code.... The following line in Data/ByteString/Lazy/Char8.hs shows that not only are the data types the same, but many of the functions are reexported identically....
-- Functions transparently exported
import Data.ByteString.Lazy
(fromChunks, toChunks, fromStrict, toStrict
,empty,null,length,tail,init,append,reverse,transpose,cycle
,concat,take,drop,splitAt,intercalate,isPrefixOf,group,inits,tails,copy
,hGetContents, hGet, hPut, getContents
,hGetNonBlocking, hPutNonBlocking
,putStr, hPutStr, interact)
So it would seem that most of Data.ByteString.(Lazy.)?Char8 are just a convenience wrapper around Data.ByteString(.Lazy)?. This also explains to me why show has always created stringy output for Word8 ByteStrings.
Of course some stuff does differ, as you can see when you try to create a ByteString-
B.pack "abcd" -- This fails
B.pack [65, 66, 67, 68] -- output is "ABCD"
B8.pack "abcd" -- This works

According to the documentation, both Lazy.ByteString and Lazy.Char.ByteString are a space-efficient representation of a Word8 vector, supporting many efficient operations. So, internally they seem to be same and you can use them interchangeably.
But Lazy.Char.ByteString has additionally these characteristics:
All Chars will be truncated to 8 bits (So be careful!)
The Char8 interface to bytestrings provides an instance of IsString for the ByteString type, enabling you to use string literals, and have them implicitly packed to ByteStrings. (you should enable OverloadedStrings extension for this)

Haskell csv-conduit in GHCi

I've been suggested csv-conduit as a good Haskell package to work with CSV files. I want to learn how it works, but the documentation is too terse for a newbie Haskell programmer.
Is there a way for me to figure out how it works by trial-and-error in GHCi?
More specifically, should I load modules and files from GHCi or should I write a simple HS file to load them and then move around interactively?
I mentioned csv-conduit, but I'm opened to using any CSV package. I just need to get my hands on one and fool around with it, until I feel at ease (much like I would do in IDLE).

Take a look at the following function: readCSVFile :: :: (MonadResource m, CSV ByteString a) => CSVSettings -> FilePath -> m [a]
Its relatively simple to call, as we just need a CSVSettings, such as defCSVSettings, and a FilePath (aka String), "file.csv" or something.
Thus, after the call, we get (MonadResource m, CSV ByteString a). We can resolve this one at a time to figure out an appropriate type for this. We are performing IO in this operation, so for MonadResource m, m should just be ResourceT IO, which happens to be an instance of MonadBaseControl IO as required by runResourceT. This is a conduit specific thing.
For the CSV ByteString a, we need to find what instances of CSV. To do so, go to http://hackage.haskell.org/packages/archive/csv-conduit/0.2.1.1/doc/html/Data-CSV-Conduit.html#t:CSV (where the documentation for the package is in my opinion somewhat obnoxiously all stuffed into the typeclass...) and click on Instances to see what available instances we have of the form CSV ByteString a. The two options are CSV ByteString ByteString and CSV ByteString Text.
Of the two of these, Text is preferable because it handles unicode and CSV is unlikely to be containing binary data. ByteString is more or less similar to a [Word8] while Text is more similar to [Char] which is probably what you want. Hence, a should be Text (although ByteString will still work).
This means the result of the function call is ResourceT IO [Row Text]. We can't do much with this, but because ResourceT is a monad transformer, we can easily "pop" off the monad transformation layer with the function runResourceT. Thus,
readFile :: FilePath -> IO [Row Text]
readFile = runResourceT . readCSVFile defCSVSettings
which is easily usable within, say, main to get at the [Row Text] which you can then iterate over with a map or a fold to get your hands on the individual rows.
To run this sort of thing in GHCI you absolutely have to specifically point out the type. The reason is that the result class instance is not dependent on any of the parameters; thus, for any set of CSVSettings and FilePath, readCSVFile could return any number of different types as long as they as m is an instance of MonadResource m and a is an instance of CSV ByteString a. Thus, we have to explicitly point out to GHCi which type you want.

Have you tried Text.CSV? It might be more appropriate if you're just starting out with Haskell, as it's much simpler.
As for exploring new modules, you can just load it into GHCi, there's no need to write an additional file.

This works with the latest version of the csv-conduit package (version 0.6.3). Note the signature of readCsv without which I couldn't compile.
{-# LANGUAGE OverloadedStrings #-}
import Data.CSV.Conduit
import Data.Text (Text)
import qualified Data.Vector as V
import qualified Data.ByteString as B
csvset :: Char -> CSVSettings
csvset c = CSVSettings {csvSep = c, csvQuoteChar = Just '"'}
readCsv :: String -> Char -> IO (V.Vector (Row Text))
readCsv fp del = readCSVFile (csvset del) fp
main = readCsv "C:\\mydir\\myfile.csv" ';'

Best way to convert between [Char] and [Word8]?

I'm new to Haskell and I'm trying to use a pure SHA1 implementation in my app (Data.Digest.Pure.SHA) with a JSON library (AttoJSON).
AttoJSON uses Data.ByteString.Char8 bytestrings, SHA uses Data.ByteString.Lazy bytestrings, and some of my string literals in my app are [Char].
Haskell Prime's wiki page on Char types seems to indicate this is something still being worked out in the Haskell language/Prelude.
And this blogpost on unicode support lists a few libraries but its a couple years old.
What is the current best way to convert between these types, and what are some of the tradeoffs?
Thanks!

Here's what I have, without using ByteString's internal functions.
import Data.ByteString as S (ByteString, unpack)
import Data.ByteString.Char8 as C8 (pack)
import Data.Char (chr)
strToBS :: String -> S.ByteString
strToBS = C8.pack
bsToStr :: S.ByteString -> String
bsToStr = map (chr . fromEnum) . S.unpack
S.unpack on a ByteString gives us [Word8], we apply (chr . fromEnum) which converts any Enum type to a character. By composing all of them together we'll the function we want!

For conversion between Char8 and Word8 you should be able to use toEnum/fromEnum conversions, as they represent the same data.
For Char and Strings you might be able to get away with Data.ByteString.Char8.pack/unpack or some sort of combination of map, toEnum and fromEnum, but that throws out data if you're using anything other than ASCII.
For strings which could contain more than just ASCII a popular choice is UTF8 encoding. I like the utf8-string package for this:
http://hackage.haskell.org/packages/archive/utf8-string/0.3.6/doc/html/Codec-Binary-UTF8-String.html

Char8 and normal bytestrings are the same thing, just with different interfaces depending on which module you import. Mainly you want to convert between strict and lazy bytestrings, for which you use toChunks and fromChunks.
To put chars into bytestrings, use pack.
Also note that if your chars include codepoints which multibyte representations in UTF-8, then there will be problems.

Note : This answers the question in a very specific case (calling functions on hard-coded strings).
This may seem a minor problem because conversion functions exist as detailed in previous answers.
But I wanted a method to reduce administrative code, i.e. the code that you have to write just to get functions working together.
The solution to reducing type-handling code for strings is to use the OverloadedStrings pragma and import the relevant module(s)
{-# LANGUAGE OverloadedStrings #-}
module Dummy where
import Data.ByteString.Lazy.Char8 (ByteString, append)
bslHandling :: ByteString -> ByteString
bslHandling = (append myWord8List)
myWord8List = "I look like a String, but I'm actually a ByteString"
Note : myWordList type is inferred by the compiler.
If you do not use it in bslHandling, then the above declaration will yeld a classical [Char] type.
It does not solve the problem of passing from one specific type to another
Hope it helps

Maybe you want to do this:
import Data.ByteString.Internal (unpackBytes)
import Data.ByteString.Char8 (pack)
import GHC.Word (Word8)
strToWord8s :: String -> [Word8]
strToWord8s = unpackBytes . pack

Assuming that Char and Word8 are the same,
import Data.Word ( Word8 )
import Unsafe.Coerce ( unsafeCoerce )
toWord8 :: Char -> Word8
toWord8 = unsafeCoerce
strToWord8 :: String -> Word8
strToWord8 = map toWord8

Many types of String (ByteString)

I wish to compress my application's network traffic.
According to the (latest?) "Haskell Popularity Rankings", zlib seems to be a pretty popular solution. zlib's interface uses ByteStrings:
compress :: ByteString -> ByteString
decompress :: ByteString -> ByteString
I am using regular Strings, which are also the data types used by read, show, and Network.Socket:
sendTo :: Socket -> String -> SockAddr -> IO Int
recvFrom :: Socket -> Int -> IO (String, Int, SockAddr)
So to compress my strings, I need some way to convert a String to a ByteString and vice-versa.
With hoogle's help, I found:
Data.ByteString.Char8 pack :: String -> ByteString
Trying to use it:
Prelude Codec.Compression.Zlib Data.ByteString.Char8> compress (pack "boo")
<interactive>:1:10:
Couldn't match expected type `Data.ByteString.Lazy.Internal.ByteString'
against inferred type `ByteString'
In the first argument of `compress', namely `(pack "boo")'
In the expression: compress (pack "boo")
In the definition of `it': it = compress (pack "boo")
Fails, because (?) there are different types of ByteString ?
So basically:
Are there several types of ByteString? What types, and why?
What's "the" way to convert Strings to ByteStrings?
Btw, I found that it does work with Data.ByteString.Lazy.Char8's ByteString, but I'm still intrigued.

There are two kinds of bytestrings: strict (defined in Data.Bytestring.Internal) and lazy (defined in Data.Bytestring.Lazy.Internal). zlib uses lazy bytestrings, as you've discovered.

The function you're looking for is:
import Data.ByteString as BS
import Data.ByteString.Lazy as LBS
lazyToStrictBS :: LBS.ByteString -> BS.ByteString
lazyToStrictBS x = BS.concat $ LBS.toChunks x
I expect it can be written more concisely without the x. (i.e. point-free, but I'm new to Haskell.)

A more efficient mechanism might be to switch to a full bytestring-based layer:
network.bytestring for bytestring sockets
lazy bytestrings for compressoin
binary of bytestring-show to replace Show/Read

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Inverse of `Data.Text.Encoding.decodeLatin1`? - haskell

Related

Updating a value in Data.ByteString

differences between Lazy.ByteString and Lazy.Char8.ByteString

Haskell csv-conduit in GHCi

Best way to convert between [Char] and [Word8]?

Many types of String (ByteString)

Categories

Resources