How to write a zip file using Haskell LibZip? - haskell

I'm trying to figure out a dead-simple task using LibZip in Haskell: how do I open an archive foo.zip, decompress it, recompress it, and save it to a new archive bar.zip? With the Zip library, this is easy:
{-# LANGUAGE OverloadedStrings #-}
import Codec.Archive.Zip (toArchive, fromArchive)
import qualified Data.ByteString.Lazy as B
import System.Environment
saveZipAs :: FilePath -> FilePath -> IO ()
saveZipAs source dest = do
arch <- fmap toArchive $ B.readFile source
putStrLn "Archive info: " >> print arch
B.writeFile dest $ fromArchive arch
LibZip, on the other hand, provides no clear way to do this (that I can see). It only seems to be able to instantiate a zip file with withArchive (which is an issue in and of itself, because a file you want to open might not be on disk), and I don't see a way to do any kind of "save as" operation, nor to extract the compressed bytes as a ByteString or otherwise (as in Zip). LibZip is supposedly faster than Zip, so I want to at least give it a try, but it seems much more obscure (and also impure, carrying around an IO everywhere it goes, where it is really only needed at the beginning and the end, if ever). Can anyone give me some tips?
Side note: it really boggles the mind how people can spend such huge amounts of time writing a library, only to document it so poorly that no one can use it. Library writers, please don't do this!

Your link is somehow to an old version of the library, and the very last version of the library seems to have haddock compilation bugs.
Here are file reading functions in a newer version:
http://hackage.haskell.org/package/LibZip-0.10.2/docs/Codec-Archive-LibZip.html#g:3
The reverse process seems to be addFile/sourceBuffer and related functions.
Here is full source code of zip repacking:
import Codec.Archive.LibZip
import Codec.Archive.LibZip.Types
main = readZip "foo.zip" >>= writeZip "bar.zip"
readZip :: FilePath -> IO [(FilePath, ZipSource)]
readZip zipName = withArchive [] zipName $ do
nn <- fileNames []
ss <- mapM (\n -> sourceFile n 0 (-1)) nn
return $ zip nn ss
writeZip :: FilePath -> [(FilePath, ZipSource)] -> IO ()
writeZip zipName zipContent = withArchive [CreateFlag] zipName $ do
mapM_ (uncurry addFile) zipContent
Few refactorings still can be done: liftM2 zip can be used in readZip, and function composition . in writeZip.

Related

Data.Binary encodeFile does not seem to be thread safe - corrupted file?

I'm trying to replicate a situation where a binary file was essentially corrupted with a filesize of 0 in a real world application via encodeFile, this occurred after a hard reboot.
Although I've not been able to replicate this behavior exactly, I have gotten it to replicate a corrupted(?) file with code below.
When we first run it (some text is garbled due to multiple threads printing):
"New valid file written"
Example "hmm" [0]
"Testing..."
"Donenn"oo
tt een#no~ouGugHghCh I bDby-ytSteTesAs
R
CTCa~al#lllSSttaacckk ((ffrroomm HHaassCCaallllStS#at~caGkcH)kC:)I
:D
- Fe IreNrrIorSroH,r- ,5c ~ac#la
llelde #da~ tGa HtsC rIscDr/-cMS/aTMiAanRi.Tnh~.s#h:s5:35:31:51 5i ni nm amiani:nM:aMiani
n
"d"ideiien#ig~n.Gg.H..C..I..D..-..S."T.
Command "cabal v2-repl app" exited unexpectedly
After a few runs eventually we get an error of:
*** Exception: not enough bytes
CallStack (from HasCallStack):
error, called at src/Main.hs:53:15 in main:Main
What is the cause of this error? Is it just the case that encodeFile is not safe when used via multiple threads (which is kind of odd as there is no mention of threads on https://hackage.haskell.org/package/binary-0.10.0.0/docs/Data-Binary.html).
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE LambdaCase #-}
module Main where
import System.PosixCompat.Files
import System.Process
import System.Process.Internals
import System.Posix.Signals
import System.Posix.Process
import Control.Concurrent
import Control.Monad
import Data.Binary
import GHC.Generics (Generic)
import Control.Exception
data Example = Example String [Int] deriving (Generic, Show)
instance Binary Example
main :: IO ()
main = do
checkFile
encodeFile "output.txt" $ Example "hmm" [0]
checkFile
print "New valid file written"
decodeFileOrFail "output.txt" >>= \case
Right v#(Example s z) -> print v
Left (e,e') -> do
error $ e'
rip
print "Testing..."
forM_ [1..3] (const $ forkIO $ catch (do
checkFile
somethingIO
checkFile) (\e -> do
print (e :: SomeException)
rip
)
)
print "Done"
checkFile :: IO ()
checkFile = do
fileExist "output.txt" >>= \case
True -> do
x <- getFileSize "output.txt"
if x == 0 then
rip
else
pure ()
decodeFileOrFail "output.txt" >>= \case
Right (Example s z) -> pure ()
Left (e,e') -> do
error $ e'
rip
False -> pure ()
rip :: IO ()
rip = do
print "dieing......."
getProcessID >>= signalProcess sigKILL
somethingIO :: IO ()
somethingIO = do
let v = 10 :: Int
decodeFileOrFail "output.txt" >>= \case
Right (Example s z) -> encodeFile "output.txt" $ z ++ [v]
Left (e,e') -> do
error $ e'
rip
getFileSize :: String -> IO Int
getFileSize path = getFileStatus path >>= return . fromIntegral . fileSize
With a cabal file of:
cabal-version: 1.12
name: HaskellNixCabalStarter
version: 0.1.0.0
author: HaskellNixCabalStarter
maintainer: HaskellNixCabalStarter
license: MIT
build-type: Simple
executable app
main-is: Main.hs
other-modules:
Paths_HaskellNixCabalStarter
hs-source-dirs:
src
build-depends:
base >=4.12 && <4.13
, binary
, process
, random
, unix
, unix-compat
default-language: Haskell2010
There's nothing particularly mysterious going on here. Reading and writing files simply aren't atomic operations, and this is biting you. If you have one thread writing output.txt and another reading output.txt, it is completely normal and expected for the reader to occasionally see only part of the file that the writer would eventually produce.
This is not particularly special to the binary package, nor even to the language -- this is, to a first approximation, true of nearly every library and language that deals with a filesystem. Guaranteeing atomicity of the appropriate kind is quite hard, indeed; but many, many engineering years have gone into providing this kind of thing for databases, so if that's a need for you, you might consider using one of them.
Alternately, a significantly simpler solution is to have a single thread that is responsible for reading and writing the appropriate file, and to communicate with it via one of Haskell's excellent inter-thread communication tools.
Some OSs do offer an atomic file-rename operation. In such a situation, one could also consider writing to a temporary file, then using an atomic rename to overwrite the filename you actually care about. (Thanks to a commenter who I will leave anonymous because they chose to delete their comment for suggesting this.)

Haskell Conduit: having a Sink return a value based on the values from upstream

I've been trying to use the Conduit library to do some simple I/O involving files, but I'm having a hard time.
I have a text file containing nothing but a few digits such as 1234. I have a function that reads the file using readFile (no conduits), and returns Maybe Int (Nothing is returned when the file actually doesn't exist). I'm trying to write a version of this function that uses conduits, and I just can't figure it out.
Here is what I have:
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Functor
import System.Directory
import qualified Data.ByteString.Char8 as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
myFile :: FilePath
myFile = "numberFile"
withoutConduit :: IO (Maybe Int)
withoutConduit = do
doesExist <- doesFileExist myFile
if doesExist
then Just . read <$> readFile myFile
else return Nothing
withConduit :: IO (Maybe Int)
withConduit = do
doesExist <- doesFileExist myFile
if doesExist
then runResourceT $ source $$ conduit =$ sink
else return Nothing
where
source :: Source (ResourceT IO) B.ByteString
source = CB.sourceFile myFile
conduit :: Conduit B.ByteString (ResourceT IO) T.Text
conduit = CT.decodeUtf8
sink :: Sink T.Text (ResourceT IO) (Maybe Int)
sink = awaitForever $ \txt -> let num = read . T.unpack $ txt :: Int
in -- I don't know what to do here...
Could someone please help me complete the sink function?
Thanks!
This isn't really a good example for where conduit actually provides a lot of value, at least not the way you're looking at it right now. Specifically, you're trying to use the read function, which requires that the entire value be in memory. Additionally, your current error handling behavior is a bit loose. Essentially, you're just going to get an read: no parse error if there's anything unexpected in the content.
However, there is a way we can play with this in conduit and be meaningful: by parsing the ByteString byte-by-byte ourselves and avoiding the read function. Fortunately, this pattern falls into a standard left fold, which the conduit-combinators package provides a perfect function for (element-wise left fold in a conduit, aka foldlCE):
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.Word8
import qualified Data.ByteString as S
sinkInt :: Monad m => Consumer S.ByteString m Int
sinkInt =
foldlCE go 0
where
go total w
| _0 <= w && w <= _9 =
total * 10 + (fromIntegral $ w - _0)
| otherwise = error $ "Invalid byte: " ++ show w
main :: IO ()
main = do
x <- yieldMany ["1234", "5678"] $$ sinkInt
print x
There are plenty of caveats that go along with this: it will simply throw an exception if there are unexpected bytes, and it doesn't handle integer overflow at all (though fixing that is just a matter of replacing Int with Integer). It's important to note that, since the in-memory string representation of a valid 32- or 64-bit int is always going to be tiny, conduit is overkill for this problem, though I hope that this code gives some guidance on how to generally write conduit code.

Is there something better than unsafePerformIO for this....?

I've so far avoided ever needing unsafePerformIO, but this might have to change today.... I would like to see if the community agrees, or if someone has a better solution.
I have a library which needs to use some config data stored in a bunch of files. This data is guaranteed static (during the run), but needs to be in files that can (on very rare occasions) be edited by an end user who can not compile Haskell programs. (The details are uninportant, but think of "/etc/mime.types" as a pretty good approximation. It is a large almost static data file used throughout many programs).
If this weren't a library I would just use the IO monad.... But because it is a library which is called throughout my code, it literally forces a bubbling up of the IO monad through pretty much everything I have written in multiple modules! Although I need to do a one time read of the data files, this low level call is effetively pure, so this is a pretty unacceptable outcome.
FYI, I plan to also wrap the call in unsafeInterleaveIO, so that only files that are needed will be loaded. My code will look something like this....
dataDir="<path to files>"
datafiles::[FilePath]
datafiles =
unsafePerformIO $
unsafeInterleaveIO $
map (dataDir </>)
<$> filter (not . ("." `isPrefixOf`))
<$> getDirectoryContents dataDir
fileData::[String]
fileData = unsafePerformIO $ unsafeInterleaveIO $ sequence $ readFile <$> datafiles
Given that the data read is referentially transparent, I am pretty sure that unsafePerformIO is safe (this has been discussed in many place, such as "Use of unsafePerformIO appropriate?"). Still, though, if there is a better way, I would love to hear about it.
UPDATE-
In response to Anupam's comment....
There are two reasons why I can't break up the lib into IO and non IO parts.
First, the amount of data is large, and I don't want to read it all into memory at once. Remember that IO is always read strictly.... This is the reason that I need to put in the unsafeInterleaveIO call, to make it lazy. IMHO, once you use unsafeInterleaveIO, you might as well use unsafePerformIO, as the risk is already there.
Second, breaking out the IO specific parts just substitutes the bubbling up of the IO monad with the bubbling up of the IO read code, as well as the passing around of the data (I might actually choose to pass around the data using the state monad anyway, so it really isn't an improvement to substitute the IO monad for the state monad everywhere). This wouldn't be so bad if the low level function itself wasn't effectively pure (ie- think of my /etc/mime.types example above, and imagine a Haskell extensionToMimeType function, which is basically pure, but needs to get the database data from the file.... Suddenly everything from low to high in the stack needs to call or pass through a readMimeData::IO String. Why should each main even need to care about the library choice of a submodule many levels deep?).
I agree with Anupam Jain, you would be better off reading these data files at a somewhat higher level, in IO, and then passing the data in them through the rest of your program purely.
You could, for example, put the functions that need the results of fileData into Reader [String], so that they can just ask for the results as needed (or some Reader Config, where Config holds these strings and whatever else you need).
A sketch of what I'm suggesting follows:
type AppResult = String
fileData :: IO [String]
fileData = undefined -- read the files
myApp :: String -> Reader [String] AppResult
myApp s = do
files <- ask
return undefined -- do whatever with s and config
main = do
config <- fileData
return $ runReader (myApp "test") config
I gather that you don't want to read all the data at once, because that would be costly. And maybe you don't really know up-front what files you will need to load, so loading all of them at the start would be wasteful.
Here's an attempt at a solution. It requires you to work inside a free monad and relegate the side-effecting operations to an interpreter. Some preliminary imports:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString as B
import Data.Monoid
import Data.List
import Data.Functor.Compose
import Control.Applicative
import Control.Monad
import Control.Monad.Free
import System.IO
We define a functor for the free monad. It will offer a value p do the interpreter and continue the computation after receiving a value b:
type LazyLoad p b = Compose ((,) p) ((->) b)
A convenience function to request the loading of a file:
lazyLoad :: FilePath -> Free (LazyLoad FilePath B.ByteString) B.ByteString
lazyLoad path = liftF $ Compose (path,id)
A dummy interpreter function that reads "file contents" from stdin:
interpret :: Free (LazyLoad FilePath B.ByteString) a -> IO a
interpret = iterM $ \(Compose (path,next)) -> do
putStrLn $ "Enter the contents for file " <> path <> ":"
B.hGetLine stdin >>= next
Some silly example functions:
someComp :: B.ByteString -> B.ByteString
someComp b = "[" <> b <> "]"
takesAwhile :: Int
takesAwhile = foldl' (+) 0 $ take 400000000 $ intersperse (negate 1) $ repeat 1
An example program:
main :: IO ()
main = do
r <- interpret $ do
r1 <- someComp <$> lazyLoad "file1"
r2 <- return takesAwhile
if (r2 == 1)
then return r1
else someComp <$> lazyLoad "file2"
putStrLn . show $ r
When executed, this program will request a line, spend some time computing takesAwhile and only then request another line.
If want to allow different kinds of "requests", this solution could be extended with something like Data types à la carte so that each function only needs to know about about the precise effects it requires.
If you are content with allowing only one type of request, you could also use Clients and Servers from Pipes.Core instead of the free monad.

Can I read n files lazily as a single IO operation in Haskell?

How can I read multiple files as a single ByteString lazily with constant memory?
readFiles :: [FilePath] -> IO ByteString
I currently have the following implementation but from what I have seen from profiling as well as my understanding I will end with n-1 of the files in memory.
readFiles = foldl1 joinIOStrings . map ByteString.readFile
where joinIOStrings ml mr = do
l <- ml
r <- mr
return $ l `ByteString.append` r
I understand that the flaw here is that I am applying the IO actions then rewrapping them so what I think I need is a way to replace the foldl1 joinIOStrings without applying them.
How can I read multiple files as a single ByteString lazily with constant memory?
If you want constant memory usage, you need Data.ByteString.Lazy. A strict ByteString cannot be read lazily, and would require O(sum of filesizes) memory.
For a not too large number of files, simply reading them all (D.B.L.readFile reads lazily) and concatenating the results is good,
import qualified Data.ByteString.Lazy as L
readFiles :: [FilePath] -> IO L.ByteString
readFiles = fmap L.concat . mapM L.readFile
The mapM L.readFile will open the files, but only read the contents of each file when it is demanded.
If the number of files is large, so that the limit of open file handles allowed by the OS for a single process could be exhausted, you need something more complicated. You can cook up your own lazy version of mapM,
import System.IO.Unsafe (unsafeInterleaveIO)
mapM_lazy :: [IO a] -> IO [a]
mapM_lazy [] = return []
mapM_lazy (x:xs) = do
r <- x
rs <- unsafeInterleaveIO (mapM_lazy xs)
return (r:rs)
so that each file will only be opened when its contents are needed, when previously read files can already be closed. There's a slight possibility that that still runs into resource limits, since the time of closing the handles is not guaranteed.
Or you can use your favourite iteratee, enumerator, conduit or whatever package that solves the problem in a systematic way. Each of them has its own advantages and disadvantages with respect to the others and, if coded correctly, eliminates the possibility of accidentally hitting the resource limit.
I assume that you are using lazy byte strings (from Data.ByteString.Lazy). There are probably other ways to do this, but one option is to simply use concat :: [ByteString] -> ByteString:
import Control.Monad
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as ByteString
readFiles :: [FilePath] -> IO ByteString
readFiles = fmap ByteString.concat . mapM ByteString.readFile
(Note: I don't have time to test the code, but reading the documentation says that this should work)

What is the haskell way to copy a directory

I find myself doing more and more scripting in haskell. But there are some cases where I'm really not sure of how to do it "right".
e.g. copy a directory recursively (a la unix cp -r).
Since I mostly use linux and Mac Os I usually cheat:
import System.Cmd
import System.Exit
copyDir :: FilePath -> FilePath -> IO ExitCode
copyDir src dest = system $ "cp -r " ++ src ++ " " ++ dest
But what is the recommended way to copy a directory in a platform independent fashion?
I didn't find anything suitable on hackage.
This is my rather naiv implementation I use so far:
import System.Directory
import System.FilePath((</>))
import Control.Applicative((<$>))
import Control.Exception(throw)
import Control.Monad(when,forM_)
copyDir :: FilePath -> FilePath -> IO ()
copyDir src dst = do
whenM (not <$> doesDirectoryExist src) $
throw (userError "source does not exist")
whenM (doesFileOrDirectoryExist dst) $
throw (userError "destination already exists")
createDirectory dst
content <- getDirectoryContents src
let xs = filter (`notElem` [".", ".."]) content
forM_ xs $ \name -> do
let srcPath = src </> name
let dstPath = dst </> name
isDirectory <- doesDirectoryExist srcPath
if isDirectory
then copyDir srcPath dstPath
else copyFile srcPath dstPath
where
doesFileOrDirectoryExist x = orM [doesDirectoryExist x, doesFileExist x]
orM xs = or <$> sequence xs
whenM s r = s >>= flip when r
Any suggestions of what really is the way to do it?
I updated this with the suggestions of hammar and FUZxxl.
...but still it feels kind of clumsy to me for such a common task!
It's possible to use the Shelly library in order to do this, see cp_r:
cp_r "sourcedir" "targetdir"
Shelly first tries to use native cp -r if available. If not, it falls back to a native Haskell IO implementation.
For further details on type semantics of cp_r, see this post written by me to described how to use cp_r with String and or Text.
Shelly is not platform independent, since it relies on the Unix package, which is not supported under Windows.
I couldn't find anything that does this on Hackage.
Your code looks pretty good to me. Some comments:
dstExists <- doesDirectoryExist dst
This does not take into account that a file with the destination name might exist.
if or [not srcExists, dstExists] then print "cannot copy"
You might want to throw an exception or return a status instead of printing directly from this function.
paths <- forM xs $ \name -> do
[...]
return ()
Since you're not using paths for anything, you can change this to
forM_ xs $ \name -> do
[...]
The filesystem-trees package provides the means for a very simple implementation:
import System.File.Tree (getDirectory, copyTo_)
copyDirectory :: FilePath -> FilePath -> IO ()
copyDirectory source target = getDirectory source >>= copyTo_ target
The MissingH package provides recursive directory traversals, which you might be able to use to simplify your code.
I assume that the function in Path.IO copyDirRecur with variants to include/exclude symlinks may be a newer and maintained solution. It requires to convert the filepath to Path x Dir which is achieved with parseRelDir respective parseAbsDir, but I think to have a more precise date type than FilePath is worthwile to avoid hard to track errors at run-time.
There are also some functions for copying files and directories in the core Haskell library Cabal modules, specifically Distribution.Simple.Utils in package Cabal. copyDirectoryRecursive is one, and there are other functions near this one in that module.

Resources