How to take lazy bytestring from zip archive without heap overflow

How to take lazy bytestring from zip archive without heap overflow - haskell

I want to take first five bytes from the fist file in zip archive. I use zip-archive package for decompression:
import qualified Data.ByteString.Lazy as L
import Data.Maybe
import System.Environment (getArgs)
import Codec.Archive.Zip
main = do
f:_ <- getArgs
print . L.take 5 . fromEntry . head . zEntries . toArchive =<< L.readFile f
This code works for small archives but I got heap overflow with big ones. For example:
./zip-arch test.zip +RTS -p -hy -M100M
for this archive gives this heap profile

Consider calling out to unzip. It's not super haskelly but it does the job. Perhaps all the haters out there should spend more time fixing or replacing broken libraries like zip-archive and less time on stackoverflow.
Standard disclaimer: no error checking present. this may leak handles. lazy i/o is lazy.
import System.Environment (getArgs)
import System.IO (hSetBinaryMode)
import System.Process (StdStream(...), createProcess, proc, close_fds, std_out)
import qualified Data.ByteString.Lazy as L
unzipLBS :: FilePath -> IO L.ByteString
unzipLBS file = do
let args = proc "unzip" ["-p", file]
args' = args { std_out = CreatePipe, close_fds = True }
(_, Just hOut, _, _) <- createProcess args'
hSetBinaryMode hOut True
L.hGetContents hOut
main :: IO ()
main = do
f:_ <- getArgs
print . L.take 5 =<< unzipLBS f
Seems to work:
$ runghc -Wall unzip.hs ~/Downloads/test.zip
Chunk ",+\227F\149" Empty

I've read the explanation of the zip-archive author and decided to make recommended repairs. I've finished with a new library - zip-conduit. Its main feature is constant memory usage without lazy IO. To take first five bytes from the fist file in the zip archive you can write:
import System.Environment
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Codec.Archive.Zip
main = do
f:_ <- getArgs
res <- withArchive f $ do
name:_ <- fileNames
source <- getSource name
runResourceT $ source $$ CB.take 5
print res

Related

how to list the functions exported by a Haskell module from an .hs script?

I am aware of this thread and the agreed-upon ghci :browse command, but I am looking for something similar to run from a script.hs file:
Say I have a module that I can import into my script.hs. How do I then view the list of functions I have just gained access to?
What I've settled on for now
Adapting this thread that suggests the now-deprecated ghc-mod command-line program, I am
calling the terminal command ghc -e ':browse <module, e.g. Data.List>'
from my script.hs using Shelly.
My full script:
#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
import Safe (headDef)
import Shelly
import System.Environment (getArgs)
import qualified Data.Text as T
mdl :: IO String
mdl = getArgs >>= return . headDef "Data.List"
runShelly :: String -> IO ()
runShelly mdl = shelly $ silently $ do
out <- run "ghc" ["-e", T.pack (":browse " ++ mdl)]
let lns = T.lines out
liftIO $ mapM_ (putStrLn .T.unpack) $ lns
main :: IO ()
main = mdl >>= runShelly
This way I can pass the module name on the command line as <script> <module> and get back the functions, one per line. It defaults to Data.List if I pass no arguments.
So that's a solution, but surely there must be handier introspection facilities than this?

How to (efficiently) follow / tail a file with Haskell, including detecting file rotation? (tail -F)

In essence I wish to know how to approach implementing tail -F Linux command functionality in Haskell. My goal is to follow a log file, such as a web server log file, and compute various real time statistics by parsing the input as it comes in. Ideally with no interruptions if the log file is rotated with logrotate or similar service.
I'm somewhat at loss on how to even approach the problem and what should I take into consideration in terms of performance in presence of lazy I/O. Would any of the streaming libraries be relevant here?

This is a partial answer, as it doesn't handle file truncation by logrotate. It avoids lazy I/O and uses the bytestring, streaming, streaming-bytestring and hinotify packages.
Some preliminary imports:
{-# language OverloadedStrings #-}
module Main where
import qualified Data.ByteString
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import qualified Data.ByteString.Streaming as B
import Streaming
import qualified Streaming.Prelude as S
import Control.Concurrent.QSem
import System.INotify
import System.IO (withFile,IOMode(ReadMode))
import System.Environment (getArgs)
Here's the "tailing" function:
tailing :: FilePath -> (B.ByteString IO () -> IO r) -> IO r
tailing filepath continuation = withINotify $ \i -> do
sem <- newQSem 1
addWatch i [Modify] filepath (\_ -> signalQSem sem)
withFile filepath ReadMode (\h -> continuation (handleToStream sem h))
where
handleToStream sem h = B.concat . Streaming.repeats $ do
lift (waitQSem sem)
readWithoutClosing h
-- Can't use B.fromHandle here because annoyingly it closes handle on EOF
-- instead of just returning, and this causes problems on new appends.
readWithoutClosing h = do
c <- lift (Data.ByteString.hGetSome h defaultChunkSize)
if Data.ByteString.null c
then return ()
else do B.chunk c
readWithoutClosing h
It takes a file path an a callback that consumes a streaming bytestring.
The idea is that, each time before reading from the handle until EOF, we decrement a semaphore, which is only increased by the callback that is invoked when the file is modified.
We can test the function like this:
main :: IO ()
main = do
filepath : _ <- getArgs
tailing filepath B.stdout

How to pipe output from an IO action into a process in haskell

I want to create a process and write some text from my haskell program into the process's stdin periodically (from an IO action).
The following works correctly in GHCi but don't work correctly when built and run. In GHCi everything works perfectly and the value from the IO action is fed in periodically. When built and run however, it seems to pause for arbitrarily long periods of time when writing to stdin of the process.
I've used CreateProcess (from System.Process) to create the handle and tried hPutStrLn (bufferent set to NoBuffering -- LineBuffering didnt work either).
So I'm trying the process-streaming package and pipes but can't seem to get anything to work at all.
The real question is this: How do i create a process from haskell and write to it periodically?
Minimal example that exhibits this behavior:
import System.Process
import Data.IORef
import qualified Data.Text as T -- from the text package
import qualified Data.Text.IO as TIO
import Control.Concurrent.Timer -- from the timers package
import Control.Concurrent.Suspend -- from the suspend package
main = do
(Just hin, _,_,_) <- createProcess_ "bgProcess" $
(System.Process.proc "grep" ["10"]) { std_in = CreatePipe }
ref <- newIORef 0 :: IO (IORef Int)
flip repeatedTimer (msDelay 1000) $ do
x <- atomicModifyIORef' ref $ \x -> (x + 1, x)
hSetBuffering hin NoBuffering
TIO.hPutStrLn hin $ T.pack $ show x
Any help will be greatly appreciated.

This is a pipes Producer that emits a sequence of numbers with a second delay:
{-# language NumDecimals #-}
import Control.Concurrent
import Pipes
import qualified Data.ByteString.Char8 as Bytes
periodic :: Producer Bytes.ByteString IO ()
periodic = go 0
where
go n = do
d <- liftIO (pure (Bytes.pack (show n ++ "\n"))) -- put your IO action here
Pipes.yield d
liftIO (threadDelay 1e6)
go (succ n)
And, using process-streaming, we can feed the producer to an external process like this:
import System.Process.Streaming
main :: IO ()
main = do
executeInteractive (shell "grep 10"){ std_in = CreatePipe } (feedProducer periodic)
I used executeInteractive, which sets std_in automatically to NoBuffering.
Also, if you pipe std_out and want to process each match immediately, be sure to pass the --line-buffered option to grep (or use the stdbuf command) to ensure that matches are immediately available at the output.

What about using threadDelay, e.g.:
import Control.Monad (forever)
import Control.Concurrent (threadDelay)
...
forever $ do
x <- atomicModifyIORef' ref $ \x -> (x + 1, x)
hSetBuffering hin NoBuffering
TIO.hPutStrLn hin $ T.pack $ show x
threadDelay 1000000 -- 1 sec
Spawn this off in another thread if you need to do other work at the same time.
You can remove he need for the IORef with:
loop h x = do
hSetBuffering h NoBuffering
TIO.hPutStrLn h $ T.pack $ show x
threadDelay 1000000
loop h (x+1)
And, of course, you only need to do the hSetBuffering once - e.g. do it just before you enter the loop.

Idiomatic io-streams directory traversal

I was discussing some code on Reddit, and it made me curious about how this would be implemented in io-streams. Consider the following code which traverses a directory structure and prints out all of the filenames:
import Control.Exception (bracket)
import qualified Data.Foldable as F
import Data.Streaming.Filesystem (closeDirStream, openDirStream,
readDirStream)
import System.Environment (getArgs)
import System.FilePath ((</>))
printFiles :: FilePath -> IO ()
printFiles dir = bracket
(openDirStream dir)
closeDirStream
loop
where
loop ds = do
mfp <- readDirStream ds
F.forM_ mfp $ \fp' -> do
let fp = dir </> fp'
ftype <- getFileType fp
case ftype of
FTFile -> putStrLn fp
FTFileSym -> putStrLn fp
FTDirectory -> printFiles fp
_ -> return ()
loop ds
main :: IO ()
main = getArgs >>= mapM_ printFiles
Instead of simply printing the files, suppose we wanted to create some kind of streaming filepath representation. I know how this would work in enumerator, conduit, and pipes. However, since the intermediate steps require acquisition of a scarce resource (the DirStream), I'm not sure what the implementation would be for io-streams. Can someone provide an example of how that would be done?
For comparison, here's the conduit implementation, which is made possible via bracketP and MonadResource. And here's how the conduit code would be used to implemented the same file printing program as above:
import Control.Monad.IO.Class (liftIO)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Conduit (($$))
import Data.Conduit.Filesystem (sourceDirectoryDeep)
import qualified Data.Conduit.List as CL
import System.Environment (getArgs)
main :: IO ()
main =
getArgs >>= runResourceT . mapM_ eachRoot
where
-- False means don't traverse dir symlinks
eachRoot root = sourceDirectoryDeep False root
$$ CL.mapM_ (liftIO . putStrLn)

Typical style would be to do something like this:
traverseDirectory :: RawFilePath -> (InputStream RawFilePath -> IO a) -> IO a
i.e. a standard "with-" function, with the obvious implementation.
Edit: added a working example implementation: https://gist.github.com/gregorycollins/00c51e7e33cf1f9c8cc0
It's not exactly complicated but it's also not as trivial as I had first suggested.

Why is main executable busy when calling external executable with Shelly.Background

I am trying to write a program with Shelly to compile Delphi projects in parallel. I thought the program would be blocked while waiting for the Delphi compilers to return. But my program starts to max out one CPU-core after compiled 2 projects. I couldn't work out what it's so busy doing. Please help? Thanks.
ps: I am quite new to Haskell, if I'm not implementing this the right way, pointers are appreciated.
{-# LANGUAGE OverloadedStrings, ExtendedDefaultRules #-}
{-# OPTIONS_GHC -fno-warn-type-defaults #-}
import Shelly
import Shelly.Background
import Control.Monad
import Control.Arrow
import System.IO.Temp (withSystemTempDirectory)
import System.Directory (getCurrentDirectory)
import System.FilePath (splitFileName)
import Data.Text.Lazy (Text, pack)
default (Int, Text)
dcc32 = command "dcc32" ["-RC:\\Program Files\\Borland\\BDS\\4.0\\Lib", "-Q", "-H", "-W", "-B"]
compile project = liftIO $ withSystemTempDirectory "TempDCU_" compile'
where
compile' tmpDir = shellyNoDir $ silently $
chdir dir (dcc32 [toTextIgnore file, pack $ "-N" ++ tmpDir])
(dir, file) = mapTuple (fromText . pack) $ splitFileName project
mapTuple = join (***)
compilePooled n projects = shellyNoDir $ jobs n (\job -> mapM (background job . compile) projects)
projectList = [
"C:\\Path\\to\\project1.dpr",
"C:\\Path\\to\\project2.dpr",
"C:\\Path\\to\\project3.dpr",
"C:\\Path\\to\\project4.dpr",
]
main = do
output <- compilePooled 2 projectList
shellyNoDir $ mapM getBgResult output >>= mapM_ inspect

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to take lazy bytestring from zip archive without heap overflow - haskell

Related

how to list the functions exported by a Haskell module from an .hs script?

How to (efficiently) follow / tail a file with Haskell, including detecting file rotation? (tail -F)

How to pipe output from an IO action into a process in haskell

Idiomatic io-streams directory traversal

Why is main executable busy when calling external executable with Shelly.Background

Categories

Resources