How to override function in Codec.Archive.Tar - haskell

Haskell noob here. I have a question specifically regarding how to use an existing library that may lead to some more fundamental aspects of the proper use of Haskell.
I'm learning Haskell and have a small project in mind to work on while I learn. The script will need to find all the tarballs in a given directory and unpack them in parallel. At this point, I'm working on the basic functionality of unpacking. So, using the Codec.Archive.Tar package, how can I override its behavior regarding tarballs with fully qualified paths?
Here's some example code:
module Main where
import qualified Codec.Archive.Tar as Tar
import qualified Codec.Compression.GZip as GZip
import Control.Monad (liftM, unless)
import qualified Data.ByteString.Lazy as BS
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.Exit (exitWith, ExitCode(..))
import System.FilePath.Posix (takeExtension)
searchPath = "/home/someuser/tarball/dir"
exit = exitWith ExitSuccess
die = exitWith (ExitFailure 1)
processFile :: String -> IO ()
processFile file = do
putStrLn $ "Unpacking " ++ file ++ " to " ++ searchPath
Tar.unpack searchPath . Tar.read . GZip.decompress =<< BS.readFile filePath
where filePath = searchPath ++ "/" ++ file
main = do
dirExists <- doesDirectoryExist searchPath
unless dirExists $ (putStrLn $ "Error: Search path not found: " ++ searchPath) >> die
files <- targetFiles `liftM` getDirectoryContents searchPath
mapM_ processFile files
exit
where targetFiles = filter (\f -> f /= "." && f /= ".." && takeExtension f == ".tgz")
When I run this in a directory with tarballs that were packed with:
tar czvPf myfile.tgz /tarball_testing/myfile
I get the following output:
Unpacking myfile.tgz to /tarball_testing
unpacker.hs: Absolute file name in tar archive: "/tarball_testing/myfile"
The second line is the issue. Reading the docs for Codec.Archive.Tar I don't see a way to disable this functionality (not interested in discussions of why I want to use full paths in tarballs, or the relative security implications of doing so).
The first thing that comes to mind is that I somehow need to override the function but that doesn't "feel" like the way a pro Haskeller would do it. Can I get a pointer in the right direction?

You cannot monkey patch or otherwise override a function from a Haskell module, and therefore no workaround will let you avoid the safety measures of the library. What you can do, however, is use the functionality in Codec.Archive.Tar to modify the tar entry paths before unpacking so that they won't be absolute any more. Specifically, there is a mapEntriesNoFail function with type
mapEntriesNoFail :: (Entry -> Entry) -> Entries e -> Entries e
Entries is the type of the argument to Tar.unpack, while Entry is the type of an individual entry. Thanks to mapEntriesNoFail, our problem becomes writing an Entry -> Entry function to adjust the paths. For that, first we will need some extra imports:
import qualified Codec.Archive.Tar.Entry as Tar
import System.FilePath.Posix (takeExtension, dropDrive, hasTrailingPathSeparator)
import Data.Either (either)
The function can look like this:
dropDriveFromEntry :: Tar.Entry -> Tar.Entry
dropDriveFromEntry entry =
either (error "Resulting tar path is somehow too long")
(\tp -> entry { Tar.entryTarPath = tp })
drivelessTarPath
where
tarPath = Tar.entryTarPath entry
path = Tar.fromTarPath tarPath
toTarPath' p = Tar.toTarPath (hasTrailingPathSeparator p) p
drivelessTarPath = toTarPath' $ dropDrive path
This may seem a little long-winded; however, the hoops we jump through are there to ensure the resulting tar paths are sane. You can read about the gory details of tar handling on the Codec.Archive.Tar.Entry documentation. The key function in this definition is dropDrive, which makes an absolute path relative (in Linux, it strips the leading slash of an absolute path).
It is worth spending a few words on the use of either. toTarPath produces a value of type Either String TarPath to account for the possibility of failure. Specifically, the conversion to a tar path fails if the provided path is too long. In our case, however, the path cannot be too long, as it is a path which already was in a tar file, perhaps with a removed leading slash. That being so, it is good enough to eliminate the Either wrapping with either, passing an error instead of the function to handle the (impossible) Left case.
With dropDriveFromEntry in hand, we just have to map it over the entries before unpacking. The relevant line of your program would become:
Tar.unpack searchPath . Tar.mapEntriesNoFail dropDriveFromEntry
. Tar.read . GZip.decompress =<< BS.readFile filePath
Note that if there were relevant errors to be accounted for in dropDriveFromEntry, we would make it return Either String TarPath, and then use mapEntries instead of mapEntriesNoFail.
With these changes, the entry in your tar file will be extracted to /home/someuser/tarball/dir/tarball_testing/myfile. If that is not what you intended, you can modify dropDriveFromEntry so that it performs whatever extra path processing you need.
P.S.: Regarding the alternate title of your question, and considering the sensible little program you have shown us, I do not think you should be worried :)

Related

main: Prelude.!!: index too large

I am trying to write a program that will copy a file and allow us to rename it, like a cp command in unix.
import System.IO
import System.Environment
import qualified Data.Text as T
import qualified Data.Text.IO as TI
main :: IO ()
main = do
args <- getArgs
let source = args !! 0
let dest = args !! 1
input <- TI.readFile source
TI.writeFile dest input
I tried this and got main: Prelude.!!: index too large
As the error says, this means that the index is too large, so that means that getArgs returns a list that contains less than two elements.
Using (!!) is however often not a good idea: there is no guarantee that the element exists, and it runs in 𝓞(k) to obtain the k-th element. You can work with:
main :: IO ()
main = do
args <- getArgs
case args of
(src:dst:_) -> do
input <- TI.readFile source
TI.writeFile dest input
_ -> putStrLn "You should provide a source and destination"
Also loading the entire content in a string is not a good idea, since the file can be larger than all (available) memory, thus crashing the system. Often copying is done through two file handlers: one that reads the source file, and one that writes to the target file, with only a small amount of memory used. Some file systems can also make copying more effective. For example by using two references to the same file, and only effectively making a copy if one of the two files is modified.

Recursively search directories for all files matching name criteria in Haskell

I'm relatively inexperienced in Haskell and I wanted to improve, so for a learning project of mine I have the following requirements:
I want to search starting from a specified top directory, not necessarily an absolute path.
I want to find all files of a given extension, say .md.
I want to not search hidden directories, say toplevel/.excluded.
I want to be able to ignore hidden files like gedit produces .filename.md.swp.
I want to end up with a complete list of files as the result of my function.
I searched all over SO. Here's what I have so far:
import qualified System.FilePath.Find as SFF
import qualified Filesystem.Path.CurrentOS as FP
srcFolderName = "src"
outFolderName = "output"
resFolderName = "res"
ffNotHidden :: SFF.FindClause Bool
ffNotHidden = SFF.fileName SFF./~? ".?*"
ffIsMD :: SFF.FindClause Bool
ffIsMD = SFF.extension SFF.==? ".md" SFF.&&? SFF.fileName SFF./~? ".?*"
findMarkdownSources :: FilePath -> IO [FilePath]
findMarkdownSources filePath = do
paths <- SFF.find ffNotHidden ffIsMD filePath
return paths
This doesn't work. printf-style debugging in "findMarkdownSources", I can verify that filePath is correct e.g. "/home/user/testdata" (print includes the ", in case that tells you something). The list paths is always empty. I'm absolutely certain there are markdown files in the directory I have specified (find /path/to/dir -name "*.md" finds them).
I therefore have some specific questions.
Is there a reason (filters incorrect) for example, why this code should not work?
There are a number of ways to do this in haskell. It seems there are at least six packages (fileman, system.directory, system.filepath.find) dedicated to this. Here's some questions where something like this is answered:
Streaming recursive descent of a directory in Haskell
Is there some directory walker in Haskell?
avoid recursion into specifc folder using filemanip
Each one has about three unique ways to achieve what I want to achieve, so, we're nearly at 10 ways to do it...
Is there a specific way I should be doing this? If so why? If it helps, once I have my file list, I'm going to walk the entire thing, open and parse each file.
If it helps, I'm reasonably comfortable with basic haskell, but you'll need to slow down if we start getting too heavy with monads and applicative functors (I don't use haskell enough for this to stay in my head). I find the haskell docs on hackage incomprehensible, though.
so, we're nearly at 10 ways to do it...
Here's yet another way to do it, using functions from the directory, filepath and extra packages, but not too much monad wizardry:
import Control.Monad (foldM)
import System.Directory (doesDirectoryExist, listDirectory) -- from "directory"
import System.FilePath ((</>), FilePath) -- from "filepath"
import Control.Monad.Extra (partitionM) -- from the "extra" package
traverseDir :: (FilePath -> Bool) -> (b -> FilePath -> IO b) -> b -> FilePath -> IO b
traverseDir validDir transition =
let go state dirPath =
do names <- listDirectory dirPath
let paths = map (dirPath </>) names
(dirPaths, filePaths) <- partitionM doesDirectoryExist paths
state' <- foldM transition state filePaths -- process current dir
foldM go state' (filter validDir dirPaths) -- process subdirs
in go
The idea is that the user passes a FilePath -> Bool function to filter unwanted directories; also an initial state b and a transition function b -> FilePath -> IO b that processes file names, updates the b state and possibly has some side effects. Notice that the type of the state is chosen by the caller, who might put useful things there.
If we only want to print file names as they are produced, we can do something like this:
traverseDir (\_ -> True) (\() path -> print path) () "/tmp/somedir"
We are using () as a dummy state because we don't really need it here.
If we want to accumulate the files into a list, we can do it like this:
traverseDir (\_ -> True) (\fs f -> pure (f : fs)) [] "/tmp/somedir"
And what if we want to filter some files? We would need to tweak the transition function we pass to traverseDir so that it ignores them.
I tested you code on my machine, and it seems to work fine. Here is some example data:
$ find test/data
test/data
test/data/look-a-md-file.md
test/data/another-dir
test/data/another-dir/shown.md
test/data/.not-shown.md
test/data/also-not-shown.md.bkp
test/data/.hidden
test/data/some-dir
test/data/some-dir/shown.md
test/data/some-dir/.ahother-hidden
test/data/some-dir/.ahother-hidden/im-hidden.md
Running your function will result in:
ghci> findMarkdownSources "test"
["test/data/another-dir/shown.md","test/data/look-a-md-file.md","test/data/some-dir/shown.md"]
I've tested this with an absolute path, and it also works. Are you sure you have passed a valid path? You'll get an empty list if that is the case (although you also get a warning).
Note that your code could be simplified as follows:
module Traversals.FileManip where
import Data.List (isPrefixOf)
import System.FilePath.Find (always, extension, fileName, find, (&&?),
(/~?), (==?))
findMdSources :: FilePath -> IO [FilePath]
findMdSources fp = find isVisible (isMdFile &&? isVisible) fp
where
isMdFile = extension ==? ".md"
isVisible = fileName /~? ".?*"
And you can even remove the fp parameter, but I'm leaving it here for the sake of clarity.
I prefer to import explicitly so that I know where each function comes from (since I don't know of any Haskell IDE with advanced symbol navigation).
However, note that this solution uses uses unsafe interleave IO, which is not recommended.
So regarding your questions 2 and 3, I would recommend a streaming solution, like pipes or conduits. Sticking to these kind of solutions will reduce your options (just like sticking to pure functional programming languages reduced my options for programming languages ;)). Here you have an example on how pipes can be used to walk a directory.
Here is the code in case you want to try this out.

How to delete unbuildable goal?

I would like to remove files that no longer have source but without cleaning.
Is there support for partially cleaning an incremental build? In this case, I guess I could compare against set of source files that were consumed in previous builds and define how to clean those that are gone.
main = shakeArgs shakeOptions { shakeVerbosity = Diagnostic } $ do
want [".build"]
phony ".build" $ do
files <- getDirectoryFiles "." ["//*.txt"]
let goals = map (-<.> "") files
need goals
"*" %> \out -> do
Stdout o <- cmd $ "sort " ++ (out ++ ".txt")
writeFile' out o
Using shakeArgsPrune you can define a function that gets passed the live files afterwards. You can then write something like:
import Development.Shake
import Development.Shake.FilePath
import Development.Shake.Util
import System.Directory.Extra
import Data.List
import System.IO
pruner :: [FilePath] -> IO ()
pruner live = do
present <- listFilesRecursive "output"
mapM_ removeFile $ map toStandard present \\ map toStandard live
main :: IO ()
main = shakeArgsPrune shakeOptions pruner $ do
... rules go here ...
This deletes all files in output that are not generated and up-to-date according to the build system as it stands. For a complete example see
http://neilmitchell.blogspot.co.uk/2015/04/cleaning-stale-files-with-shake.html.
The shakeArgsPrune function is only available in shake-0.15.1 and above, but is based on the shakeLiveFiles feature which has been available for longer and can be used directly if you so desire.

Flexible number of arguments to haskell program

I am using the System.FilePath.Find module of filemanip to recursively find all files I need to process (here I will be using just printing to console as the action to perform, in order not confuse things). Now, this code:
import System.Environment (getArgs)
import System.FilePath (FilePath)
import System.Directory (doesDirectoryExist, getDirectoryContents,doesFileExist)
import Control.Monad
import System.FilePath.Find (find,always,fileType,(==?),FileType(..),(&&?),extension)
main= do
[dbFile,input]<- getArgs
files <- findFiles input
mapM_ putStrLn files
return ()
searchExtension :: String
searchExtension = ".hs"
findFiles :: FilePath -> IO [String]
findFiles = find (always) ( fileType ==? RegularFile &&? extension ==? searchExtension)
works well with this call
./myprog tet .
In this case, the get argument is ignored (will be the output database file later) and the second argument is searched recursively for matching files. It also allows me to specify just a single file, which is just perfect!
BUT, I would like to be able to specify
./myprog tet path1 path2 path4 file1
but this of course fails in the pattern matching:
./myprog tet . .
myprogt: user error (Pattern match failure in do expression at myprog.hs:11:9-22)
Now, how do I make this program more flexible, so that I can take more than two arguments?
Sorry for asking this, actually, but my Haskell knowledge is limited but increasing for every new thing I have to do in my first project.
Well, you can use a different pattern like:
(dbFile:inputs) <- getArgs
where dbFile will match the first argument passed while inputs will match any number of file names (even 0. If you want at least one path name use inputs#(_:_) instead of the simple inputs).
Then you can use mapM to call findFiles for each path in inputs:
files <- mapM findFiles input
mapM_ putStrLn $ concat files
Instead of mapM you could modify findFiles to accept a [FilePath] argument instead of a simple FilePath.
Note that to parse command arguments you could consider using some module like getopt. You should also read this page about argument handling.

How can I make file I/O more transactional?

I'm writing CGI scripts in Haskell. When the user hits ‘submit’, a Haskell program runs on the server, updating (i.e. reading in, processing, overwriting) a status file. Reading then overwriting sometimes causes issues with lazy IO, as we may be able to generate a large output prefix before we've finished reading the input. Worse, users sometimes bounce on the submit button and two instances of the process run concurrently, fighting over the same file!
What's a good way to implement
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
where the function (‘update’) computes the new file contents from the old file contents? It is not safe to presume that ‘update’ is strict, but it may be presumed that it is total (robustness to partial update functions is a bonus). Transactions may be attempted concurrently, but no transaction should be able to update if the file has been written by anyone else since it was read. It's ok for a transaction to abort in case of competition for file access. We may assume a source of systemwide-unique temporary filenames.
My current attempt writes to a temporary file, then uses a system copy command to overwrite. That seems to deal with the lazy IO problems, but it doesn't strike me as safe from races. Is there a tried and tested formula that we could just bottle?
The most idiomatic unixy way to do this is with flock:
http://hackage.haskell.org/package/flock
http://swoolley.org/man.cgi/2/flock
Here is a rough first cut that relies on the atomicity of the underlying mkdir. It seems to fulfill the specification, but I'm not sure how robust or fast it is:
import Control.DeepSeq
import Control.Exception
import System.Directory
import System.IO
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
transactionalUpdate file upd = bracket acquire release update
where
acquire = do
let lockName = file ++ ".lock"
createDirectory lockName
return lockName
release = removeDirectory
update _ = nonTransactionalUpdate file upd
nonTransactionalUpdate :: FilePath -> (String -> String) -> IO ()
nonTransactionalUpdate file upd = do
h <- openFile file ReadMode
s <- upd `fmap` hGetContents h
s `deepseq` hClose h
h <- openFile file WriteMode
hPutStr h s
hClose h
I tested this by adding the following main and throwing a threadDelay in the middle of nonTransactionalUpdate:
main = do
[n] <- getArgs
transactionalUpdate "foo.txt" ((show n ++ "\n") ++)
putStrLn $ "successfully updated " ++ show n
Then I compiled and ran a bunch of instances with this script:
#!/bin/bash
rm foo.txt
touch foo.txt
for i in {1..50}
do
./SO $i &
done
A process that printed a successful update message if and only if the corresponding number was in foo.txt; all the others printed the expected SO: foo.txt.notveryunique: createDirectory: already exists (File exists).
Update: You actually do not want to use unique names here; it must be a consistent name across the competing processes. I've updated the code accordingly.

Resources