Storing metadata for rules - haskell

I have a function that rebuilds a target whenever the associated command changes:
target :: FilePath -> [FilePath] -> String -> Rules ()
target dst deps cline = do
let dcmd = dst <.> "x"
dcmd %> \out -> do
alwaysRerun
writeFileChanged out cline
return ()
dst %> \out -> do
c <- readFile' dcmd
need deps
() <- cmd $ "../dumpdeps/dumpdeps " ++ out ++ " " ++ c
needMakefileDependencies $ out <.> "d"
return ()
I would prefer not to touch the filesystem for this task, is there any way to store the associated command line and trigger the final rule when that command changes?

I'd personally do this using the file system. File systems have great debuggers (ls and cat) and a large array of manipulation tools (echo, rm, touch). They also tend to be very fast for small files, living mostly in the cache. If you avoid the files, you tend to have less visibility of what's happening.
That said, there can be very good reasons to avoid the files. The closest thing Shake has to your pattern above is to use an Oracle. Note that doesn't quite match the pattern you are doing, as it assumes the cline can be computed from dst, which might be possible in your case or might not.
If Oracle doesn't suit you, you can define your own Rule instance which can operate much the same way as a file, but store the values in the Shake database instead.
(If either of those cause any difficulties, let me know which is the relevant one to your question, and I'll expand the right one.)

Related

Recursively search directories for all files matching name criteria in Haskell

I'm relatively inexperienced in Haskell and I wanted to improve, so for a learning project of mine I have the following requirements:
I want to search starting from a specified top directory, not necessarily an absolute path.
I want to find all files of a given extension, say .md.
I want to not search hidden directories, say toplevel/.excluded.
I want to be able to ignore hidden files like gedit produces .filename.md.swp.
I want to end up with a complete list of files as the result of my function.
I searched all over SO. Here's what I have so far:
import qualified System.FilePath.Find as SFF
import qualified Filesystem.Path.CurrentOS as FP
srcFolderName = "src"
outFolderName = "output"
resFolderName = "res"
ffNotHidden :: SFF.FindClause Bool
ffNotHidden = SFF.fileName SFF./~? ".?*"
ffIsMD :: SFF.FindClause Bool
ffIsMD = SFF.extension SFF.==? ".md" SFF.&&? SFF.fileName SFF./~? ".?*"
findMarkdownSources :: FilePath -> IO [FilePath]
findMarkdownSources filePath = do
paths <- SFF.find ffNotHidden ffIsMD filePath
return paths
This doesn't work. printf-style debugging in "findMarkdownSources", I can verify that filePath is correct e.g. "/home/user/testdata" (print includes the ", in case that tells you something). The list paths is always empty. I'm absolutely certain there are markdown files in the directory I have specified (find /path/to/dir -name "*.md" finds them).
I therefore have some specific questions.
Is there a reason (filters incorrect) for example, why this code should not work?
There are a number of ways to do this in haskell. It seems there are at least six packages (fileman, system.directory, system.filepath.find) dedicated to this. Here's some questions where something like this is answered:
Streaming recursive descent of a directory in Haskell
Is there some directory walker in Haskell?
avoid recursion into specifc folder using filemanip
Each one has about three unique ways to achieve what I want to achieve, so, we're nearly at 10 ways to do it...
Is there a specific way I should be doing this? If so why? If it helps, once I have my file list, I'm going to walk the entire thing, open and parse each file.
If it helps, I'm reasonably comfortable with basic haskell, but you'll need to slow down if we start getting too heavy with monads and applicative functors (I don't use haskell enough for this to stay in my head). I find the haskell docs on hackage incomprehensible, though.
so, we're nearly at 10 ways to do it...
Here's yet another way to do it, using functions from the directory, filepath and extra packages, but not too much monad wizardry:
import Control.Monad (foldM)
import System.Directory (doesDirectoryExist, listDirectory) -- from "directory"
import System.FilePath ((</>), FilePath) -- from "filepath"
import Control.Monad.Extra (partitionM) -- from the "extra" package
traverseDir :: (FilePath -> Bool) -> (b -> FilePath -> IO b) -> b -> FilePath -> IO b
traverseDir validDir transition =
let go state dirPath =
do names <- listDirectory dirPath
let paths = map (dirPath </>) names
(dirPaths, filePaths) <- partitionM doesDirectoryExist paths
state' <- foldM transition state filePaths -- process current dir
foldM go state' (filter validDir dirPaths) -- process subdirs
in go
The idea is that the user passes a FilePath -> Bool function to filter unwanted directories; also an initial state b and a transition function b -> FilePath -> IO b that processes file names, updates the b state and possibly has some side effects. Notice that the type of the state is chosen by the caller, who might put useful things there.
If we only want to print file names as they are produced, we can do something like this:
traverseDir (\_ -> True) (\() path -> print path) () "/tmp/somedir"
We are using () as a dummy state because we don't really need it here.
If we want to accumulate the files into a list, we can do it like this:
traverseDir (\_ -> True) (\fs f -> pure (f : fs)) [] "/tmp/somedir"
And what if we want to filter some files? We would need to tweak the transition function we pass to traverseDir so that it ignores them.
I tested you code on my machine, and it seems to work fine. Here is some example data:
$ find test/data
test/data
test/data/look-a-md-file.md
test/data/another-dir
test/data/another-dir/shown.md
test/data/.not-shown.md
test/data/also-not-shown.md.bkp
test/data/.hidden
test/data/some-dir
test/data/some-dir/shown.md
test/data/some-dir/.ahother-hidden
test/data/some-dir/.ahother-hidden/im-hidden.md
Running your function will result in:
ghci> findMarkdownSources "test"
["test/data/another-dir/shown.md","test/data/look-a-md-file.md","test/data/some-dir/shown.md"]
I've tested this with an absolute path, and it also works. Are you sure you have passed a valid path? You'll get an empty list if that is the case (although you also get a warning).
Note that your code could be simplified as follows:
module Traversals.FileManip where
import Data.List (isPrefixOf)
import System.FilePath.Find (always, extension, fileName, find, (&&?),
(/~?), (==?))
findMdSources :: FilePath -> IO [FilePath]
findMdSources fp = find isVisible (isMdFile &&? isVisible) fp
where
isMdFile = extension ==? ".md"
isVisible = fileName /~? ".?*"
And you can even remove the fp parameter, but I'm leaving it here for the sake of clarity.
I prefer to import explicitly so that I know where each function comes from (since I don't know of any Haskell IDE with advanced symbol navigation).
However, note that this solution uses uses unsafe interleave IO, which is not recommended.
So regarding your questions 2 and 3, I would recommend a streaming solution, like pipes or conduits. Sticking to these kind of solutions will reduce your options (just like sticking to pure functional programming languages reduced my options for programming languages ;)). Here you have an example on how pipes can be used to walk a directory.
Here is the code in case you want to try this out.

Conduit Exception

I couldn't figure out how to make sourceDirectory and catchC work.
src = (sourceDirectory "/does/not/exist/input.txt" $$ C.print) `catchC` \e ->
yield (pack $ "Could not read input file: " ++ show (e :: IOException))
The idea is that I use sourceDirectory to walk a directory tree and in case of failure I want the program to continue and not stop.
The catchC function works on individual components of a pipeline, like sourceDirectory "somedir" (in other words, things of type ConduitM). You've applied it to a completely run pipeline, which is just a normal action, and therefore catchC won't work. Your choices are:
Apply catchC to the individual component, e.g. (sourceDirectory "foo" `catchC` handler) $$ printC
Use a non-conduit-specific catch function (such as from safe-exceptions), e.g. (sourceDirectory "foo" $$ printC) `catch` handler.
Also, a recommendation for the future: it's a good idea to include the compiler error when some code won't build.

Unsafe IO Or: Haskeline and Directories

DISCLAIMER: I am somewhat new to Haskell.
I am writing an interpreter, or, in this context, a REPL. For that purpose I am using haskeline, which is nice for REPLs. It has the capability of storing the command line history within a file, which is also nice.
One problem I came across while working with it, though, is that it does not seem to expand "~" to the home directory, which means that I have to retrieve the home directory manually.
I could do it like this (and currently do):
-- | returns a fresh settings variable
addSettings :: Env -> Settings IO
addSettings env = Settings { historyFile = Just getDir
, complete = completeWord Nothing " \t" $
return . completionSearch env
, autoAddHistory = True
}
where
getDir :: FilePath
getDir = unsafePerformIO getHomeDirectory ++ "/.zepto_history"
But that uses unsafePerformIO, which makes me cringe. Do you know of a good and clean workaround that does not involve rewriting the whole function? This can be a haskeline feature I do not know of or something I just did not see.
Telling me there is no way around rewriting and rethinking it all is fine, too.
EDIT:
I know unsafePerformIO is bad, that's why it makes me cringe. If you are new to Haskell and reading this question right now: Just pretend it is not there.
A better approach would be to generate the Settings object inside IO, instead of the other way around, so to speak:
addSettings :: Env -> IO (Settings IO)
addSettings = do
getDir <- fmap (++ "/.zepto_history") getHomeDirectory
return $ Settings
{ historyFile = Just getDir
, complete = completeWord Nothing " \t" $ return . completionSearch env
, autoAddHistory = True
}
This will no doubt require some changes in your current software, but this would be considered the "right" way to go about this.

Recover the source file name in a Shake rule

I am writing a build system for a static website which works like this:
for every file src/123-some-title.txt produce a file out/123.html
My problem is that when writing the rule for out/*.html I have no direct way to recover the source file name (src/123-some-title.txt) from the target file name (out/123.html).
Of course I could read the src/ directory again and search for a file that starts with 123, but is there a nicer way to do this with Shake?
The first thing to mention is that if you call getDirectoryFiles multiple times with the same arguments it will only calculate once, in the same way that if you call need multiple times on the same file it will only build once. One approach would be:
"out/*.fwd" *> \out -> do
res <- getDirectoryFiles "src" ["*.txt"]
let match = [(takeBaseName out ++ "-") `isPrefixOf` takeBaseName x | x <- res]
when (length match /= 1) $ error "fail, because wrong number of matches"
writeFileChanged out $ head match
"out/*.html" *> \out -> do
src <- readFile' (out -<.> "fwd")
txt <- readFile' ("src" </> src)
...
Here the idea is that the file out/123.txt contain the contents 123-some-title.txt. By using writeFileChanged we only change the .fwd file when the relevant part of the directory changes.
If you want to avoid the .fwd files, you can use the Oracle mechanism. If you want to avoid a linear scan of the getDirectoryFiles result you can use the newCache function. In practice, neither is likely to be problematic, and going with the files is probably simplest.

How can I make file I/O more transactional?

I'm writing CGI scripts in Haskell. When the user hits ‘submit’, a Haskell program runs on the server, updating (i.e. reading in, processing, overwriting) a status file. Reading then overwriting sometimes causes issues with lazy IO, as we may be able to generate a large output prefix before we've finished reading the input. Worse, users sometimes bounce on the submit button and two instances of the process run concurrently, fighting over the same file!
What's a good way to implement
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
where the function (‘update’) computes the new file contents from the old file contents? It is not safe to presume that ‘update’ is strict, but it may be presumed that it is total (robustness to partial update functions is a bonus). Transactions may be attempted concurrently, but no transaction should be able to update if the file has been written by anyone else since it was read. It's ok for a transaction to abort in case of competition for file access. We may assume a source of systemwide-unique temporary filenames.
My current attempt writes to a temporary file, then uses a system copy command to overwrite. That seems to deal with the lazy IO problems, but it doesn't strike me as safe from races. Is there a tried and tested formula that we could just bottle?
The most idiomatic unixy way to do this is with flock:
http://hackage.haskell.org/package/flock
http://swoolley.org/man.cgi/2/flock
Here is a rough first cut that relies on the atomicity of the underlying mkdir. It seems to fulfill the specification, but I'm not sure how robust or fast it is:
import Control.DeepSeq
import Control.Exception
import System.Directory
import System.IO
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
transactionalUpdate file upd = bracket acquire release update
where
acquire = do
let lockName = file ++ ".lock"
createDirectory lockName
return lockName
release = removeDirectory
update _ = nonTransactionalUpdate file upd
nonTransactionalUpdate :: FilePath -> (String -> String) -> IO ()
nonTransactionalUpdate file upd = do
h <- openFile file ReadMode
s <- upd `fmap` hGetContents h
s `deepseq` hClose h
h <- openFile file WriteMode
hPutStr h s
hClose h
I tested this by adding the following main and throwing a threadDelay in the middle of nonTransactionalUpdate:
main = do
[n] <- getArgs
transactionalUpdate "foo.txt" ((show n ++ "\n") ++)
putStrLn $ "successfully updated " ++ show n
Then I compiled and ran a bunch of instances with this script:
#!/bin/bash
rm foo.txt
touch foo.txt
for i in {1..50}
do
./SO $i &
done
A process that printed a successful update message if and only if the corresponding number was in foo.txt; all the others printed the expected SO: foo.txt.notveryunique: createDirectory: already exists (File exists).
Update: You actually do not want to use unique names here; it must be a consistent name across the competing processes. I've updated the code accordingly.

Resources