Recover the source file name in a Shake rule

Recover the source file name in a Shake rule - haskell

I am writing a build system for a static website which works like this:
for every file src/123-some-title.txt produce a file out/123.html
My problem is that when writing the rule for out/*.html I have no direct way to recover the source file name (src/123-some-title.txt) from the target file name (out/123.html).
Of course I could read the src/ directory again and search for a file that starts with 123, but is there a nicer way to do this with Shake?

The first thing to mention is that if you call getDirectoryFiles multiple times with the same arguments it will only calculate once, in the same way that if you call need multiple times on the same file it will only build once. One approach would be:
"out/*.fwd" *> \out -> do
res <- getDirectoryFiles "src" ["*.txt"]
let match = [(takeBaseName out ++ "-") `isPrefixOf` takeBaseName x | x <- res]
when (length match /= 1) $ error "fail, because wrong number of matches"
writeFileChanged out $ head match
"out/*.html" *> \out -> do
src <- readFile' (out -<.> "fwd")
txt <- readFile' ("src" </> src)
...
Here the idea is that the file out/123.txt contain the contents 123-some-title.txt. By using writeFileChanged we only change the .fwd file when the relevant part of the directory changes.
If you want to avoid the .fwd files, you can use the Oracle mechanism. If you want to avoid a linear scan of the getDirectoryFiles result you can use the newCache function. In practice, neither is likely to be problematic, and going with the files is probably simplest.

Related

How to properly match file names in subdirectories?

I am currently going through the book Real World Haskell and one exercise from this book asks the reader to implement file name matching with the use of **, which is the same as *, but also looks in subdirectories all the way down in the file system. Below is a fragment of my code with comments (there is a lot of duplication at the moment) and further down you can find additional info about the code. I think that the posted code is sufficient for the problem and there is no need to list the whole program here.
case splitFileName pat of
("", baseName) -> do -- just the file name passed
curDir <- getCurrentDirectory
if searchSubDirs baseName -- check if file name has `**` in it
then do
contents <- getDirectoryContents curDir
subDirs <- filterM doesDirectoryExist contents
let properSubDirs = filter (`notElem` [".", ".."]) subDirs
subDirsNames <- forM properSubDirs $ \dir -> do
namesMatching (curDir </> dir </> baseName) -- call the function recursively on subdirectories
curDirNames <- listMatches curDir baseName -- list matches in the current directory
return (curDirNames ++ (concat subDirsNames)) -- concatenate results into a single list
else listMatches curDir baseName
(dirName, baseName) -> do // full path passed
if searchSubDirs baseName
then do
contents <- getDirectoryContents dirName
subDirs <- filterM doesDirectoryExist contents
let properSubDirs = filter (`notElem` [".", ".."]) subDirs
subDirsNames <- forM properSubDirs $ \dir -> do
namesMatching (dirName </> dir </> baseName) -- call the function recursively on subdirectories
curDirNames <- listMatches dirName baseName -- list matches in the passed directory
return (curDirNames ++ (concat subDirsNames)) -- concatenate results into a single list
Additional information:
pat is the pattern I'm looking for (e.g. *.txt or C:\\A\[a-z].*).
splitFileName is a function which splits a file path into the directory path and the file name. The first element of the tuple will be empty if we specify just a file name in pat.
searchSubDirs returns True if the file name has ** in it.
listMatches returns a list of file names that match the pattern in the directory, substituting ** for *.
namesMatching is the name of the function whose excerpt I posted.
Why doesn't it work?
When I pass just the file name, the program searches for it only in the current directory and first level of subdirectories. When I pass a full path, it searches only in the specified directory. It looks like case (dirName, baseName) doesn't properly recurse. I've been looking at the code for some time now and I can't figure out where the problem is.
Note
If any more information is needed, please let me know in the comments and I'll add whatever is necessary to the question.

Here's an issue:
contents <- getDirectoryContents dirName
subDirs <- filterM doesDirectoryExist contents
getDirectoryContents only returns the leaf names of the directories, so you have to prepend dirName (along with a /) to the elements of contents before calling doesDirectoryExist.

Storing metadata for rules

I have a function that rebuilds a target whenever the associated command changes:
target :: FilePath -> [FilePath] -> String -> Rules ()
target dst deps cline = do
let dcmd = dst <.> "x"
dcmd %> \out -> do
alwaysRerun
writeFileChanged out cline
return ()
dst %> \out -> do
c <- readFile' dcmd
need deps
() <- cmd $ "../dumpdeps/dumpdeps " ++ out ++ " " ++ c
needMakefileDependencies $ out <.> "d"
return ()
I would prefer not to touch the filesystem for this task, is there any way to store the associated command line and trigger the final rule when that command changes?

I'd personally do this using the file system. File systems have great debuggers (ls and cat) and a large array of manipulation tools (echo, rm, touch). They also tend to be very fast for small files, living mostly in the cache. If you avoid the files, you tend to have less visibility of what's happening.
That said, there can be very good reasons to avoid the files. The closest thing Shake has to your pattern above is to use an Oracle. Note that doesn't quite match the pattern you are doing, as it assumes the cline can be computed from dst, which might be possible in your case or might not.
If Oracle doesn't suit you, you can define your own Rule instance which can operate much the same way as a file, but store the values in the Shake database instead.
(If either of those cause any difficulties, let me know which is the relevant one to your question, and I'll expand the right one.)

Haskell write a list in file and read later

I am trying to write a list into a file and later on I want to read the file contents into the list as well.
So I have a list like this ["ABC","DEF"]
I have tried things like
hPrint fileHandle listName
This just prints into file "["ABC","DEF"]"
I have tried unlines but that is priniting like "ABC\nDEF\n"
Now in both the cases, I cant read back into proper list. The output file has quotes and because of which when I read, I get like this ["["ABC","DEF"]""] i.e a single string in list.
As I am not succeeding in this, I tried to write the list line by line, I tried to apply a map and the function to write the list k = map (\x -> hPrint fileSLC x) fieldsBefore, it is not doing anything, file is blank. I think if I write everything in separate line, I will be able to read like (lines src) later on.
I know whatever I am doing is wrong but I am writing the code on Haskell for second time only, last time I just a wrote a very a small file reading program. Moving from imperative to functional is not that easy. :(

Try using hPutStrLn and unlines instead of hPrint. The hPrint internally calls show which causes Strings to be quoted and escaped.
hPutStr fileHandle (unlines listName)
Alternatively, use a mapM or a forM. A verbose example is:
forM_ listName $ \string ->
hPutStrLn string
This can be simplified ("eta-contracted", in lambda-calculus terminology) to
forM_ listName hPutStrLn

As you have seen, when you read from a file, you get a String. In order to convert this String into a list, you will need to parse it.
For k = map (\x -> hPrint fileSLC x) fieldsBefore to work, you need to use mapM or mapM_ instead of map.

How to override function in Codec.Archive.Tar

Haskell noob here. I have a question specifically regarding how to use an existing library that may lead to some more fundamental aspects of the proper use of Haskell.
I'm learning Haskell and have a small project in mind to work on while I learn. The script will need to find all the tarballs in a given directory and unpack them in parallel. At this point, I'm working on the basic functionality of unpacking. So, using the Codec.Archive.Tar package, how can I override its behavior regarding tarballs with fully qualified paths?
Here's some example code:
module Main where
import qualified Codec.Archive.Tar as Tar
import qualified Codec.Compression.GZip as GZip
import Control.Monad (liftM, unless)
import qualified Data.ByteString.Lazy as BS
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.Exit (exitWith, ExitCode(..))
import System.FilePath.Posix (takeExtension)
searchPath = "/home/someuser/tarball/dir"
exit = exitWith ExitSuccess
die = exitWith (ExitFailure 1)
processFile :: String -> IO ()
processFile file = do
putStrLn $ "Unpacking " ++ file ++ " to " ++ searchPath
Tar.unpack searchPath . Tar.read . GZip.decompress =<< BS.readFile filePath
where filePath = searchPath ++ "/" ++ file
main = do
dirExists <- doesDirectoryExist searchPath
unless dirExists $ (putStrLn $ "Error: Search path not found: " ++ searchPath) >> die
files <- targetFiles `liftM` getDirectoryContents searchPath
mapM_ processFile files
exit
where targetFiles = filter (\f -> f /= "." && f /= ".." && takeExtension f == ".tgz")
When I run this in a directory with tarballs that were packed with:
tar czvPf myfile.tgz /tarball_testing/myfile
I get the following output:
Unpacking myfile.tgz to /tarball_testing
unpacker.hs: Absolute file name in tar archive: "/tarball_testing/myfile"
The second line is the issue. Reading the docs for Codec.Archive.Tar I don't see a way to disable this functionality (not interested in discussions of why I want to use full paths in tarballs, or the relative security implications of doing so).
The first thing that comes to mind is that I somehow need to override the function but that doesn't "feel" like the way a pro Haskeller would do it. Can I get a pointer in the right direction?

You cannot monkey patch or otherwise override a function from a Haskell module, and therefore no workaround will let you avoid the safety measures of the library. What you can do, however, is use the functionality in Codec.Archive.Tar to modify the tar entry paths before unpacking so that they won't be absolute any more. Specifically, there is a mapEntriesNoFail function with type
mapEntriesNoFail :: (Entry -> Entry) -> Entries e -> Entries e
Entries is the type of the argument to Tar.unpack, while Entry is the type of an individual entry. Thanks to mapEntriesNoFail, our problem becomes writing an Entry -> Entry function to adjust the paths. For that, first we will need some extra imports:
import qualified Codec.Archive.Tar.Entry as Tar
import System.FilePath.Posix (takeExtension, dropDrive, hasTrailingPathSeparator)
import Data.Either (either)
The function can look like this:
dropDriveFromEntry :: Tar.Entry -> Tar.Entry
dropDriveFromEntry entry =
either (error "Resulting tar path is somehow too long")
(\tp -> entry { Tar.entryTarPath = tp })
drivelessTarPath
where
tarPath = Tar.entryTarPath entry
path = Tar.fromTarPath tarPath
toTarPath' p = Tar.toTarPath (hasTrailingPathSeparator p) p
drivelessTarPath = toTarPath' $ dropDrive path
This may seem a little long-winded; however, the hoops we jump through are there to ensure the resulting tar paths are sane. You can read about the gory details of tar handling on the Codec.Archive.Tar.Entry documentation. The key function in this definition is dropDrive, which makes an absolute path relative (in Linux, it strips the leading slash of an absolute path).
It is worth spending a few words on the use of either. toTarPath produces a value of type Either String TarPath to account for the possibility of failure. Specifically, the conversion to a tar path fails if the provided path is too long. In our case, however, the path cannot be too long, as it is a path which already was in a tar file, perhaps with a removed leading slash. That being so, it is good enough to eliminate the Either wrapping with either, passing an error instead of the function to handle the (impossible) Left case.
With dropDriveFromEntry in hand, we just have to map it over the entries before unpacking. The relevant line of your program would become:
Tar.unpack searchPath . Tar.mapEntriesNoFail dropDriveFromEntry
. Tar.read . GZip.decompress =<< BS.readFile filePath
Note that if there were relevant errors to be accounted for in dropDriveFromEntry, we would make it return Either String TarPath, and then use mapEntries instead of mapEntriesNoFail.
With these changes, the entry in your tar file will be extracted to /home/someuser/tarball/dir/tarball_testing/myfile. If that is not what you intended, you can modify dropDriveFromEntry so that it performs whatever extra path processing you need.
P.S.: Regarding the alternate title of your question, and considering the sensible little program you have shown us, I do not think you should be worried :)

How can I make file I/O more transactional?

I'm writing CGI scripts in Haskell. When the user hits ‘submit’, a Haskell program runs on the server, updating (i.e. reading in, processing, overwriting) a status file. Reading then overwriting sometimes causes issues with lazy IO, as we may be able to generate a large output prefix before we've finished reading the input. Worse, users sometimes bounce on the submit button and two instances of the process run concurrently, fighting over the same file!
What's a good way to implement
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
where the function (‘update’) computes the new file contents from the old file contents? It is not safe to presume that ‘update’ is strict, but it may be presumed that it is total (robustness to partial update functions is a bonus). Transactions may be attempted concurrently, but no transaction should be able to update if the file has been written by anyone else since it was read. It's ok for a transaction to abort in case of competition for file access. We may assume a source of systemwide-unique temporary filenames.
My current attempt writes to a temporary file, then uses a system copy command to overwrite. That seems to deal with the lazy IO problems, but it doesn't strike me as safe from races. Is there a tried and tested formula that we could just bottle?

The most idiomatic unixy way to do this is with flock:
http://hackage.haskell.org/package/flock
http://swoolley.org/man.cgi/2/flock

Here is a rough first cut that relies on the atomicity of the underlying mkdir. It seems to fulfill the specification, but I'm not sure how robust or fast it is:
import Control.DeepSeq
import Control.Exception
import System.Directory
import System.IO
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
transactionalUpdate file upd = bracket acquire release update
where
acquire = do
let lockName = file ++ ".lock"
createDirectory lockName
return lockName
release = removeDirectory
update _ = nonTransactionalUpdate file upd
nonTransactionalUpdate :: FilePath -> (String -> String) -> IO ()
nonTransactionalUpdate file upd = do
h <- openFile file ReadMode
s <- upd `fmap` hGetContents h
s `deepseq` hClose h
h <- openFile file WriteMode
hPutStr h s
hClose h
I tested this by adding the following main and throwing a threadDelay in the middle of nonTransactionalUpdate:
main = do
[n] <- getArgs
transactionalUpdate "foo.txt" ((show n ++ "\n") ++)
putStrLn $ "successfully updated " ++ show n
Then I compiled and ran a bunch of instances with this script:
#!/bin/bash
rm foo.txt
touch foo.txt
for i in {1..50}
do
./SO $i &
done
A process that printed a successful update message if and only if the corresponding number was in foo.txt; all the others printed the expected SO: foo.txt.notveryunique: createDirectory: already exists (File exists).
Update: You actually do not want to use unique names here; it must be a consistent name across the competing processes. I've updated the code accordingly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Recover the source file name in a Shake rule - haskell

Related

How to properly match file names in subdirectories?

Storing metadata for rules

Haskell write a list in file and read later

How to override function in Codec.Archive.Tar

How can I make file I/O more transactional?

Categories

Resources