How can I make file I/O more transactional?

How can I make file I/O more transactional? - haskell

I'm writing CGI scripts in Haskell. When the user hits ‘submit’, a Haskell program runs on the server, updating (i.e. reading in, processing, overwriting) a status file. Reading then overwriting sometimes causes issues with lazy IO, as we may be able to generate a large output prefix before we've finished reading the input. Worse, users sometimes bounce on the submit button and two instances of the process run concurrently, fighting over the same file!
What's a good way to implement
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
where the function (‘update’) computes the new file contents from the old file contents? It is not safe to presume that ‘update’ is strict, but it may be presumed that it is total (robustness to partial update functions is a bonus). Transactions may be attempted concurrently, but no transaction should be able to update if the file has been written by anyone else since it was read. It's ok for a transaction to abort in case of competition for file access. We may assume a source of systemwide-unique temporary filenames.
My current attempt writes to a temporary file, then uses a system copy command to overwrite. That seems to deal with the lazy IO problems, but it doesn't strike me as safe from races. Is there a tried and tested formula that we could just bottle?

The most idiomatic unixy way to do this is with flock:
http://hackage.haskell.org/package/flock
http://swoolley.org/man.cgi/2/flock

Here is a rough first cut that relies on the atomicity of the underlying mkdir. It seems to fulfill the specification, but I'm not sure how robust or fast it is:
import Control.DeepSeq
import Control.Exception
import System.Directory
import System.IO
transactionalUpdate :: FilePath -> (String -> String) -> IO ()
transactionalUpdate file upd = bracket acquire release update
where
acquire = do
let lockName = file ++ ".lock"
createDirectory lockName
return lockName
release = removeDirectory
update _ = nonTransactionalUpdate file upd
nonTransactionalUpdate :: FilePath -> (String -> String) -> IO ()
nonTransactionalUpdate file upd = do
h <- openFile file ReadMode
s <- upd `fmap` hGetContents h
s `deepseq` hClose h
h <- openFile file WriteMode
hPutStr h s
hClose h
I tested this by adding the following main and throwing a threadDelay in the middle of nonTransactionalUpdate:
main = do
[n] <- getArgs
transactionalUpdate "foo.txt" ((show n ++ "\n") ++)
putStrLn $ "successfully updated " ++ show n
Then I compiled and ran a bunch of instances with this script:
#!/bin/bash
rm foo.txt
touch foo.txt
for i in {1..50}
do
./SO $i &
done
A process that printed a successful update message if and only if the corresponding number was in foo.txt; all the others printed the expected SO: foo.txt.notveryunique: createDirectory: already exists (File exists).
Update: You actually do not want to use unique names here; it must be a consistent name across the competing processes. I've updated the code accordingly.

Related

main: Prelude.!!: index too large

I am trying to write a program that will copy a file and allow us to rename it, like a cp command in unix.
import System.IO
import System.Environment
import qualified Data.Text as T
import qualified Data.Text.IO as TI
main :: IO ()
main = do
args <- getArgs
let source = args !! 0
let dest = args !! 1
input <- TI.readFile source
TI.writeFile dest input
I tried this and got main: Prelude.!!: index too large

As the error says, this means that the index is too large, so that means that getArgs returns a list that contains less than two elements.
Using (!!) is however often not a good idea: there is no guarantee that the element exists, and it runs in 𝓞(k) to obtain the k-th element. You can work with:
main :: IO ()
main = do
args <- getArgs
case args of
(src:dst:_) -> do
input <- TI.readFile source
TI.writeFile dest input
_ -> putStrLn "You should provide a source and destination"
Also loading the entire content in a string is not a good idea, since the file can be larger than all (available) memory, thus crashing the system. Often copying is done through two file handlers: one that reads the source file, and one that writes to the target file, with only a small amount of memory used. Some file systems can also make copying more effective. For example by using two references to the same file, and only effectively making a copy if one of the two files is modified.

Editable default string in Haskell's terminal input

I want to be able to prompt the user for input (let's say a FilePath), but also to offer a mutable/interactive string as a default, so instead of having the user type the full path, I can prompt with:
C:\Users\John\project\test
and have them be able to backspace 4 times and enter final to yield C:\Users\John\project\final, rather than type the entire path.
However printing a default string with putStr or System.IO.hPutStr stdout does print this default to the terminal, but does not allow me to alter any of it. E.g.
import System.IO
main = do
hSetBuffering stdout NoBuffering
putStr "C:\\Users\\John\\project\\test"
l <- getLine
doSomethingWith l
I suspect Data.Text.IO's interact may be able to do what I want but I could not get it to work.
Any suggestions would be greatly appreciated.

getLine doesn’t offer any facility for line editing. For this you can use a library like haskeline instead, for example:
import System.Console.Haskeline
main :: IO ()
main = do
runInputT defaultSettings $ do
mInput <- getInputLineWithInitial "Enter path: "
("C:\\Users\\John\\project\\test", "")
case mInput of
Nothing -> do
outputStrLn "No entry."
Just input -> do
outputStrLn $ "Entry: " ++ show input
An alternative is to invoke the program with a wrapper that provides line editing, such as rlwrap. For building a more complex fullscreen text UI, there is also brick, which provides a simple text editing component in Brick.Widgets.Edit.

Haskell: Why does this function keep asking for user input and not terminating

I'm learning some Haskell and I came across this small program
reverseLines :: String -> String
reverseLines input =
unlines (map reverse (lines input))
main :: IO ()
main = interact reverseLines
This program will keep asking the user for more input and reverse the input and print it on the screen.
Most of this is straight forward but one thing I can't wrap my head around is why does this function keeps running and ask the user for more input whereas if I just replace the reverseLines function with a function the simply returns some string it will not happen.
This program will stop after one execution:
foo input = "Stops"
main :: IO ()
main = interact foo
Why?

If you look at the source of interact you see this:
interact f = do s <- getContents
putStr (f s)
see the getContents? This is where the magic starts - it will read everything till EOF
Now in Haskell this is lazy-IO which can be bad but here is almost magical - see the string is read lazily and passed to your reverseLines - this one of course will only generate output as soon as it saw \n characters (the lines) and so it seems your program is some kind of REPL.
In the second one you don't consume any of the lazy-string at all so it stops ASAP
As I wrote in the comments you can play with this by either passing content into the program using a file (or echo) and pipes on the terminal:
echo "Hello World\nBye Bye" | runhaskell LazyIO.hs
or using CTRL-D to pass in the EOF yourself.
To get a feeling for it I would play with the functions more - what happens if you use something that needs to see the complete input first (try reverse without the maps)? What happens with words instead of lines, ...?
Have fun!

How to delete unbuildable goal?

I would like to remove files that no longer have source but without cleaning.
Is there support for partially cleaning an incremental build? In this case, I guess I could compare against set of source files that were consumed in previous builds and define how to clean those that are gone.
main = shakeArgs shakeOptions { shakeVerbosity = Diagnostic } $ do
want [".build"]
phony ".build" $ do
files <- getDirectoryFiles "." ["//*.txt"]
let goals = map (-<.> "") files
need goals
"*" %> \out -> do
Stdout o <- cmd $ "sort " ++ (out ++ ".txt")
writeFile' out o

Using shakeArgsPrune you can define a function that gets passed the live files afterwards. You can then write something like:
import Development.Shake
import Development.Shake.FilePath
import Development.Shake.Util
import System.Directory.Extra
import Data.List
import System.IO
pruner :: [FilePath] -> IO ()
pruner live = do
present <- listFilesRecursive "output"
mapM_ removeFile $ map toStandard present \\ map toStandard live
main :: IO ()
main = shakeArgsPrune shakeOptions pruner $ do
... rules go here ...
This deletes all files in output that are not generated and up-to-date according to the build system as it stands. For a complete example see
http://neilmitchell.blogspot.co.uk/2015/04/cleaning-stale-files-with-shake.html.
The shakeArgsPrune function is only available in shake-0.15.1 and above, but is based on the shakeLiveFiles feature which has been available for longer and can be used directly if you so desire.

How to override function in Codec.Archive.Tar

Haskell noob here. I have a question specifically regarding how to use an existing library that may lead to some more fundamental aspects of the proper use of Haskell.
I'm learning Haskell and have a small project in mind to work on while I learn. The script will need to find all the tarballs in a given directory and unpack them in parallel. At this point, I'm working on the basic functionality of unpacking. So, using the Codec.Archive.Tar package, how can I override its behavior regarding tarballs with fully qualified paths?
Here's some example code:
module Main where
import qualified Codec.Archive.Tar as Tar
import qualified Codec.Compression.GZip as GZip
import Control.Monad (liftM, unless)
import qualified Data.ByteString.Lazy as BS
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.Exit (exitWith, ExitCode(..))
import System.FilePath.Posix (takeExtension)
searchPath = "/home/someuser/tarball/dir"
exit = exitWith ExitSuccess
die = exitWith (ExitFailure 1)
processFile :: String -> IO ()
processFile file = do
putStrLn $ "Unpacking " ++ file ++ " to " ++ searchPath
Tar.unpack searchPath . Tar.read . GZip.decompress =<< BS.readFile filePath
where filePath = searchPath ++ "/" ++ file
main = do
dirExists <- doesDirectoryExist searchPath
unless dirExists $ (putStrLn $ "Error: Search path not found: " ++ searchPath) >> die
files <- targetFiles `liftM` getDirectoryContents searchPath
mapM_ processFile files
exit
where targetFiles = filter (\f -> f /= "." && f /= ".." && takeExtension f == ".tgz")
When I run this in a directory with tarballs that were packed with:
tar czvPf myfile.tgz /tarball_testing/myfile
I get the following output:
Unpacking myfile.tgz to /tarball_testing
unpacker.hs: Absolute file name in tar archive: "/tarball_testing/myfile"
The second line is the issue. Reading the docs for Codec.Archive.Tar I don't see a way to disable this functionality (not interested in discussions of why I want to use full paths in tarballs, or the relative security implications of doing so).
The first thing that comes to mind is that I somehow need to override the function but that doesn't "feel" like the way a pro Haskeller would do it. Can I get a pointer in the right direction?

You cannot monkey patch or otherwise override a function from a Haskell module, and therefore no workaround will let you avoid the safety measures of the library. What you can do, however, is use the functionality in Codec.Archive.Tar to modify the tar entry paths before unpacking so that they won't be absolute any more. Specifically, there is a mapEntriesNoFail function with type
mapEntriesNoFail :: (Entry -> Entry) -> Entries e -> Entries e
Entries is the type of the argument to Tar.unpack, while Entry is the type of an individual entry. Thanks to mapEntriesNoFail, our problem becomes writing an Entry -> Entry function to adjust the paths. For that, first we will need some extra imports:
import qualified Codec.Archive.Tar.Entry as Tar
import System.FilePath.Posix (takeExtension, dropDrive, hasTrailingPathSeparator)
import Data.Either (either)
The function can look like this:
dropDriveFromEntry :: Tar.Entry -> Tar.Entry
dropDriveFromEntry entry =
either (error "Resulting tar path is somehow too long")
(\tp -> entry { Tar.entryTarPath = tp })
drivelessTarPath
where
tarPath = Tar.entryTarPath entry
path = Tar.fromTarPath tarPath
toTarPath' p = Tar.toTarPath (hasTrailingPathSeparator p) p
drivelessTarPath = toTarPath' $ dropDrive path
This may seem a little long-winded; however, the hoops we jump through are there to ensure the resulting tar paths are sane. You can read about the gory details of tar handling on the Codec.Archive.Tar.Entry documentation. The key function in this definition is dropDrive, which makes an absolute path relative (in Linux, it strips the leading slash of an absolute path).
It is worth spending a few words on the use of either. toTarPath produces a value of type Either String TarPath to account for the possibility of failure. Specifically, the conversion to a tar path fails if the provided path is too long. In our case, however, the path cannot be too long, as it is a path which already was in a tar file, perhaps with a removed leading slash. That being so, it is good enough to eliminate the Either wrapping with either, passing an error instead of the function to handle the (impossible) Left case.
With dropDriveFromEntry in hand, we just have to map it over the entries before unpacking. The relevant line of your program would become:
Tar.unpack searchPath . Tar.mapEntriesNoFail dropDriveFromEntry
. Tar.read . GZip.decompress =<< BS.readFile filePath
Note that if there were relevant errors to be accounted for in dropDriveFromEntry, we would make it return Either String TarPath, and then use mapEntries instead of mapEntriesNoFail.
With these changes, the entry in your tar file will be extracted to /home/someuser/tarball/dir/tarball_testing/myfile. If that is not what you intended, you can modify dropDriveFromEntry so that it performs whatever extra path processing you need.
P.S.: Regarding the alternate title of your question, and considering the sensible little program you have shown us, I do not think you should be worried :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I make file I/O more transactional? - haskell

The most idiomatic unixy way to do this is with flock: http://hackage.haskell.org/package/flock http://swoolley.org/man.cgi/2/flock

Related

main: Prelude.!!: index too large

Editable default string in Haskell's terminal input

Haskell: Why does this function keep asking for user input and not terminating

How to delete unbuildable goal?

How to override function in Codec.Archive.Tar

Categories

Resources