How to properly match file names in subdirectories? - haskell

I am currently going through the book Real World Haskell and one exercise from this book asks the reader to implement file name matching with the use of **, which is the same as *, but also looks in subdirectories all the way down in the file system. Below is a fragment of my code with comments (there is a lot of duplication at the moment) and further down you can find additional info about the code. I think that the posted code is sufficient for the problem and there is no need to list the whole program here.
case splitFileName pat of
("", baseName) -> do -- just the file name passed
curDir <- getCurrentDirectory
if searchSubDirs baseName -- check if file name has `**` in it
then do
contents <- getDirectoryContents curDir
subDirs <- filterM doesDirectoryExist contents
let properSubDirs = filter (`notElem` [".", ".."]) subDirs
subDirsNames <- forM properSubDirs $ \dir -> do
namesMatching (curDir </> dir </> baseName) -- call the function recursively on subdirectories
curDirNames <- listMatches curDir baseName -- list matches in the current directory
return (curDirNames ++ (concat subDirsNames)) -- concatenate results into a single list
else listMatches curDir baseName
(dirName, baseName) -> do // full path passed
if searchSubDirs baseName
then do
contents <- getDirectoryContents dirName
subDirs <- filterM doesDirectoryExist contents
let properSubDirs = filter (`notElem` [".", ".."]) subDirs
subDirsNames <- forM properSubDirs $ \dir -> do
namesMatching (dirName </> dir </> baseName) -- call the function recursively on subdirectories
curDirNames <- listMatches dirName baseName -- list matches in the passed directory
return (curDirNames ++ (concat subDirsNames)) -- concatenate results into a single list
Additional information:
pat is the pattern I'm looking for (e.g. *.txt or C:\\A\[a-z].*).
splitFileName is a function which splits a file path into the directory path and the file name. The first element of the tuple will be empty if we specify just a file name in pat.
searchSubDirs returns True if the file name has ** in it.
listMatches returns a list of file names that match the pattern in the directory, substituting ** for *.
namesMatching is the name of the function whose excerpt I posted.
Why doesn't it work?
When I pass just the file name, the program searches for it only in the current directory and first level of subdirectories. When I pass a full path, it searches only in the specified directory. It looks like case (dirName, baseName) doesn't properly recurse. I've been looking at the code for some time now and I can't figure out where the problem is.
Note
If any more information is needed, please let me know in the comments and I'll add whatever is necessary to the question.

Here's an issue:
contents <- getDirectoryContents dirName
subDirs <- filterM doesDirectoryExist contents
getDirectoryContents only returns the leaf names of the directories, so you have to prepend dirName (along with a /) to the elements of contents before calling doesDirectoryExist.

Related

shake build: read filepath list from config file

I have a config.cfg file where a the variable file_list is a list of relative path to files
file_list = file1 dir1/file2 ../dir2/file3
How do I read this variable in to get a file_list::[FilePath]?
Tried to follow the Development.Shake.Config API Doc without success. I need something to achieve that
file_list <- getConfig "file_list"
let fl = ??? file_list
need fl
ps. I'am an Haskell beginner
The type of file_list is Maybe String, and the type of fl needs to be [FilePath], so the question becomes how to write a function to transform between the two. One option is:
let fl = words (fromMaybe "" file_list)
The fromMaybe function replaces Nothing with "" - so you now have a String. The words function splits a string on the spaces, to produce [String]. In Haskell FilePath is a synonym for String so it all works out.
If instead you want to error out if the key is missing, you can do:
Just file_list <- getConfig "file_list"
let fl = words file_list
need fl
Now you are asserting and unwrapping the Maybe in file_list, so if it is Nothing you get a runtime crash, and if it is Just you get it without the Just wrapper, so can simply use words.

Flexible number of arguments to haskell program

I am using the System.FilePath.Find module of filemanip to recursively find all files I need to process (here I will be using just printing to console as the action to perform, in order not confuse things). Now, this code:
import System.Environment (getArgs)
import System.FilePath (FilePath)
import System.Directory (doesDirectoryExist, getDirectoryContents,doesFileExist)
import Control.Monad
import System.FilePath.Find (find,always,fileType,(==?),FileType(..),(&&?),extension)
main= do
[dbFile,input]<- getArgs
files <- findFiles input
mapM_ putStrLn files
return ()
searchExtension :: String
searchExtension = ".hs"
findFiles :: FilePath -> IO [String]
findFiles = find (always) ( fileType ==? RegularFile &&? extension ==? searchExtension)
works well with this call
./myprog tet .
In this case, the get argument is ignored (will be the output database file later) and the second argument is searched recursively for matching files. It also allows me to specify just a single file, which is just perfect!
BUT, I would like to be able to specify
./myprog tet path1 path2 path4 file1
but this of course fails in the pattern matching:
./myprog tet . .
myprogt: user error (Pattern match failure in do expression at myprog.hs:11:9-22)
Now, how do I make this program more flexible, so that I can take more than two arguments?
Sorry for asking this, actually, but my Haskell knowledge is limited but increasing for every new thing I have to do in my first project.
Well, you can use a different pattern like:
(dbFile:inputs) <- getArgs
where dbFile will match the first argument passed while inputs will match any number of file names (even 0. If you want at least one path name use inputs#(_:_) instead of the simple inputs).
Then you can use mapM to call findFiles for each path in inputs:
files <- mapM findFiles input
mapM_ putStrLn $ concat files
Instead of mapM you could modify findFiles to accept a [FilePath] argument instead of a simple FilePath.
Note that to parse command arguments you could consider using some module like getopt. You should also read this page about argument handling.

Haskell write a list in file and read later

I am trying to write a list into a file and later on I want to read the file contents into the list as well.
So I have a list like this ["ABC","DEF"]
I have tried things like
hPrint fileHandle listName
This just prints into file "["ABC","DEF"]"
I have tried unlines but that is priniting like "ABC\nDEF\n"
Now in both the cases, I cant read back into proper list. The output file has quotes and because of which when I read, I get like this ["["ABC","DEF"]""] i.e a single string in list.
As I am not succeeding in this, I tried to write the list line by line, I tried to apply a map and the function to write the list k = map (\x -> hPrint fileSLC x) fieldsBefore, it is not doing anything, file is blank. I think if I write everything in separate line, I will be able to read like (lines src) later on.
I know whatever I am doing is wrong but I am writing the code on Haskell for second time only, last time I just a wrote a very a small file reading program. Moving from imperative to functional is not that easy. :(
Try using hPutStrLn and unlines instead of hPrint. The hPrint internally calls show which causes Strings to be quoted and escaped.
hPutStr fileHandle (unlines listName)
Alternatively, use a mapM or a forM. A verbose example is:
forM_ listName $ \string ->
hPutStrLn string
This can be simplified ("eta-contracted", in lambda-calculus terminology) to
forM_ listName hPutStrLn
As you have seen, when you read from a file, you get a String. In order to convert this String into a list, you will need to parse it.
For k = map (\x -> hPrint fileSLC x) fieldsBefore to work, you need to use mapM or mapM_ instead of map.

Recover the source file name in a Shake rule

I am writing a build system for a static website which works like this:
for every file src/123-some-title.txt produce a file out/123.html
My problem is that when writing the rule for out/*.html I have no direct way to recover the source file name (src/123-some-title.txt) from the target file name (out/123.html).
Of course I could read the src/ directory again and search for a file that starts with 123, but is there a nicer way to do this with Shake?
The first thing to mention is that if you call getDirectoryFiles multiple times with the same arguments it will only calculate once, in the same way that if you call need multiple times on the same file it will only build once. One approach would be:
"out/*.fwd" *> \out -> do
res <- getDirectoryFiles "src" ["*.txt"]
let match = [(takeBaseName out ++ "-") `isPrefixOf` takeBaseName x | x <- res]
when (length match /= 1) $ error "fail, because wrong number of matches"
writeFileChanged out $ head match
"out/*.html" *> \out -> do
src <- readFile' (out -<.> "fwd")
txt <- readFile' ("src" </> src)
...
Here the idea is that the file out/123.txt contain the contents 123-some-title.txt. By using writeFileChanged we only change the .fwd file when the relevant part of the directory changes.
If you want to avoid the .fwd files, you can use the Oracle mechanism. If you want to avoid a linear scan of the getDirectoryFiles result you can use the newCache function. In practice, neither is likely to be problematic, and going with the files is probably simplest.

How to override function in Codec.Archive.Tar

Haskell noob here. I have a question specifically regarding how to use an existing library that may lead to some more fundamental aspects of the proper use of Haskell.
I'm learning Haskell and have a small project in mind to work on while I learn. The script will need to find all the tarballs in a given directory and unpack them in parallel. At this point, I'm working on the basic functionality of unpacking. So, using the Codec.Archive.Tar package, how can I override its behavior regarding tarballs with fully qualified paths?
Here's some example code:
module Main where
import qualified Codec.Archive.Tar as Tar
import qualified Codec.Compression.GZip as GZip
import Control.Monad (liftM, unless)
import qualified Data.ByteString.Lazy as BS
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.Exit (exitWith, ExitCode(..))
import System.FilePath.Posix (takeExtension)
searchPath = "/home/someuser/tarball/dir"
exit = exitWith ExitSuccess
die = exitWith (ExitFailure 1)
processFile :: String -> IO ()
processFile file = do
putStrLn $ "Unpacking " ++ file ++ " to " ++ searchPath
Tar.unpack searchPath . Tar.read . GZip.decompress =<< BS.readFile filePath
where filePath = searchPath ++ "/" ++ file
main = do
dirExists <- doesDirectoryExist searchPath
unless dirExists $ (putStrLn $ "Error: Search path not found: " ++ searchPath) >> die
files <- targetFiles `liftM` getDirectoryContents searchPath
mapM_ processFile files
exit
where targetFiles = filter (\f -> f /= "." && f /= ".." && takeExtension f == ".tgz")
When I run this in a directory with tarballs that were packed with:
tar czvPf myfile.tgz /tarball_testing/myfile
I get the following output:
Unpacking myfile.tgz to /tarball_testing
unpacker.hs: Absolute file name in tar archive: "/tarball_testing/myfile"
The second line is the issue. Reading the docs for Codec.Archive.Tar I don't see a way to disable this functionality (not interested in discussions of why I want to use full paths in tarballs, or the relative security implications of doing so).
The first thing that comes to mind is that I somehow need to override the function but that doesn't "feel" like the way a pro Haskeller would do it. Can I get a pointer in the right direction?
You cannot monkey patch or otherwise override a function from a Haskell module, and therefore no workaround will let you avoid the safety measures of the library. What you can do, however, is use the functionality in Codec.Archive.Tar to modify the tar entry paths before unpacking so that they won't be absolute any more. Specifically, there is a mapEntriesNoFail function with type
mapEntriesNoFail :: (Entry -> Entry) -> Entries e -> Entries e
Entries is the type of the argument to Tar.unpack, while Entry is the type of an individual entry. Thanks to mapEntriesNoFail, our problem becomes writing an Entry -> Entry function to adjust the paths. For that, first we will need some extra imports:
import qualified Codec.Archive.Tar.Entry as Tar
import System.FilePath.Posix (takeExtension, dropDrive, hasTrailingPathSeparator)
import Data.Either (either)
The function can look like this:
dropDriveFromEntry :: Tar.Entry -> Tar.Entry
dropDriveFromEntry entry =
either (error "Resulting tar path is somehow too long")
(\tp -> entry { Tar.entryTarPath = tp })
drivelessTarPath
where
tarPath = Tar.entryTarPath entry
path = Tar.fromTarPath tarPath
toTarPath' p = Tar.toTarPath (hasTrailingPathSeparator p) p
drivelessTarPath = toTarPath' $ dropDrive path
This may seem a little long-winded; however, the hoops we jump through are there to ensure the resulting tar paths are sane. You can read about the gory details of tar handling on the Codec.Archive.Tar.Entry documentation. The key function in this definition is dropDrive, which makes an absolute path relative (in Linux, it strips the leading slash of an absolute path).
It is worth spending a few words on the use of either. toTarPath produces a value of type Either String TarPath to account for the possibility of failure. Specifically, the conversion to a tar path fails if the provided path is too long. In our case, however, the path cannot be too long, as it is a path which already was in a tar file, perhaps with a removed leading slash. That being so, it is good enough to eliminate the Either wrapping with either, passing an error instead of the function to handle the (impossible) Left case.
With dropDriveFromEntry in hand, we just have to map it over the entries before unpacking. The relevant line of your program would become:
Tar.unpack searchPath . Tar.mapEntriesNoFail dropDriveFromEntry
. Tar.read . GZip.decompress =<< BS.readFile filePath
Note that if there were relevant errors to be accounted for in dropDriveFromEntry, we would make it return Either String TarPath, and then use mapEntries instead of mapEntriesNoFail.
With these changes, the entry in your tar file will be extracted to /home/someuser/tarball/dir/tarball_testing/myfile. If that is not what you intended, you can modify dropDriveFromEntry so that it performs whatever extra path processing you need.
P.S.: Regarding the alternate title of your question, and considering the sensible little program you have shown us, I do not think you should be worried :)

Resources