getAllFiles (but not symlinks) - haskell

I have a directory traversal function in Haskell, but I want it to ignore symlinks. I figured out how to filter out the files alone, albeit with a slightly inelegant secondary filterM. But after some diagnosis I realize that I'm failing to filter symlinked directories.
I'd like to be able to write something like this:
-- Lazily return (normal) files from rootdir
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
nodes <- pathWalkLazy root
-- get file paths from each node
let files = [dir </> file | (dir, _, files) <- nodes,
file <- files,
not . pathIsSymbolicLink dir]
normalFiles <- filterM (liftM not . pathIsSymbolicLink) files
return normalFiles
However, all the variations I have tried get some version of the "Couldn't match expected type ‘Bool’ with actual type ‘IO Bool’" message (without the filter clause in the comprehension it works, but fails to filter those linked dirs).
Various hints at ways I might completely restructure the function are in partial form at online resources, but I'm pretty sure that every such variation will run into some similar issue. The list comprehension would certainly be the most straightforward way... if I could just somehow exclude those dirs that are links.
Followup: Unfortunately, the solution kindly provided by ChrisB behaves (almost?!) identically to my existing version. I defined three functions, and run them within a test program:
-- XXX: debugging
files <- getAllFilesRaw rootdir
putStrLn ("getAllFilesRaw: " ++ show (length files))
files' <- getAllFilesNoSymFiles rootdir
putStrLn ("getAllFilesNoSymFiles: " ++ show (length files'))
files'' <- getAllFilesNoSymDirs rootdir
putStrLn ("getAllFilesNoSymDirs: " ++ show (length files''))
The first is my version with the normalFiles filter removed. The second is my original version (minus the type error in the listcomp). The final one is ChrisB's suggestion.
Running that, then also the system find utility:
% find $CONDA_PREFIX -type f | wc -l
449667
% find -L $CONDA_PREFIX -type f | wc -l
501153
% haskell/find-dups $CONDA_PREFIX
getAllFilesRaw : 501153
getAllFilesNoSymFiles: 464553
getAllFilesNoSymDirs: 464420
Moreover, this question came up because—for my own self-education—I've implemented the same application in a bunch of languages: Python; Golang; Rust; Julia; TypeScript; Bash, except the glitch, Haskell; others are planned. The programs actually do something more with the files, but that's not the point of this question.
The point of this is that ALL other languages report the same number as the system find tool. Moreover, the specific issue is things like this:
% ls -l /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo
lrwxrwxrwx 1 dmertz dmertz 17 Apr 29 2020 /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo -> ../share/terminfo
There are about 16k examples here (on my system currently), but looking at some in the other version of the tool, I see specifically that all the other languages are excluding the contents of that symlink directory.

EDIT:
Instead of just fixing a Bool / IO Bool issue we now want to mach find's behavior.
After looking at the documentation,
this seems to be quite hard to implement reasonably performantly
with the PathWalk library, so i just handrolled it.
(Using do-notation, as requested in the comments.)
In my quick and dirty tests the results match those of find:
import System.FilePath
import System.Directory
getAllFiles' :: FilePath -> IO [FilePath]
getAllFiles' path = do
isSymlink <- pathIsSymbolicLink path
if isSymlink
-- if this is a symlink, return the empty list.
-- even if this was the original root. (matches find's behavior)
then return []
else do
isFile <- doesFileExist path
if isFile
then return [path] -- if this is a file, return it
else do
-- if it's not a file, we assume it to be a directory
dirContents <- listDirectory path
-- run this function recursively on all the children
-- and accumulate the results
fmap concat $ mapM (getAllFiles' . (path </>)) dirContents
Original Answer solving the IO Bool / Bool issue
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = pathWalkLazy root
-- remove dirs that are symlinks
>>= filterM (\(dir, _, _) -> fmap not $ pathIsSymbolicLink dir)
-- flatten to list of files
>>= return . concat . map (\(dir, _, files) -> map (\f -> dir </> f) files)
-- remove files that are symlinks
>>= filterM (fmap not . pathIsSymbolicLink)

Related

mapM on IO produces infinite output

This is a bizzare behavior, even for Haskell. Look at the code segments below:
import System.Directory
import System.FilePath
-- This spins infinitely
loadCtx :: FilePath -> IO ()
loadCtx dir = do
lsfiles <- listDirectory dir
let files = mapM (dir </>) lsfiles
putStrLn $ "Files " ++ show files
-- This does what I'd expect, prepending the dir path to each file
loadCtx dir = do
lsfiles <- listDirectory dir
let files = map (dir </>) lsfiles
putStrLn $ "Files " ++ show files
Both definitions are accepted from the typechecker but give completely
different behavior. What is the output of the first mapM? It looks like an infinite loop on reading some files. Also is it possible to compose the listDirectory do-arrow line with the map (dir </>) that prepends the path, in one-line?
What is the output of the first mapM? It looks like an infinite loop on reading some files.
It is not an infinite loop -- merely a very, very long one.
You are not using mapM for IO; you are using mapM in the nondeterminism monad. Here is the type of mapM, specialized to that monad:
Traversable t => (a -> [b]) -> t a -> [t b]
Read this in the following way:
First, give me a way to turn an element of a container (type a) into a nondeterministic choice between many possible replacement elements (type [b]).
Then give me a containerful of elements (type t a).
I will give you a nondeterministic choice between containers with replacement elements in them (type [t b]). (And, this part is not in the type, but: the way I will do this is by taking all possible combinations; for each position in the container, I'll try each possible b, and give you every which way of making one choice for each position in the container.)
For example, if we were to define the function f :: Int -> [Char] for which f n chose nondeterministically between the first n letters of the alphabet, then we could see this kind of interaction:
> f 3
"abc"
> f 5
"abcde"
> f 2
"ab"
> mapM f [3,5,2]
["aaa","aab","aba","abb","aca","acb","ada","adb","aea","aeb","baa","bab","bba","bbb","bca","bcb","bda","bdb","bea","beb","caa","cab","cba","cbb","cca","ccb","cda","cdb","cea","ceb"]
In each result, the first letter is one of the first three in the alphabet (a, b, or c); the second is from the first five, and the third from the first two. What's more, we get every list which has this property.
Now let's think about what that means for your code. You have written
mapM (dir </>) lsfiles
and so what you will get back is a collection of lists. Each list in the collection will be exactly as long as lsfiles is. Let's focus on one of the lists in the collection; call it cs.
The first element of cs will be drawn from dir </> filename, where filename is the first element of lsfiles; that is, it will be one of the characters in dir, or a slash, or one of the characters in filename. The second element of cs will be similar: one of the characters of dir, or a slash, or one of the characters from the second filename in lsfiles. I guess you can see where this is going... there's an awful lot of possibilities here. =)
Also is it possible to compose the listDirectory do-arrow line with the map (dir </>) that prepends the path, in one-line?
Yes:
loadCtx dir = do
files <- map (dir </>) <$> listDirectory dir
putStrLn $ "Files " ++ show files
Well according to the documentation,
type FilePath = String
That is,
type FilePath = [Char]
So in this line,
let files = mapM (dir </>) lsfiles
you have that the argument of mapM, which is (dir </>), is of type FilePath -> FilePath. Now look at the type of mapM,
mapM :: (Traversable t, Monad m) => (a -> m b) -> t a -> m (t b)
^^^^^
So the type a -> m b is instantiated to FilePath -> FilePath, which is FilePath -> [Char]. So you're performing a monadic mapping using the list monad, which is the "nondeterminism" monad in this case for values of type Char.
To complement Jorge's answer, here's an exponential blowup, demonstrated:
> map ("XY" </>) ["a","b","c"]
["XY\\a","XY\\b","XY\\c"]
> mapM ("XY" </>) ["a","b","c"]
["XXX","XXY","XX\\","XXc","XYX","XYY","XY\\","XYc","X\\X","X\\Y","X\\\\",
"X\\c","XbX","XbY","Xb\\","Xbc","YXX","YXY","YX\\","YXc","YYX","YYY","YY\\","YYc",
"Y\\X","Y\\Y","Y\\\\","Y\\c","YbX","YbY","Yb\\","Ybc","\\XX","\\XY","\\X\\",
"\\Xc","\\YX","\\YY","\\Y\\","\\Yc","\\\\X","\\\\Y","\\\\\\","\\\\c","\\bX",
"\\bY","\\b\\","\\bc","aXX","aXY","aX\\","aXc","aYX","aYY","aY\\","aYc","a\\X",
"a\\Y","a\\\\","a\\c","abX","abY","ab\\","abc"]
Indeed, mapM = sequence . map, and sequence in the list monad performs the cartesian product of a list-of-lists, ["XY\\a","XY\\b","XY\\c"] in this case, so we get 4*4*4 combinations. (Ouch!)

Turtle: dealing with non-utf8 input

In my path to learning Pipes, I've run into problems when dealing with non-utf8 files. That is why I've took a detour into the Turtle library to try to understand how to solve the problem there, at higher level of abstraction.
The exercise I want to do is quite simple: find the sum of all the lines of all regular files reachable from a given directory. This is readily implemented by the following shell command:
find $FPATH -type f -print | xargs cat | wc -l
I've come up with the following solution:
import qualified Control.Foldl as F
import qualified Turtle as T
-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath
-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
fInFPath <- T.lsif noSymLink fPath
st <- T.stat fInFPath
if T.isRegularFile st
then return fInFPath
else T.empty
-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
file <- regularFilesIn fPath
T.input file
-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
count <- T.fold (inputDir fPath) F.length
print count
This solution gives the correct result, as long as there are no non-utf8 files in the directory. If this is not the case, the program will raise an exception like the following one:
*** Exception: test/resources/php_ext_syslog.h: hGetLine: invalid argument (invalid byte sequence)
Which is to be expected since:
$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1
I was wondering how to solve the problem of reading different encodings into Text, so that the program can deal with this. For the problem at hand I guess I could avoid the conversion to Text, but I'd rather know how to do this, since you could imagine a situation in which, for instance, I would like to make a set with all the words under a certain directory.
EDIT
For what is worth so far the only solution I could come up with is the following:
mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
where gMDecodeByteString :: (ByteString -> Decoding)
-> T.Shell ByteString
-> T.Shell T.Text
gMDecodeByteString f bss = do
bs <- bss
let Some res bs' g = f bs
if BS.null bs'
then return res
else gMDecodeByteString g bss
inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
file <- regularFilesIn fPath
text <- mDecodeByteString (TB.input file)
T.select (NE.toList $ T.textToLines text)
-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
count <- T.fold (inputDir' fPath) T.countLines
print count
The problem is that this will count one more line per file, but at least allows to decode non-utf8 ByteStrings.

How to grep result of ls in Turtle

I'm playing with Turtle and I'm faced with the following problem.
I want to do something like (in shell)
ls | grep 'foo'
My attempt using Turtle is
grep (prefix "foo") (ls ".") & view
But I got the following message
Couldn't match type ‘Turtle.FilePath’ with ‘Text’
Expected type: Shell Text
Actual type: Shell Turtle.FilePath
In the second argument of ‘grep’, namely ‘(ls ".")’
In the first argument of ‘(&)’, namely
‘grep (prefix "foo") (ls ".")’
I understand ls returns FilePath whereas grep works on Text, so what can I do ?
Update
There are obviously solutions which involves converting back and forth from FilePath to Text. That's beyond the simplicity I would expect shell-like program.
Someone mentioned the find function, which somehow could solves the problem.
However find is the equivalent to the find shell function and I was trying just to do ls | grep "foo". I'm not trying to solve a real life problem (if I were, I would switch to bash instead) but trying to combine simple bricks as I would do in bash. Unfortunately, it doesn't seem that bricks in Turtle are that easy to combine :-(.
Instead of grep, we can use match, in combination with the MonadPlus instance of Shell for filtering:
filterByPattern :: MonadPlus m => Pattern x -> FilePath -> m FilePath
filterByPattern somepattern somepath =
case match somepattern (either id id (toText somepath)) of
[] -> mzero
otherwise -> return somepath
greppedls :: FilePath -> Pattern x -> Shell FilePath
greppedls somepath somepattern =
ls somepath >>= filterByPattern somepattern
Edit: Instead of using the unnecesarily general MonadPlus, here's an implementation that filters using the turtle-specific combinator select:
filterByPattern :: Pattern x -> FilePath -> Shell FilePath
filterByPattern somepattern somepath =
case match somepattern (either id id (toText somepath)) of
[] -> select [] -- no matches, so filter this path
otherwise -> select [somepath] -- let this path pass
A value foo :: Shell a is a bit like a "list of as". If we have a function genlist :: a -> Shell b that for each a generates a (perhaps empty!) list of bs, we can obtain a list of bs using the (>>=) operator: foo >>= genlist.
Edit#2: The standard turtle function find already filters files using a pattern. It is recursive and searches in subdirectories.
To convert from FilePath into Text you use:
fp :: Format r (FilePath -> r)
Here is an example:
format fp ("usr" </> "lib")
There is a couple of issues about this so Gabriel has decided to update the tutorial a few days ago:
https://github.com/Gabriel439/Haskell-Turtle-Library/commit/a2fff2acf912cc7adb2e02671340822feb0e9172
To answer your (updated) question, the best I can come up is:
format fp <$> ls "." & grep (has "foo") & view
& is playing the role of |.
As a personal note, it is of course not as short as ls | grep 'foo' but still quite elegant given that Haskell is a typed language.
The literal answer is this one-liner:
example =
view (ls "." & fmap (format fp) & grep (prefix "foo") & fmap toText)
The idiomatic answer is to use the find utility
Try to use repr
repr :: Show a => a -> Text

Check if two directories are on the same filesystem in haskell

If I have two directories A and B How do I tell if they are on the same filesystem (e.g on same hardrive) in Haskell on OS X and linux ?
I checked System.Directory and System.FilePath.Posix which don't seem to have any thing for doing this.
The getFileStatus and deviceID functions from the unix package should help you with that.
One way would be to exploit the stat utility and write a wrapper for it yourself. stat has the ability to give device number for your file. I tested this following code in Linux and it works for different disks (but I'm not sure for Mac OS):
import Control.Applicative ((<$>))
import System.Process
statDeviceID :: FilePath -> IO String
statDeviceID fp = readProcess "stat" ["--printf=%d", fp] ""
-- for mac which has a different version of stat
-- statDeviceID fp = readProcess "stat" ["-f", "%d", fp] ""
checkSameDevice :: [FilePath] -> IO Bool
checkSameDevice xs = (\x -> all (== head x) x) <$> (sequence $ map statDeviceID xs)
paths = ["/mnt/Books", "/home/sibi"]
main = checkSameDevice paths >>= print
In ghci:
λ> main
False -- False since /mnt is a different hard disk

What is the haskell way to copy a directory

I find myself doing more and more scripting in haskell. But there are some cases where I'm really not sure of how to do it "right".
e.g. copy a directory recursively (a la unix cp -r).
Since I mostly use linux and Mac Os I usually cheat:
import System.Cmd
import System.Exit
copyDir :: FilePath -> FilePath -> IO ExitCode
copyDir src dest = system $ "cp -r " ++ src ++ " " ++ dest
But what is the recommended way to copy a directory in a platform independent fashion?
I didn't find anything suitable on hackage.
This is my rather naiv implementation I use so far:
import System.Directory
import System.FilePath((</>))
import Control.Applicative((<$>))
import Control.Exception(throw)
import Control.Monad(when,forM_)
copyDir :: FilePath -> FilePath -> IO ()
copyDir src dst = do
whenM (not <$> doesDirectoryExist src) $
throw (userError "source does not exist")
whenM (doesFileOrDirectoryExist dst) $
throw (userError "destination already exists")
createDirectory dst
content <- getDirectoryContents src
let xs = filter (`notElem` [".", ".."]) content
forM_ xs $ \name -> do
let srcPath = src </> name
let dstPath = dst </> name
isDirectory <- doesDirectoryExist srcPath
if isDirectory
then copyDir srcPath dstPath
else copyFile srcPath dstPath
where
doesFileOrDirectoryExist x = orM [doesDirectoryExist x, doesFileExist x]
orM xs = or <$> sequence xs
whenM s r = s >>= flip when r
Any suggestions of what really is the way to do it?
I updated this with the suggestions of hammar and FUZxxl.
...but still it feels kind of clumsy to me for such a common task!
It's possible to use the Shelly library in order to do this, see cp_r:
cp_r "sourcedir" "targetdir"
Shelly first tries to use native cp -r if available. If not, it falls back to a native Haskell IO implementation.
For further details on type semantics of cp_r, see this post written by me to described how to use cp_r with String and or Text.
Shelly is not platform independent, since it relies on the Unix package, which is not supported under Windows.
I couldn't find anything that does this on Hackage.
Your code looks pretty good to me. Some comments:
dstExists <- doesDirectoryExist dst
This does not take into account that a file with the destination name might exist.
if or [not srcExists, dstExists] then print "cannot copy"
You might want to throw an exception or return a status instead of printing directly from this function.
paths <- forM xs $ \name -> do
[...]
return ()
Since you're not using paths for anything, you can change this to
forM_ xs $ \name -> do
[...]
The filesystem-trees package provides the means for a very simple implementation:
import System.File.Tree (getDirectory, copyTo_)
copyDirectory :: FilePath -> FilePath -> IO ()
copyDirectory source target = getDirectory source >>= copyTo_ target
The MissingH package provides recursive directory traversals, which you might be able to use to simplify your code.
I assume that the function in Path.IO copyDirRecur with variants to include/exclude symlinks may be a newer and maintained solution. It requires to convert the filepath to Path x Dir which is achieved with parseRelDir respective parseAbsDir, but I think to have a more precise date type than FilePath is worthwile to avoid hard to track errors at run-time.
There are also some functions for copying files and directories in the core Haskell library Cabal modules, specifically Distribution.Simple.Utils in package Cabal. copyDirectoryRecursive is one, and there are other functions near this one in that module.

Resources