Turtle: dealing with non-utf8 input

Turtle: dealing with non-utf8 input - haskell

In my path to learning Pipes, I've run into problems when dealing with non-utf8 files. That is why I've took a detour into the Turtle library to try to understand how to solve the problem there, at higher level of abstraction.
The exercise I want to do is quite simple: find the sum of all the lines of all regular files reachable from a given directory. This is readily implemented by the following shell command:
find $FPATH -type f -print | xargs cat | wc -l
I've come up with the following solution:
import qualified Control.Foldl as F
import qualified Turtle as T
-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath
-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
fInFPath <- T.lsif noSymLink fPath
st <- T.stat fInFPath
if T.isRegularFile st
then return fInFPath
else T.empty
-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
file <- regularFilesIn fPath
T.input file
-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
count <- T.fold (inputDir fPath) F.length
print count
This solution gives the correct result, as long as there are no non-utf8 files in the directory. If this is not the case, the program will raise an exception like the following one:
*** Exception: test/resources/php_ext_syslog.h: hGetLine: invalid argument (invalid byte sequence)
Which is to be expected since:
$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1
I was wondering how to solve the problem of reading different encodings into Text, so that the program can deal with this. For the problem at hand I guess I could avoid the conversion to Text, but I'd rather know how to do this, since you could imagine a situation in which, for instance, I would like to make a set with all the words under a certain directory.
EDIT
For what is worth so far the only solution I could come up with is the following:
mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
where gMDecodeByteString :: (ByteString -> Decoding)
-> T.Shell ByteString
-> T.Shell T.Text
gMDecodeByteString f bss = do
bs <- bss
let Some res bs' g = f bs
if BS.null bs'
then return res
else gMDecodeByteString g bss
inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
file <- regularFilesIn fPath
text <- mDecodeByteString (TB.input file)
T.select (NE.toList $ T.textToLines text)
-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
count <- T.fold (inputDir' fPath) T.countLines
print count
The problem is that this will count one more line per file, but at least allows to decode non-utf8 ByteStrings.

Related

Getting IO based on the contents of an IO stream

I have a situation where I am trying to concatenate the contents of two text files, A and B. The complication is that the location of B is specified in the contents of A. I've created a function (minimal example below), which reads A, opens B, and then just tries to stick them both together, but frankly this method seems too easy to be correct and I have a feeling that it may not be the best approach. It compiles but I'm unable to test it since it can't find the second file (presumably something to do with paths, but I've not yet figured out what). Any advice appreciated.
getIOFromIO :: IO String -> IO String
getIOFromIO orig = do
origContents <- orig
moreIO <- readFile origContents
return (origContents ++ " " ++ moreIO)

The function getIOFromIO should work fine, provided you pass it an IO action that reads the first file, like:
getIOFromIO (readFile "foo.tmp")
and provided the entire contents of foo.tmp, including any preceding or trailing whitespace (like a trailing newline) are part of the desired filename.
The following self-contained example demonstrates its use:
setup :: String -> String -> IO ()
setup file1 file2 = do
writeFile file1 file2 -- put name of file2 in file1
writeFile file2 $ "body\n"
-- unmodified from your question
getIOFromIO :: IO String -> IO String
getIOFromIO orig = do
origContents <- orig
moreIO <- readFile origContents
return (origContents ++ " " ++ moreIO)
main = do
setup "foo.tmp" "bar.tmp"
txt <- getIOFromIO (readFile "foo.tmp")
print txt
It should generate the output:
"bar.tmp body\n"
^^^^^^^ ^^^^
| ` contents of second file (bar.tmp)
|
`- contents of first file (foo.tmp)

Reading first line of each file getting aborted at binary files

I am trying to read first line of each file in current directory:
import System.IO(IOMode(ReadMode), withFile, hGetLine)
import System.Directory (getDirectoryContents, doesFileExist, getFileSize)
import System.FilePath ((</>))
import Control.Monad(filterM)
readFirstLine :: FilePath -> IO String
readFirstLine fp = withFile fp ReadMode System.IO.hGetLine
getAbsoluteDirContents :: String -> IO [FilePath]
getAbsoluteDirContents dir = do
contents <- getDirectoryContents dir
return $ map (dir </>) contents
main :: IO ()
main = do
-- get a list of all files & dirs
contents <- getAbsoluteDirContents "."
-- filter out dirs
files <- filterM doesFileExist contents
-- read first line of each file
d <- mapM readFirstLine files
print d
It is compiling and running but getting aborted with following error at a binary file:
mysrcfile: ./aBinaryFile: hGetLine: invalid argument (invalid byte sequence)
I want to detect and avoid such files and go on to next file.

A binary file is a file that contains byte sequences that can not be decoded to a valid string. But a binary file is not different from a text file if you do not inspect its content.
It might be better to use an "It's Easier to Ask Forgiveness than Permission (EAFP)" approach: we try to read the first line, and if that fails, we ignore the output.
import Control.Exception(catch, IOException)
import System.IO(IOMode(ReadMode), withFile, hGetLine)
readFirstLine :: FilePath -> IO (Maybe String)
readFirstLine fp = withFile fp ReadMode $
\h -> (catch (fmap Just (hGetLine h))
((const :: a -> IOException -> a) (return Nothing)))
For a FilePath this returns an IO (Maybe String). If we run the IO (Maybe String), it will return a Just x with x the first line if it can read such file, and Nothing if an IOException was encoutered.
We can then make use of catMaybes :: [Maybe a] -> [a] to obtain the Just xs:
import Data.Maybe(catMaybes)
main :: IO ()
main = do
-- get a list of all files & dirs
contents <- getAbsoluteDirContents "."
-- filter out dirs
files <- filterM doesFileExist contents
-- read first line of each file
d <- mapM readFirstLine files
print (catMaybes d)
or you can make use of mapMaybeM :: Monad m => (a -> m (Maybe b)) -> [a] -> m [b] in the extra package [Hackage] that will automate that work for you.

How do I read from a file and add the numbers in the text file in Haskell

I'm new to Haskell and IO is still a bit confusing. I have a txt file that I want to read, add the numbers in the text file, and then write it to a text file. the file looks like the following:
2
3
the numbers are separated by a new line character I know how to read a file contents then write it to another file but I don't know how I can manipulate it or if I have to cast the information to an Int?
module Main where
import System.Environment
-- | this fuction read first line in a file and write out to src file
-- src "src.txt", des "des.txt"
copyFirstLine :: FilePath -- ^ path to input file
-> FilePath -- ^ path to output file
-> IO ()
copyFirstLine src dst = do
contect <- readFile src
let (fst :rest) = (lines contect)
writeFile dst fst
main = do
[src,dst] <- getArgs
copyFirstLine src dst
Thanks in advance.

I can't sure your 'manipulate' means what, but I will assume you need integer calculation. It won't be difficult to manipulate as string.
If you hoogle the signature String -> Int you can find the read.
-- | this fuction read first line in a file and write out +1 result
-- to src file src "src.txt", des "des.txt"
eachPlusOne :: FilePath -- ^ path to input file
-> FilePath -- ^ path to output file
-> IO ()
eachPlusOne src dst = do
contect <- readFile src
let lns = lines contect :: [String]
ints = map ((1+) . read) lns :: [Int]
outs = unlines . map show $ ints :: String
writeFile dst outs
If you are using sufficiently recent version of ghc, you can use readMaybe which is desirable.

Better way to check for file type in Haskell

For an app which uses a directory walker I need the information if a file is accessible, a real file and need to distinguish between file and directory entries.
I want to:
skip all soft-links, pipes and other special files.
only access files which can be read and could be written to.
only list directories which can be entered and listed.
So all files can be read and could be manipulated and reside in directories which allow that.
This is what I came up with:
fileType :: FilePath -> IO Int
fileType f = do
-- skip any symbolic link
l <- getSymbolicLinkStatus f
if isSymbolicLink l
then return 0 -- link
else do
s <- getFileStatus f
if isRegularFile s
then do
-- files need read and write
facc <- fileAccess f True True False
if facc
then return 1
else return 0 -- file but not RW
else if isDirectory s
then do
-- dirs need read and execute
dacc <- fileAccess f True False True
if dacc
then return 2
else return 0 -- dir but not RX
else return 0 -- not a file or dir
But I am pretty unsure about the implementation and want to ask if there is something I could do to make this more concise.
For example I have a feeling that I could at least move "return" somewhere at the top. But trying this I could not get the types right.
P.S.: It is fine for me to return Int 0 1 2 (instead of a special datatype) but I don't mind if that is changed.

Instead of using Int to keep track of the different file type, you can use the sum type to denote different file type:
data FileType = SymbolicLink -- Symbolic link
| FileRead -- File with Read Permission
| DirRead -- Directory with Read Permission
| DirNoRead -- Directory with No Read Permission
| FileNoRead -- File with No Read Permssion
| NotFileAndDir -- Neither File nor directory
deriving (Show)
One pattern which I can see in your code is that there are various nested monadic if to check conditions and then return an appropriate result based on that. You can see if the standard library offers such a abstraction or if it doesn't you can write it for yourself:
bdef :: (Monad m) => m Bool -> m a -> m a -> m a
bdef mb t f = mb >>= \x -> if x then t else f
In the bdef function, in case if mb is IO True, then I'm returing the first parameter or else the second parameter. Note that it doesn't need to be IO but it can be any monad. Once this is defined, the rest is to define the remaining function:
filetype :: FilePath -> IO FileType
filetype f = sym
where sym = bdef (isSymbolicLink <$> getSymbolicLinkStatus f)
(return SymbolicLink) reg
reg = bdef (isRegularFile <$> fStatus)
(bdef checkfRead (return FileRead) (return FileNoRead)) dir
dir = bdef (isDirectory <$> fStatus)
(bdef checkDRead (return DirRead) (return DirNoRead))
(return NotFileAndDir)
checkfRead = fileAccess f True True False
checkDRead = fileAccess f True False True
fStatus = getFileStatus f
Sample ghci demo:
λ> filetype "/home/sibi/test.hs"
FileRead
λ> filetype "/home/sibi"
DirRead

After reading the other answers and comments (many thanks!) I want to answer my own question and show you what I was coming up with:
data FileType = Skip | File | Dir
getFileType :: FilePath -> IO FileType
getFileType f = getSymbolicLinkStatus f >>= testIt
where testIt s
| isSymbolicLink s = return Skip
| isRegularFile s = useWhen (fileAccess f True True False) File
| isDirectory s = useWhen (fileAccess f True False True) Dir
| otherwise = return Skip
useWhen p t = p >>= \b -> if b then return t else return Skip
What I did was:
First creating the type I really needed (Skip, File, Dir).
Then I found that I really just need to get the fileStatus once (and this should not follow a symbolic link).
After this it was trivial to see that this ends up with multiple cases and I used guards for that.
While trying "ifM" (bdef from the comment) I saw that the universal form is just to universal 8now) but it is nice to have a helper to make the function more readable.
ifM actually is more something like "when" in that case.
Removing do with bind in some places because do had only one action left.

Read array-string into variable

I have text file containing data like that:
13.
13.
[(1,2),(2,3),(4,5)].
And I want to read this into 3 variables in Haskell. But standard functions read this as strings, but considering I get rid of dot at the end myself is there any built-in parser function that will make Integer of "13" and [(Integer,Integer)] list out of [(1,2),(2,3),(4,5)] ?

Yes, it's called read:
let i = read "13" :: Integer
let ts = read "[(1,2),(2,3),(4,5)]" :: [(Integer, Integer)]

The example text file you gave has trailing spaces as well as the full stop, so merely cutting the last character doesn't work. Let's take just the digits, using:
import Data.Char (isDigit)
Why not have a data type to store the stuff from the file:
data MyStuff = MyStuff {firstNum :: Int,
secondNum:: Int,
intPairList :: [(Integer, Integer)]}
deriving (Show,Read)
Now we need to read the file, and then turn it into individual lines:
getMyStuff :: FilePath -> IO MyStuff
getMyStuff filename = do
rawdata <- readFile filename
let [i1,i2,list] = lines rawdata
return $ MyStuff (read $ takeWhile isDigit i1) (read $ takeWhile isDigit i2) (read $ init list)
The read function works with any data type that has a Read instance, and automatically produces data of the right type.
> getMyStuff "data.txt" >>= print
MyStuff {firstNum = 13, secondNum = 13, intPairList = [(1,2),(2,3),(4,5)]}
A better way
I'd be inclined to save myself a fair bit of work, and just write that data directly, so
writeMyStuff :: FilePath -> MyStuff -> IO ()
writeMyStuff filename somedata = writeFile filename (show somedata)
readMyStuff :: FilePath -> IO MyStuff
readMyStuff filename = fmap read (readFile filename)
(The fmap just applies the pure function read to the output of the readFile.)
> writeMyStuff "test.txt" MyStuff {firstNum=12,secondNum=42, intPairList=[(1,2),(3,4)]}
> readMyStuff "test.txt" >>= print
MyStuff {firstNum = 12, secondNum = 42, intPairList = [(1,2),(3,4)]}
You're far less likely to make little parsing or printing errors if you let the compiler sort it all out for you, it's less code, and simpler.

Haskell's strong types require you to know what you're getting. So let's forgo all error checking and optimization and assume that the file is always in the right format, you can do something like this:
data Entry = Number Integer
| List [(Integer, Integer)]
parseYourFile :: FilePath -> IO [Entry]
parseYourFile p = do
content <- readFile p
return $ parseYourFormat content
parseYourFormat :: String -> [Entry]
parseYourFormat data = map parseEntry $ lines data
parseEntry :: String -> Entry
parseEntry line = if head line == '['
then List $ read core
else Number $ read core
where core = init line
Or you could write a proper parser for it using one of the many combinator frameworks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Turtle: dealing with non-utf8 input - haskell

Related

Getting IO based on the contents of an IO stream

Reading first line of each file getting aborted at binary files

How do I read from a file and add the numbers in the text file in Haskell

Better way to check for file type in Haskell

Read array-string into variable

Categories

Resources