Haskell remove trailing and leading whitespace from a file - string

Let's say I have a file
mary had a little lamb
It's fleece was white as snow
Everywhere
the child went
The lamb, the lamb was sure to go, yeah
How would I read the file as a string, and remove the trailing and leading whitespace? It could be spaces or tabs. It would print like this after removing whitespace:
mary had a little lamb
It's fleece was white as snow
Everywhere
the child went
The lamb, the lamb was sure to go, yeah
Here's what I have currently:
import Data.Text as T
readTheFile = do
handle <- openFile "mary.txt" ReadMode
contents <- hGetContents handle
putStrLn contents
hClose handle
return(contents)
main :: IO ()
main = do
file <- readTheFile
file2 <- (T.strip file)
return()

Your code suggests a few misunderstandings about Haskell so let's go through your code before getting to the solution.
import Data.Text as T
You're using Text, great! I suggest you also use the IO operations that read and write Text types instead of what is provided by the prelude which works on Strings (linked lists of characters). That is, import Data.Text.IO as T
readTheFile = do
handle <- openFile "mary.txt" ReadMode
contents <- hGetContents handle
putStrLn contents
hClose handle
return(contents)
Oh, hey, the use of hGetContents and manually opening and closing a file can be error prone. Consider readTheFile = T.readFile "mary.txt".
main :: IO ()
main = do
file <- readTheFile
file2 <- (T.strip file)
return()
Two issues here.
Issue one Notice here you have used strip as though it's an IO action... but it isn't. I suggest you learn more about IO and binding (do notation) vs let-bound variables. strip computes a new value of type Text and presumably you want to do something useful with that value, like write it.
Issue two Stripping the whole file is different than stripping each line one at a time. I suggest you read mathk's answer.
So in the end I think you want:
-- Qualified imports are accessed via `T.someSymbol`
import qualified Data.Text.IO as T
import qualified Data.Text as T
-- Not really need as a separate function unless you want to also
-- put the stripping here too.
readTheFile :: IO T.Text
readTheFile = T.readFile "mary.txt"
-- First read, then strip each line, then write a new file.
main :: IO ()
main =
do file <- readTheFile
let strippedFile = T.unlines $ map T.strip $ T.lines file
T.writeFile "newfile.txt" (T.strip strippedFile)

Here is a possible solution for what you are looking for:
import qualified Data.Text as T
main = do
trimedFile <- (T.unlines . map T.strip . T.lines) <$> T.readFile "mary.txt"
T.putStr trimedFile
strip from Data.Text is doing the job.

Read the file or process the file one line at a time then
> intercalate " ".words $ " The lamb, the lamb was sure to go, yeah "
"The lamb, the lamb was sure to go, yeah"
But, unwords with no parameter is better than intercalate " " and it neither has to be imported.

Related

Case conversion of large dictionary word set

I'm trying to match words from a dictionary, case-insensitively. My initial approach
looks like so:
read dict; convert all words to lowercase, store in set.
check new word for membership in set
Is there a better (more efficient) way to achieve this? I'm new to Haskell.
import System.IO
import Data.Text (toLower, pack, unpack)
import Data.Set (fromList, member)
main = do
let path = "/usr/share/dict/american-english"
h <- openFile path ReadMode
hSetEncoding h utf8
contents <- hGetContents h
let mySet = (fromList . map (unpack . toLower . pack) . lines) contents
putStrLn $ show $ member "acadia" mySet
I would just work with Text directly instead of converting to/from Strings.
Data.Text.IO contains versions of hGetContents, readFile, etc. for reading Text from files, and Data.Text has lines for Text.
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Set as S
main = do
let path = "/usr/share/dict/american-english"
h <- openFile path ReadMode
hSetEncoding h utf8
contents <- T.hGetContents h
let mySet = (S.fromList . map T.toLower . T.lines) contents
putStrLn $ show $ S.member "acadia" mySet
By using T.tolower and T.lines we avoid explicit pack/unpack calls.
mySet is now a set of Text values rather than of Strings. By using
the OverloadedStrings pragma the literal "acadia" will be interpreted
as a Text value.
Yes, what you propose is reasonable. Some few remarks, mostly unrelated to the main question:
It would be more efficient to restrict your self to using only Text and not String.
Prefer the toCaseFold function to toLower, it's more appropriate for this case.
Even though you found my first answer helpful, let me propose another approach...
A boggle solver I wrote simply reads in the entire dictionary as a single ByteString, and to look up words performs a binary search on that ByteString.
The dictionary must already be in sorted order and normalized to lower case, but usually this is not a problem since the dictionary is static
and known in advance.
Of course, when you compute (lo+hi)/2 in performing the binary search you might land in the middle of word, so you simply back up to the beginning of the current word.
The main advantage of this is that loading the dictionary is extremely fast and it is memory efficient. Moreover, the search algorithm has good memory locality. I haven't measured it, but I wouldn't be surprised if creating a Data.Set will more than double the size of the raw data.
The code is available here: https://github.com/erantapaa/hoggle

Read file with UTF-8 in Haskell as IO String

I have the following code which works fine unless the file has utf-8 characteres :
module Main where
import Ref
main = do
text <- getLine
theInput <- readFile text
writeFile ("a"++text) (unlist . proc . lines $ theInput)
With utf-8 characteres I get this:
hGetContents: invalid argument (invalid byte sequence)
Since the file I'm working with has UTF-8 characters, I would like to handle this exception in order to reuse the functions imported from Ref if possible.
Is there a way to read a UTF-8 file as IO String so I can reuse my Ref's functions?. What modifications should I make to my code?. Thanks in Advance.
I attach the functions declarations from my Ref module:
unlist :: [String] -> String
proc :: [String] -> [String]
from prelude:
lines :: String -> [String]
This can be done with just GHC's basic (but extended from the standard) System.IO module, although you'll then have to use more functions:
module Main where
import Ref
import System.IO
main = do
text <- getLine
inputHandle <- openFile text ReadMode
hSetEncoding inputHandle utf8
theInput <- hGetContents inputHandle
outputHandle <- openFile ("a"++text) WriteMode
hSetEncoding outputHandle utf8
hPutStr outputHandle (unlist . proc . lines $ theInput)
hClose outputHandle -- I guess this one is optional in this case.
Thanks for the answers, but I found the solution by myself.
Actually the file I was working with has this codification:
ISO-8859 text, with CR line terminators
So to work with that file with my haskell code It should have this codification instead:
UTF-8 Unicode text, with CR line terminators
You can check the file codification with the utility file like this:
$ file filename
To change the file codification follow the instructions from this link!
Use System.IO.Encoding.
The lack of unicode support is a well known problem with with the standard Haskell IO library.
module Main where
import Prelude hiding (readFile, getLine, writeFile)
import System.IO.Encoding
import Data.Encoding.UTF8
main = do
let ?enc = UTF8
text <- getLine
theInput <- readFile text
writeFile ("a" ++ text) (unlist . proc . lines $ theInput)

Haskell - loop over user input

I have a program in haskell that has to read arbitrary lines of input from the user and when the user is finished the accumulated input has to be sent to a function.
In an imperative programming language this would look like this:
content = ''
while True:
line = readLine()
if line == 'q':
break
content += line
func(content)
I find this incredibly difficult to do in haskell so I would like to know if there's an haskell equivalent.
The Haskell equivalent to iteration is recursion. You would also need to work in the IO monad, if you have to read lines of input. The general picture is:
import Control.Monad
main = do
line <- getLine
unless (line == "q") $ do
-- process line
main
If you just want to accumulate all read lines in content, you don't have to do that. Just use getContents which will retrieve (lazily) all user input. Just stop when you see the 'q'. In quite idiomatic Haskell, all reading could be done in a single line of code:
main = mapM_ process . takeWhile (/= "q") . lines =<< getContents
where process line = do -- whatever you like, e.g.
putStrLn line
If you read the first line of code from right to left, it says:
get everything that the user will provide as input (never fear, this is lazy);
split it in lines as it comes;
only take lines as long as they're not equal to "q", stop when you see such a line;
and call process for each line.
If you didn't figure it out already, you need to read carefully a Haskell tutorial!
It's reasonably simple in Haskell. The trickiest part is that you want to accumulate the sequence of user inputs. In an imperative language you use a loop to do this, whereas in Haskell the canonical way is to use a recursive helper function. It would look something like this:
getUserLines :: IO String -- optional type signature
getUserLines = go ""
where go contents = do
line <- getLine
if line == "q"
then return contents
else go (contents ++ line ++ "\n") -- add a newline
This is actually a definition of an IO action which returns a String. Since it is an IO action, you access the returned string using the <- syntax rather than the = assignment syntax. If you want a quick overview, I recommend reading The IO Monad For People Who Simply Don't Care.
You can use this function at the GHCI prompt like this
>>> str <- getUserLines
Hello<Enter> -- user input
World<Enter> -- user input
q<Enter> -- user input
>>> putStrLn str
Hello -- program output
World -- program output
Using pipes-4.0, which is coming out this weekend:
import Pipes
import qualified Pipes.Prelude as P
f :: [String] -> IO ()
f = ??
main = do
contents <- P.toListM (P.stdinLn >-> P.takeWhile (/= "q"))
f contents
That loads all the lines into memory. However, you can also process each line as it is being generated, too:
f :: String -> IO ()
main = runEffect $
for (P.stdinLn >-> P.takeWhile (/= "q")) $ \str -> do
lift (f str)
That will stream the input and never load more than one line into memory.
You could do something like
import Control.Applicative ((<$>))
input <- unlines . takeWhile (/= "q") . lines <$> getContents
Then input would be what the user wrote up until (but not including) the q.

Haskell: why does this code fail?

When I try to run this code...
module Main where
import qualified Data.Text.Lazy.IO as LTIO
import qualified Data.Text.Lazy as LT
import System.IO (IOMode(..), withFile)
getFirstLine :: FilePath -> IO String
getFirstLine path =
withFile path ReadMode (\f -> do
contents <- LTIO.hGetContents f
return ("-- "++(LT.unpack . head $ LT.lines contents)++" --"))
main::IO()
main = do
firstLine <- getFirstLine "/tmp/foo.csv"
print firstLine
I get
"-- *** Exception: Prelude.head: empty list
... where I would expect it to print the first line of "/tmp/foo.csv". Could you please explain why? Ultimately, I'm trying to figure out how to create a lazy list of Texts from file input.
As Daniel Lyons mentions in a comment, this is due to IO and laziness interacting.
Imagine, if you will:
withFile opens the file, to file handle f.
Thunk using contents of f is returned.
withFile closes the file.
Thunk is evaluated. There are no contents in a closed file.
This trap is mentioned on the HaskellWiki / Maintaining laziness page.
To fix, you can either read the whole file contents within withFile (possibly by forcing it with seq) or lazily close the file instead of using withFile.
I think it's like this: withFile closes the file after executing the function. hGetContents reads the contents lazily (lazy IO), and by the time it needs to read the stuff, the file is closed.
Instead of using withFile, try just using openFile, and not closing it. hGetContents will place the file in semi-closed state after it's reading from it. Or better, just read the contents directly using readFile

Better data stream reading in Haskell

I am trying to parse an input stream where the first line tells me how many lines of data there are. I'm ending up with the following code, and it works, but I think there is a better way. Is there?
main = do
numCases <- getLine
proc $ read numCases
proc :: Integer -> IO ()
proc numCases
| numCases == 0 = return ()
| otherwise = do
str <- getLine
putStrLn $ findNextPalin str
proc (numCases - 1)
Note: The code solves the Sphere problem https://www.spoj.pl/problems/PALIN/ but I didn't think posting the rest of the code would impact the discussion of what to do here.
Use replicate and sequence_.
main, proc :: IO ()
main = do numCases <- getLine
sequence_ $ replicate (read numCases) proc
proc = do str <- getLine
putStrLn $ findNextPalin str
sequence_ takes a list of actions, and runs them one after the other, in sequence. (Then it throws away the results; if you were interested in the return values from the actions, you'd use sequence.)
replicate n x makes a list of length n, with each element being x. So we use it to build up the list of actions we want to run.
Dave Hinton's answer is correct, but as an aside here's another way of writing the same code:
import Control.Applicative
main = (sequence_ . proc) =<< (read <$> getLine)
proc x = replicate x (putStrLn =<< (findNextPalin <$> getLine))
Just to remind everyone that do blocks aren't necessary! Note that in the above, both =<< and <$> stand in for plain old function application. If you ignore both operators, the code reads exactly the same as similarly-structured pure functions would. I've added some gratuitous parentheses to make things more explicit.
Their purpose is that <$> applies a regular function inside a monad, while =<< does the same but then compresses an extra layer of the monad (e.g., turning IO (IO a) into IO a).
The interesting part of looking at code this way is that you can mostly ignore where the monads and such are; typically there's very few ways to place the "function application" operators to make the types work.
You (and the previous answers) should work harder to divide up the IO from the logic. Make main gather the input and separately (purely, if possible) do the work.
import Control.Monad -- not needed, but cleans some things up
main = do
numCases <- liftM read getLine
lines <- replicateM numCases getLine
let results = map findNextPalin lines
mapM_ putStrLn results
When solving SPOJ problems in Haskell, try not to use standard strings at all. ByteStrings are much faster, and I've found you can usually ignore the number of tests and just run a map over everything but the first line, like so:
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy.Char8 as BS
main :: IO ()
main = do
(l:ls) <- BS.lines `fmap` BS.getContents
mapM_ findNextPalin ls
The SPOJ page in the Haskell Wiki gives a lot of good pointers about how to read Ints from ByteStrings, as well as how to deal with a large quantities of input. It'll help you avoid exceeding the time limit.

Resources