I'm trying to code a kind of a simple web crawler in haskell just for practice. To my own astonishment neither the web request itself nor parsing the web site was any complicated.
I coded the program purely functional with a recursive function, but only some fourty or fifty web requests later, the program eats up all the memory.
So I tried to do the task with dynamic programming, but here I'm totally stuck, which means, I have no idea where to begin. In this tiny program I got so many errors, that I'm not able to figure out, where to start.
This is my current concept:
scanPage :: String -> IO (String,String,[String])
scanPage url = ....
crawler :: String -> IO [(String, Int)]
crawler startUrl = runST $ do
toVisit <- newSTRef [startUrl] :: ST s (STRef s [String])
visited <- newSTRef [] :: ST s (STRef s [String])
result <- newSTRef [] :: ST s (STRef s [(String, Int)])
-- Iterate over urls to visit
while (liftM not $ liftM null $ readSTRef toVisit) $ do
url <- fmap (head) (readSTRef toVisit)
(moreUrls, value_a, value_b) <- scanPage url
-- Mark page as visited
vis <- readSTRef visited
writeSTRef visited (url : vis)
-- Add Results
res <- readSTRef result
writeSTRef result ((value_a, value_b) : res)
-- Extend urls to visit
nextUrls <- readSTRef toVisit
writeSTRef toVisit (nextUrls ++ (moreUrls \\ vis))
-- End of while
return =<< readSTRef result
main = do
putStrLn =<< fmap show (crawler "http://starturl.com")
I already wrote a lot of programs like this with arrays, which are much more convenient, as I can simply write or read from or to array elements. So I thought I could use mutable vectors for these lists, but they can't grow (at least in the same instance) or shrink. So I ended up with simple lists in STRef.
The first line I can't get to work is the line with the while command. I wrote my own while function like this
while :: (Monad m) => m Bool -> m a -> m ()
while cond action = do
c <- cond
when c $ do
action
while cond action
because I couldn't find any other while command. I googled many days for mutable vectors, but was not able to find a single tutorial or even example that I could use here. Please, can anyone tell me, how to write a syntactical correct crawler function? Yes, a pure functional solution would be nicer and more "haskellish", but I'm considering me still as a beginner and all this monad-stuff is still a bit strange for me. I'm willing to learn, but a hint or even an example would be really awesome.
EDIT:
Here comes some pseudocode of my messy code.
toVisitList = startURL
visitedList = []
resultList = []
while (length toVisitList /= 0) {
url = head toVisitList -- Get the 1st element
toVisitList -= url -- Remove this url from list
visitedList += url -- Append url to visitedList
(moreUrls, val_a, val_b) = scanPage url
resultList += (val_a, val_b) -- append the result
toVisitList += (moreUrls - visitedList)
}
return resultList
EDIT:
I still haven't any clue, how to put this pseudocode into real code, especially the while-statement. Any hints appreciacted.
The natural data structure for your toVisitList is a queue. There are a few implementations of queues around, but for this purpose, the simplest thing is to just use Data.Sequence.Seq. This lets you add things to the end with |> or <>, and to view the beginning with viewl. Consider something like
crawlOnce :: Seq Url -> [Url] -> IO (Either [Url] (Seq Url, [Url]))
crawlOnce toVisitList visitedList uses viewl to look at the front of the list of URLs to visit. If it's empty, it returns Left visitedList. Otherwise, it visits the first URL, appends it to the visited list, and adds the newly discovered URLS to the list to visit, then wraps them up in Right.
There are several reasonable variations. For instance, you could go for a type like ExceptT [Url] (StateT (Seq Url, [Url]) IO) a that "throws" its final result.
Related
I am trying to make a function that takes a list of strings and executes the command putStrLn or print (I think they are basically equivalent, please correct me if I am wrong as I'm still new to Haskell) to every element and have it printed out on my terminal screen. I was experimenting with the map function and also with lambda/anonymous functions as I already know how to do this recursively but wanted to try a more complex non recursive version. map returned a list of the type IO() which was not what I was going for and my attempts at lambda functions did not go according to plan. The basic code was:
test :: [String] -> something
test x = map (\a->putStrLn a) x -- output for this function would have to be [IO()]
Not entirely sure what the output of the function was supposed to be either which also gave me issues.
I was thinking of making a temp :: String variable and have each String appended to temp and then putStrLn temp but was not sure how to do that entirely. I though using where would be viable but I still ran into issues. I know how to do this in languages like java and C but I am still quite new to Haskell. Any help would be appreciated.
There is a special version of map that works with monadic functions, it's called mapM:
test :: [String] -> IO [()]
test x = mapM putStrLn x
Note that this way the return type of test is a list of units - that's because each call to putStrLn returns a unit, so result of applying it to each element in a list would be a list of units. If you'd rather not deal with this silliness and have the return type be a plain unit, use the special version mapM_:
test :: [String] -> IO ()
test x = mapM_ putStrLn x
I was thinking of making a temp :: String variable and have each String appended to temp and then putStrLn temp
Good idea. A pattern of "render the message" then a separate "emit the message" is often nice to have long term.
test xs = let temp = unlines (map show xs)
in putStrLn temp
Or just
test xs = putStrLn (unlines (show <$> xs))
Or
test = putStrLn . unlines . map show
Not entirely sure what the output of the function was supposed to be either which also gave me issues.
Well you made a list of IO actions:
test :: [String] -> [IO ()]
test x = map (\a->putStrLn a) x
So with this list of IO actions when do you want to execute them? Now? Just once? The first one many times the rest never? In what order?
Presumably you want to execute them all now. Let's also eta reduce (\a -> putStrLn a) to just putStrLn since that means the same thing:
test :: [String] -> IO ()
test x = sequence_ (map (\a->putStrLn a) x)
I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!
First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!
Consider the two following variations:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = go []
where
go :: [String] -> IO [String]
go l = do {
inp <- getLine;
if (inp == "") then
return l;
else go (inp:l);
}
myReadListOrdinary :: IO [String]
myReadListOrdinary = do
inp <- getLine
if inp == "" then
return []
else
do
moreInps <- myReadListOrdinary
return (inp:moreInps)
In ordinary programming languages, one would know that the tail recursive variant is a better choice.
However, going through this answer, it is apparent that haskell's implementation of recursion is not similar to that of using the recursion stack repeatedly.
But because in this case the program in question involves actions, and a strict monad, I am not sure if the same reasoning applies. In fact, I think in the IO case, the tail recursive form is indeed better. I am not sure how to correctly reason about this.
EDIT: David Young pointed out that the outermost call here is to (>>=). Even in that case, does one of these styles have an advantage over the other?
FWIW, I'd go for existing monadic combinators and focus on readability/consiseness. Using unfoldM :: Monad m => m (Maybe a) -> m [a]:
import Control.Monad (liftM, mfilter)
import Control.Monad.Loops (unfoldM)
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM go
where
go :: IO (Maybe String)
go = do
line <- getLine
return $ case line of
"" -> Nothing
s -> Just s
Or using MonadPlus instance of Maybe, with mfilter :: MonadPlus m => (a -> Bool) -> m a -> m a:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM (liftM (mfilter (/= "") . Just) getLine)
Another, more versatile option, might be to use LoopT.
That’s really not how I would write it, but it’s clear enough what you’re doing. (By the way, if you want to be able to efficiently insert arbitrary output from any function in the chain, without using monads, you might try a Data.ByteString.Builder.)
Your first implementation is very similar to a left fold, and your second very similar to a right fold or map. (You might try actually writing them as such!) The second one has several advantages for I/O. One of the most important, for handling input and output, is that it can be interactive.
You’ll notice that the first builds the entire list from the outside in: in order to determine what the first element of the list is, the program needs to compute the entire structure to get to the innermost thunk, which is return l. The program generates the entire data structure first, then starts to process it. That’s useful when you’re reducing a list, because tail-recursive functions and strict left folds are efficient.
With the second, the outermost thunk contains the head and tail of the list, so you can grab the tail, then call the thunk to generate the second list. This can work with infinite lists, and it can produce and return partial results.
Here’s a contrived example: a program that reads in one integer per line and prints the sums so far.
main :: IO ()
main = interact( display . compute 0 . parse . lines )
where parse :: [String] -> [Int]
parse [] = []
parse (x:xs) = (read x):(parse xs)
compute :: Int -> [Int] -> [Int]
compute _ [] = []
compute accum (x:xs) = let accum' = accum + x
in accum':(compute accum' xs)
display = unlines . map show
If you run this interactively, you’ll get something like:
$ 1
1
$ 2
3
$ 3
6
$ 4
10
But you could also write compute tail-recursively, with an accumulating parameter:
main :: IO ()
main = interact( display . compute [] . parse . lines )
where parse :: [String] -> [Int]
parse = map read
compute :: [Int] -> [Int] -> [Int]
compute xs [] = reverse xs
compute [] (y:ys) = compute [y] ys
compute (x:xs) (y:ys) = compute (x+y:x:xs) ys
display = unlines . map show
This is an artificial example, but strict left folds are a common pattern. If, however, you write either compute or parse with an accumulating parameter, this is what you get when you try to run interactively, and hit EOF (control-D on Unix, control-Z on Windows) after the number 4:
$ 1
$ 2
$ 3
$ 4
1
3
6
10
This left-folded version needs to compute the entire data structure before it can read any of it. That can’t ever work on an infinite list (When would you reach the base case? How would you even reverse an infinite list if you did?) and an application that can’t respond to user input until it quits is a deal-breaker.
On the other hand, the tail-recursive version can be strict in its accumulating parameter, and will run more efficiently, especially when it’s not being consumed immediately. It doesn’t need to keep any thunks or context around other than its parameters, and it can even re-use the same stack frame. A strict accumulating function, such as Data.List.foldl', is a great choice whenver you’re reducing a list to a value, not building an eagerly-evaluated list of output. Functions such as sum, product or any can’t return any useful intermediate value. They inherently have to finish the computation first, then return the final result.
I am currently working on this program in Haskell where I analyze a website and try to find all links (href) that belong to this website. I was already able to extract all the links of the main site but i am struggling with the recursion since i want to follow the links I already found and do the same process again.
This is what i have already:
parseHtml = fmap LB.unpack . simpleHttp
filterFunc x y = -- damn long line with a lot of filters
main :: IO()
main = do
let site = "https://stackoverflow.com/"
url <- parseHtml site
let links = filterFunc site url
mapM_ print $ take 5 $ links
And this is my output so far:
"https://stackoverflow.com/company/about"
"https://stackoverflow.com/company/work-here"
"https://stackoverflow.com/help"
"https://stackoverflow.com/jobs/directory/developer-jobs"
"https://stackoverflow.com/questions/44691577/stream-versus-iterators-in-set"
I just need a hint on how to further proceed and how to visit the already found links again. Should I work with fold?
Link finding is essentially a graph traversal problem, which can be tricky in Haskell because of functional purity: it's hard to explicitly mark nodes (links) as visited or not through the use of an external history table.
Your typical traversal algorithm might look something like this:
function traverse(current_node) {
if (current_node.is_visited) {
return some_data;
} else {
current_node.is_visisted = true; // Hard in Haskell!
accumulated_data = ...;
for (child in current_node.children()) {
accumulated_data += traverse(child); // Recursion happens here.
}
return accumulated_data;
}
}
Because there is not an easy, direct way to mark a node as visited or not, we can try other solutions. For instance, we might consider something of the sort:
traverse :: ([URL], Data) -> URL -> ([URL], Data)
traverse (history, datum) current = let ... in ([new_history], accumulated_data)
The idea here is as follows: we keep an explicit list of URLs that we have visited. This allows us to quickly return from the current node (URL) if it appears in our history list (perhaps a Set for optimization? :)). In this case, each subsequent call to a child node using traverse would get the new_history list, effectively keeping track of a list of visited and unvisisted URLs.
One possible way to implement this is using a fold function such as foldl:
foldl :: Foldable t => (b -> a -> b) -> b -> t a -> b
Here type t a might be [URL], that denotes the children of the current link, and our traverse function conveniently has the type signature (b -> a -> b), where type b = ([URL], Data) and type a = URL.
Can you take it from here and figure out how to combine traverse and foldl?
Simply move your link visiting logic in a separate function which takes a link as a parameter, and then recurse on the links, as you intuited.
Depending on what you want to ultimately do with the links, you can for instance simply fold the links with your function.
For example, slightly modifying your code:
parseHtml = fmap LB.unpack . simpleHttp
filterFunc x y = -- damn long line with a lot of filters
visitLink :: String -> IO ()
visitLink site = do
url <- parseHtml site
let links = filterFunc site url
mapM_ print $ take 5 $ links -- or whatever you want to do on your links
mapM_ visitLink links -- the recursive call
main :: IO()
main = visitLinks "https://stackoverflow.com/"
If, rather than printing the links as you go, you would rather for instance return them, tweak the return type of the visitLink function (for instance String -> IO [String] and change your last line in visitLink suitably (for instance fmap join $ mapM visitLinks links).
As mentionned in another answer, keep in mind that with such a simple code, you may visit the same link infinitely many times. Consider storing the links you visit in a suitable data structure (such as a set) that you would pass to visitLink.
You have a sequence of actions that prefer to be executed in chunks due to some high-fixed overhead like packet headers or making connections. The limit is that sometimes the next action depends on the result of a previous one in which case, all pending actions are executed at once.
Example:
mySession :: Session IO ()
a <- readit -- nothing happens yet
b <- readit -- nothing happens yet
c <- readit -- nothing happens yet
if a -- all three readits execute because we need a
then write "a"
else write "..."
if b || c -- b and c already available
...
This reminds me of so many Haskell concepts but I can't put my finger on it.
Of course, you could do something obvious like:
[a,b,c] <- batch([readit, readit, readit])
But I'd like to hide the fact of chunking from the user for slickness purposes.
Not sure if Session is the right word. Maybe you can suggest a better one? (Packet, Batch, Chunk and Deferred come to mind.)
Update
I think there was a really good answer last night that I read on my phone but when I came back to look for it today it was gone. Was I dreaming?
I don't think you can do exactly what you want, since what you describe exploits haskell's lazy evaluation to have the evaluation of a force the actions that compute b and c, and there's no way to seq on unspecified values.
What I could do was hack together a monad transformer that delayed actions sequenced via >> so that they could be executed all together:
data Session m a = Session { pending :: [ m () ], final :: m a }
runSession :: Monad m => Session m a -> m a
runSession (Session ms ma) = foldr (flip (>>)) (return ()) ms >> ma
instance Monad m => Monad (Session m) where
return = Session [] . return
s >>= f = Session [] $ runSession s >>= (runSession . f)
(Session ms ma) >> (Session ms' ma') =
Session (ms' ++ (ma >> return ()) : ms) ma'
This violates some monad laws, but lets you do something like:
liftIO :: IO a -> Session IO a
liftIO = Session []
exampleSession :: Session IO Int
exampleSession = do
liftIO $ putStrLn "one"
liftIO $ putStrLn "two"
liftIO $ putStrLn "three"
liftIO $ putStrLn "four"
trace "five" $ return 5
and get
ghci> runSession exampleSession
five
one
two
three
four
5
ghci> length (pending exampleSession)
4
This is very similar to what Haxl does.
For more info:
Open sourcing haxl - Facebook Code Blog
ICFP 2014 talk
You could use the unsafeInterleaveIO function. It is a dangerous function that can introduce bugs to your program if not used carefully, but it does what you're asking for.
You can insert it into your example code like this:
lazyReadits :: IO [a]
lazyReadits = unsafeInterleaveIO $ do
a <- readit
r <- lazyReadits
return (a:r)
unsafeInterleaveIO makes the action as a whole lazy, but once it starts evaluating it will evaluate as if it had been strict. This means in my above example: readit will run as soon as something tests whether the returned list is empty or not. If I'd used mapM unsafeInterleaveIO (replicate 3 readit) instead, then readit would only be run when the actual elements of the list are evaluated, which would make the contents of the list depend on the order in which its elements are inspected, which is one example of how unsafeInterleaveIO can introduce bugs.