I'm having a little trouble with HXT: I am trying to locate all the nodes in a document that match some criteria, and I'm trying to combine the lenses/XPaths as predicates in an OR-like fashion, using Control.Arrow.<+>, as this guide suggests. However, when I try to "run" the arrow on my document, I am getting duplicate results. Is there an easy way to remove the duplicates, or to combine the tests in a more meaningful way? Here is my code:
run :: App -> IO ()
run a = do
inputContents <- readFile (input a)
let doc = readString [withParseHTML yes, withWarnings no] inputContents
links <- runX . xshow $ doc >>> indentDoc //> cssLinks
mapM_ putStrLn links
cssLinks = links >>> (rels <+> hrefs <+> types)
where
links = hasName "link"
rels = hasAttrValue "rel" (isInfixOf "stylesheet")
hrefs = hasAttrValue "href" (endswith ".css")
types = hasAttrValue "type" (== "text/css")
Yet every time I run this (on any web page), I get duplicated results / nodes. I noticed that <+> is part of the ArrowPlus typeclass, which mimics a monoid, and ArrowXML is an instance of both ArrowList and ArrowTree, which gives me a lot to work with. Would I have to construct ArrowIf predicates? Any help with this would be wonderful :)
You may get the arrow result as a [XmlTree], then apply List.nub, then get the string rep.
import "hxt" Text.XML.HXT.DOM.ShowXml as SX
...
links <- runX $ doc >>> indentDoc //> cssLinks
-- first remove duplicates (List.nub) then apply SX.xshow
putStrLn (SX.xshow . L.nub $ links)
Related
I'm trying to code a kind of a simple web crawler in haskell just for practice. To my own astonishment neither the web request itself nor parsing the web site was any complicated.
I coded the program purely functional with a recursive function, but only some fourty or fifty web requests later, the program eats up all the memory.
So I tried to do the task with dynamic programming, but here I'm totally stuck, which means, I have no idea where to begin. In this tiny program I got so many errors, that I'm not able to figure out, where to start.
This is my current concept:
scanPage :: String -> IO (String,String,[String])
scanPage url = ....
crawler :: String -> IO [(String, Int)]
crawler startUrl = runST $ do
toVisit <- newSTRef [startUrl] :: ST s (STRef s [String])
visited <- newSTRef [] :: ST s (STRef s [String])
result <- newSTRef [] :: ST s (STRef s [(String, Int)])
-- Iterate over urls to visit
while (liftM not $ liftM null $ readSTRef toVisit) $ do
url <- fmap (head) (readSTRef toVisit)
(moreUrls, value_a, value_b) <- scanPage url
-- Mark page as visited
vis <- readSTRef visited
writeSTRef visited (url : vis)
-- Add Results
res <- readSTRef result
writeSTRef result ((value_a, value_b) : res)
-- Extend urls to visit
nextUrls <- readSTRef toVisit
writeSTRef toVisit (nextUrls ++ (moreUrls \\ vis))
-- End of while
return =<< readSTRef result
main = do
putStrLn =<< fmap show (crawler "http://starturl.com")
I already wrote a lot of programs like this with arrays, which are much more convenient, as I can simply write or read from or to array elements. So I thought I could use mutable vectors for these lists, but they can't grow (at least in the same instance) or shrink. So I ended up with simple lists in STRef.
The first line I can't get to work is the line with the while command. I wrote my own while function like this
while :: (Monad m) => m Bool -> m a -> m ()
while cond action = do
c <- cond
when c $ do
action
while cond action
because I couldn't find any other while command. I googled many days for mutable vectors, but was not able to find a single tutorial or even example that I could use here. Please, can anyone tell me, how to write a syntactical correct crawler function? Yes, a pure functional solution would be nicer and more "haskellish", but I'm considering me still as a beginner and all this monad-stuff is still a bit strange for me. I'm willing to learn, but a hint or even an example would be really awesome.
EDIT:
Here comes some pseudocode of my messy code.
toVisitList = startURL
visitedList = []
resultList = []
while (length toVisitList /= 0) {
url = head toVisitList -- Get the 1st element
toVisitList -= url -- Remove this url from list
visitedList += url -- Append url to visitedList
(moreUrls, val_a, val_b) = scanPage url
resultList += (val_a, val_b) -- append the result
toVisitList += (moreUrls - visitedList)
}
return resultList
EDIT:
I still haven't any clue, how to put this pseudocode into real code, especially the while-statement. Any hints appreciacted.
The natural data structure for your toVisitList is a queue. There are a few implementations of queues around, but for this purpose, the simplest thing is to just use Data.Sequence.Seq. This lets you add things to the end with |> or <>, and to view the beginning with viewl. Consider something like
crawlOnce :: Seq Url -> [Url] -> IO (Either [Url] (Seq Url, [Url]))
crawlOnce toVisitList visitedList uses viewl to look at the front of the list of URLs to visit. If it's empty, it returns Left visitedList. Otherwise, it visits the first URL, appends it to the visited list, and adds the newly discovered URLS to the list to visit, then wraps them up in Right.
There are several reasonable variations. For instance, you could go for a type like ExceptT [Url] (StateT (Seq Url, [Url]) IO) a that "throws" its final result.
Starting out with Haskell and Yesod, probably getting a bit too far with Yesod relative to Haskell :)
I build entities using Persistent via
share [mkPersist sqlSettings, mkMigrate "migrateAll"][persistLowerCase|
Game
title String
company String
UniqueTitle title
deriving Show
Tag
label String
description String Maybe
UniqueLabel label
deriving Show
GameTag
gameId GameId
tagId TagId
UniqueGameTag gameId tagId
|]
-- Yesod related code ...
In main I have
main :: IO ()
main = do
let taggings = fromFile :: [(Game, Tag)] -- fromFile code not included
runStderrLoggingT $ withSqlitePool ":inmemory:" 10 $ λpool → liftIO $ do
runResourceT $ flip runSqlPool pool $ do
runMigration migrateAll
let (g, t) = head taggings
gid ← insert g
tid ← insert t
insert (GameTag gid tid)
warp 3000 $ App pool
Doing this I get the first relation into the database, and by selecting elements from the list I can add more 'by hand', but I can't figure out how to get all the relations into the database by somehow iterating over taggings. How do i define a function that I can map over taggings ::[(Game, Tag)] and inserts the game tags of the type GameTag constructed
by Persistent?
The main trick here isn't in pulling out the function, that is easy:
f (g, t) = do
gid <- insert g
tid <- insert t
insert (GameTag gid tid)
The trick is knowing how to use this.... Standard map won't work alone, because the function is defined in a monad (you can use it, it will just give you a list of actions back without running them).
map f taggings -- returns just a list, type [ResourceT IO a], doesn't run anything
Here are two ways to actually run the actions from within main.
sequence (map f taggings) --sequentially runs the actions in the list
or, the more readable
forM taggings f
or, in the slightly more verbose
forM taggings $ \tagging -> do
f tagging
You also might want to look at mapM. Also also should learn about forM_ and sequence_ to supress the (often useless) return values.
I am trying to create dynamic links using the Heist templating system. The problem is that the links are appearing as text rather than being interpreted as html. Is there a specific method to create dyamic lists like this with Heist?
The function where the link is constructed:
renderCategories :: Monad m => Db.Category -> I.Splice m
renderCategories (Db.Category catid catname catdesc) =
I.runChildrenWithText [ ("categoryId", T.concat $ ["<a href='http://localhost:8000/thread_home?cateid=", T.pack . show $ catid, "'>", T.pack . show $ catid, "</a>"])
, ("categoryName", catname)
, ("categoryDesc", catdesc)]
The tag appears as "http://localhost:8000/thread_home?cateid=1'>1" text on the webpage. And the source shows it as follows:
<a href='http://localhost:8000/thread_home?cateid=1'>1</a>
I figure that I need to have it print the actual < and > but I am not sure how to achieve this.
As I am currently running runChildrenWithText to populate this Heist template changing to just runChildrenWith requires splices instead of text and so instead of attempting this I am hoping there is some way to runChildrenWithText without the '<' and '>' being converted to '<' and '>'.
Any help is appreciated!
EDIT
I am trying to manually create the link using:
renderCategories :: Monad m => Db.Category -> I.Splice m
renderCategories (Db.Category catid catname catdesc) =
I.runChildrenWith [ ("categoryId", return $ X.Element "a"[("href", "http://localhost")] $ X.TextNode (T.pack $ show catid))]
However I am encountering two errors:
Couldn't match type `X.Node' with `[X.Node]'
Expected type: I.Splice m
Actual type: heist-0.11.1:Heist.Types.HeistT m m X.Node
In the expression:
return
$ X.Element "a" [("href", "http://localhost")]
$ X.TextNode (T.pack $ show catid)
and
Couldn't match expected type `[X.Node]' with actual type `X.Node'
In the return type of a call of `X.TextNode'
In the second argument of `($)', namely
`X.TextNode (T.pack $ show catid)'
I do not really understand these errors at the moment and any help is appreciated.
Working function for both returning the link and normal text:
renderCategories :: Monad m => Db.Category -> I.Splice m
renderCategories (Db.Category catid catname catdesc) =
I.runChildrenWith [( "categoryId", return $ [X.Element "a" [("href", T.concat $ ["http://localhost:8000/thread_home?cateid=", T.pack $ show catid] )] [X.TextNode (T.pack $ show catid)] ] )
, ("categoryName", I.textSplice catname)
, ("categoryDesc", I.textSplice catdesc)]
The behavior you are seeing is exactly what is intended. The reason you are having problems is because you're using runChildrenWithText which is a higher level function designed for situations where you are returning text nodes. It is meant for when you want that actual text on your page. What you are seeing is the correct way to achieve that.
A splice is a computation that returns a list of nodes.
type Splice n = HeistT n n [Node]
Node is a representation of the DOM as Haskell types, so if you want to return a link, you should do something like this:
return $ [Element "a" [("href", "http://localhost")] [TextNode (T.pack $ show catid)]]
To use this kind of a splice, you'll need to use runChildrenWith instead of runChildrenWithText.
If this manual creation of Nodes seems ugly to you, there's also a more convenient option. If you import the module Text.Blaze.Renderer.XmlHtml, you'll find functions there that let you generate Node trees using blaze-html syntax.
All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.
You can use HXT's xread along with runLA to parse an XML string outside of IO.
xread has the following type:
xread :: ArrowXml a => a String XmlTree
This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.
runLA is like runX, but for things of type LA:
runLA :: LA a b -> a -> [b]
LA is an instance of ArrowXml.
To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
where lookupValues cs m = map (flip M.lookup m) cs
xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
main = print values
classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.
It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.
A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.
Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:
classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
where pairs = getAttrValue "class" &&& deep getText
Travis Brown's answer was very helpful. I just want to add my own solution here, which I think is a bit more general (using the same functions, just ignoring the problem-specific issues).
I was previously unpickling with:
upIO :: XmlPickler a => String -> IO [a]
upIO str = runX $ readString [] str >>> arrL (maybeToList . unpickleDoc xpickle)
which I was able to change to this:
upPure :: XmlPickler a => String -> [a]
upPure str = runLA (xreadDoc >>> arrL (maybeToList . unpickleDoc xpickle)) str
I completely agree with him that doing this gives you less control over the configuration of the parser etc, which is unfortunate.
I'm trying to come to terms with Haskell's XML Toolbox (HXT) and I'm hitting a wall somewhere, because I don't seem to fully grasp arrows as a computational tool.
Here's my problem, which I hoped to illustrate a little better using a GHCi session:
> let parse p = runLA (xread >>> p) "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> :t parse
parse :: LA XmlTree b -> [b]
So Parse is a small helper function that applies whatever arrow I give it to the trivial XML document
<root>
<a>foo</a>
<b>bar</b>
<c>baz</c>
</root>
I define another helper function, this time to extract the text below a node with a given name:
> let extract s = getChildren >>> isElem >>> hasName s >>> getChildren >>> getText
> :t extract
extract :: (ArrowXml cat) =>
String -> cat (Data.Tree.NTree.TypeDefs.NTree XNode) String
> parse (extract "a" &&& extract "b") -- extract two nodes' content.
[("foo","bar")]
With the help of this function, it's easy to use the &&& combinator to pair up the text of two different nodes, and then, say, pass it to a constructor, like this:
> parse (extract "a" &&& extract "b" >>^ arr (\(a,b) -> (b,a)))
[("bar","foo")]
Now comes the part I don't understand: I want to left-factor! extract calls getChildren on the root-node twice. Instead, I'd like it to only call it once! So I first get the child of the root node
> let extract' s = hasName s >>> getChildren >>> getText
> :t extract'
extract' :: (ArrowXml cat) => String -> cat XmlTree String
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[]
Note, that I've tried to re-order the calls to, say, isElem, etc. in order to find out if that's the issue. But as it stands, I just don't have any idea why this isn't working. There is an arrow 'tutorial' on the Haskell wiki and the way I understood it, it should be possible to do what I want to do that way — namely use &&& in order to pair up the results of two computations.
It does work, too — but only at the start of the arrow-chain, not mid-way trough, when I have some results already, that I want to keep 'shared.' I have the feeling that I'm just not being able to wrap my head around a difference in ideas between normal function composition and arrow notation. I'd be very appreciative of any pointers! (Even if it is just to some generic arrow-tutorial that goes a little more in-depth than the on the Haskell-wiki.)
Thank you!
If you convert the arrow to (and then from) a deterministic version this works as expected:
> let extract' s = unlistA >>> hasName s >>> getChildren >>> getText
> parse (listA (getChildren >>> isElem) >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This isn't really satisfactory, though, and I can't remember off the top of my head why (&&&) behaves this way with a nondeterministic arrow (I'd personally use the proc/do notation for anything much more complicated than this).
UPDATE: There seems to be something weird going on here with runLA and xread. If you use runX and readString everything works as expected:
> let xml = "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> let parse p = runX (readString [] xml >>> p)
> let extract' s = getChildren >>> hasName s >>> getChildren >>> getText
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This means you have to run the parser in the IO monad, but there are advantages to using runX anyway (better error messages, etc.).