I'm trying to come to terms with Haskell's XML Toolbox (HXT) and I'm hitting a wall somewhere, because I don't seem to fully grasp arrows as a computational tool.
Here's my problem, which I hoped to illustrate a little better using a GHCi session:
> let parse p = runLA (xread >>> p) "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> :t parse
parse :: LA XmlTree b -> [b]
So Parse is a small helper function that applies whatever arrow I give it to the trivial XML document
<root>
<a>foo</a>
<b>bar</b>
<c>baz</c>
</root>
I define another helper function, this time to extract the text below a node with a given name:
> let extract s = getChildren >>> isElem >>> hasName s >>> getChildren >>> getText
> :t extract
extract :: (ArrowXml cat) =>
String -> cat (Data.Tree.NTree.TypeDefs.NTree XNode) String
> parse (extract "a" &&& extract "b") -- extract two nodes' content.
[("foo","bar")]
With the help of this function, it's easy to use the &&& combinator to pair up the text of two different nodes, and then, say, pass it to a constructor, like this:
> parse (extract "a" &&& extract "b" >>^ arr (\(a,b) -> (b,a)))
[("bar","foo")]
Now comes the part I don't understand: I want to left-factor! extract calls getChildren on the root-node twice. Instead, I'd like it to only call it once! So I first get the child of the root node
> let extract' s = hasName s >>> getChildren >>> getText
> :t extract'
extract' :: (ArrowXml cat) => String -> cat XmlTree String
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[]
Note, that I've tried to re-order the calls to, say, isElem, etc. in order to find out if that's the issue. But as it stands, I just don't have any idea why this isn't working. There is an arrow 'tutorial' on the Haskell wiki and the way I understood it, it should be possible to do what I want to do that way — namely use &&& in order to pair up the results of two computations.
It does work, too — but only at the start of the arrow-chain, not mid-way trough, when I have some results already, that I want to keep 'shared.' I have the feeling that I'm just not being able to wrap my head around a difference in ideas between normal function composition and arrow notation. I'd be very appreciative of any pointers! (Even if it is just to some generic arrow-tutorial that goes a little more in-depth than the on the Haskell-wiki.)
Thank you!
If you convert the arrow to (and then from) a deterministic version this works as expected:
> let extract' s = unlistA >>> hasName s >>> getChildren >>> getText
> parse (listA (getChildren >>> isElem) >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This isn't really satisfactory, though, and I can't remember off the top of my head why (&&&) behaves this way with a nondeterministic arrow (I'd personally use the proc/do notation for anything much more complicated than this).
UPDATE: There seems to be something weird going on here with runLA and xread. If you use runX and readString everything works as expected:
> let xml = "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> let parse p = runX (readString [] xml >>> p)
> let extract' s = getChildren >>> hasName s >>> getChildren >>> getText
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This means you have to run the parser in the IO monad, but there are advantages to using runX anyway (better error messages, etc.).
Related
Haskell IO system is super hard to understand for me so i have question : How to read from standard input to list ? I know that there is function getLine :: IO String and interact. But i do not know how to convert the input to list so I can use it in this three functions :
powerset [] = [[]]
powerset (x:xs) = xss ++ map (x:) xss
where xss = powerset xs
main = print $ powerset([1,2])
import Control.Monad(filterM)
p = filterM(const[True,False])
main = p[1,2]
main = subsequences([1,2])
I want to be able to write 1 2 3 and pass this values to the function. Can you tell/show how to do it ?
Extra question
Haskell is full of magic so i was wondering if it possible to use input directly in the function like this :
main = subsequences(some input magic here)
You may write:
main = readLn >>= print . subsequences
You will need to nail down the type to be read, for example by having a monomorphic subsequences or by annotating readLn. In ghci:
Data.List> (readLn :: IO [Integer]) >>= print . subsequences
[1,2,3]
[[],[1],[2],[1,2],[3],[1,3],[2,3],[1,2,3]]
(I typed in the first and second lines -- both followed by enter -- and the third line was the result.)
For more details, you may enjoy one of the excellent resources below:
The IO Monad for People who Simply Don't Care
You Could Have Invented Monads (And Maybe You Already Have)
All About Monads
I'm having a little trouble with HXT: I am trying to locate all the nodes in a document that match some criteria, and I'm trying to combine the lenses/XPaths as predicates in an OR-like fashion, using Control.Arrow.<+>, as this guide suggests. However, when I try to "run" the arrow on my document, I am getting duplicate results. Is there an easy way to remove the duplicates, or to combine the tests in a more meaningful way? Here is my code:
run :: App -> IO ()
run a = do
inputContents <- readFile (input a)
let doc = readString [withParseHTML yes, withWarnings no] inputContents
links <- runX . xshow $ doc >>> indentDoc //> cssLinks
mapM_ putStrLn links
cssLinks = links >>> (rels <+> hrefs <+> types)
where
links = hasName "link"
rels = hasAttrValue "rel" (isInfixOf "stylesheet")
hrefs = hasAttrValue "href" (endswith ".css")
types = hasAttrValue "type" (== "text/css")
Yet every time I run this (on any web page), I get duplicated results / nodes. I noticed that <+> is part of the ArrowPlus typeclass, which mimics a monoid, and ArrowXML is an instance of both ArrowList and ArrowTree, which gives me a lot to work with. Would I have to construct ArrowIf predicates? Any help with this would be wonderful :)
You may get the arrow result as a [XmlTree], then apply List.nub, then get the string rep.
import "hxt" Text.XML.HXT.DOM.ShowXml as SX
...
links <- runX $ doc >>> indentDoc //> cssLinks
-- first remove duplicates (List.nub) then apply SX.xshow
putStrLn (SX.xshow . L.nub $ links)
I'd like to save a huge list A to a textfile. writeFile seems to only save the list at the very end of the calcultaion of A, which crashes because my memory is insufficient to store the whole list.
I have tried this using
writeFile "test.txt" $ show mylistA
Now I have tried saving the elements of the list, as they are calculated using:
[appendFile "test2.txt" (show x)|x<-mylistA]
But it doesn't work because:
No instance for (Show (IO ())) arising from a use of `print' Possible fix: add an instance declaration for (Show (IO ())) In a stmt of an interactive GHCi command: print it
Can you help me fix this, or give me a solution which saves my huge list A to a text file?
Thank you
The problem is that your list has the type [ IO () ] or "A list of IO actions". Since the IO is on the "inside" of out type we can't execute this in the IO monad. What we want instead is IO (). So a list comprehension isn't going to hack it here.
We could use a function to turn [IO ()] -> IO [()] but this case lends itself to a much more concise combinator.
Instead we can use a simple predefined combinator called mapM_. In the Haskell prelude the M means it's monadic and the _ means that it returns m () in our case IO (). Using it is trivial in this case
[appendFile "test2.txt" (show x)|x<-mylistA]
becomes
mapM_ (\x -> appendFile "test2.txt" (show x)) myListA
mapM_ (appendFile "test2.txt" . show) myListA
This will unfold to something like
appendFile "test2.txt" (show firstItem) >>
appendFile "test2.txt" (show secondItem) >>
...
So we don't ever have the whole list in memory.
You can use the function sequence from Control.Monad to take a (lazily generated) list of IO actions and execute them one at a time
>>> import Control.Monad
Now you can do
>>> let myList = [1, 2, 3]
>>> sequence [print x | x <- myList]
1
2
3
[(),(),()]
Note that you get a list of all the return values at the end. If you want to discard the return value, just use sequence_ instead of sequence.
>>> sequence_ [print x | x <- myList]
1
2
3
I just wanted to expand on jozefg's answer by mentioning forM_, the flipped version of mapM_. Using forM_ you get something that looks like a foreach loop:
-- Read this as "for each `x` in `myListA`, do X"
forM_ myListA $ \x -> do
appendFile "test2.txt" (show x)
I am trying to get the informations out of a xml file into a lookup table.
So far I have been reading what librairies might be available and how to use them.
I went with hxt and hashtables.
Here is the file :
<?xml version="1.0" encoding="UTF-8" ?>
<tables>
<table name="nametest1">
test1
</table>
<table name="nametest2">
test2
</table>
</tables>
I would like to have the following pairs:
nametest1, test1
nametest2, test2
etc...
-- | We get the xml into a hash
getTables :: IO (H.HashTable String String)
getTables = do
confPath <- getEnv "ENCODINGS_XML_PATH"
doc <- runX $ readDocument [withValidate no] confPath
-- this is the part I don't have
-- I get the whole hashtable create and insert process
-- It is the get the xml info that is blocking
where -- I think I might use the following so I shamelessly took them from the net
atTag tag = deep (isElem >>> hasName tag)
text = getChildren >>> getText
I saw many examples of how to do similar things but I can't figure out how to get the name attribute at each node.
Cheers,
rakwatt
Here is an example that reads a file with the name of test.xml and just prints out the (name,text) pairs:
import Text.XML.HXT.Core
-- | Gets the name attribute and the content of the selected items as a pair
getAttrAndText :: (ArrowXml a) => a XmlTree (String, String)
getAttrAndText =
getAttrValue "name" -- And zip it together with the the attribute name
&&& deep getText -- Get the text of the node
-- | Gets all "table" items under a root tables item
getTableItem :: (ArrowXml a) => a XmlTree XmlTree
getTableItem =
deep (hasName "tables") -- Find a tag <tables> anywhere in the document
>>> getChildren -- Get all children of that tag
>>> hasName "table" -- Filter those that have the tag <table>
>>> hasAttr "name" -- Filter those that have an attribute name
-- | The main function
main = (print =<<) $ runX $ -- Print the result
readDocument [withValidate no] "test.xml" -- Read the document
>>> getTableItem -- Get all table items
>>> getAttrAndText -- Get the attribute 'name' and the text of those nodes
The construction of the pairs happens in getAttrAndText. The rest of the functions just open the file and select all tags that are an immediate children of a tag. You still might want to strip leading whitespace in the text.
All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.
You can use HXT's xread along with runLA to parse an XML string outside of IO.
xread has the following type:
xread :: ArrowXml a => a String XmlTree
This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.
runLA is like runX, but for things of type LA:
runLA :: LA a b -> a -> [b]
LA is an instance of ArrowXml.
To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
where lookupValues cs m = map (flip M.lookup m) cs
xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
main = print values
classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.
It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.
A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.
Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:
classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
where pairs = getAttrValue "class" &&& deep getText
Travis Brown's answer was very helpful. I just want to add my own solution here, which I think is a bit more general (using the same functions, just ignoring the problem-specific issues).
I was previously unpickling with:
upIO :: XmlPickler a => String -> IO [a]
upIO str = runX $ readString [] str >>> arrL (maybeToList . unpickleDoc xpickle)
which I was able to change to this:
upPure :: XmlPickler a => String -> [a]
upPure str = runLA (xreadDoc >>> arrL (maybeToList . unpickleDoc xpickle)) str
I completely agree with him that doing this gives you less control over the configuration of the parser etc, which is unfortunate.