put xml into a hash table - haskell

I am trying to get the informations out of a xml file into a lookup table.
So far I have been reading what librairies might be available and how to use them.
I went with hxt and hashtables.
Here is the file :
<?xml version="1.0" encoding="UTF-8" ?>
<tables>
<table name="nametest1">
test1
</table>
<table name="nametest2">
test2
</table>
</tables>
I would like to have the following pairs:
nametest1, test1
nametest2, test2
etc...
-- | We get the xml into a hash
getTables :: IO (H.HashTable String String)
getTables = do
confPath <- getEnv "ENCODINGS_XML_PATH"
doc <- runX $ readDocument [withValidate no] confPath
-- this is the part I don't have
-- I get the whole hashtable create and insert process
-- It is the get the xml info that is blocking
where -- I think I might use the following so I shamelessly took them from the net
atTag tag = deep (isElem >>> hasName tag)
text = getChildren >>> getText
I saw many examples of how to do similar things but I can't figure out how to get the name attribute at each node.
Cheers,
rakwatt

Here is an example that reads a file with the name of test.xml and just prints out the (name,text) pairs:
import Text.XML.HXT.Core
-- | Gets the name attribute and the content of the selected items as a pair
getAttrAndText :: (ArrowXml a) => a XmlTree (String, String)
getAttrAndText =
getAttrValue "name" -- And zip it together with the the attribute name
&&& deep getText -- Get the text of the node
-- | Gets all "table" items under a root tables item
getTableItem :: (ArrowXml a) => a XmlTree XmlTree
getTableItem =
deep (hasName "tables") -- Find a tag <tables> anywhere in the document
>>> getChildren -- Get all children of that tag
>>> hasName "table" -- Filter those that have the tag <table>
>>> hasAttr "name" -- Filter those that have an attribute name
-- | The main function
main = (print =<<) $ runX $ -- Print the result
readDocument [withValidate no] "test.xml" -- Read the document
>>> getTableItem -- Get all table items
>>> getAttrAndText -- Get the attribute 'name' and the text of those nodes
The construction of the pairs happens in getAttrAndText. The rest of the functions just open the file and select all tags that are an immediate children of a tag. You still might want to strip leading whitespace in the text.

Related

Logical OR in HXT without duplicating results

I'm having a little trouble with HXT: I am trying to locate all the nodes in a document that match some criteria, and I'm trying to combine the lenses/XPaths as predicates in an OR-like fashion, using Control.Arrow.<+>, as this guide suggests. However, when I try to "run" the arrow on my document, I am getting duplicate results. Is there an easy way to remove the duplicates, or to combine the tests in a more meaningful way? Here is my code:
run :: App -> IO ()
run a = do
inputContents <- readFile (input a)
let doc = readString [withParseHTML yes, withWarnings no] inputContents
links <- runX . xshow $ doc >>> indentDoc //> cssLinks
mapM_ putStrLn links
cssLinks = links >>> (rels <+> hrefs <+> types)
where
links = hasName "link"
rels = hasAttrValue "rel" (isInfixOf "stylesheet")
hrefs = hasAttrValue "href" (endswith ".css")
types = hasAttrValue "type" (== "text/css")
Yet every time I run this (on any web page), I get duplicated results / nodes. I noticed that <+> is part of the ArrowPlus typeclass, which mimics a monoid, and ArrowXML is an instance of both ArrowList and ArrowTree, which gives me a lot to work with. Would I have to construct ArrowIf predicates? Any help with this would be wonderful :)
You may get the arrow result as a [XmlTree], then apply List.nub, then get the string rep.
import "hxt" Text.XML.HXT.DOM.ShowXml as SX
...
links <- runX $ doc >>> indentDoc //> cssLinks
-- first remove duplicates (List.nub) then apply SX.xshow
putStrLn (SX.xshow . L.nub $ links)

Snap: rendering table with compiled splices

Well, this is another obstacle with compiled splices that I struggle to overcome.
I had some trouble when I first did it with the original interpreted splices and now I am stuck translating it to compiled variant.
The original problem is described here: rendering a table
So I need to build this kind of structure:
map (\doc -> fmap (\f -> splice $ f at doc) fields ) documents
which is just just external and internal loops, the outer rendering raws and the inner rendering cells. It is all described in the original link.
It is currently implemented in interpreted mode like this:
tableRawS raw = mapSplices (\f -> tableCellS $ T.pack $ at f raw) (tableFields table)
where tableCellS cell = runChildrenWithText $ "table-cell" ## cell
tableBodyS = mapSplices (\d -> runChildrenWith $ raws d) documents
where docID d = T.pack $ show $ valueAt "_id" d
raws d = do "table-raw" ## tableRawS d
"raw-id" ## textSplice $ docID d
I've been struggling for a few days now with no result. It is probably due to the lack of understanding compiled splices API. Please help!
EDIT
I did not provide enough details to my problem. The best approximation to the above interpreted variant that I managed to get is this:
tableBody = manyWithSplices runChildren tableRaw $ lift documents
tableRaw doc = do "table-raw" ## manyWithSplices runChildren rawCell $ lift labels
where rawCell label = "table-cell" ## pureSplice . textSplice $ at label doc
"table-rawid" ## pureSplice . textSplice $ at "_id" doc
where oid = T.pack . show. valueAt "_id"
It does not work! The problem is translating the raw rendering function which is tableRaw. So I will give more details.
The data I am feeding into the tableBody comes as a list of Map like structure:
document = ["_id" =: 12345, "name" =: "pretty", "model" =: "cool", "size" =: "big"]
So to pull the data out in a raw I need to map over list of labels
map (\l -> at l document) labels
I can do this with the interpreted varian without a problem. The interpreted example of tableRawS is parameterised by a ducument but maps over [labels]. I can not achieve this in compiled variant. I need both the document and the list of labels present in the tableRaw function so I can bind different cell-splices out of the document. This is real pain. I don't know how to do it. Whatever I do I end up with tableRaw mapping over document instead of mapping over list of labels.
Basically. I get [document] from the database with the above mentioned document structure, and I get labels from the database, where
labels = ["name", "model", "size"] -- table fields
I get data from the document with "at :: Label -> Value"
Having all that, how do I render this template:
<table class="table" data-title=${table-name}>
<thead>
<tr>
<th>#</th>
<table-head>
<th> <table-hcell/> </th>
</table-head>
</tr>
</thead>
<tbody>
<table-body>
<tr data-oid=${table-rowid}>
<td> <input type="checkbox"/> </td>
<table-raw>
<td> <table-rcell/> </td>
</table-raw>
</tr>
</table-body>
</tbody>
Suppose this is the structure that we want to render as a table:
data Table = Table
{
caption :: T.Text
, rows :: [[Int]]
}
A simple template could be something like
<body>
<mytable>
<table>
<caption><mycaptiontext/></caption>
<myrow>
<tr>
<mydata>
<td><mydatavalue/></td>
</mydata>
</tr>
</myrow>
</table>
</mytable>
</body>
Where mytable, mycaptiontext, myrow, mydata and mydatavalue are the tags that will be bound to splices. mytable in particular will be bound to a top-level splice.
There are three levels to consider: the table as a whole, the rows, and the data inside the rows. For each level, we are going to define a function which takes a runtime action and returns a compiled splice.
For the "row data" level, the runtime action carries the list of values in a row:
dataSplice :: Monad m => RuntimeSplice m [Int] -> C.Splice m
dataSplice = C.manyWithSplices C.runChildren splicefuncs
where
splicefuncs = do
"mydatavalue" ## (C.pureSplice . C.textSplice $ T.pack . show)
For the "rows" level, the runtime action carries the full list of rows:
rowSplice :: Monad m => RuntimeSplice m [[Int]] -> C.Splice m
rowSplice = C.manyWithSplices C.runChildren splicefuncs
where
splicefuncs = do
"mydata" ## dataSplice
Notice how we use dataSplice in the definition.
For the "whole table" level, the runtime action carries a Table:
tableSplice :: Monad m => RuntimeSplice m Table -> Splice m
tableSplice = C.withSplices C.runChildren splicefuncs
where
splicefuncs = do
"mycaptiontext" ## (C.pureSplice . C.textSplice $ caption)
"myrow" ## (rowSplice . liftM rows)
Notice how we use liftM here, transforming a RuntimeSplice m Table into a RuntimeSplice m [[Int]] so that the type fits with what rowSplice expects.
Now suppose we have a runtime action that produces the table data:
runtime :: Monad m => RuntimeSplice m Table
runtime = return $ Table "This is the caption" $
[ [3, 5, 6], [8, 3, 7 ] ]
(This one is trivial. A more complex action could fetch the data from a database, for example. Remember that, in Snap, RuntimeSplices have access to all the Snaplet machinery.)
We can combine tableSplice with runtime and register the result as a top-level splice in the Heist configuration, bound to the mytable tag. Like this:
let heistConfig = mempty {
hcCompiledSplices = do
"mytable" ## (tableSplice runtime)
.... other top-level splices here ...
}
This should work.
More examples of rendering nested structures with compiled Heist can be found in this tutorial.

HXT Parsing to list

I recently posted about using HXT pickles for parsing. After some reading I decided use regular HXT instead.
However, I am unable to create lists, ie. I have an XML document:
<meta>
<sampleQuery>sample1</sampleQuery>
<sampleQuery>sample2</sampleQuery>
</meta>
and a parsing function
parseMeta =
proc x -> do
meta <- deep (isElem >>> hasName "meta") -< x
sampleQueries <- getText <<< getChildren <<< deep (hasName "sampleQuery") -< meta
returnA -< Meta sampleQueries
sampleQueries should have the type [String] (["sample1", "sample2"] in this case) but I am unable to achieve this.
Arrow notation seems like overkill here.
import Text.XML.HXT.Core
xml = unlines
[ "<meta>"
, "<sampleQuery>sample1</sampleQuery>"
, "<sampleQuery>sample2</sampleQuery>"
, "</meta>"
]
queries = hasName "meta" /> hasName "sampleQuery" /> getText
main = runX (readString [] xml /> queries) >>= print
This will print ["sample1","sample2"], as expected.

Running Haskell HXT outside of IO?

All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.
You can use HXT's xread along with runLA to parse an XML string outside of IO.
xread has the following type:
xread :: ArrowXml a => a String XmlTree
This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.
runLA is like runX, but for things of type LA:
runLA :: LA a b -> a -> [b]
LA is an instance of ArrowXml.
To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
where lookupValues cs m = map (flip M.lookup m) cs
xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
main = print values
classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.
It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.
A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.
Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:
classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
where pairs = getAttrValue "class" &&& deep getText
Travis Brown's answer was very helpful. I just want to add my own solution here, which I think is a bit more general (using the same functions, just ignoring the problem-specific issues).
I was previously unpickling with:
upIO :: XmlPickler a => String -> IO [a]
upIO str = runX $ readString [] str >>> arrL (maybeToList . unpickleDoc xpickle)
which I was able to change to this:
upPure :: XmlPickler a => String -> [a]
upPure str = runLA (xreadDoc >>> arrL (maybeToList . unpickleDoc xpickle)) str
I completely agree with him that doing this gives you less control over the configuration of the parser etc, which is unfortunate.

HXT: Left-Factoring Nondeterministic Arrows?

I'm trying to come to terms with Haskell's XML Toolbox (HXT) and I'm hitting a wall somewhere, because I don't seem to fully grasp arrows as a computational tool.
Here's my problem, which I hoped to illustrate a little better using a GHCi session:
> let parse p = runLA (xread >>> p) "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> :t parse
parse :: LA XmlTree b -> [b]
So Parse is a small helper function that applies whatever arrow I give it to the trivial XML document
<root>
<a>foo</a>
<b>bar</b>
<c>baz</c>
</root>
I define another helper function, this time to extract the text below a node with a given name:
> let extract s = getChildren >>> isElem >>> hasName s >>> getChildren >>> getText
> :t extract
extract :: (ArrowXml cat) =>
String -> cat (Data.Tree.NTree.TypeDefs.NTree XNode) String
> parse (extract "a" &&& extract "b") -- extract two nodes' content.
[("foo","bar")]
With the help of this function, it's easy to use the &&& combinator to pair up the text of two different nodes, and then, say, pass it to a constructor, like this:
> parse (extract "a" &&& extract "b" >>^ arr (\(a,b) -> (b,a)))
[("bar","foo")]
Now comes the part I don't understand: I want to left-factor! extract calls getChildren on the root-node twice. Instead, I'd like it to only call it once! So I first get the child of the root node
> let extract' s = hasName s >>> getChildren >>> getText
> :t extract'
extract' :: (ArrowXml cat) => String -> cat XmlTree String
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[]
Note, that I've tried to re-order the calls to, say, isElem, etc. in order to find out if that's the issue. But as it stands, I just don't have any idea why this isn't working. There is an arrow 'tutorial' on the Haskell wiki and the way I understood it, it should be possible to do what I want to do that way — namely use &&& in order to pair up the results of two computations.
It does work, too — but only at the start of the arrow-chain, not mid-way trough, when I have some results already, that I want to keep 'shared.' I have the feeling that I'm just not being able to wrap my head around a difference in ideas between normal function composition and arrow notation. I'd be very appreciative of any pointers! (Even if it is just to some generic arrow-tutorial that goes a little more in-depth than the on the Haskell-wiki.)
Thank you!
If you convert the arrow to (and then from) a deterministic version this works as expected:
> let extract' s = unlistA >>> hasName s >>> getChildren >>> getText
> parse (listA (getChildren >>> isElem) >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This isn't really satisfactory, though, and I can't remember off the top of my head why (&&&) behaves this way with a nondeterministic arrow (I'd personally use the proc/do notation for anything much more complicated than this).
UPDATE: There seems to be something weird going on here with runLA and xread. If you use runX and readString everything works as expected:
> let xml = "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> let parse p = runX (readString [] xml >>> p)
> let extract' s = getChildren >>> hasName s >>> getChildren >>> getText
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This means you have to run the parser in the IO monad, but there are advantages to using runX anyway (better error messages, etc.).

Resources