HXT Parsing to list

HXT Parsing to list - haskell

I recently posted about using HXT pickles for parsing. After some reading I decided use regular HXT instead.
However, I am unable to create lists, ie. I have an XML document:
<meta>
<sampleQuery>sample1</sampleQuery>
<sampleQuery>sample2</sampleQuery>
</meta>
and a parsing function
parseMeta =
proc x -> do
meta <- deep (isElem >>> hasName "meta") -< x
sampleQueries <- getText <<< getChildren <<< deep (hasName "sampleQuery") -< meta
returnA -< Meta sampleQueries
sampleQueries should have the type [String] (["sample1", "sample2"] in this case) but I am unable to achieve this.

Arrow notation seems like overkill here.
import Text.XML.HXT.Core
xml = unlines
[ "<meta>"
, "<sampleQuery>sample1</sampleQuery>"
, "<sampleQuery>sample2</sampleQuery>"
, "</meta>"
]
queries = hasName "meta" /> hasName "sampleQuery" /> getText
main = runX (readString [] xml /> queries) >>= print
This will print ["sample1","sample2"], as expected.

Related

Catching and halting on syntax exceptions during parsing with HXT

Suppose an XML based language where the node attribute animal is illegal. Consider:
{-# LANGUAGE Arrows, RankNTypes #-}
module Lib ( parseXml ) where
import Control.Arrow
import Text.XML.HXT.Core
parseXml = runLA (xread >>> isElem >>> myParser) content
where
content = unlines
[ "<pet animal='cat'>felix</pet>"
, "<pet>milo</pet>"
, "<pet animal='rat'>tom</pet>" ]
myParser :: (ArrowXml a) => a XmlTree [String]
myParser = proc xml -> do
x <- isElem -< xml
pets <- (getText <<< getChildren <<< neg (hasAttr "animal")) >. id -< x
returnA -< pets
The result of evaluating parseXml is:
parseXml
[[],["milo"],[]]
Which is not what is intended. The parser has silently ignored the 1st and 3rd XML nodes since they do not conform to the myParser, specifically neg (hasAttr "animal"), but I'd like the behaviour to not silently ignore them, but instead halt parsing at the 1st XML node.
How can I change this code such that HXT throw an error if it ever encounters our syntax violation that an XML node can never have an "animal" attribute? That is, when attempting to parse the 1st XML node it returns a Left value e.g.
parseXml
Left (ParseError "'animal' attribute is not permitted")

Try err and its neighbors with when.

Logical OR in HXT without duplicating results

I'm having a little trouble with HXT: I am trying to locate all the nodes in a document that match some criteria, and I'm trying to combine the lenses/XPaths as predicates in an OR-like fashion, using Control.Arrow.<+>, as this guide suggests. However, when I try to "run" the arrow on my document, I am getting duplicate results. Is there an easy way to remove the duplicates, or to combine the tests in a more meaningful way? Here is my code:
run :: App -> IO ()
run a = do
inputContents <- readFile (input a)
let doc = readString [withParseHTML yes, withWarnings no] inputContents
links <- runX . xshow $ doc >>> indentDoc //> cssLinks
mapM_ putStrLn links
cssLinks = links >>> (rels <+> hrefs <+> types)
where
links = hasName "link"
rels = hasAttrValue "rel" (isInfixOf "stylesheet")
hrefs = hasAttrValue "href" (endswith ".css")
types = hasAttrValue "type" (== "text/css")
Yet every time I run this (on any web page), I get duplicated results / nodes. I noticed that <+> is part of the ArrowPlus typeclass, which mimics a monoid, and ArrowXML is an instance of both ArrowList and ArrowTree, which gives me a lot to work with. Would I have to construct ArrowIf predicates? Any help with this would be wonderful :)

You may get the arrow result as a [XmlTree], then apply List.nub, then get the string rep.
import "hxt" Text.XML.HXT.DOM.ShowXml as SX
...
links <- runX $ doc >>> indentDoc //> cssLinks
-- first remove duplicates (List.nub) then apply SX.xshow
putStrLn (SX.xshow . L.nub $ links)

Test if it exists a node HXT

Is there a way to test a node (Attribute Value) and use it with my if-condition ?
For example:
import Text.XML.HXT.Core
import System.Environment --para uso do getArgs
import Data.List.Split (splitOn)
data Class = Class { name ::String }
deriving (Show,Eq)
main = do
[src]<- getArgs
teams <- runX(readDocument [ withValidate no] src >>> getClass)
print teams
--Test
test = if (True) then getAttrValue "rdf:about" else getAttrValue "rdf:ID"
atTag tag = deep (isElem >>> hasName tag)
getClass = atTag "owl:Class" >>>
proc l -> do
className <- test -< l
returnA -< Class { name = splitOn "#" className !! 1}
On that example, i would like to test an attribute value and if it exists it return my then-condition otherwise the else-condition !
I saw the API of XMLArrow and it exists some function which will be able to do it (for example, isAttrib or hasAttrib) But it doesn't return a boolean ...
So ... I thought on other ways for solving it, but i think there must be a simpler solution to solve that...
Can someone gives me a hint please ?

You can use the functions of the module Control.Arrow.ArrowIf of the hxt package. Here you find the function ifA, a lifted version of the if-else-statement. For example the code
if (True) then getAttrValue "rdf:about" else getAttrValue "rdf:ID"
should be written as
ifA (constA True) (getAttrValue "rdf:about") (getAttrValue "rdf:ID")
Depending of what you want to archive, you should use derived functions of ifA like guards.

put xml into a hash table

I am trying to get the informations out of a xml file into a lookup table.
So far I have been reading what librairies might be available and how to use them.
I went with hxt and hashtables.
Here is the file :
<?xml version="1.0" encoding="UTF-8" ?>
<tables>
<table name="nametest1">
test1
</table>
<table name="nametest2">
test2
</table>
</tables>
I would like to have the following pairs:
nametest1, test1
nametest2, test2
etc...
-- | We get the xml into a hash
getTables :: IO (H.HashTable String String)
getTables = do
confPath <- getEnv "ENCODINGS_XML_PATH"
doc <- runX $ readDocument [withValidate no] confPath
-- this is the part I don't have
-- I get the whole hashtable create and insert process
-- It is the get the xml info that is blocking
where -- I think I might use the following so I shamelessly took them from the net
atTag tag = deep (isElem >>> hasName tag)
text = getChildren >>> getText
I saw many examples of how to do similar things but I can't figure out how to get the name attribute at each node.
Cheers,
rakwatt

Here is an example that reads a file with the name of test.xml and just prints out the (name,text) pairs:
import Text.XML.HXT.Core
-- | Gets the name attribute and the content of the selected items as a pair
getAttrAndText :: (ArrowXml a) => a XmlTree (String, String)
getAttrAndText =
getAttrValue "name" -- And zip it together with the the attribute name
&&& deep getText -- Get the text of the node
-- | Gets all "table" items under a root tables item
getTableItem :: (ArrowXml a) => a XmlTree XmlTree
getTableItem =
deep (hasName "tables") -- Find a tag <tables> anywhere in the document
>>> getChildren -- Get all children of that tag
>>> hasName "table" -- Filter those that have the tag <table>
>>> hasAttr "name" -- Filter those that have an attribute name
-- | The main function
main = (print =<<) $ runX $ -- Print the result
readDocument [withValidate no] "test.xml" -- Read the document
>>> getTableItem -- Get all table items
>>> getAttrAndText -- Get the attribute 'name' and the text of those nodes
The construction of the pairs happens in getAttrAndText. The rest of the functions just open the file and select all tags that are an immediate children of a tag. You still might want to strip leading whitespace in the text.

HXT: Left-Factoring Nondeterministic Arrows?

I'm trying to come to terms with Haskell's XML Toolbox (HXT) and I'm hitting a wall somewhere, because I don't seem to fully grasp arrows as a computational tool.
Here's my problem, which I hoped to illustrate a little better using a GHCi session:
> let parse p = runLA (xread >>> p) "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> :t parse
parse :: LA XmlTree b -> [b]
So Parse is a small helper function that applies whatever arrow I give it to the trivial XML document
<root>
<a>foo</a>
<b>bar</b>
<c>baz</c>
</root>
I define another helper function, this time to extract the text below a node with a given name:
> let extract s = getChildren >>> isElem >>> hasName s >>> getChildren >>> getText
> :t extract
extract :: (ArrowXml cat) =>
String -> cat (Data.Tree.NTree.TypeDefs.NTree XNode) String
> parse (extract "a" &&& extract "b") -- extract two nodes' content.
[("foo","bar")]
With the help of this function, it's easy to use the &&& combinator to pair up the text of two different nodes, and then, say, pass it to a constructor, like this:
> parse (extract "a" &&& extract "b" >>^ arr (\(a,b) -> (b,a)))
[("bar","foo")]
Now comes the part I don't understand: I want to left-factor! extract calls getChildren on the root-node twice. Instead, I'd like it to only call it once! So I first get the child of the root node
> let extract' s = hasName s >>> getChildren >>> getText
> :t extract'
extract' :: (ArrowXml cat) => String -> cat XmlTree String
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[]
Note, that I've tried to re-order the calls to, say, isElem, etc. in order to find out if that's the issue. But as it stands, I just don't have any idea why this isn't working. There is an arrow 'tutorial' on the Haskell wiki and the way I understood it, it should be possible to do what I want to do that way — namely use &&& in order to pair up the results of two computations.
It does work, too — but only at the start of the arrow-chain, not mid-way trough, when I have some results already, that I want to keep 'shared.' I have the feeling that I'm just not being able to wrap my head around a difference in ideas between normal function composition and arrow notation. I'd be very appreciative of any pointers! (Even if it is just to some generic arrow-tutorial that goes a little more in-depth than the on the Haskell-wiki.)
Thank you!

If you convert the arrow to (and then from) a deterministic version this works as expected:
> let extract' s = unlistA >>> hasName s >>> getChildren >>> getText
> parse (listA (getChildren >>> isElem) >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This isn't really satisfactory, though, and I can't remember off the top of my head why (&&&) behaves this way with a nondeterministic arrow (I'd personally use the proc/do notation for anything much more complicated than this).
UPDATE: There seems to be something weird going on here with runLA and xread. If you use runX and readString everything works as expected:
> let xml = "<root><a>foo</a><b>bar</b><c>baz</c></root>"
> let parse p = runX (readString [] xml >>> p)
> let extract' s = getChildren >>> hasName s >>> getChildren >>> getText
> parse (getChildren >>> isElem >>> (extract' "a" &&& extract' "b"))
[("foo","bar")]
This means you have to run the parser in the IO monad, but there are advantages to using runX anyway (better error messages, etc.).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

HXT Parsing to list - haskell

Related

Catching and halting on syntax exceptions during parsing with HXT

Logical OR in HXT without duplicating results

Test if it exists a node HXT

put xml into a hash table

HXT: Left-Factoring Nondeterministic Arrows?

Categories

Resources