Suppose an XML based language where the node attribute animal is illegal. Consider:
{-# LANGUAGE Arrows, RankNTypes #-}
module Lib ( parseXml ) where
import Control.Arrow
import Text.XML.HXT.Core
parseXml = runLA (xread >>> isElem >>> myParser) content
where
content = unlines
[ "<pet animal='cat'>felix</pet>"
, "<pet>milo</pet>"
, "<pet animal='rat'>tom</pet>" ]
myParser :: (ArrowXml a) => a XmlTree [String]
myParser = proc xml -> do
x <- isElem -< xml
pets <- (getText <<< getChildren <<< neg (hasAttr "animal")) >. id -< x
returnA -< pets
The result of evaluating parseXml is:
parseXml
[[],["milo"],[]]
Which is not what is intended. The parser has silently ignored the 1st and 3rd XML nodes since they do not conform to the myParser, specifically neg (hasAttr "animal"), but I'd like the behaviour to not silently ignore them, but instead halt parsing at the 1st XML node.
How can I change this code such that HXT throw an error if it ever encounters our syntax violation that an XML node can never have an "animal" attribute? That is, when attempting to parse the 1st XML node it returns a Left value e.g.
parseXml
Left (ParseError "'animal' attribute is not permitted")
Try err and its neighbors with when.
Related
I have no experience in Haskell. I'm trying to parse many .json files to a data structure in Haskell using aeson. However, by reasons beyond my control, I need to store the name of the file from where the data was parsed as one of the fields in my data. A simple example of what I have so far is:
data Observation = Observation { id :: Integer
, value :: Integer
, filename :: String}
instance FromJSON Observation where
parseJson (Object v) =
Observation <$> (read <$> v .: "id")
<*> v .: "value"
<*> ????
My question is: what is a smart way to be able to serialize my data when parsing a json file having access to the name of the file?
What comes in my mind is to define another data like NotNamedObservation, initialize it and then having a function that converts NotNamedObservation -> String -> Observation (where String is the filename) but that sounds like a very poor approach.
Thanks.
Just make your instance a function from file path to Observation:
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Aeson
import qualified Data.ByteString.Lazy as LBS
import System.Environment
data Observation = Observation { ident :: Integer
, value :: Integer
, filename :: FilePath
} deriving (Show)
instance FromJSON (FilePath -> Observation) where
parseJSON (Object v) =
do i <- read <$> v .: "id"
l <- v .: "value"
pure $ Observation i l
main :: IO ()
main = do
files <- getArgs
fileContents <- traverse LBS.readFile files
print fileContents
let fs = map (maybe (error "Invalid json") id . decode) fileContents
jsons :: [Observation]
jsons = zipWith ($) fs files
print jsons
When you don't control the data definition and you have strict requirements about the format to parse, it's better to write the (de)serializer explicitly.
If external information is required to fully construct values, avoid the FromJSON/ToJSON type classes, and just write standalone parsers.
aeson's deriving mechanism is more suited to applications that talk to themselves (and thus only care about round-tripping between parseJSON and toJSON), or where there is flexibility either in defining the JSON format or the Haskell types.
If you still have to use these classes for some reason, one option is of course to just put undefined in those missing fields. To rely on the type system more, you can also parameterize types by a "phase" (that assumes again you can tweak the data type), which is a type constructor that wraps some fields.
Functor functors
One way to keep that style compatible with regular records is to use this HKD/defunctionalization design pattern.
data Observation' p = Observation
{ id :: Integer
, value :: Integer
, filename :: p String }
-- This is isomorphic to the original Observation data type
type Observation = Observation Identity
-- When we don't have the filename available, we keep the field empty with Proxy
instance FromJSON (Observation' Proxy) where
...
mkObservation :: FileName -> Observation' Proxy -> Observation
I'm constantly admitting that I'm bad at lens, but isn't learning by examples is a good thing? I want to take HTML, parse it with taggy-lens and then remove all the script elements from inside. Here's my attempt:
#!/usr/bin/env stack
-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude
import Control.Lens hiding (children, element)
import Data.String.Class (toText, fromText, toString)
import Data.Text (Text)
import Text.Taggy.Lens
import qualified Text.Taggy.Lens as Taggy
import qualified Text.Taggy.Renderer as Renderer
somehtmlSmall :: Text
somehtmlSmall =
"<!doctype html><html><body>\
\<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\
\</body></html>"
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(\el ->
el ^.. to universe . traverse .
filtered (\n -> n ^. name /= "script"))
mArticle
in maybe "" (toText . concatMap Renderer.render) mArticleFiltered
main :: IO ()
main = print renderWithoutScriptTag
Mark this file as executable and just run it, and you'll see:
➜ tmp ./scraping-question.hs
"<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div><div>first</div><div>second</div><div>third</div>"
So, this didn't work. I would like to:
have a working solution
understand the working solution
Would be especially thankful, if you'd help me realize what's wrong with mine. Thanks!
The root of your problem is universe, which flattens the DOM tree into a list. If you look again at the output, you will see the filtering is working fine but the tree structure is lost -- and so you get the unmodified article element (with all children still within) followed by the children nodes minus the script element.
One Control.Lens.Plated combinator that can do what you want is transform, which transforms "every element in the tree, in a bottom-up manner":
transform :: Plated a => (a -> a) -> a -> a
In particular, you can use it to filter the children nodes recursively:
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(transform (children %~ filter (\n ->
n ^? element . name /= Just "script")))
mArticle
in maybe "" (toText . Renderer.render) mArticleFiltered
I have a Haskell source file which using Unicode syntax:
{-# LANGUAGE UnicodeSyntax #-}
succ' :: Int → Int
succ' = succ
main :: IO ()
main = print $ succ' 1
This parses and runs fine with GHC. Additionally, stylish-haskell and hlint (both based on haskell-src-exts) can read this file without any trouble. However, when I try to parse it myself using haskell-src-exts:
import Language.Haskell.Exts (parseModule)
main = do
x <- readFile "test.hs"
print $ parseModule x
I get the error message:
ParseFailed (SrcLoc {srcFilename = "<unknown>.hs", srcLine = 6, srcColumn = 1}) "TypeOperators is not enabled"
However, providing UnicodeSyntax explicitly in the extensions list or using parseFile works just fine:
import Language.Haskell.Exts
main = do
x <- readFile "test.hs"
print $ parseModuleWithMode defaultParseMode
{ extensions = [UnicodeSyntax]
} x
parseFile "test.hs" >>= print
Any idea why the first approach fails?
From a cursory glance at the source, it doesn't look like parseModule extracts language pragmas from the source before parsing (parseFile does do that by calling getExtensions). By the time parsing has begun, it is already too late to enable unicode syntax.
I'm looking to have my Haskell program read settings from an external file, to avoid recompiling for minor changes. Being familiar with YAML, I thought it would be a good choice. Now I have to put the two pieces together. Google hasn't been very helpful so far.
A little example code dealing with reading and deconstructing YAML from a file would be very much appreciated.
If I'm interested in what packages are available, I go to hackage, look at the complete package list, and then just search-in-page for the keyword. Doing that brings up these choices (along with a few other less compelling ones):
yaml: http://hackage.haskell.org/package/yaml
HsSyck: http://hackage.haskell.org/package/HsSyck
and a wrapper around HsSyck called yaml-light: http://hackage.haskell.org/package/yaml-light
Both yaml and HsSyck look updated relatively recently, and appear to be used by other packages in widespread use. You can see this by checking the reverse deps:
http://packdeps.haskellers.com/reverse/HsSyck
http://packdeps.haskellers.com/reverse/yaml
Of the two, yaml has more deps, but that is because it is part of the yesod ecosystem. One library that depends on HsSyck is yst, which I happen to know is actively maintained, so that indicates to me that HsSyck is fine too.
The next step in making my choice would be to browse through the documentation of both libraries and see which had the more appealing api for my purposes.
Of the two, it looks like HsSyck exposes more structure but not much else, while yaml goes via the json encodings provided by aeson. This indicates to me that the former is probably more powerful while the latter is more convenient.
A simple example:
First you need a test.yml file:
db: /db.sql
limit: 100
Reading YAML in Haskell
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
import Data.Yaml
data Config = Config { db :: String
, limit :: Int
} deriving (Show, Generic)
instance FromJSON Config
main :: IO ()
main = do
file <- decodeFile "test.yml" :: IO (Maybe Config)
putStrLn (maybe "Error" show file)
With yamlparse-applicative you can describe your YAML parser in a static analysis-friendly way, so you can get a description of the YAML format from a parser for free. I'm going to use Matthias Braun's example format for this one:
{-# LANGUAGE ApplicativeDo, RecordWildCards, OverloadedStrings #-}
import Data.Yaml
import Data.Aeson.Types (parse)
import YamlParse.Applicative
import Data.Map (Map)
import qualified Data.Text.IO as T
data MyType = MyType
{ stringsToStrings :: Map String String
, mapOfLists :: Map String [String]
} deriving Show
parseMyType :: YamlParser MyType
parseMyType = unnamedObjectParser $ do
stringsToStrings <- requiredField' "strings_to_strings"
mapOfLists <- requiredField' "map_of_lists"
pure MyType{..}
main :: IO ()
main = do
T.putStrLn $ prettyParserDoc parseMyType
yaml <- decodeFileThrow "config/example.yaml"
print $ parse (implementParser parseMyType) yaml
Note that main is able to print out a schema before even seeing an instance:
strings_to_strings: # required
<key>: <string>
map_of_lists: # required
<key>: - <string>
Success
(MyType
{ stringsToStrings = fromList
[ ("key_one","val_one")
, ("key_two","val_two")
]
, mapOfLists = fromList
[ ("key_one",["val_one","val_two","val_three"])
, ("key_two",["val_four","val_five"])
]
})
Here's how to parse specific objects from your YAML file using the yaml library.
Let's parse parts of this file, config/example.yaml:
# A map from strings to strings
strings_to_strings:
key_one: val_one
key_two: val_two
# A map from strings to list of strings
map_of_lists:
key_one:
- val_one
- val_two
- val_three
key_two:
- val_four
- val_five
# We won't parse this
not_for: us
This module parses strings_to_strings and map_of_lists individually and puts them into a custom record, MyType:
{-# Language OverloadedStrings, LambdaCase #-}
module YamlTests where
import Data.Yaml ( decodeFileEither
, (.:)
, parseEither
, prettyPrintParseException
)
import Data.Map ( Map )
import Control.Applicative ( (<$>) )
import System.FilePath ( FilePath
, (</>)
)
data MyType = MyType {stringsToStrings :: Map String String,
mapOfLists :: Map String [String]} deriving Show
type ErrorMsg = String
readMyType :: FilePath -> IO (Either ErrorMsg MyType)
readMyType file =
(\case
(Right yamlObj) -> do
stringsToStrings <- parseEither (.: "strings_to_strings") yamlObj
mapOfLists <- parseEither (.: "map_of_lists") yamlObj
return $ MyType stringsToStrings mapOfLists
(Left exception) -> Left $ prettyPrintParseException exception
)
<$> decodeFileEither file
yamlTest = do
parsedValue <- readMyType $ "config" </> "example.yaml"
print parsedValue
Run yamlTest to see the parsing result.
All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.
You can use HXT's xread along with runLA to parse an XML string outside of IO.
xread has the following type:
xread :: ArrowXml a => a String XmlTree
This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.
runLA is like runX, but for things of type LA:
runLA :: LA a b -> a -> [b]
LA is an instance of ArrowXml.
To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
where lookupValues cs m = map (flip M.lookup m) cs
xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
main = print values
classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.
It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.
A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.
Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:
classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
where pairs = getAttrValue "class" &&& deep getText
Travis Brown's answer was very helpful. I just want to add my own solution here, which I think is a bit more general (using the same functions, just ignoring the problem-specific issues).
I was previously unpickling with:
upIO :: XmlPickler a => String -> IO [a]
upIO str = runX $ readString [] str >>> arrL (maybeToList . unpickleDoc xpickle)
which I was able to change to this:
upPure :: XmlPickler a => String -> [a]
upPure str = runLA (xreadDoc >>> arrL (maybeToList . unpickleDoc xpickle)) str
I completely agree with him that doing this gives you less control over the configuration of the parser etc, which is unfortunate.