I have a directory of xml files in a source directory that I want to turn into a directory of html files in a destination directory. It seems I can use getDirectoryFiles to get files from a directory, but that is an Action, and want needs not an Action [FilePath] but just a [FilePath]. How can I do something like want ["dest/*.html"] in Shake?
If I understand you correctly, you can do like this.
First of all, write a rule that creates the html file from xml.
main = shakeArgs shakeOptions $ do
"dest/*.html" %> \out -> do
let src = "source" </> dropDirectory1 out -<.> "xml"
-- todo: generate out (HTML) from src (XML)
Then you can write a rule by action that will be run in every build execution.
main = shakeArgs shakeOptions $ do
action $ do
srcs <- getDirectoryFiles "source" ["*.xml"]
need ["dest" </> src -<.> "html" | src <- srcs]
"dest/*.html" %> \out -> do
let src = "source" </> dropDirectory1 out -<.> "xml"
-- todo: translate HTML(out) from XML(src)
For your information, want defined like: want xs = action $ need xs.
getDirectoryFiles returns a result in the Action monad, but want returns a result in the Rules monad. You probably meant need instead, which is basically the same as want but in Action. Then just use >>= or do blocks like you would with any other monad:
do
directoryFiles <- getDirectoryFiles path patterns
need directoryFiles
If you want to end up back in Rules after that, then wrap the entire do block in a call to action.
Related
I have problem using scalpel to capture block of tags.
Given following HTML snippet store in testS :: String
<body>
<h2>Apple</h2>
<p>I Like Apple</p>
<p>Do you like Apple?</p>
<h2>Banana</h2>
<p>I Like Banana</p>
<p>Do you like Banana?</p>
<h2>Carrot</h2>
<p>I Like Carrot</p>
<p>Do you like Carrot?</p>
</body>
I want to parse block of h2 and two p as a single record Block.
{-#LANGUAGE OverloadedStrings #-}
import Control.Monad
import Text.HTML.Scalpel
data Block = B String String String
deriving Show
block :: Scraper String Block
block = do
h <- text $ "h2"
pa <- text $ "p"
pb <- text $ "p"
return $ B h pa pb
blocks :: Scraper String [Block]
blocks = chroot "body" $ replicateM 3 block
But the result of scraping is not what I want, look like it keep repeat capturing the first block and never consume it.
λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
Expected output:
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
How to make it work?
First, I apologize for proposing a solution without testing or knowing anything about scalpel (such arrogance). Let me make it up to you; here's my totally rewritten attempt.
First, this monstrosity works.
blocks :: Scraper String [Block]
blocks = chroot "body" $ do
hs <- texts "h2"
ps <- texts "p"
return $ combine hs ps
where
combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
combine _ _ = []
I call it a monstrosity because it erases the structure of the document with the two texts calls and then recreates it in the assumed order via combine. This probably isn't such a big deal in practice though, since most pages will be structured by combining tags via <div>.
So, if we were to have a different page:
testS' :: String
testS'= unlines [ "<body>",
"<div>",
" <h2>Apple</h2>",
" <p>I Like Apple</p>",
" <p>Do you like Apple?</p>",
"</div>",
"",
"<div>",
" <h2>Banana</h2>",
" <p>I Like Banana</p>",
" <p>Do you like Banana?</p>",
"",
"</div>",
"<div>",
" <h2>Carrot</h2>",
" <p>I Like Carrot</p>",
" <p>Do you like Carrot?</p>",
"</div>",
"</body>"
]
Then we can parse via:
block' :: Scraper String Block
block' = do
h <- text $ "h2"
[pa,pb] <- texts $ "p"
return $ B h pa pb
blocks' :: Scraper String [Block]
blocks' = chroots ("body" // "div") $ block'
Yielding,
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
Edit: re >>= and combine
My combine, above, is a local where definition. What you see there is what you get. Its unrelated to the function used in >>=, which incidentally is also a locally defined function with a slightly different name—combined. Even if they had the same name, however, it wouldn’t matter since each is only in scope within their respective functions.
As for the >>=, and just going by the observed behavior, each scrape starts from the beginning of the currently selected tags. So in your block definition, chroot “body” returns all tags in the body, text “h2” matches the first <h2>, and the next two text “p” both match the first <p>. So the bind is acting like an “and”: given the scalpel context of a bunch of tags match an <h2> and a <p> and (redundantly) a <p>. Notice that in my <div> based parse i could use texts (note the “s”) to get the two <p> i was expecting.
Finally, this behavior clicked for me when i saw it was based on tag soup. (Simultaneously with why they named it tag soup). Each of these scrapes are like dipping a spoon into an unordered soup of tags. The selector makes the soup, the scraper is your spoon. Hope that helps.
This is now supported in version 0.6.0 of scalpel through the use of SerialScrapers. SerialScrapers allow you to focus on one child of the current root at a time and expose APIs to move the focus and execute Scrapers on the currently focused node.
Adapting the example code in the documentation to your HTML gives:
-- Copyright 2019 Google LLC.
-- SPDX-License-Identifier: Apache-2.0
-- Chroot to the body tag and start a SerialScraper context with inSerial.
-- This will allow for focusing each child of body.
--
-- Many applies the subsequent logic repeatedly until it no longer matches
-- and returns the results as a list.
chroot "body" $ inSerial $ many $ do
-- Move the focus forward until text can be extracted from an h2 tag.
title <- seekNext $ text "h2"
-- Create a new SerialScraper context that contains just the tags between
-- the current focus and the next h2 tag. Then until the end of this new
-- context, move the focus forward to the next p tag and extract its text.
ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
return (title, ps)
Which would return:
[
("Apple", ["I like Apple", "Do you like Apple?"]),
("Banana", ["I like Banana", "Do you like Banana?"]),
("Carrot", ["I like Carrot", "Do you like Carrot?"])
]
In a yesod application, I want to create URL attributes for a graph that will be rendered by graphviz , and I want to use interpolation. Ideally,
graphToDot nonClusteredParams { fmtNode = \ (n,l) ->
[ URL [whamlet| #{MyRoute ...} |]
} g
Of course, the types don't match:
attribute of URL is pure Text, but whamlet is monadic (widget)
when I replace by shamlet, type is fine, but it cannot interpolate: URL interpolation used, but no URL renderer provided
Is there an easy way to solve this?
This works: get the render function (in the monad), and apply (in pure code)
render <- getUrlRender
let d = graphToDot ...
[ URL $ render $ MyRoute ... ]
I found this here, where a similar problem is solved: https://github.com/yesodweb/yesod/wiki/Using-type-safe-urls-from-inside-javascript
I would like to extract the text contents from the below Html page. All the paragraphs from the <div>.
I use the xml-conduit package for html parsing and came up with the following code:
getWebPageContents :: Url -> IO [T.Text]
getWebPageContents u = do
cursor <- cursorFor u
return $ cursor $// filter &/ content
filter = element "div" >=> attributeIs "id" "article-body-blocks" &// element "p"
This will return most of the text but not the ones from the links("front page of today's Daily Mirror")
Could anyone help?
You need to filter to all the descendants of the p tags, not just the children. You probably just need to replace &/ content with &// content.
I have been following the yesod tutorial and I am stuck on how to build a unit test involving parameters in a view that also hit a database. Backtracking a little, I followed the Echo.hs example:
getEchoR :: Text -> Handler Html
getEchoR theText = do
defaultLayout $ do
$(widgetFile "echo")
The corresponding test, note I have to cast the parameter into Text using Data.Text.pack
yit "Echo some text" $ do
get $ EchoR $ pack "Hello"
statusIs 200
Now I have the model defined like so:
Tag
name Text
type Text
With a handler that can render that that obviously take a TagId as the parameter
getTagR :: TagId -> Handler Html
getTagR tagId = do
tag <- runDB $ get404 tagId
defaultLayout $ do
setTitle $ toHtml $ tagName tag
$(widgetFile "tag")
This is where the test fails.
yit "Get a tag" $ do
-- tagId is undefined
get $ TagR tagId
statusIs 200
I am not sure how to define the tagId. It wouldn't work with a String or Text or Num, and I can't seem to figure out how to generate one as I can't find any example code in various Data.Persist tutorials. Or better yet, some other way to call the get method.
You want to use the Key data constructor to construct an ID value, which takes a PersistValue as a parameter. A simple example of creating one is:
Key $ PersistInt64 5
Another option is to call get with a textual URL, e.g. get ("/tag/5" :: Text).
Since times have changed, I'll leave this note here to say that these days one would use something like:
fromBackendKey 5
See the docs for fromBackendKey.
I have a Happstack program that dynamically converts Markdown documents to HTML using Text.Pandoc:
import qualified Text.Pandoc as Pandoc
...
return $ toResponse $ Pandoc.writeHtml Pandoc.def contents
I.e. Pandoc is returning a Text.Blaze.Html.Html value. (This has a ToMessage instance which means it can be used as a response to a request.)
How do I insert a custom CSS stylesheet into Pandoc's output? What if I want to customise the HTML e.g. by wrapping the <body> contents with some other elements?
When Pandoc's "standalone mode" option is enabled, it uses a template to format the output.
The template and its substitions variables can be set in the writerTemplate and writerVariables members of WriterOptions.
The command line tool has a default set of template it uses. You can see the default template for a format using e.g. pandoc -D html.
When using the library, the default is to use an empty template. You can get the default template programmatically using getDefaultTemplate.
Here's some example code:
import Text.Blaze.Html.Renderer.String
import Text.Pandoc
getHtmlOpts = do
template <- either (error . show) id
`fmap` getDefaultTemplate Nothing "html"
return $ def
{ writerStandalone = True
, writerTemplate = template
, writerVariables = [
("css", "/path/to/style.css"),
("header-includes",
"<style>p { background-color: magenta; }</style>")]
}
main = do
opts <- getHtmlOpts
putStrLn $ renderHtml $ writeHtml opts $ readMarkdown def "..."
You can also write your own template, call it for instance template.html and use the --template template.html option when calling pandoc from the command-line.
The documentation is at https://pandoc.org/MANUAL.html#templates, and the default template (for inspiration) is at https://raw.githubusercontent.com/jgm/pandoc-templates/master/default.html5.
Pandoc, when run from the command line, takes some arguments that allow you to insert something into the <head> tag (-H), before content (-B) and after content (-A). I don't know about Happstack, but surely there must be a way to pass these parameters to Pandoc.writeHtml