Xml-conduit parsing for text inside links

Xml-conduit parsing for text inside links - haskell

I would like to extract the text contents from the below Html page. All the paragraphs from the <div>.
I use the xml-conduit package for html parsing and came up with the following code:
getWebPageContents :: Url -> IO [T.Text]
getWebPageContents u = do
cursor <- cursorFor u
return $ cursor $// filter &/ content
filter = element "div" >=> attributeIs "id" "article-body-blocks" &// element "p"
This will return most of the text but not the ones from the links("front page of today's Daily Mirror")
Could anyone help?

You need to filter to all the descendants of the p tags, not just the children. You probably just need to replace &/ content with &// content.

Related

Netsuite Custom Field with REGEXP_REPLACE to strip HTML code except carriage return

I have a custom field with some HTML code in it:
<h1>A H1 Heading</h1>
<h2>A H2 Heading</h2>
<b>Rich Text</b><br>
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd<br>
<ol><li>numbered list</li><li>fgdsfsd f sa</li></ol>Another List<br>
<ul><li>bulleted</li></ul>
I also have another non-stored field where I want to display the plain text version of the above using REGEXP_REPLACE, while preserving the carriage returns/line breaks, maybe even converting <br> and <br/> to \r\n
However the patterns etc... seem to be different in NetSuite fields compared to using ?replace(...) in freemarker... and I'm terrible with remembering regexp patterns :)
Assuming the html text is stored in custitem_htmltext what expression could i use as the default value of the NetSuite Text Area custom field to display the html code above as:
A H1 Heading
A H2 Heading
Rich Text
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd
etc...
I understand the bulleted or numbered lists will look crap.
My current non-working formula is:
REGEXP_REPLACE({custitem_htmltext},'<[^<>]*>','')
I've also tried:
REGEXP_REPLACE({custitem_htmltext},'<[^>]+>','') - didn't work

When you use a Text Area type of custom field and input HTML, NetSuite seems to change the control characters ('<' and '>') to HTML entities ('<' and '>'). You can see this if you input the HTML and then change the field type to Long Text.
If you change both fields to Long Text, and re-input the data and formula, the REGEXP_REPLACE() should work as expected.

From what I have learned recently, Netsuite encodes data by default to URL format, so from < to < and > to >.
Try using triple handlebars e.g. {{{custitem_htmltext}}}
https://docs.celigo.com/hc/en-us/articles/360038856752-Handlebars-syntax
This should stop the default behaviour and allow you to use in a formula/saved search.

Problem parsing adjcent block of tags with scalpel

I have problem using scalpel to capture block of tags.
Given following HTML snippet store in testS :: String
<body>
<h2>Apple</h2>
<p>I Like Apple</p>
<p>Do you like Apple?</p>
<h2>Banana</h2>
<p>I Like Banana</p>
<p>Do you like Banana?</p>
<h2>Carrot</h2>
<p>I Like Carrot</p>
<p>Do you like Carrot?</p>
</body>
I want to parse block of h2 and two p as a single record Block.
{-#LANGUAGE OverloadedStrings #-}
import Control.Monad
import Text.HTML.Scalpel
data Block = B String String String
deriving Show
block :: Scraper String Block
block = do
h <- text $ "h2"
pa <- text $ "p"
pb <- text $ "p"
return $ B h pa pb
blocks :: Scraper String [Block]
blocks = chroot "body" $ replicateM 3 block
But the result of scraping is not what I want, look like it keep repeat capturing the first block and never consume it.
λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
Expected output:
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
How to make it work?

First, I apologize for proposing a solution without testing or knowing anything about scalpel (such arrogance). Let me make it up to you; here's my totally rewritten attempt.
First, this monstrosity works.
blocks :: Scraper String [Block]
blocks = chroot "body" $ do
hs <- texts "h2"
ps <- texts "p"
return $ combine hs ps
where
combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
combine _ _ = []
I call it a monstrosity because it erases the structure of the document with the two texts calls and then recreates it in the assumed order via combine. This probably isn't such a big deal in practice though, since most pages will be structured by combining tags via <div>.
So, if we were to have a different page:
testS' :: String
testS'= unlines [ "<body>",
"<div>",
" <h2>Apple</h2>",
" <p>I Like Apple</p>",
" <p>Do you like Apple?</p>",
"</div>",
"",
"<div>",
" <h2>Banana</h2>",
" <p>I Like Banana</p>",
" <p>Do you like Banana?</p>",
"",
"</div>",
"<div>",
" <h2>Carrot</h2>",
" <p>I Like Carrot</p>",
" <p>Do you like Carrot?</p>",
"</div>",
"</body>"
]
Then we can parse via:
block' :: Scraper String Block
block' = do
h <- text $ "h2"
[pa,pb] <- texts $ "p"
return $ B h pa pb
blocks' :: Scraper String [Block]
blocks' = chroots ("body" // "div") $ block'
Yielding,
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
Edit: re >>= and combine
My combine, above, is a local where definition. What you see there is what you get. Its unrelated to the function used in >>=, which incidentally is also a locally defined function with a slightly different name—combined. Even if they had the same name, however, it wouldn’t matter since each is only in scope within their respective functions.
As for the >>=, and just going by the observed behavior, each scrape starts from the beginning of the currently selected tags. So in your block definition, chroot “body” returns all tags in the body, text “h2” matches the first <h2>, and the next two text “p” both match the first <p>. So the bind is acting like an “and”: given the scalpel context of a bunch of tags match an <h2> and a <p> and (redundantly) a <p>. Notice that in my <div> based parse i could use texts (note the “s”) to get the two <p> i was expecting.
Finally, this behavior clicked for me when i saw it was based on tag soup. (Simultaneously with why they named it tag soup). Each of these scrapes are like dipping a spoon into an unordered soup of tags. The selector makes the soup, the scraper is your spoon. Hope that helps.

This is now supported in version 0.6.0 of scalpel through the use of SerialScrapers. SerialScrapers allow you to focus on one child of the current root at a time and expose APIs to move the focus and execute Scrapers on the currently focused node.
Adapting the example code in the documentation to your HTML gives:
-- Copyright 2019 Google LLC.
-- SPDX-License-Identifier: Apache-2.0
-- Chroot to the body tag and start a SerialScraper context with inSerial.
-- This will allow for focusing each child of body.
--
-- Many applies the subsequent logic repeatedly until it no longer matches
-- and returns the results as a list.
chroot "body" $ inSerial $ many $ do
-- Move the focus forward until text can be extracted from an h2 tag.
title <- seekNext $ text "h2"
-- Create a new SerialScraper context that contains just the tags between
-- the current focus and the next h2 tag. Then until the end of this new
-- context, move the focus forward to the next p tag and extract its text.
ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
return (title, ps)
Which would return:
[
("Apple", ["I like Apple", "Do you like Apple?"]),
("Banana", ["I like Banana", "Do you like Banana?"]),
("Carrot", ["I like Carrot", "Do you like Carrot?"])
]

How to fetch a website as unicode?

I want to fetch a website looking like this¹: naked unicode between tags
<html>
한국어
</html>
I'm currently using
openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
But when inspecting the string 한국어 is displayed as \237\149\156\234\181\173\236\150\180, 9 characters.
And what I want would be 3 escaped characters like \u???\u???\u???
I tried Text.pack on page where page <- openURL "url" but it's already to late then.
¹ If I let firefox show me the source, and page info says UTF-8

You need to decode the text, for example with Data.Text.Encoding.decodeUtf8 from text or Codec.Binary.UTF8.String.decodeString from utf8-string.

run whamlet in pure code?

In a yesod application, I want to create URL attributes for a graph that will be rendered by graphviz , and I want to use interpolation. Ideally,
graphToDot nonClusteredParams { fmtNode = \ (n,l) ->
[ URL [whamlet| #{MyRoute ...} |]
} g
Of course, the types don't match:
attribute of URL is pure Text, but whamlet is monadic (widget)
when I replace by shamlet, type is fine, but it cannot interpolate: URL interpolation used, but no URL renderer provided
Is there an easy way to solve this?

This works: get the render function (in the monad), and apply (in pure code)
render <- getUrlRender
let d = graphToDot ...
[ URL $ render $ MyRoute ... ]
I found this here, where a similar problem is solved: https://github.com/yesodweb/yesod/wiki/Using-type-safe-urls-from-inside-javascript

How to use $maybe in hamlet

In Yesod, I've got a form that populates the type
data Field = Field Text Text text
deriving Show
When I write the hamlet html to display it, I'm running into the problem that Field is wrapped up in Maybe Maybe Field. So in hamlet I'm trying to do the following as shown here
(Snippet in postHomeR function)
let fieldData = case result of
FormSuccess res -> Just res
_ -> Nothing
(In the hamlet file)
<ul>
$maybe (Field one two three) <- fieldData
<li>#{show one}
However, when compiling there is Not in scope: one error. Why is the variable one not created/populated correctly?

You need to indent the <li> so that it's inside the $maybe block. As it sits now, it's a sibling to the $maybe, and thus the variables bound by $maybe are not in scope.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Xml-conduit parsing for text inside links - haskell

You need to filter to all the descendants of the p tags, not just the children. You probably just need to replace &/ content with &// content.

Related

Netsuite Custom Field with REGEXP_REPLACE to strip HTML code except carriage return

Problem parsing adjcent block of tags with scalpel

How to fetch a website as unicode?

run whamlet in pure code?

How to use $maybe in hamlet

Categories

Resources