Groovy XmlParser / XmlSlurper: node.localText() position?

Groovy XmlParser / XmlSlurper: node.localText() position? - groovy

I have a follow-up question for this question: Groovy XmlSlurper get value of the node without children.
It explains that in order to get the local inner text of a (HTML) node without recursively get the nested text of potential inner child nodes as well, one has to use #localText() instead of #text().
For instance, a slightly enhanced example from the original question:
<html>
<body>
<div>
Text I would like to get1.
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</div>
<span>
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</span>
</body>
</html>
with the solution applied:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0].localText()[0]
would return:
[Text I would like to get1., Text I would like to get2., Text I would like to get3.]
However, when parsing the <span> part in this example
println htmlParsed.body.span[0].localText()
the output is
[Text I would like to get2., Text I would like to get3.]
The problem I am facing now is that it's apparently not possible to pinpoint the location ("between which child nodes") of the texts. I would have expected the second invocation to yield
[, Text I would like to get2., Text I would like to get3.]
This would have made it clear: Position 0 (before child 0) is empty, position 1 (between child 0 and 1) is "Text I would like to get2.", and position 2 (between child 1 and 2) is "Text I would like to get3." But given the API works as it does, there is apparently no way to determine whether the text returned at index 0 is actually positioned at index 0 or at any other index, and the same is true for all the other indices.
I have tried it with both XmlSlurper and XmlParser, yielding the same results.
If I'm not mistaken here, it's as a consequence also impossible to completely recreate an original HTML document using the information from the parser because this "text index" information is lost.
My question is: Is there any way to find out those text positions? An answer requiring me to change the parser would also be acceptable.
UPDATE / SOLUTION:
For further reference, here's Will P's answer, applied to the original code:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlParser(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0].children().collect {it in String ? it : null}
This yields:
[Text I would like to get1., null, Text I would like to get2., null, Text I would like to get3.]
One has to use XmlParser instead of XmlSlurper with node.children().

I don't know jsoup, and i hope it is not interfering with the solution, but with a pure XmlParser you can get an array of children() which contains the raw string:
html = '''<html>
<body>
<div>
Text I would like to get1.
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</div>
<span>
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</span>
</body>
</html>'''
def root = new XmlParser().parseText html
root.body.div[0].children().with {
assert get(0).trim() == 'Text I would like to get1.'
assert get(0).getClass() == String
assert get(1).name() == 'a'
assert get(1).getClass() == Node
assert get(2) == '''
Text I would like to get2.
'''
}

Related

Find if text exist inside a nested Div, if yes print out the whole string, Selenium Python

i'm very new to selenium(3.141.0) and python3, and i got a problem that couldn't figure it out.
The html looks similar to this
<div class='a'>
<div>
<p><b>ABC</b></p>
<p><b>ABC#123</b></p>
<p><b>XYZ</b></p>
<div>
</div>
I want selenium to find if # exist inside that div, (can not target the paragraph only element because sometime the text i want to extract is inside different element BUT it's always inside that <div class='a'>) If # exist => print the whole <p><b>ABC#123</b></p> (or sometime <div>ABC#123<div> )

To find an element with contained text, you must use an XPath. From what you are describing, it looks like you want the locator
//div[#class='a']//*[contains(text(),'#')]
^ a DIV with class 'a'
^ that has a descendant element that contains the text '#' within itself or a descendant
The code would look something like
for e in driver.find_elements(By.XPATH, "//div[#class='a']//*[contains(text(),'#')]"):
print(e.get_attribute('outerHTML')
and it will print all instances of <b>ABC#123</b>, <div>ABC#123</div>, or <p>ABC#123</p>, whichever exists

Scrapy parse is returning an empty array, regardles of yield

I am brand new to Scrapy, and I could use a hint here. I realize that there are quite a few similar questions, but none of them seem to fix my problem. I have the following code written for a simple web scraper:
import scrapy
from ScriptScraper.items import ScriptItem
class ScriptScraper(scrapy.Spider):
name = "script_scraper"
allowed_domains = ["https://proplay.ws"]
start_urls = ["https://proplay.ws/dramas/"]
def parse(self, response):
for column in response.xpath('//div[#class="content-column one_fourth"]'):
text = column.xpath('//p/b/text()').extract()
item = ScriptItem()
item['url'] = "test"
item['title'] = text
yield item
I will want to do some more involved scraping later, but right now, I'm just trying to get the scraper to return anything at all. The HTML for the site I'm trying to scrape looks like this:
<div class="content-column one_fourth">
::before
<p>
<b>
All dramas
<br>
(in alphabetical
<br>
order):
</b>
</p>
...
</div>
and I am running the following command in the Terminal:
scrapy parse --spider=script_scraper -c parse_ITEM -d 2 https://proplay.ws/dramas/
According to my understanding of Scrapy, the code I have written should be yielding the text "All dramas"; however, it is yielding an empty array instead. Can anyone give me a hint as to why this is not producing the expected yield? Again, I apologize for the repetitive question.

your XPath expressions are not exactly as you want to extract data. If you want the first column's first-row item. Then your XPath expression should be.
item = {}
item['text'] = response.xpath ('//div[#class="content-column one_fourth"][1]/p[1]/b/text()').extract()[0].
The function extract() will return all the matches for the expression, it returns an array. If you want the first you should use extract()[0] or extract_first().
Go through this page https://devhints.io/xpath to get more knowledge related to Xpath.

Extracting contents of nested <p> Tags with Beautiful Soup

I am having a hard time using the advantages of beautiful soup for my use case. There are many similar but not always equal nested p tags where I want to get the contents from. Examples as follows:
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">first</span>.</p>
I need to save the string of the span tag as well as the strings inside the p tag, no matter its styling and if applicable the referencequote. So from examples above I would like to extract:
example = 20, text = 'normal string', reference = []
example = 21, text = 'this text belongs together', reference = []
example = 22, text = 'some text that might continue', reference = ['a reference text']
example = 23, text = 'more text', reference = []
example = 24, text = 'text with two references', reference = ['first', 'second']
What I was trying is to collect all items with the "example" class and then looping though its parents contents.
for span in bs.find_all("span", {"class": "example"}):
references = []
for item in span.parent.contents:
if (type(item) == NavigableString):
text= item
elif (item['class'][0]) == 'verse':
number= int(item.string)
elif (item['class']) == 'referencequote':
references.append(item.string)
else:
#how to handle <strong> tags?
verses.append(MyClassObject(n=number, t=text, r=references))
My approach is very prone to error and there might be even more tags like <strong>, <em> that I am ignoring right now. The get_text() method unfortunately gives back sth like '22 some text a reference text that might continue'.
There must be an elegant way to extract this information. Could you give me some ideas for other approaches? Thanks in advance!

Try this.
from simplified_scrapy.core.regex_helper import replaceReg
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">second</span>.</p>
'''
html = replaceReg(html,"<[/]*strong>","") # Pretreatment
doc = SimplifiedDoc(html)
ps = doc.ps
for p in ps:
text = ''.join(p.spans.nextText())
text = replaceReg(text,"[()]+","") # Remove ()
span = p.span # Get first span
spans = span.getNexts(tag="span").text # Get references
print (span["class"], span.text, text, spans)
Result:
example 20 normal string []
example 21 this text belongs together []
example 22 some text that might continue ['a reference text']
example 23 more text []
example 24 text with two references. ['first', 'second']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

A different approach I found out - without regex and maybe more robust to different spans that might come up
for s in bsItem.select('span'):
if s['class'][0] == 'example' :
# do whatever needed with the content of this span
s.extract()
elif s['class'][0] == 'referencequote':
# do whatever needed with the content of this span
s.extract()
# check for all spans with a class where you want the text excluded
# finally get all the text
text = span.parent.text.replace(' ()', '')
maybe that approach is of interest for someone reading this :)

Extracting div attributes based on text

I am using Python 3.6 with bs4 to implement this task.
my div tag looks like this
<div class="Portfolio" portfolio_no="345">VBHIKE324</div>
<div class="Portfolio" portfolio_no="567">SCHF54TYS</div>
I need to extract portfolio_no i.e 345. As it is a dynamic value it keeps changing for multiple div tags but whereas the text remains same.
for data in soup.find_all('div',class_='Portfolio', text='VBHIKE324'):
print (data)
It outputs as None, where as I'm looking for o/p like 345

Here you go
for data in soup.find_all('div', {'class':'Portfolio'}):
print(data['portfolio_no'])
If you want the portfolio_no for the one with text VBHIKE324 then you can do something like this
for data in soup.find_all('div', {'class':'Portfolio'}):
if data.text == 'VBHIKE324':
print(data['portfolio_no'])

changing the font color in a computed field using javascript

How to change the font color of Hello alone in "Hello World" using javascript/some other method?
I tried the following code,
var s= session.getCommonUserName()
s.fontcolor("green")
"Hello"+" "+ s.toUpperCase()
where i tried to change just the color of the username alone. But it failed.

I wouldn't bother to send down unformatted HTML to the client and then let the client do the JavaScript work. You create a computed field and give it the data type HTML (that keeps HTML you create intact) and use SSJS. So no JS needs to execute at the client side:
var cu = session.getCommonUserName();
return "Hello"+" <span style=\"color : green\">"+ cu.toUpperCase()+"</span>";
Don't forget to cross your t, dot your i and finish a statement with a semicolon :-)

If you want to do it with client java script, then you must do something like this:
dojo.style("html_element_id", "color", "green");
So in your case you can have have something like:
<p><span id="span1">Hello</span> World.</p>
Or you can do it directly if you don't need to change it with CJS:
<p><span style="color:green">Hello</span> World</p>

one way to do it is to wrap your 'hello' in a html span and then change the color of that span.
<span id='myspan'>hello</span> world
javascript code:
document.getElementById('myspan').style.color='green';

Went old school on this one...
Say you want to put your formatted text in a div
<div id="test">
</div>
Then you need the following javascript to do so:
div = document.getElementById("test");
hello = document.createElement("span");
hello.innerHTML = "Hello"
hello.style.color = "green";
div.appendChild(hello);
div.appendChild(document.createTextNode(" world!"));

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Groovy XmlParser / XmlSlurper: node.localText() position? - groovy

Related

Find if text exist inside a nested Div, if yes print out the whole string, Selenium Python

Scrapy parse is returning an empty array, regardles of yield

Extracting contents of nested <p> Tags with Beautiful Soup

Extracting div attributes based on text

changing the font color in a computed field using javascript

Categories

Resources