obtain en-US title tag text - python-3.x

I'm trying to obtain the text in only the title#lang=en-US elements in an XML file.
This code obtains all the title text for all languages.
entries = root.xpath('//prefix:new-item', namespaces={'prefix': 'http://mynamespace'})
for entry in entries:
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
print (title.text)
I tried this code to get the title#lang=en-US text, but it does not work.
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
test = title.xpath("#lang='en-US'")
print (test)
How do I obtain the text for only the english language items?

The expression
//prefix:title[lang('en')]
will select all the English-language titles. Specifically:
title elements that have an xml:lang attribute identifying the title as English, for example <title xml:lang="en-US"> or <title xml:lang="en-GB">
title elements within some container that identifies all the contents as English, for example <section xml:lang="en-US"><title/></section>.
If you specifically want only US English titles, excluding other forms of English, then you can use the predicate [lang('en-US')].

Related

Extract text only from the parent tag with Requests-HTML

I'd like to extract text only from the parent tag using Requests-HTML.
If we have html like this
<td>
There are some links. The text that we are looking for.
<td>
then
html.find('td', first=True).text
results in
>>> There are some links. The text that we are looking for.
You can use an xpath expression, which is directly supported by the library
from requests_html import HTML
doc = """<td>
There are some links/ The text that we are looking for.
<td>"""
html = HTML(html=doc)
# the list will contain all the whitespaces "between" <a> tags
text_list = html.xpath('//td/text()')
# join the list and strip the whitespaces
print(''.join(text_list).strip()) # The text that we are looking for.
The expression //td/text() will select all td nodes and their text root text content (//td//text() would select all text content).

How to remove< > from the result

I am trying to get the list of college names from an online dataset table (search result), and the college names are in between the tag and , i am not sure how to remove those from the result.
geo_table = soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('strong')
Colleges
I am thinking that the problem is I am extracting the wrong part because refers to bold the line. Where shall I find the college name?
This is a sample output:
href="?s=IL+MA+PA&p=14.0802+14.0801+14.3901&l=91+92+93+94&id=211440"
To fetch the href value you need to find_all <a> tag and then iterate the loop and get the attribute value href to fetch the college name you can find <strong> tag and get the text value.
geo_table =soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('a')
for college in Colleges:
print('href :' + college['href'])
print('college Name : ' + college.find('strong').text )

Netsuite Custom Field with REGEXP_REPLACE to strip HTML code except carriage return

I have a custom field with some HTML code in it:
<h1>A H1 Heading</h1>
<h2>A H2 Heading</h2>
<b>Rich Text</b><br>
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd<br>
<ol><li>numbered list</li><li>fgdsfsd f sa</li></ol>Another List<br>
<ul><li>bulleted</li></ul>
I also have another non-stored field where I want to display the plain text version of the above using REGEXP_REPLACE, while preserving the carriage returns/line breaks, maybe even converting <br> and <br/> to \r\n
However the patterns etc... seem to be different in NetSuite fields compared to using ?replace(...) in freemarker... and I'm terrible with remembering regexp patterns :)
Assuming the html text is stored in custitem_htmltext what expression could i use as the default value of the NetSuite Text Area custom field to display the html code above as:
A H1 Heading
A H2 Heading
Rich Text
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd
etc...
I understand the bulleted or numbered lists will look crap.
My current non-working formula is:
REGEXP_REPLACE({custitem_htmltext},'<[^<>]*>','')
I've also tried:
REGEXP_REPLACE({custitem_htmltext},'<[^>]+>','') - didn't work
When you use a Text Area type of custom field and input HTML, NetSuite seems to change the control characters ('<' and '>') to HTML entities ('<' and '>'). You can see this if you input the HTML and then change the field type to Long Text.
If you change both fields to Long Text, and re-input the data and formula, the REGEXP_REPLACE() should work as expected.
From what I have learned recently, Netsuite encodes data by default to URL format, so from < to < and > to >.
Try using triple handlebars e.g. {{{custitem_htmltext}}}
https://docs.celigo.com/hc/en-us/articles/360038856752-Handlebars-syntax
This should stop the default behaviour and allow you to use in a formula/saved search.

Groovy XmlParser / XmlSlurper: node.localText() position?

I have a follow-up question for this question: Groovy XmlSlurper get value of the node without children.
It explains that in order to get the local inner text of a (HTML) node without recursively get the nested text of potential inner child nodes as well, one has to use #localText() instead of #text().
For instance, a slightly enhanced example from the original question:
<html>
<body>
<div>
Text I would like to get1.
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</div>
<span>
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</span>
</body>
</html>
with the solution applied:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0].localText()[0]
would return:
[Text I would like to get1., Text I would like to get2., Text I would like to get3.]
However, when parsing the <span> part in this example
println htmlParsed.body.span[0].localText()
the output is
[Text I would like to get2., Text I would like to get3.]
The problem I am facing now is that it's apparently not possible to pinpoint the location ("between which child nodes") of the texts. I would have expected the second invocation to yield
[, Text I would like to get2., Text I would like to get3.]
This would have made it clear: Position 0 (before child 0) is empty, position 1 (between child 0 and 1) is "Text I would like to get2.", and position 2 (between child 1 and 2) is "Text I would like to get3." But given the API works as it does, there is apparently no way to determine whether the text returned at index 0 is actually positioned at index 0 or at any other index, and the same is true for all the other indices.
I have tried it with both XmlSlurper and XmlParser, yielding the same results.
If I'm not mistaken here, it's as a consequence also impossible to completely recreate an original HTML document using the information from the parser because this "text index" information is lost.
My question is: Is there any way to find out those text positions? An answer requiring me to change the parser would also be acceptable.
UPDATE / SOLUTION:
For further reference, here's Will P's answer, applied to the original code:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlParser(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0].children().collect {it in String ? it : null}
This yields:
[Text I would like to get1., null, Text I would like to get2., null, Text I would like to get3.]
One has to use XmlParser instead of XmlSlurper with node.children().
I don't know jsoup, and i hope it is not interfering with the solution, but with a pure XmlParser you can get an array of children() which contains the raw string:
html = '''<html>
<body>
<div>
Text I would like to get1.
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</div>
<span>
extra stuff
Text I would like to get2.
link to example
Text I would like to get3.
</span>
</body>
</html>'''
def root = new XmlParser().parseText html
root.body.div[0].children().with {
assert get(0).trim() == 'Text I would like to get1.'
assert get(0).getClass() == String
assert get(1).name() == 'a'
assert get(1).getClass() == Node
assert get(2) == '''
Text I would like to get2.
'''
}

In Watir, how to get the full text, from a portion of text?

I have a portion of HTML that looks similar to:
<table><tbody><tr>
<td><div> Text Goes Here </div></td>
<td> ... rest of table
There are no IDs, no Titles, no descriptors of any kind to easily identify the div that contains the text.
When an error occurs on the page, the error is inserted into the location where "Text Goes Here" is at (no text is present unless an error occurs). Each error contains the word "valid".
Examples: "The form must contain a valid name" or "Invalid date range selected"
I currently have the Watir code looking like this:
if browser.frame(:index => 0).text.includes? "valid"
msg = # need to get full text of message
return msg
else
return true
end
Is there any way to get the full text in a situation like this?
Basically: return the full text of the element that contains the text "valid" ?
Using: Watir 2.0.4 , Webdriver 0.4.1
Given the structure you provided, since divs are so often used I would be inclined to look for the table cell using a regular expression as Dave shows in his answer. Unless you have a lot of nested tables, it is more likely to return just the text you want.
Also if 'valid' may appear elsewhere then you might want to provide a slightly larger sample of the text to look for
. browser(:cell => /valid/).text
Try this
return browser.div(:text => /valid/).text
or
return browser.table.div(:text => /valid/).text
if the valid is not found, it should return nil.

Resources