Using rvest with drake: external pointer is not valid error - rvest

When I first run the code below, everything is ok. But when I change something in html_file %>%... comand, for example commenting tolower(), I get the following error:
Error: target title failed.
diagnose(title)error$message:
external pointer is not valid
diagnose(title)error$calls:
1. └─html_file %>% html_nodes("h2") %>% html_text()
Code:
library(rvest)
library(drake)
some_string <- '
<div class="main">
<h2>A</h2>
<div class="route">X</div>
</div>
'
html_file <- read_html(some_string)
title <- html_file %>%
html_nodes("h2") %>%
html_text()
plan <- drake_plan(
html_file = read_html(some_string),
title = html_file %>%
html_nodes("h2") %>%
html_text() %>%
tolower()
)
make(plan)
I found two possible solutions but I'm not enthusiastic about them.
1. Join both steps in drake_plan into one.
2. Use xml2::write_html() and xml2::read_html() as suggested here.
Is there a better way to solve it?
P.S. Issue was already discussed here, Rstudio forum, and on github.

By default, drake saves targets as RDS files (other options here). So https://github.com/tidyverse/rvest/issues/181#issuecomment-395064636, which you brought up, is exactly the problem. I like (1) because text is compatible with RDS. Speaking broadly, it is up to the user to choose good targets compatible with drake's data storage system. See https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets for a discussion and links to similar issues. But you want to go with (2), you could return the file path to your HTML file from within a dynamic file.

Related

Checking status/event of icon - Selenium, Python 3.9

With covid and online schooling, it's hard to keep up with my kid keeping up! I'm only a little familiar with Python and less familiar with webstuff and selenium, but I wanted to try to make it easier to check on whether he's finishing his assignments each day by writing a script that (1) goes to the class webpages, (2) looks for the 'Overdue' text in the outer HTML, and (3) does something (e.g., print 'There is an overdue assignment') if the find method succeeds.
I've completed 1 successfully and know how to do part 3, but can't figure out part 2.
I found what I think is the relevant part on the pages with the inspect element:
<i class="icon-minimize" aria-label="This assignment is overdue" title="This assignment is overdue"></i>
And I've tried the following code variations:
overDue = driver.find_element_by_tag_name("This assignment is overdue")
overDue = driver.find_element_by_name("This assignment is overdue")
And I've tried to copy the CSS Selector and use
overDue = driver.find_element_by_class_name("i.icon-minimize:nth-child(1)")
I also tried XPath, but I forget now exactly what my code was. Something like:
overDue = driver.find_element_by_xpath(//*[., text()="This assignment is overdue"])
But all of these return a NoSuchElement exception. Is there something wrong with my syntax or am I using the wrong methods?
Thanks.
overDue = driver.find_element_by_xpath('//*[#title="This assignment is overdue"]')
its title , use #title to validate the value of title attribute
xpath syntax is
//tagname[#attribute="attributevalue"]
so tag_name is i
driver.find_element_by_tag_name("i")
there is no name attribute so you cannot use
driver.find_element_by_name
class is 'icon-minimize' you should not mention tag there:
driver.find_element_by_class_name("icon-minimize")

Extracting labels from owl ontologies when the label isn't in the ontology but can be found at the URI

Please bear with me as I am new to semantic technologies.
I am trying to use the package rdflib to extract labels from classes in ontologies. However some ontologies don't contain the labels themselves but have the URIs of classes from other ontologies. How does one extract the labels from URIs of the external ontologies?
The intuition behind my attempts center on identifying classes that don't contain labels locally (if that is the right way of putting it) and then "following" their URIs to the external ontologies to extract the labels. However the way I have implemented it does not work.
import rdflib
g = rdflib.Graph()
# I have no trouble extracting labels from this ontology:
# g.load("http://purl.obolibrary.org/obo/po.owl#")
# However, this ontology contains no labels locally:
g.load("http://www.bioassayontology.org/bao/bao_complete.owl#")
owlClass = rdflib.namespace.OWL.Class
rdfType = rdflib.namespace.RDF.type
for s in g.subjects(predicate=rdfType, object=owlClass):
# Where label is present...
if g.label(s) != '':
# Do something with label...
print(g.label(s))
# This is what I have added to try to follow the URI to the external ontology.
elif g.label(s) == '':
g2 = rdflib.Graph()
g2.parse(location=s)
# Do something with label...
print(g.label(s))
Am I taking completely the wrong approach? All help is appreciated! Thank you.
I think you can be much more efficient than this. You are trying to do a web request, remote ontology download and search every time you encounter a URI that doesn't have a label given in http://www.bioassayontology.org/bao/bao_complete.owl which is most of them and it's a very large number. So your script will take forever and thrash the web servers delivering those remote ontologies.
Looking at http://www.bioassayontology.org/bao/bao_complete.owl, I see that most of the URIs without labels there are from OBO, and perhaps a couple of other ontologies, but mostly OBO.
What you should do is download OBO once and load that with RDFlib. Then if you run your script above on the joined (union) graph of http://www.bioassayontology.org/bao/bao_complete.owl & OBO, you'll have all OBO's content at your fingertips so that g.label(s) will find a much higher proportion of labels.
Perhaps there are a couple of other source ontologies providing labels for http://www.bioassayontology.org/bao/bao_complete.owl you may need as well but my quick browsing sees only OBO.

How to get only rely content (not included quote content) using Selenium

I want to know what to get some content not include quote content.
https://forumd.hkgolden.com/view.aspx?type=BW&message=7219211
The following picture is the example
I want to get only "唔提冇咩人記得", but I use the following code will get both content.
content = driver_blank.find_element_by_xpath('/html/body/form/div[5]/div/div/div[2]/div[1]/div[5]/table[24]/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[1]/td/div')
print(content.text)
The following code is what I want to capture content:
<div class="ContentGrid">
<blockquote><div style="color: #0000A0;"><blockquote><div style="color: #0000A0;">腦魔都俾你地bam咗啦<img data-icons=":~(" src="/faces/cry.gif" alt=":~("></div></blockquote><br>珠。。。。。</div></blockquote><br>唔提冇咩人記得
<br><br><br>
</div>
Can anyone help me? Thanks~~~
Can not(starts-with 's method be solved?
Use below line of code to extract only text node content
element = driver.find_element_by_css_selector('div.ContentGrid')
text = driver.execute_script("return arguments[0].childNodes[3].textContent", element);
print(text)
Selenium won't allow you to directly locate an element using text node. Though you can use some JavaScript code to make it happen.
Code Explanation:
arguments[0].childNodes[3] indicates 3rd child element of your context node which is div.ContentGrid. Please note first 2 child element of the context node are blank (tried with the HTML code shared by you) that's why index 3 used.

Extracting text and ignoring "b" tag

Trying to extract the following text using:
'''
response.css("span[class = 'summary content']::text").extract()
'''
<span class="summary content">With its multiple cleaning modes, the <b>LG Hom-Bot Square</b> gives the user a terrific amount of control over how it operates. Its remote is convenient, easy to use, and well-designed.</span>
But gives me
Out[1]:
['With its multiple cleaning modes, the ',
' gives the user a terrific amount of control over how it operates. Its remote is convenient, easy to use, and well-designed.']
missing "LG Hom-Bot Square"
How can I just ignore the b tag?
I usually take the turn around using a join:
summary = response.css("span[class = 'summary content']::text").extract()
" ".join(summary)
In that case, you won´t be avoiding <b>, but the result will be the same as you want

Is there a way to get ProxyHTMLURLMap to match more than once per tag attribute?

I have a problem that seems to be caused by resources being called with img tags that look like this:
<img
class="alignnone size-full"
title="some title"
src="https://new.url.com/some.jpeg" alt="" width="612" height="408"
srcset="https://new.url.com/some.jpeg 612w, https://old.url.com/some-300x200.jpg 300w"
sizes="(max-width: 612px) 100vw, 612px">
ProxyHTMLURLMap successfully replaces the first URL within the attribute "srcset" but never more than the first.
I don't see anything in the manual that could address this, any help is much appreciated.
I am interested in any open source Linux compatible solutions even if outside Apache.
Thanks!
I found a limited workaround for this issue.
If each ProxyHTMLURLMap can replace only one matched occurrence, we need to add more directives like that.
ProxyHTMLURLMap "https://old.url.com/" "https://new.url.com/" Rl
ProxyHTMLURLMap " https://old.url.com/" " https://new.url.com/" Rl
ProxyHTMLURLMap ", https://old.url.com/" ", https://new.url.com/" Rl
ProxyHTMLURLMap "w, https://old.url.com/" "w, https://new.url.com/" Rl
These four directives can replace up to 4 instances of https://old.url.com
"R" flag is needed to process regular expressions.
"l" flag is needed to avoid stopping after first (second, third) match occurs.
It seems to work for me.

Resources