html_nodes("table") returns {xml_nodeset (0)} - rvest

I am trying to scrape a table from the following webpage.
https://s3-us-west-2.amazonaws.com/campaign-zero-use-of-force/index.html
I am using the following code:
page <- "https://s3-us-west-2.amazonaws.com/campaign-zero-use-of-force/index.html"
pagina <- read_html(page)
pagina %>% html_nodes("table")
I expected the script to return a table on the webpage but it returned {xml_nodeset (0)}. I don't know why it's not seeing a table on the webpage. Thank you.

Related

Solving recaptcha with anticaptcha using Python

I am trying to fill recaptcha using anticaptcha api.
But I am unable to figure out how to submit response.
Here is what I am trying to do:
driver.switch_to.frame(driver.find_element_by_xpath('//iframe'))
site_key = '6Ldd2doaAAAAAFhvJxqgQ0OKnYEld82b9FKDBnRE'
api_key = 'api_keys'
url = 'https://coinsniper.net/register'
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
job.join()
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='{}';".format(job.get_solution_response()))
driver.refresh()
Above code snippet only refreshes the same page and not redirecting to input url.
Then I see that there is a variable in script on the same page and I tried to execute that variable too to submit form just like that
driver.execute_script("var captchaSubmitEl = document.getElementById('captcha-submit');")
driver.refresh()
Which also fails.The webpage is here.
Second try with this url which is loading recpatcha of the same page.
But this time I tried with different site_key and url which were extracted as below
url_key = driver.find_element_by_xpath('//*[#id="captcha-submit"]/div/div/iframe').get_attribute('src')
site_key = re.search('k=([^&]+)',url_key).group(1)
url = 'https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMABhLJ2Rn0V78AZ5gFAg%3D%3D&hash=7F23E8F8FB0B33347C06D1347938C1&cid=.z5o-mMJuvaX_CLxOMBRebJsY6NgZvUv87bLMft~A_st0Fkvl~3jcaTr1R64GU7xO.WZFYNq5P3.UNuLWFa32.Pe6GGuIV7Y5w-RaMu0K3&t=fe&referer=https%3A%2F%2Fcoinsniper.net%2Fregister&s=33682'
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
job.join()
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='{}';".format(job.get_solution_response()))
driver.refresh()
Above both ways are, I don't know why but, not working. I am finding solution from previous 3 days and got not any single solution working in my case.
Can anyone look into this and let me know what is wrong with this code.
After you receive a response from anti-captcha you should set it to this element
<input type="hidden" class="mtcaptcha-verifiedtoken" name="mtcaptcha-verifiedtoken" id="mtcaptcha-verifiedtoken-1" readonly="readonly" value="">
Fill in all other fields on UI and click the Register button.
You should not refresh the page.

Selenium Stale Element Reference Errors (Seems Random)?

I know there have been several questions asked regarding stale elements, but I can't seem to resolve these.
My site is private so unfortunately can't share, but seems to always throw the error somewhere within the below for-loop. This loop is meant to get the text of each row in a table (number of rows varies). I've assigned WebDriverWait commands and have a very similar for-loop earlier in my code to do the same thing in another table on the website which works perfectly. I've also tried including the link click command and table, body, and tableText definition inside the loop to redefine at every iteration.
Once the code stops and the error message displays (stale element reference: element is not attached to the page document (Session info: chrome=89.0.4389.128)), if I manually run everything line-by-line, it all seems to work and correctly grabs the text.
Any ideas? Thanks!
link = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "*link address*")))
link.click()
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "TableId")))
body = tableSig.find_element(By.CLASS_NAME, "*table body class*")
tableText = body.find_elements(By.TAG_NAME, "tr")
rows = len(tableText)
approvedSigs = [None]*rows
for i in range(1, rows+1):
approvedSigs[i-1] = (tableText[i-1].text)
approvedSigs[i-1] = approvedSigs[i-1].lstrip()
approvedSigs[i-1] = approvedSigs[i-1][9:]
approvedSigs[i-1] = approvedSigs[i-1].replace("\n"," ")

Finding Data with cheerio inside clases

Im trying to scrape a webpage but im unable to scrape a certain part that is inside an <ul> then a <li> then a b. So far I have tried these lines of code with no luck:
const totalSurfaceHelper = $('.section-icon-features').find('.icon-feature').find('.icon-f icon-f-stotal');
dwelling.TotalSurface = totalSurfaceHelper.find('b').text();
dwelling.totalSurface = $('.icon-f icon-f-stotal .icon-feature > b').text();

Web Scraping Location Data with BeautifulSoup

I am trying to scrape a webpage for address data (the highlighted street address shown in this image:1) using the find() function of the BeautifulSoup library. Most online tutorials only provide examples where data can be easily pinpointed to a certain class; however, for this particular site, the street address is a element within a larger class="dataCol col02 inlineEditWrite" and I'm not sure how to get at it with the find() function.
What would be the arguments to find() to get the street address in this example? Any help would be greatly appreciated.
Image: 1
This should get you started, it will find the div element with the class "dataCol col02 inlineEditWrite" then search for td elements within it and print the first td elements text:
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
for tag in divTag:
tdTags = tag.find_all("td")
print (tdTags[0].text)
the above example assumes you want to print the first td element from all the div elements with the class "dataCol col02 inlineEditWrite" otherwise
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
tdTags = divTag[0].find_all("td")
print (tdTags[0].text)

Selenium scrapes only one result and ignores other related reults

I am new to selenium. Searching a web site, I get 10 results for each page. Those results are shown as lists (li tags) on the page and each list contains the same attributes. When my conditions are met, I go to another related web page and get desired content. However, when my code keeps looping for the lists, it fails to find the same attributes for the others. Here is my code:
p_url = "https://www.linkedin.com/vsearch/f?keywords=BARCO%2BNV%2Bkortrijk&pt=people&page_num=5"
driver.get(p_url)
time.sleep(5)
results = driver.find_element_by_id("results-container")
employees = results.find_elements_by_tag_name('li')
#emp_list = []
#for i in range(len(employees)):
# emp_list.append(employees[i])
for emp in employees:
try:
main_emp = emp.find_element_by_css_selector("a.title.main-headline")
name = emp.find_element_by_css_selector("a.title.main-headline").text
href = main_emp.get_attribute("href")
if name != "LinkedIn Member":
location = emp.find_element_by_class_name("demographic").text
href = main_emp.get_attribute("href")
print(href)
print(location)
driver.get(href)
exp = driver.find_element_by_id("background-experience")
amkk = exp.find_elements_by_class_name("editable-item")
for amk in amkk:
him = amk.find_element_by_tag_name("header").text
him2 = amk.find_element_by_class_name("experience-date-locale").text
if '\n' in him:
a = him.split('\n')
print(a[0])
print(a[1])
print(him2)
except Exception as exc:
print(exc)
continue
In this code the line main_emp = emp.find_element_by_css_selector("a.title.main-headline") stop working after it works for the first time. As a result I got an error of Message: stale element reference: element is not attached to the page document
From stackoverflow questions I saw that some say the content is removed from DOM structure and from another post someone suggested to fill a list with the results. Here what I have tried emp_list = []
for i in range(len(employees)):
emp_list.append(employees[i]) , however, it also did not work out.
How can I overcome this?
The selector you are using is wrong. You are getting the results using the results-container id. This works fine, but the collecting the elements form this is not working. It is returning more elements than just the employees (I'm not quite sure why).
If you change you selectors to this single selector you will get just the employees and no other unwanted elements.
employees = results.find_elements_by_css_selector("ol[id='results']>li")
Edit
Since you are opening the employees and losing the list of elements you might want to try opening the employee in a new tab, perform your actions here and close the tab afterwards.
Example:
for emp in employees:
try:
main_emp = emp.find_element_by_css_selector("a.title.main-headline")
# Do stuff you need...
# Open employee in new tab (make sure Keys is imported)
main_emp.send_keys(Keys.CONTROL + 't')
# Focus on new tab
driver.switch_to_window(d.window_handles[1])
# Do stuff inside the employee page
# Close the tab you opened
driver.close()
# Switch back to the first tab
driver.switch_to_window(d.window_handles[0])
Note: For OSX you should use main_emp.send_keys(Keys.COMMAND + 't')

Resources