Finding Data with cheerio inside clases - node.js

Im trying to scrape a webpage but im unable to scrape a certain part that is inside an <ul> then a <li> then a b. So far I have tried these lines of code with no luck:
const totalSurfaceHelper = $('.section-icon-features').find('.icon-feature').find('.icon-f icon-f-stotal');
dwelling.TotalSurface = totalSurfaceHelper.find('b').text();
dwelling.totalSurface = $('.icon-f icon-f-stotal .icon-feature > b').text();

Related

Solving recaptcha with anticaptcha using Python

I am trying to fill recaptcha using anticaptcha api.
But I am unable to figure out how to submit response.
Here is what I am trying to do:
driver.switch_to.frame(driver.find_element_by_xpath('//iframe'))
site_key = '6Ldd2doaAAAAAFhvJxqgQ0OKnYEld82b9FKDBnRE'
api_key = 'api_keys'
url = 'https://coinsniper.net/register'
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
job.join()
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='{}';".format(job.get_solution_response()))
driver.refresh()
Above code snippet only refreshes the same page and not redirecting to input url.
Then I see that there is a variable in script on the same page and I tried to execute that variable too to submit form just like that
driver.execute_script("var captchaSubmitEl = document.getElementById('captcha-submit');")
driver.refresh()
Which also fails.The webpage is here.
Second try with this url which is loading recpatcha of the same page.
But this time I tried with different site_key and url which were extracted as below
url_key = driver.find_element_by_xpath('//*[#id="captcha-submit"]/div/div/iframe').get_attribute('src')
site_key = re.search('k=([^&]+)',url_key).group(1)
url = 'https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMABhLJ2Rn0V78AZ5gFAg%3D%3D&hash=7F23E8F8FB0B33347C06D1347938C1&cid=.z5o-mMJuvaX_CLxOMBRebJsY6NgZvUv87bLMft~A_st0Fkvl~3jcaTr1R64GU7xO.WZFYNq5P3.UNuLWFa32.Pe6GGuIV7Y5w-RaMu0K3&t=fe&referer=https%3A%2F%2Fcoinsniper.net%2Fregister&s=33682'
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
job.join()
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='{}';".format(job.get_solution_response()))
driver.refresh()
Above both ways are, I don't know why but, not working. I am finding solution from previous 3 days and got not any single solution working in my case.
Can anyone look into this and let me know what is wrong with this code.
After you receive a response from anti-captcha you should set it to this element
<input type="hidden" class="mtcaptcha-verifiedtoken" name="mtcaptcha-verifiedtoken" id="mtcaptcha-verifiedtoken-1" readonly="readonly" value="">
Fill in all other fields on UI and click the Register button.
You should not refresh the page.

Returning hrefs using Selenium

I'm working with html loosely structured like this:
...
<div class='TL-dsdf2323...'>
<a href='/link1/'>
(more stuff)
</a>
<a href='/link2/'>
(more stuff)
</a>
</div>
...
I want to be able to return all of the hrefs contained within this particular div. So far it seems like I am able to locate the proper div
div = driver.find_elements_by_xpath("//div[starts-with(#class, 'TL')]")
This is where I'm hitting a wall though. I've gone through other posts and tried several options such as
links = div.find_elements_by_xpath("//a[starts-with(#href,'/link')]")
and
div.find_element_by_partial_link_text('/link')
but I keep returning empty lists. Any idea where I'm going wrong here?
Edit:
here's a picture of the actual html. I simplified the div class name from ThumbnailLayout to TL and the href /listing to /link
As #mr_mooo_cow pointed out in a comment, a delay was needed in order to extract the links. Here is the final working code:
a_tags = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located( (By.XPATH, "//a[starts-with(#href,'/listing')]") ))
links = []
for link in a_tags:
links.append(link.get_attribute('href'))
Can you try something like this:
links = div.find_elements_by_xpath("//a[starts-with(#href,'/link') and ./div[starts-with(#class, 'TL')]]")
./ references the parent element in xpath. I haven't tested this so let me know if it doesn't work.

I am doing some web scraping and during that I've run into an error whih states that, "'NoneType' object is not subscriptable"

I am using bs4 for web scraping. This is the html code that I am scraping.
items is a list of these multiple div tags i.e <div class="list_item odd" itemscope=""...>
from which the tag that I really want from each in items element is:
<p class="cert-runtime-genre">
<img title="R" alt="Certificate R" class="absmiddle certimage" src="https://m...>
<time datetime="PT119M">119 min</time>
-
<span>Drama</span>
<span class="ghost">|</span>
<span>War</span>
</p>
The main class of this list is saved in items. From that I want to scrape the img tag and then access the title attribute so that I can save all the certifications of the movies in a database i.e R or PG etc. But when I apply loop to the items it gives an error that the items is not subscriptable. I tried list comprehensions, simple for loop, called items elements through a predefined integer array nothing works and still gives the same error. (items is not Null and is subscriptable i.e. is a list). But when I call it with a direct integers it works fine i.e items[0] or items[1] etc, and gives the correct result for each corresponding element in the items list. The error line is below:
cert = [item.find(class_ = "absmiddle certimage")["title"] for item in items] or
cert = [item.find("img",{"class": "absmiddle certimage"})["title"] for item in items]
and this is what works fine: cert = items[0].find(class_ = "absmiddle certimage")["title"]
Any suggestion will be appreciated.

Scrape info from a span title

My html looks like this:
<h3>Current Guide Price <span title="92"> 92
</span></h3>
The info I am trying to get is the 92.
here is another html page where i need to get the same data:
<h3>Current Guide Price <span title="4,161"> 4,161
</span></h3>
I would need to get the 4,161 from this page.
here is the link to the page for reference:
http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=1613
What I have tried:
/h3/span[#title="92"]#title
/h3/span[#title="92"]/text()
/div[#class="stats"]/h3/span[#title="4,161"]#title
since the info I need is in the actual span tag, it is hard to grab the data in a dynamic way that I can use for many different pages.
from lxml import html
import requests
baseUrl = 'http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=2355'
page = requests.get(baseUrl)
tree = html.fromstring(page.content)
price = tree.xpath('//h3/span')
price2 = tree.xpath('//h3/span/#title')
for p in price:
print(p.text.strip())
for p2 in price2:
print(p2)
The output is 92 in both cases.

How to get href values from a class - Python - Selenium

<a class="link__f5415c25" href="/profiles/people/1515754-andrea-jung" title="Andrea Jung">
I have above HTML element and tried using
driver.find_elements_by_class_name('link__f5415c25')
and
driver.get_attribute('href')
but it doesn't work at all. I expected to extract values in href.
How can I do that? Thanks!
You have to first locate the element, then retrieve the attribute href, like so:
href = driver.find_element_by_class_name('link__f5415c25').get_attribute('href')
if there are multiple links associated with that class name, you can try something like:
eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)

Resources