scrapy + splash : not rendering full page javascript data - python-3.x

I am just exploring scrapy with splash and I am trying to scrape all the product (pants) data with productid,name and price from one of the e-commerce site
gap but I didn't see all the dynamic product data loaded when I see from splash web UI splash web UI (only 16 items are loading though for every request - no clue why)
I tried with the following options but no luck
Increasing wait time upto 20 sec
By starting the docker with "--disable-private-mode"
By using lua_script for page scrolling
With view report full option splash:set_viewport_full()
lua_script2 = """ function main(splash)
local num_scrolls = 10
local scroll_delay = 2.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end"""
yield SplashRequest(
url,
self.parse_product_contents,
endpoint='execute',
args={
'lua_source': lua_script2,
'wait': 5,
}
)
Can anyone please shed some light on this behavior?
p.s : I am using scrapy framework and I am able to parse the product information (itemid,name and price) from the render.html (but render.html has only 16 items information)

I updated the script to below
function main(splash)
local num_scrolls = 10
local scroll_delay = 2.0
splash:set_viewport_size(1980, 8020)
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
-- splash:set_viewport_full()
splash:wait(10)
splash:runjs("jQuery('span.icon-x').click();")
splash:wait(1)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
splash:wait(30)
return {
png = splash:png(),
html = splash:html(),
har = splash:har()
}
end
And ran it in my local splash, the png doesn't work fine but the HTML has the last product
The only issue was when the email subscribe popup is there it won't scroll, so I added code to close it

Related

Selenium anticaptcha submission

I am trying to fill recaptcha using anticaptcha api.
But I am unable to figure out how to submit response.
Here is what I am trying to do:
driver.switch_to.frame(driver.find_element_by_xpath('//iframe'))
url_key = driver.find_element_by_xpath('//*[#id="captcha-submit"]/div/div/iframe').get_attribute('src')
#site_key = re.search('k=([^&]+)',url_key).group(1)
site_key = '6Ldd2doaAAAAAFhvJxqgQ0OKnYEld82b9FKDBnRE'
api_key = 'api_keys'
url = driver.current_url
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
job.join()
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='{}';".format(job.get_solution_response()))
driver.refresh()
Above code snippet only refreshes the same page and not redirecting to input url.
Then I see that there is a variable in script on the same page and I tried to execute that variable too to submit form just like that
driver.execute_script("var captchaSubmitEl = document.getElementById('captcha-submit');")
driver.refresh()
Which also fails.The webpage is here.

Algolia search python client to upload indexes behind a firewall

Building a search with Algolia search product for our Moodle site.
To update the index at Algolia site we want to use some automated process (upload the index data).
I decided to start with the Python client (Python v 3.9.x). Since we are working inside a corporate network we are behind the firewalls and there is a problem accessing algolia's servers.
I'm testing with two methods:
Search within already existing indices: index.search
update all indexes: index.replace_all_objects
Error message: 'Unreachable hosts'
Here is my code snippets:
# Search:
searchAppID = constants.ALGOLIA_APP_ID
searchKeyHash = constants.ALGOLIA_SEARCH_KEY
config = SearchConfig(searchAppID, searchKeyHash)
config.connect_timeout = 2
config.read_timeout = 5
config.write_timeout = 30
client = SearchClient.create_with_config(config)
index = client.init_index('myIndex')
result = index.search('SearchString')
# function to update index using json file:
def replace_IdxObj(client,index,fileName):
try:
if (index.exists()):
with open(fileName, encoding="utf8") as f:
data = json.load(f)
result = index.replace_all_objects(data, { 'autoGenerateObjectIDIfNotExist' : True })
return result
else:
print('No Index found! Cancel operation... \n')
return None
except Exception as err:
print ("Error: " + str(err))
When I test my code outside of the company's network - it's all ok!
Also I have a SSL certificate that could be used to work with external resources, so I used it in the following test (running from company's network behind the firewall):
response = requests.get('https://google.com/', verify=("C:\\Program Files\\Common Files\\SSL\\certs\\cert.cer"))
print(response)
This test works just fine (I'm getting 200 response)!
Please advise are there any ways to make Python process (which is used by algolia's python client) to use the certificate that I have ??? Any alternatives?
Many thanks!

Node has gone away in WWW::Mechanize::Chrome

WWW::Mechanize::Chrome loses somehow the node, when calling a different method like ->find_all_links()
No node with given id found
-32000 at /usr/local/share/perl/5.20.2/Chrome/DevToolsProtocol/Target.pm line 490
Node 447 has gone away in the meantime, could not resolve at /usr/local/share/perl/5.20.2/WWW/Mechanize/Chrome/Node.pm line 206.
No node with given id found
When I do $mech->get($url) again before fetching all links, it's working, but I don't want to re-get URLs every time when we call a method and we need to submit forms later on anyway.
my $mech = WWW::Mechanize::Chrome->new(
headless => 1,
launch_exe => '/usr/bin/google-chrome',
launch_arg => ["--headless" , "--no-sandbox"],
separate_session => 1,
);
&parse_content;
sub parse_content {
$mech->get($url);
my $content = $browser->content;
# parse some content
}
# $mech->get($url);
# fetching the links would work when getting the URL again
# but this is not the way it should work
my #links = $mech->find_all_links();
Can anybody help?

Login to a website then open it in browser

I am trying to write a Python 3 code that logins in to a website and then opens it in a web browser to be able to take a screenshot of it.
Looking online I found that I could do webbrowser.open('example.com')
This opens the website, but cannot login.
Then I found that it is possible to login to a website using the request library, or urllib.
But the problem with both it that they do not seem to provide the option of opening a web page.
So how is it possible to login to a web page then display it, so that a screenshot of that page could be taken
Thanks
Have you considered Selenium? It drives a browser natively as a user would, and its Python client is pretty easy to use.
Here is one of my latest works with Selenium. It is a script to scrape multiple pages from a certain website and save their data into a csv file:
import os
import time
import csv
from selenium import webdriver
cols = [
'ies', 'campus', 'curso', 'grau_turno', 'modalidade',
'classificacao', 'nome', 'inscricao', 'nota'
]
codigos = [
96518, 96519, 96520, 96521, 96522, 96523, 96524, 96525, 96527, 96528
]
if not os.path.exists('arquivos_csv'):
os.makedirs('arquivos_csv')
options = webdriver.ChromeOptions()
prefs = {
'profile.default_content_setting_values.automatic_downloads': 1,
'profile.managed_default_content_settings.images': 2
}
options.add_experimental_option('prefs', prefs)
# Here you choose a webdriver ("the browser")
browser = webdriver.Chrome('chromedriver', chrome_options=options)
for codigo in codigos:
time.sleep(0.1)
# Here is where I set the URL
browser.get(f'http://www.sisu.mec.gov.br/selecionados?co_oferta={codigo}')
with open(f'arquivos_csv/sisu_resultados_usp_final.csv', 'a') as file:
dw = csv.DictWriter(file, fieldnames=cols, lineterminator='\n')
dw.writeheader()
ies = browser.find_element_by_xpath('//div[#class ="nome_ies_p"]').text.strip()
campus = browser.find_element_by_xpath('//div[#class ="nome_campus_p"]').text.strip()
curso = browser.find_element_by_xpath('//div[#class ="nome_curso_p"]').text.strip()
grau_turno = browser.find_element_by_xpath('//div[#class = "grau_turno_p"]').text.strip()
tabelas = browser.find_elements_by_xpath('//table[#class = "resultado_selecionados"]')
for t in tabelas:
modalidade = t.find_element_by_xpath('tbody//tr//th[#colspan = "4"]').text.strip()
aprovados = t.find_elements_by_xpath('tbody//tr')
for a in aprovados[2:]:
linha = a.find_elements_by_class_name('no_candidato')
classificacao = linha[0].text.strip()
nome = linha[1].text.strip()
inscricao = linha[2].text.strip()
nota = linha[3].text.strip().replace(',', '.')
dw.writerow({
'ies': ies, 'campus': campus, 'curso': curso,
'grau_turno': grau_turno, 'modalidade': modalidade,
'classificacao': classificacao, 'nome': nome,
'inscricao': inscricao, 'nota': nota
})
browser.quit()
In short, you set preferences, choose a webdriver (I recommend Chrome), point to the URL and that's it. The browser is automatically opened and start executing your instructions.
I have tested using it to log in and it works fine, but never tried to take screenshot. It theoretically should do.

Only one image getting uploaded multiple times

I have been using mechanize gem to scrape data from craigslist, I have a piece of code that uploads multiple image to craigslist, all the file paths are correct, but only single image gets uploaded multiple times what's the reason.
unless pic_url_arry.blank?
unless page.links_with(:text => 'Use classic image uploader').first.blank?
page = page.links_with(:text => 'Use classic image uploader').first.click
end
puts "After classic image uploader"
form = page.form_with(class: "add")
# build full file path before setting like this => file = File.join( APP_ROOT, 'tmp', 'image.jpg')
i = 0
pic_url_arry = pic_url_arry.shuffle
pic_url_arry.each do |p|
form.file_uploads.first.file_name = p
i+= 1
page = form.submit
puts "******#{p.inspect}*******"
puts "******#{page.inspect}*******"
end unless pic_url_arry.blank?
# check if the file uploaded sucessfully with no. of files with no. of imgbox on page.
check_image_uploaded = page.at('figure.imgbox').count
if check_image_uploaded.to_i == i.to_i
# upload failure craiglist or net error.
end
end
AND the pic array has value as ["/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/1/images__4_.jpg", "/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/2/mona200.jpg", "/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/3/images__1_.jpg"].
The form holding filefield is being set only once, which is taking only one image that hits first. So, the updated code will be,
unless pic_url_arry.blank?
unless page.links_with(:text => 'Use classic image uploader').first.blank?
page = page.links_with(:text => 'Use classic image uploader').first.click
end
puts "After classic image uploader"
form = page.form_with(class: "add")
# build full file path before setting like this => file = File.join( APP_ROOT, 'tmp', 'image.jpg')
i = 0
pic_url_arry = pic_url_arry.shuffle
pic_url_arry.each do |p|
form.file_uploads.first.file_name = p
i+= 1
page = form.submit
form = page.form_with(class: "add")
puts "******#{p.inspect}*******"
puts "******#{page.inspect}*******"
end unless pic_url_arry.blank?
# check if the file uploaded sucessfully with no. of files with no. of imgbox on page.
check_image_uploaded = page.at('figure.imgbox').count
if check_image_uploaded.to_i == i.to_i
# upload failure craiglist or net error.
end
end
hoping this will solve the problem.

Resources