i am using proxy api to test and scrape movies from imdb but all i get is [scrapy.core.engine] DEBUG: Crawled (200) - python-3.x

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('hiding the key')
class MoviesSpider(CrawlSpider):
name = "movies"
def start_requests(self):
urls = ["https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc"]
for link in urls:
yield scrapy.Request(client.scrapyGet(url=link,render=True),callback=self.parse_item)
rules = (Rule(LinkExtractor(restrict_xpaths='//h3[#class="lister-item-header"]/a'), callback="parse_item", follow=True),)
def parse_item(self, response):
yield {'link':response.url}
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.imdb.com%2Fsearch%2Ftitle%2F%3Fgenres%3Ddrama%26groups%3Dtop_250%26sort%3Duser_rating%2Cdesc&api_key=8ccf268c7e3c965da0777f5594598b9d&render=true&scraper_sdk=python%5C\> (referer: None)
2023-02-19 17:48:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.scraperapi.com/?url=https%3A%2F%2Fwww.imdb.com%2Fsearch%2Ftitle%2F%3Fgenres%3Ddrama%26groups%3Dtop_250%26sort%3Duser_rating%2Cdesc&api_key=8ccf268c7e3c965da0777f5594598b9d&render=true&scraper_sdk=python%5C\>
{'link': 'https://api.scraperapi.com/?url=https%3A%2F%2Fwww.imdb.com%2Fsearch%2Ftitle%2F%3Fgenres%3Ddrama%26groups%3Dtop_250%26sort%3Duser_rating%2Cdesc&api_key=8ccf268c7e3c965da0777f5594598b9d&render=true&scraper_sdk=python'}
i was expecting to get ['link': name of the link]

You can try using a standard spider instead of the crawlspider. If all you are trying to scrape is the urls then this is the way to go anyway since this way you only make a single request instead of one for each link.
class MoviesSpider(scrapy.Spider):
name = "movies"
def start_requests(self):
urls = ["https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating"]
for link in urls:
yield scrapy.Request(client.scrapyGet(url=link,render=True))
def parse(self, response):
for link in response.xpath("//h3/a/#href").getall():
full_link = response.urljoin(link)
yield {'link': full_link}

Related

Getting 401 response from scrapy Request

I am trying to extract table data from this page. After navigating in network tool, I figured out that an api call could provide me the required table data so I tried to mimic request with python scrapy. Here is the code and response message.
In [27]: url
Out[27]: 'https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1'
In [28]: headers
Out[28]: {'X-XSRF-TOKEN': 'eyJpdiI6Ims2ZVJxT3pRRUplSCtLZXRVZXA3cXc9PSIsInZhbHVlIjoiaDJaQ0hhVWQwUU9zMEQ2S1FqVEVxR3hPYTJYRzd3d0VWWkZzMUhYQmRPSGVoaWVtTnBNUXZzdkJhTngvS2xNLyIsIm1hYyI6Ijc3MzY1N2M4ZDljMWQ4MDY4OTA5ZGQwNmUzYThiNDNkMDNlZDUyZmQ1Mjc4ZTU0MzkwMjA3ZDFmMDAwMTdkYTMifQ=='}
In [29]: fetch(scrapy.Request(url,headers=headers))
2021-03-03 12:12:55 [scrapy.core.engine] DEBUG: Crawled (401) <GET https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1> (referer: None)
Is there anything I am missing in headers or something elsewhere?
When you visit https://www.barchart.com/stocks/quotes/MSFT/competitors you get get a repsponse header with set-cookie=larvel-token... and some other cookies. I tried all cookies and laravel-token is the one used for auth. You also need to x-xsrf-token that you've already extracted.
To solve your problem in Scrapy. First make sure you have cookies enabled in settings.py.
Then send a request to: https://www.barchart.com/stocks/quotes/MSFT/competitors. In the parse method of that request there you send the next request to the url you sent above. Scrapy will then automatically handle the cookies.
Here's an example spider that worked for me (I extracted the xsrf token quite sloppy, you probably have a better way):
import re
from urllib.parse import unquote
import scrapy
class TestSpider(scrapy.Spider):
name='testspider'
def start_requests(self):
yield scrapy.Request(
url='https://www.barchart.com/stocks/quotes/MSFT/competitors',
)
def parse(self, response):
for set_cookie in response.headers.getlist('Set-Cookie'):
try:
xsrf_token = re.findall(r'XSRF-TOKEN=(\w+==);', unquote(set_cookie.decode('utf-8')))[0]
except IndexError:
pass
yield scrapy.Request(
url='https://www.barchart.com/proxies/core-api/v1/quotes/get?'\
'symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symb'\
'ol,symbolName,weightedAlpha,lastPrice,priceChange,percen'\
'tChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime'\
',symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&'\
'orderDir=desc&meta=field.shortName,field.type,field.desc'\
'ription&hasOptions=true&page=1&limit=100&raw=1',
callback=self.parse_data,
headers={
'x-xsrf-token': xsrf_token
}
)
def parse_data(self, response):
pass
Output
2021-03-03 12:26:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barchart.com/stocks/quotes/MSFT/competitors> (referer: None)
2021-03-03 12:26:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1> (referer: https://www.barchart.com/stocks/quotes/MSFT/competitors)

Authenticated spider pagination. 302 redirect. reqvalidation.asps - page not found

I have a scrapy sider that can log into ancestry.com successfully. I then use that authenticated session to return a new link and can scrape the first page of the new link successfully. The issue happens when I try to go to the second page. I get a 302 redirect debug message, and this url: https://secure.ancestry.com/error/reqvalidation.aspx?aspxerrorpath=http%3a%2f%2fsearch.ancestry.com%2ferror%2fPageNotFound&msg=&ti=0>.
I followed the documentation and have follow some recommendations here to get me this far. Do I need a session token for each page? if so how do I got about doing that?
import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import FormRequest
from scrapy.loader import ItemLoader
from ..items import AncItem
class AncestrySpider(CrawlSpider):
name = 'ancestry'
def start_requests(self):
return[
FormRequest(
'https://www.ancestry.com/account/signin?returnUrl=https%3A%2F%2Fwww.ancestry.com',
formdata={"username": "foo", "password": "bar"},
callback=self.after_login
)
]
def after_login(self, response):
if "authentication failed".encode() in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url='https://www.ancestry.com/search/collections/nypl/?name=_Wang&count=50&name_x=_1',
callback=self.parse)
def parse(self, response):
all_products = response.xpath("//tr[#class='tblrow record']")
for product in all_products:
loader = ItemLoader(item=AncItem(), selector=product, response=response)
loader.add_css('Name', '.srchHit')
loader.add_css('Arrival_Date', 'td:nth-child(3)')
loader.add_css('Birth_Year', 'td:nth-child(4)')
loader.add_css('Port_of_Departure', 'td:nth-child(5)')
loader.add_css('Ethnicity_Nationality', 'td:nth-child(6)')
loader.add_css('Ship_Name', 'td:nth-child(7)')
yield loader.load_item()
next_page = response.xpath('//a[#class="ancBtn sml green icon iconArrowRight"]').extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request( url=next_page_link, callback=self.parse)
I tired adding some request header information. I tried adding the cookie information to the request header, but that did not work. I've tried using only the USER agents that are listed in the POST packages.
Right now I only get 50 results. I should be getting hundreds after crawling all the pages.
Found the solution. It had nothing to do with the authentication to the website. I needed to find a different way to approach pagination. I resorted to using the page url for pagination instead of following the "next page" button link.

Scrapy crawl class skips links and doesn't return response body

Right now I am trying to scrape this webpage: http://search.siemens.com/en/?q=iot
For that I need to Extract the Links and parse them which I just learned should be possible with the Crawl class. However my implementation doesn't seem to work. For testing purposes I am trying to return the response body from each website. Unfortunately the spider only opens every third or so link and doesn't give me the response body back.
Any ideas what I am doing wrong?
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SiemensCrawlSSpider(CrawlSpider):
name = 'siemens_crawl_s'
allowed_domains = ['search.siemens.com/en/?q=iot']
start_urls = ['http://search.siemens.com/en/?q=iot']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//dl[#id="search-resultlist"]/dt/a'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield response.body
Setting LOG_LEVEL = 'DEBUG' on settings.py you can see some requests being filtered due to the allowed_domains parameter
2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siemens.com': <GET https://www.siemens.com/global/en/home/products/software/mindsphere-iot.html>
2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.industry.siemens.com.cn': <GET https://www.industry.siemens.com.cn/automation/cn/zh/pc-based-automation/industrial-iot/iok2k/Pages/iot.aspx>
2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'w3.siemens.com': <GET https://w3.siemens.com/mcms/pc-based-automation/en/industrial-iot>
2019-05-10 00:38:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'new.siemens.com': <GET https://new.siemens.com/global/en/products/services/iot-siemens.html>
You can try with allowed_domains = ['siemens.com', 'siemens.com.cn']
or dont set allowed_domains at all
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains

Scrapy Callback function not being called by Request

I am learning Python Scrapy and I am struggling in isolating why the callback function of a request is not being executed. The workflow process scrape a website. If a item page is found, then the program test to see if the web page has an active login session. If not, then call a function to login. The issue experienced is the session expires over time and my function requires to log in after a certain period. Any help or guidance would be appreciated.
The function I am trying to call is the following:
def retrylogin_parse(self, response):
#FUNCTION NOT CALLED
self.logger.debug("Re-Login attempted for url " + self.login_url)
return [FormRequest.from_response(response, formid= 'login-form', formdata=
{'login[username]': self.username,
'login[password]': self.password},
clickdata = { "type": "submit" }, callback=self.after_relogin)
]
I am trying to determine why the following function retrylogin_parse is not being called by the following line of code:
yield Request(self.login_url, dont_filter=True, callback=self.retrylogin_parse)
Here is the Code:
import ....
class MySpider(CrawlSpider):
name = "bot-help"
allowed_domains = ['www.somewebsite.com']
start_urls =["https://www.somewebsite.com/category/subcategory.html"]
reloginCurrentUrl = ""
username = 'username'
password = '1234'
login_msg = "Welcome"
login_url = "https://www.somewebsite.com/login/"
rules = (
Rule(LinkExtractor(allow=('html')), callback='item_page'),
)
def item_page(self, response):
image_item = Item()
self.logger.info("item_page Called")
str1 = response.xpath("//p[#class='welcome-msg']/text()").extract_first()
self.logger.info("Testing if Response is still logged in")
self.logger.debug("Message: " + str1)
if (str1.find(self.login_msg)==-1):
self.logger.error("Session Lost! Must Login")
self.reloginCurrentUrl = response.url
image_item['manu_product_url'] = response.url
self.logger.debug("reloginCurrentUrl: " + self.reloginCurrentUrl)
#HERE IS WHERE I WANT TO RE-LOGIN
x = self.start_relogin(response)
self.logger.debug("Relogin Request completed")
return
else:
self.logger.info("Login Session is alive")
self.logger.info("worked")
#SCRAPE DATA.....
yield image_item
def __init__(self, **kwargs):
CrawlSpider.__init__(self, **kwargs)
def start_relogin(self, response):
self.logger.debug("start_relogin function called")
x = response.url
self.logger.debug("Login Url: " + self.login_url)
yield Request(self.login_url, dont_filter=True, callback=self.retrylogin_parse)
def retrylogin_parse(self, response):
#FUNCTION NOT CALLED
self.logger.debug("Re-Login attempted for url " + self.login_url)
return [FormRequest.from_response(response, formid= 'login-form', formdata={'login[username]': self.username, 'login[password]': self.password}, clickdata = { "type": "submit" }, callback=self.after_relogin)]
def after_relogin(self, response):
self.logger.info("Post Re-Login Attempted")
str1 = response.xpath("//p[#class='welcome-msg']/text()").extract_first()
if (str1.find(self.login_msg) == -1):
self.logger.info("Re-Login failed")
return
else:
self.logger.info("Re-Login successful will conitnue to parse")
return [Request(url=self.reloginCurrentUrl)]
Here is the Debug Output:
DEBUG: Crawled (200) <GET https://www.somewebsite.com/robots.txt> (referer: None)
2018-10-30 23:49:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.somewebsite.com/category/subcategory.html> (referer: None)
2018-10-30 23:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.somewebsite.com/category/abc-46467.html> (referer: https://www.somewebsite.com/category/subcategory.html)
2018-10-30 23:49:35 [foagroupbot-help] INFO: item_page Called
2018-10-30 23:49:35 [foagroupbot-help] INFO: Testing if Response is still logged in
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: Message: Login Please!
2018-10-30 23:49:35 [foagroupbot-help] ERROR: Session Lost! Must Login
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: reloginCurrentUrl: https://www.somewebsite.com/category/abc-46467.html
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: Relogin Request completed

how to get status code other than 200 from scrapy-splash

I am trying to get request status code with scrapy and scrapy-splash,below is spider code.
class Exp10itSpider(scrapy.Spider):
name = "exp10it"
def start_requests(self):
urls = [
'http://192.168.8.240:8000/xxxx'
]
for url in urls:
#yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
#yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
def parse(self, response):
input("start .........")
print("status code is:\n")
input(response.status)
My start url http://192.168.8.240:8000/xxxx is a 404 status code url,there are threee kinds of request way upon:
the first is:
yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
the second is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
the third is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
Only the second request way yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True}) can get the right status code 404,the first and the third both get status code 200,that's to say,after I try to use scrapy-splash,I can not get the right status code 404,can you help me?
As the documentation to scrapy-splash suggests, you have to pass magic_response=True to SplashRequest to achieve this:
meta['splash']['http_status_from_error_code'] - set response.status to HTTP error code when assert(splash:go(..)) fails; it requires meta['splash']['magic_response']=True. http_status_from_error_code option is False by default if you use raw meta API; SplashRequest sets it to True by default.
EDIT:
I was able to get it to work only with execute endpoint, though. Here is sample spider that tests HTTP status code using httpbin.org:
# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
}
end
"""
def start_requests(self):
yield scrapy_splash.SplashRequest(
'https://httpbin.org/status/402', self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': self.lua_script})
def parse(self, response):
pass
It passes the HTTP 402 status code to Scrapy, as can be seen from the output:
...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...
You can experiment with other HTTP status codes as well.

Resources