Scrapy: using multiple image fields, store results - python-3.x

I am using scrapy to capture several images. I save them in separate fields. Because of dependencies with other systems, I do not want all the images results (url, path, checksum) stored in one field.
image_url1
image_url2
The results (url, path, checksum), are stored in;
images1
images2
I finally have it working where it downloads 2 pictures.
The results for the image_url1 is stored in images1. Only it doesn't store the results for image_url2 in images2. I don't know how I make clear that the results for image_url2 should be stored in images2. If I now run the following code, it tries to put the 2 results (url, path, checksum) together (results of image_url1 and image_url2 behind each other separated by a space). It cannot insert that field into MySQL so that fails.
class GooutImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_url1']:
yield scrapy.Request(image_url)
for image_url in item['image_url2']:
yield scrapy.Request(image_url)
I already made the following edit in settings;
IMAGES_URLS_FIELD = 'image_url1'
IMAGES_RESULT_FIELD = 'images1'
I can't find anything how to work with multiple image fields.
*** Edit/solution after feedback
After the suggestion to do this in item_completed, I come up with the following;
class GooutImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if item['image_url1']:
for image_url in item['image_url1']:
yield scrapy.Request(image_url)
if item['image_url2']:
for image_url in item['image_url2']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
for download_status, result in results:
if result['url'] == item['image_url1'][0]:
item['images1'] = result['path']
if result['url'] == item['image_url2'][0]:
item['images2'] = result['path']
return item
Don't know if this is the best way to do it, but it works. Feedback is appreciated. Thanks!

The item_completed() method is what does this storing, so that's what you'll need to override.
You will need to know where to store the data though, so you'll probably want to add that information to the meta dict of the request you create in get_media_requests().

Related

How can I resolve this selenium stale reference error when scraping page 2 of a website?

I'm using Selenium to scrape Linkedin for jobs but I'm getting the stale reference error.
I've tried refresh, wait, webdriverwait, a try catch block.
It always fails on page 2.
I'm aware it could be a DOM issue and have run through a few of the answers to that but none of them seem to work for me.
def scroll_to(self, job_list_item):
"""Just a function that will scroll to the list item in the column
"""
self.driver.execute_script("arguments[0].scrollIntoView();", job_list_item)
job_list_item.click()
time.sleep(self.delay)
def get_position_data(self, job):
"""Gets the position data for a posting.
Parameters
----------
job : Selenium webelement
Returns
-------
list of strings : [position, company, location, details]
"""
# This is where the error is!
[position, company, location] = job.text.split('\n')[:3]
details = self.driver.find_element_by_id("job-details").text
return [position, company, location, details]
def wait_for_element_ready(self, by, text):
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((by, text)))
except TimeoutException:
logging.debug("wait_for_element_ready TimeoutException")
pass
logging.info("Begin linkedin keyword search")
self.search_linkedin(keywords, location)
self.wait()
# scrape pages,only do first 8 pages since after that the data isn't
# well suited for me anyways:
for page in range(2, 3):
jobs = self.driver.find_elements_by_class_name("occludable-update")
#jobs = self.driver.find_elements_by_css_selector(".occludable-update.ember-view")
#WebDriverWait(self.driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'occludable-update')))
for job in jobs:
self.scroll_to(job)
#job.click
[position, company, location, details] = self.get_position_data(job)
# do something with the data...
data = (position, company, location, details)
#logging.info(f"Added to DB: {position}, {company}, {location}")
writer.writerow(data)
# go to next page:
bot.driver.find_element_by_xpath(f"//button[#aria-label='Page {page}']").click()
bot.wait()
logging.info("Done scraping.")
logging.info("Closing DB connection.")
f.close()
bot.close_session()
I expect that when job_list_item.click() is performed the page is loaded, in this case since you are looping jobs which is a list of WebDriverElement will become stale. You are returning back to the page but your jobs is already stale.
Usually to prevent a stale element, I always prevent the use of the element in a loop or store an element to a variable, especially if the element may change.

Can not get client.command parameter to parse API response by key value in discord.py

I'm building a command onto an existing bot that will search an API and take a baseball player's name as a parameter to query a json response with. I've gotten everything to work correctly in test, only for the life of me I can't figure out how to restrict the results to only those that include the query parameter that is passed when the command is invoked within discord.
For example: a user will type !card Adam Dunn and only the value "Adam Dunn" for the key "name" will return. Currently, the entire first page of results is being sent no matter what is typed for the parameter, and with my embed logic running, each result gets a separate embed, which isn't ideal.
I've only included the pertinent lines of code and not included the massive embed of the results for readability's sake.
It's got to be something glaringly simple, but I think I've just been staring at it for too long to see it. Any help would be greatly appreciated, thank you!
Below is a console output when the command is run:
Here is the code I'm currently working with:
async def card(ctx, *, player_name: str):
async with ctx.channel.typing():
async with aiohttp.ClientSession() as cs:
async with cs.get("https://website.items.json") as r:
data = await r.json()
listings = data["items"]
for k in listings:
if player_name == k["name"]
print()```
I hope I understood you right. If the user did not give a player_name Then you will just keep searching for nothing, and you want to end if there is no player_name given. if that is the case then.
Set the default value of player_name: str=None to be None then check at the beginning of your code if it is there.
async def card(ctx, *, player_name: str=None):
if not player_name:
return await ctx.send('You must enter a player name')
# if there is a name do this
async with ctx.channel.typing():
async with aiohttp.ClientSession() as cs:
async with cs.get("https://theshownation.com/mlb20/apis/items.json") as r:
data = await r.json()
listings = data["items"]
for k in listings:
if player_name == k["name"]
print()```
Update:
I'm an idiot. Works as expected, but because the player_name I was searching for wasn't on the first page of results, it wasn't showing. When using a player_name that is on the first page of the API results, it works just fine.
This is a pagination issue, not a key value issue.

Scrapy crawl saved links out of csv or array

import scrapy
class LinkSpider(scrapy.Spider):
name = "articlelink"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
BASE_URL = 'https://www.topart-online.com/de/'
#scraping cards of specific category
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
yield{
'links': a.xpath('#href').get()
}
next_page_url = response.xpath('//a[#class="page-link"]/#href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
this is my spider which crawls all pages of that category and saves all the productlinks into a csv file when i run in my server "scrapy crawl articlelink -o filename.csv".
now i have to crawl all the links in my csv file for specific information, that arent contained in the card of the productlink clearfix
How do i satrt?
So i'm glad you've gone and had a look at how to scrape the next pages.
Now with regard to the productean number, this is where yielding a dictionary is cumbersome. You will be better off in almost most scrapy scripts to be using the items dictionary. It's suited to being able to grab elements from different pages which is exactly want to do.
Scrapy extracts data from HTML and the mechanism it suggests to do this is through what they call items. Now scrapy accepts a few different types of ways to put data into some form of object. You can use a bog standard dictionary, but for data that requires modification or isn't that clean almost anything other than a very structured data set from the website you should use items at the least. Item's provides a dictionary like object.
To use the items mechanism in our spider script, we have to instantiate the items class to create an item's object. We then populate that items dictionary with the data we want and then in your particularly case we share this items dictionary across functions to continuing adding data from a different page.
In addition to this we have to declare the item field names as is called. But they are the keys to the items dictionary. We do this through items.py located in the project folder.
Code Example
items.py
import scrapy
class TopartItem(scrapy.Item):
title = scrapy.Field()
links = scrapy.Field()
ItemSKU = scrapy.Field()
Delivery_Status = scrapy.Field()
ItemEAN = scrapy.Field()
spider script
import scrapy
from ..items import TopartItem
class LinkSpider(scrapy.Spider):
name = "link"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status'] }
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
items = TopartItem()
link = a.xpath('#href')
items['title'] = a.xpath('.//div[#class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip()
items['links'] = link.get()
items['ItemSKU'] = a.xpath('.//span[#class="sn_p01_pno"]/text()').get().strip()
items['Delivery_Status'] = a.xpath('.//div[#class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
def parse_item(self,response):
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[#class="productean"]/text()').get().strip()
yield items
Explanation
First within items.py, we are creating a class called TopArtItem because we are inheriting from scrapy.Item we can instantiating a field object creation for our item's object. Any field we want to create we give it a name and create the field object by scrapy.Field()
Within the spider script we have to import this class TopArtItem into our spider script. from ..items is a relative import which means from the parent directory of the spider script we want to take something from the items.py.
Now the code will look slightly familiar here. First to create our specific item' dictionary called items within the for loop we use items = TopArtItem().
The way to add to the items dictionary is similar to any other python dictionary, the key's in the items dictionary are our fields we created in items.py.
The variable link is the link to the specific page. We then grab the data we want which you've seen before.
So when we populate our items dictionary in we need to grab the productean number from the individual pages. We do this by following the link, and the callback is the function we want to send the HTML from the individual page to.
The meta = {'items',items} is the way we transfer our items dictionary to that function parse_item. We create a meta dictionary, with key called items, and the value is our items dictionary we just created.
We then have created the function parse_item. To get access to that items dictionary we must access it through the response.meta which holds the meta dictionary we created as we made the request in the previous function. response.meta['items'] is how we acccess our items dictionary, we then have called this items as before.
Now we can populate the items dictionary which already has data in it from the previous function and add the productean number to it. We then finally yield that items dictionary to tell scrapy we are done adding data for extraction from this particular card.
To summarise that workflow in the parse function we have a loop and for every loop iteration we are following a link, we extract the four pieces of data first, we then make scrapy follow the specific link and add the 5th piece of data before moving onto the next card in the original html document.
Additional Information
Note, try to use get() if you want to grab only one piece of data, and getall() for more than one piece of data. This is instead of extract_first() and extract(). If you look at the scrapy docs, they recommend this. get() is abit more concise and when using extract() you weren't always sure whether you'll get a string or a list as the extracted data. getall will always give you a list.
Recommend that you look up other examples of items in other scrapy scripts. by searching github or other websites. I would recommend once you understand the workflow to read carefully the items page on the docs here. It's clear but not example friendly. I think its more understandable when you've created scripts with items a few times.
Updated next page links
I've replaced the code you had for next_page with a more robust way of getting all the data.
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
Here we are creating a variable for the last page and grabbing the number from that. I made this because incase you had pages where there were more than 3 pages.
We are doing a for loop and creating the url for each iteration. This url has whats called an f-string in it. f'' This allows us to plant a variable in the string, and to use a for loop to add either a number or anything else into that url. So we are planting the number 2 first into the url which gives us the link to the 2nd page. We then use the variable for the last page + 1, as the range function will only process the lastpage by selecting lastpage+1. To create the 3rd page url. We then follow that 3rd page url too.

Using multiple parsers while scraping a page

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

Going through specified items in a page using scrapy

I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.

Resources