Scrapy not extracting data from a certain xpath - python-3.x

I'm trying to extract some data from an amazon product page.
What I'm looking for is getting the images from the products. For example:
https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
By using the XPath
//script[contains(., "ImageBlockATF")]/text()
I get the part of the source code that contains the urls, but 2 options pop up in the chrome XPath helper.
By trying things out with XPaths I ended up using this:
//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]
Which gives me exclusively the data I need.
The problem that I'm having is that, for certain products ( it can happen within 2 pairs of different shoes) sometimes I can extract the data and other times nothing comes out. I extract by doing:
imagenesString = response.xpath('//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]').extract()
If I use the chrome xpath helper, the data always appears with the xpath above but in the program itself sometimes it appears, sometimes not. I know sometimes the script that the console reads is different than the one that appears on the site but I'm struggling with this one, because sometimes it works, sometimes it does not. Any ideas on what could be going on?

I think I found your problem: Its a captcha.
Follow these steps to reproduce:
1. run scrapy shell
scrapy shell https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
2. view response like scrapy
view(respone)
When executing this I sometimes got a captcha.
Hope this points you in the right direction.
Cheers

Related

JMETER - WebDriver Sampler - Groovy - Dynamic Name (3 Level)

Thank you for your prompt responses. I have tried the below codes but looks like not picking up the values because there's 3 level of variables. Can you please advise? Thanks.
1st level: xpath=(//input[#type='text'])[7]
2nd level
it doesn't work: //li[contains(#id, 'cascader-menu')]/span
or
it doesn't work: //li[contains(#id,'cascader-menu')]/span1
I don't think anyone is capable of coming up with a proper unique locator by looking at the screenshot without having access to full DOM or application
From what I can see so far you need the following element:
//input[#placeholder='Select...']
However it may or may not work depending on:
whether there is another input element matching this query, if there are more than one - the action goes to the first match
it is visible
it can be interacted with (i.e. not covered by a modal window or not disabled)
phase of the moon
You can test your expressions using your browser developer tools and given you will get the match - Selenium will also be able to find the element and hopefully work with it.
If you going to continue repeatedly ignoring my suggestions of getting familiarized with DOM and XPath concepts I can only suggest recording the scenario using Selenium IDE or JMeter Chrome Extension and hope that it could be replayed without modifications.

Python selenium "Timeout Exception" error

I am trying to click on the Financials link of the following URL using Selenium and Python.
https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/
initially I used the following code
link = driver.find_element_by_link_text('Financials')
link.click()
Sometimes this works and sometimes it doesn't and I get the Element is not Clickable at point (X,Y) error. I have added code to maximise the webpage in case the link was getting overlapped by something.
It seems that the error is because the webpage doesn't always load in time. To overcome this I have been trying to use expected conditions and wait libraries of Selenium.
I came up with the following but it isn't working, I just get TimeoutException.
link = wait(driver, 60).until(EC.element_to_be_clickable((By.XPATH,'//*[#id="zbCenter"]/div/span/table[3]/tbody/tr/td/table/tbody/tr/td[8]/nobr/a/b')))
link.click()
I think XPATH is probably the best choice here or perhaps class name, but there is no ID. I'm not sure if its because the link is inside some table that it isn't working, but it seems odd to me that sometimes it works without having to wait at all.
I have tried Jacob's approach. The problem is I want it to be dynamic so it will work for other companies. Also, when I first land on the summary page the URL has other things at the end so I can't just append /financials to the URL.
This is the URL it gives me: https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/?type_recherche=rapide&mots=DG
I might have find a way around this:
link = driver.current_url.split('?')[0]
How do I then access this list item and append the string 'financial/' to the list item?
I was looking for a solution when I noticed that clicking on the financial tab takes you to a new URL. In this case I think the simplest solution is just to use the .get() method for that URL.
i.e.
driver.get('https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/financials/')
This will always take you directly to the financial page! Hope this helps.

Python - make a search and retrieve a set amount of images from a search engine

I would like to get images from a search engine, to run some automated tests without the need to go online and pick them by hand.
I found an old example from 5 years ago (ajax.googleapis.com/ajax/services/search/images), which sadly does not work anymore. What is the current method to do so in Python3? Ideally I would like to be able to pass a string with the search name, and retrieve a set amount of images, at full size.
I don't really mind which search engine is used; I just want to be sure that it is supported for the time being. Also I would like to avoid Selenium; I am planning to run this without any UI nor using the browser, all from terminal.
Have you heard of pixabay? There is a nice python wrapper for working with it as well.
Found a pretty good solution using BeautifulSoup.
It does not work on Google, since I get 403, but when faking the header in the request, is possible to get sometimes, data. I will have to experiment with different other sites.
So far the workflow is to search in the browser so I can get the url to pass to beautifulsoup. Once I get the url in code, I replaced the query part with a variable, so I can pass it programmatically.
Then I parse the output of beautifulsoup to extract the links to the images, and retrieve them using requests.
I wish there was a public API to get also parameters like picture size and such, but I found nothing that works currently.

Find Elements with dynamic id's and Xpaths's

I apologize if this is a duplicate of a question already but nothing I read seemed to do the trick.
I am trying to automate the process of adding my hours for my job. This entails using selenium to mimic the process I do to enter the hours for me.
The problem is, as I navigate through the process, I have run into an instance where one of the elements has a dynamic id and xpath (any maybe other things. I am not very proficient in HTML).
I need to select the "Day" button on the "View" drop down. The highlighted HTML corresponds to that button. I have already checked and both the ID and Xpath change every time I create a new session. I usually do the following to find my elements:
elem = driver.find_element_by_xpath('xpath')
Below is the xpath I currently see:
//*[#id="ab5378a9418345a2a57ad12f066127a6"]
To further complicate things, the xpath for the "Week" selection is the following:
//*[#id="741015164c5547fbb5403c03c46636d3"]
I tried to figure out how to use "contains" with the xpath but even so, the two are not different enough to differentiate by using "#id". The only constant thing and difference I see each time is that the
data-automation-label="Day"
is present on the day element and
data-automation-label="Week"
is present on the week element.
Does anyone have any experience finding the elements when a problem like such occurs? I am working in Python3.6 on a windows 7 computer.
Again, I apologize if this is a duplicate but I tried very hard to find an answer before coming here for help.
Thanks in advance!
You can use two of the below possible selectors
XPATH
//div[#data-automation-label="Day"]
CSS
div[data-automation-label="Day"]
When you use identifier your main focus should be how to find something that is unique to that object. And it really doesn't matter if it is name or id or what not. Use what you think would work the best. And here data-automation-label implies that itself

Search Algorithm for a web application that needs to look for a specific value

I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.
This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.

Resources