I'm having some issues trying to get data from a dropdown button and none of the answers in the site (or at least the ones y found) help me.
The website i'm trying to scrape is amazon, for example, 'Nike Shoes'.
When I enter a product that falls into 'Nike Shoes', I may get a product like this:
https://www.amazon.com/NIKE-Flex-2017-Running-Shoes/dp/B072LGTJKQ/ref=sr_1_1_sspa?ie=UTF8&qid=1546518735&sr=8-1-spons&keywords=nike+shoes&psc=1
Where the size and the color comes with the page. So scraping is simple.
The problem comes when I get this type of products:
https://www.amazon.com/NIKE-Lebron-Soldier-Mid-Top-Basketball/dp/B07KJJ52S4/ref=sr_1_3?ie=UTF8&qid=1546518445&sr=8-3&keywords=nike+shoes
Where I have to select a size, and maybe a color, and also the price changes if I select different sizes.
My question is, is it there a way to, for example, access every "shoe size" so I can at least check the price for that size?
If the page had some sort of list with the sizes within the source code it wouldn't be that hard, but the page changes when I select the size and no "list" of shoe sizes appears on the source (also the URL doesn't change).
Most ecommerce websites deal with variants by embedding json into html and loading appropriate selection with javascript. So once you scrape html you most likely have all of the variant data.
In your case you'd have shoe sizes, their prices etc embeded in html body. If you search unique enough variant name you can see some json in the body:
Now you need to:
Identify where it json part is:
It usually is somewhere in <script> tags or as data-<something> attribute of any tag.
Extract json part:
If it's embedded into javascript directly you can clean extract it with regex:
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
Load the json as dict and parse the tree, e.g.:
import json
d = json.loads(data[0])
d['products'][0]
Related
I'm writing a script to automate some data retrieval/entry and need to iterate over an unknown number of entries in a table on a web page. I created a sample you can see here:
So far my script logs into this ERP system, navigates to the page in the screenshot, and then waits for the StandardGrid to load with this:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class='StandardGrid']")))
The StandardGrid is where the links are housed on the web page that I want to iterate over. It's here's where I get lost.
I've tried several different find_elements_by_XYZ but can't figure out for the life of me how to get it to work. I've tried using ChroPath with Inspect in the browser to identify every way I could think of to get this to work.
The best I could come up with is in this case, the main table that contains the data I want to iterate over has an XPath of "//table[#class='StandardGrid']"
Therefore, I tried the following:
my_list = driver.find_elements_by_xpath("//a[#class='StandardGrid']")
for item in my_list:
print(item)
But nothing prints out.
The table header of the column I want to iterate over and click on all the links of has the tag of <th xpath="1">Operation</th>
In this screenshot, the first URL has a tag of
'<a href="../Modules/Engineering/ProcessRoutings/PartOperationForm.aspx?Do=Update&Part_Operation_Key=8355805&Part_Key=2920988&Part_No=WP112789+Rev%2E%2D&Revision=-&Image=&Operation=Polish" onmouseover="status='Go To Detail Form'; return true;" onmouseout="status='';" style="" xpath="1">
Polish
</a>'
With the data I'm using there are hundreds of possible links like the one above, so a proper dynamic solution is needed.
Easiest way to get links directly:
operations = wait.until(
EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table.StandardGrid a[onmouseover*="Go To Detail Form"]')))
for operations in operations:
print(operations.text, operations.get_attribute['href'])
I am trying to search am html for Tags loaded by the user so the amount of tags is variable.
Now i attempted to search the HTML by loading the txt and split the lines but i didnt suceed.
For example this is the Response (link):
/><div class="cw cx"><span class="cy cz">Rockstar Games Social Club</span></div><div class="da"></div></div></a></a></h3></td></tr></table></div></div><div class="co">
And this is the user made tag file that gets loaded (link):
ABC
Gr
Rockstar Games Social Club
Now i want to check the Response for each tag and if i find one i save it in a named txt.
For that i need to assign each line to a string.
But as i load a varying amount of lines i didnt find a solution.
If i had a fixed amount i would use a delimiter for example | and than use data[0] data1 etc
I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453
The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.
I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.
I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like
// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")
to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations
With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.
References:
how-to-find-elements-by-class
searching-by-css-class
I'm working on a Drupal 8 project and having issues trying to access one particular image of an image multifield, eg {{field_images}}.
I have a View with Exposed Filters which returns a list of nodes based on selected taxonomy terms. One of those taxonomy terms, has values that may match the title field of images contained within field_images.
The view displays a teaser list of content based on the filters, which then links to the full node content page.
Normally, the node list returned in the view will only display the first image from the multifield, displayed in the template like this {{ content.field_images.0}} or by simply limiting the images returned in the view itself.
However, in the case that an exposed filter value in the view, say "bathroom" is selected, I want to be able to find an image from field_images that has "bathroom" in the title and display that instead.
Is there any way using either twig logic or views that I can achieve this?
Thanks so much for any assistance anyone may have!
I would like to add a field to a list with displays an Image, but acts as a hyperlink. In other words like the "Hyperlink or Picture" column, but "Hyperlink AND Picture" instead.
Where the two fields you input would be the URL to the image to display, and the URL of the hyperlink.
This must be possible. I notice that the Type (in a document library) column does just that, and also includes the views that are currently being used (in the case of a folder).
Is it possible to duplicate the computed Type field in a document library to read two other fields in the list (which will act as the image url, and the redirect link)? What would the CAML be?
Thanks in advance if anyone could offer any insight.
Arnhem
This can be done but you would need to develop a custom field type. As you have found, SharePoint's default rendering for pictures is without the hyperlink. You need to change how the rendering behaves in Display mode in your own custom field. Check Patterns in Custom Field Rendering for more info.
There are also several examples of creating custom field types on the web. The MSDN articles give a lot of detail about how it all works but don't let that put you off as it's not too tricky.