Python extract a browser dropdown list with selenium - python-3.x

Dears,
I'm new using seleneium in python to do web scraping.
At this moment I have a simple example ( image attached) were I would like to extract all the countries from the dropdown list "Select Country"
I did the following code
driver = webdriver.Chrome(path)
driver.get(website)
wait = 20
countriesdropdown = driver.find_element_by_xpath('//*[#id="dropdown"]/ul/li/a')
print(countriesdropdown)
but I receive something in the outbox that don't understand.
<selenium.webdriver.remote.webelement.WebElement (session="379a6b651a4829939ee2907a649d7655", element="3942d4ab-bb74-407a-a673-886d11fe49e9")>
Could you please help me the best way to do it and to learn more about web scraping using selenium in python?
thanks,
Merle-Dog

There are several problems here.
driver.find_element_by_xpath('//*[#id="dropdown"]/ul/li/a') returns single element, not a list as you wish.
To get a list of web elements you should use driver.find_elements_by_xpath('//*[#id="dropdown"]/ul/li/a')
when you will have a list you will have to iterate over the list elements and get their texts.
Like this:
countries = driver.find_elements_by_xpath('//*[#id="dropdown"]/ul/li/a')
for country in countries:
print(country.text)

Related

Need To make list in selenium to find elements by xpath

driver.find_element_by_xpath('//*[#id="PolarisTextField10"]').send_keys(img.value)
every order to add media change the number PolarisTextField[10]
I need to write a list like that
driver.find_element_by_xpath('//[#id="PolarisTextField["10,11,etc"]"]').send_keys(img.value)
sorry but i am beginner
Try the below solution where we are using find_elements_by_xpath to get all elements that starts with PolarisTextField id.
inputs = len(driver.find_elements_by_xpath("//*[starts-with(#id,'PolarisTextField')]"))
for inputIndex in range(inputs):
driver.find_elements_by_xpath("//*[starts-with(#id,'PolarisTextField')]")[inputIndex].send_keys("text goes here")

Python Selenium Loop Through Table Elements

I'm writing a script to automate some data retrieval/entry and need to iterate over an unknown number of entries in a table on a web page. I created a sample you can see here:
So far my script logs into this ERP system, navigates to the page in the screenshot, and then waits for the StandardGrid to load with this:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class='StandardGrid']")))
The StandardGrid is where the links are housed on the web page that I want to iterate over. It's here's where I get lost.
I've tried several different find_elements_by_XYZ but can't figure out for the life of me how to get it to work. I've tried using ChroPath with Inspect in the browser to identify every way I could think of to get this to work.
The best I could come up with is in this case, the main table that contains the data I want to iterate over has an XPath of "//table[#class='StandardGrid']"
Therefore, I tried the following:
my_list = driver.find_elements_by_xpath("//a[#class='StandardGrid']")
for item in my_list:
print(item)
But nothing prints out.
The table header of the column I want to iterate over and click on all the links of has the tag of <th xpath="1">Operation</th>
In this screenshot, the first URL has a tag of
'<a href="../Modules/Engineering/ProcessRoutings/PartOperationForm.aspx?Do=Update&Part_Operation_Key=8355805&Part_Key=2920988&Part_No=WP112789+Rev%2E%2D&Revision=-&Image=&Operation=Polish" onmouseover="status='Go To Detail Form'; return true;" onmouseout="status='';" style="" xpath="1">
Polish
</a>'
With the data I'm using there are hundreds of possible links like the one above, so a proper dynamic solution is needed.
Easiest way to get links directly:
operations = wait.until(
EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table.StandardGrid a[onmouseover*="Go To Detail Form"]')))
for operations in operations:
print(operations.text, operations.get_attribute['href'])

Scraping Dropdown prompts

I'm having some issues trying to get data from a dropdown button and none of the answers in the site (or at least the ones y found) help me.
The website i'm trying to scrape is amazon, for example, 'Nike Shoes'.
When I enter a product that falls into 'Nike Shoes', I may get a product like this:
https://www.amazon.com/NIKE-Flex-2017-Running-Shoes/dp/B072LGTJKQ/ref=sr_1_1_sspa?ie=UTF8&qid=1546518735&sr=8-1-spons&keywords=nike+shoes&psc=1
Where the size and the color comes with the page. So scraping is simple.
The problem comes when I get this type of products:
https://www.amazon.com/NIKE-Lebron-Soldier-Mid-Top-Basketball/dp/B07KJJ52S4/ref=sr_1_3?ie=UTF8&qid=1546518445&sr=8-3&keywords=nike+shoes
Where I have to select a size, and maybe a color, and also the price changes if I select different sizes.
My question is, is it there a way to, for example, access every "shoe size" so I can at least check the price for that size?
If the page had some sort of list with the sizes within the source code it wouldn't be that hard, but the page changes when I select the size and no "list" of shoe sizes appears on the source (also the URL doesn't change).
Most ecommerce websites deal with variants by embedding json into html and loading appropriate selection with javascript. So once you scrape html you most likely have all of the variant data.
In your case you'd have shoe sizes, their prices etc embeded in html body. If you search unique enough variant name you can see some json in the body:
Now you need to:
Identify where it json part is:
It usually is somewhere in <script> tags or as data-<something> attribute of any tag.
Extract json part:
If it's embedded into javascript directly you can clean extract it with regex:
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
Load the json as dict and parse the tree, e.g.:
import json
d = json.loads(data[0])
d['products'][0]

Python: Webcraping connecting multiple links from same page

I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453
The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.
I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.
I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like
// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")
to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations
With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.
References:
how-to-find-elements-by-class
searching-by-css-class

SharePoint Rest call is not returning all fields

Goal: Have python program pull data from SharePoint so we can store on database.
Issue: I am able to connect to share point and return data, but I am not getting all of the fields I can see when hitting the UI page. The UI page I am hitting is in the list on REST call but is a Custom View
Update: Using the renderashtml I was at least able to see some of the data points I am looking for. I would hope there is a better solution than this
Code:
import sharepy
connection = sharepy.connect("https://{site}.sharepoint.com")
r = connection.get("https://{site}.sharepoint.com/{page}/_api/web/Lists/getbytitle('{list_name}')/items")
print(r.content)
print(r.json())
#I have also tried
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')
#I was able to return data as html
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')/renderashtml
Research: I have taken a look at the rest documentation for sharepoint and I am under the impression you cannot return data from a view. The solution I saw was to first hit the view and then generate a list of columns and use that to build a query to search the list. I have tied that and those fields are not available when I pull the list but are in the view.
https://social.msdn.microsoft.com/forums/sharepoint/en-US/a5815727-925b-4ac5-8a46-b0979a910ebb/query-listitems-by-view-through-rest-api
https://msdn.microsoft.com/en-us/library/office/dn531433.aspx#bk_View
Are you trying to get the data from known fields, or discover the names of the fields?
Can you get the desired data by listing the fields in a select?
_api/web/lists/getbytitle('Documents')/items?$select=Title,Created,DateOfBirth
or to get all of the fields:
_api/web/lists/getbytitle('Documents')/items?$select=*

Resources