Python Selenium Loop Through Table Elements - python-3.x

I'm writing a script to automate some data retrieval/entry and need to iterate over an unknown number of entries in a table on a web page. I created a sample you can see here:
So far my script logs into this ERP system, navigates to the page in the screenshot, and then waits for the StandardGrid to load with this:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class='StandardGrid']")))
The StandardGrid is where the links are housed on the web page that I want to iterate over. It's here's where I get lost.
I've tried several different find_elements_by_XYZ but can't figure out for the life of me how to get it to work. I've tried using ChroPath with Inspect in the browser to identify every way I could think of to get this to work.
The best I could come up with is in this case, the main table that contains the data I want to iterate over has an XPath of "//table[#class='StandardGrid']"
Therefore, I tried the following:
my_list = driver.find_elements_by_xpath("//a[#class='StandardGrid']")
for item in my_list:
print(item)
But nothing prints out.
The table header of the column I want to iterate over and click on all the links of has the tag of <th xpath="1">Operation</th>
In this screenshot, the first URL has a tag of
'<a href="../Modules/Engineering/ProcessRoutings/PartOperationForm.aspx?Do=Update&Part_Operation_Key=8355805&Part_Key=2920988&Part_No=WP112789+Rev%2E%2D&Revision=-&Image=&Operation=Polish" onmouseover="status='Go To Detail Form'; return true;" onmouseout="status='';" style="" xpath="1">
Polish
</a>'
With the data I'm using there are hundreds of possible links like the one above, so a proper dynamic solution is needed.

Easiest way to get links directly:
operations = wait.until(
EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table.StandardGrid a[onmouseover*="Go To Detail Form"]')))
for operations in operations:
print(operations.text, operations.get_attribute['href'])

Related

Python extract a browser dropdown list with selenium

Dears,
I'm new using seleneium in python to do web scraping.
At this moment I have a simple example ( image attached) were I would like to extract all the countries from the dropdown list "Select Country"
I did the following code
driver = webdriver.Chrome(path)
driver.get(website)
wait = 20
countriesdropdown = driver.find_element_by_xpath('//*[#id="dropdown"]/ul/li/a')
print(countriesdropdown)
but I receive something in the outbox that don't understand.
<selenium.webdriver.remote.webelement.WebElement (session="379a6b651a4829939ee2907a649d7655", element="3942d4ab-bb74-407a-a673-886d11fe49e9")>
Could you please help me the best way to do it and to learn more about web scraping using selenium in python?
thanks,
Merle-Dog
There are several problems here.
driver.find_element_by_xpath('//*[#id="dropdown"]/ul/li/a') returns single element, not a list as you wish.
To get a list of web elements you should use driver.find_elements_by_xpath('//*[#id="dropdown"]/ul/li/a')
when you will have a list you will have to iterate over the list elements and get their texts.
Like this:
countries = driver.find_elements_by_xpath('//*[#id="dropdown"]/ul/li/a')
for country in countries:
print(country.text)

Python: Webcraping connecting multiple links from same page

I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453
The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.
I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.
I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like
// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")
to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations
With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.
References:
how-to-find-elements-by-class
searching-by-css-class

SharePoint Rest call is not returning all fields

Goal: Have python program pull data from SharePoint so we can store on database.
Issue: I am able to connect to share point and return data, but I am not getting all of the fields I can see when hitting the UI page. The UI page I am hitting is in the list on REST call but is a Custom View
Update: Using the renderashtml I was at least able to see some of the data points I am looking for. I would hope there is a better solution than this
Code:
import sharepy
connection = sharepy.connect("https://{site}.sharepoint.com")
r = connection.get("https://{site}.sharepoint.com/{page}/_api/web/Lists/getbytitle('{list_name}')/items")
print(r.content)
print(r.json())
#I have also tried
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')
#I was able to return data as html
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')/renderashtml
Research: I have taken a look at the rest documentation for sharepoint and I am under the impression you cannot return data from a view. The solution I saw was to first hit the view and then generate a list of columns and use that to build a query to search the list. I have tied that and those fields are not available when I pull the list but are in the view.
https://social.msdn.microsoft.com/forums/sharepoint/en-US/a5815727-925b-4ac5-8a46-b0979a910ebb/query-listitems-by-view-through-rest-api
https://msdn.microsoft.com/en-us/library/office/dn531433.aspx#bk_View
Are you trying to get the data from known fields, or discover the names of the fields?
Can you get the desired data by listing the fields in a select?
_api/web/lists/getbytitle('Documents')/items?$select=Title,Created,DateOfBirth
or to get all of the fields:
_api/web/lists/getbytitle('Documents')/items?$select=*

Re-using a list in Node.Js

I'm developing a web scraper in Node JS and I'm trying to figure out the best approach to combine data from one list, with data on another list.
For example:
Step 1: Scrape data from website A
In this step, I scrape some data using cheerio/request and store this in a list, which is then displayed on the screen in a jQuery data table. The data table has a checkbox next to each scraped row of data and the value I have assigned to each checkbox is a "URL".
Step 2: Scrape again based on URLs chosen in checkboxes
In step 2, the app will scrape another website based on the URLs selected in step 1 and will retrieve some data values from these URLs that are scraped.
My dilemma
I wish to use some data values that were scraped in step 1 along with some data values scraped in step 2. However currently in my app, the data from Step 1 has been lost because it's not being saved anywhere.
Since this is a sort of dynamic search whereby a user will search for data, scrape it and then not neccessarily want to see it again, I think saving data into a database would be overkill? So I'm wondering if I should save the list data from step 1 into a session variable and then link them up together again using the URL (in the checkbox) as the key?
Thanks for your help!
Anthony
if you dont want to do saving for these data then pass it as another inputs of your form , try
< input type=hidden value=json.stringify(item)/>
I dont think using database for storage of the scrapped conetent would be an overkill.
The ideal points to note in this process would be.
Use a document store like mongoDB to dump your json data directly. I proposed mongoDB because you get more resources to refer.
Initially open up a db connection in node and your scraping deamon can reuse it for each time when it parses the http result using cheerio and dumps it to db.
Here Once you get the output from http request to your Target URL, the cheerio parse steps should take more time than dumping the data to a db.
I have followed a similar approach for scrapping movie data from imdb and you can find the corresponding code in this repo, if you are interested. I have also used cheerio and its a good selection in my opinion.

OpenERP: Complex interactive search across several objects/tables - how?

Are Wizards & Transient Models the right way to implement complex interactive search across several tables (objects) and to visualize data aggregated from several objects?
“Complex interactive search” means a search panel with several input fields for entering search criteria, where input fields are dynamically filtered according to user access rules and data from other fields.
“Across several tables” means data stored in different objects (osv.Model) -- i.e. not only in different tables, but in different python 'objects' (osv.Model).
My current implementation using wizards & transients has the following drawbacks:
no way of filtering / grouping search results. Search results are handled as a one2many field, displayed as a Tree view in the Form view of the parent search wizard ==> hence no way of grouping results inside a form view of a wizard. Frankly, I’d be happy if there would be a way of making a complex search over multiple objects from a Tree view, as we do it with simple search on a single object.
2 reloads of a page happen when changing input parameters and making another search. First, the page is refreshed with the old results inside, and then after a second or two (w/o showing “Loading…”) the page is refreshed with the new results appearing. This doesn’t happen if I search with the same parameters several times in a row -- then only 1 refresh happens. This double refresh is very user-unfriendly, as it creates an impression that the search failed (the old results are displayed). Probably, this happens because on every search I create a new wizard instance (to keep a search history before it’s cleared with autovacuum) and return its id as a 'res_id'. However, I create a copy of previous wizard by passing an empty list as default search results ids, so I wouldn’t expect the old-results to show up. Also, I wouldn’t expect the second refresh to happen.
results from all previous searches are preserved on the client side and just made hidden on the page instead of being removed completely from the page ===> makes the page heavy after a lot of searches. E.g.:
<div class="oe_view_manager oe_view_manager_inline" style="display: none;">
<div class="oe_view_manager oe_view_manager_inline" style="display: none;">
<div class="oe_view_manager oe_view_manager_inline" style="display: none;">
<div class="oe_view_manager oe_view_manager_inline">
Here: 3 old result sets + the current active one (visible). Think I can get rid of those old results using JavaScript. Still, I wonder what is the purpose of such preserve&hide behavior?

Resources