Python: Webcraping connecting multiple links from same page

Python: Webcraping connecting multiple links from same page - python-3.x

I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453
The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.
I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.

I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like
// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")
to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations
With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.
References:
how-to-find-elements-by-class
searching-by-css-class

Related

Python Selenium Loop Through Table Elements

I'm writing a script to automate some data retrieval/entry and need to iterate over an unknown number of entries in a table on a web page. I created a sample you can see here:
So far my script logs into this ERP system, navigates to the page in the screenshot, and then waits for the StandardGrid to load with this:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class='StandardGrid']")))
The StandardGrid is where the links are housed on the web page that I want to iterate over. It's here's where I get lost.
I've tried several different find_elements_by_XYZ but can't figure out for the life of me how to get it to work. I've tried using ChroPath with Inspect in the browser to identify every way I could think of to get this to work.
The best I could come up with is in this case, the main table that contains the data I want to iterate over has an XPath of "//table[#class='StandardGrid']"
Therefore, I tried the following:
my_list = driver.find_elements_by_xpath("//a[#class='StandardGrid']")
for item in my_list:
print(item)
But nothing prints out.
The table header of the column I want to iterate over and click on all the links of has the tag of <th xpath="1">Operation</th>
In this screenshot, the first URL has a tag of
'<a href="../Modules/Engineering/ProcessRoutings/PartOperationForm.aspx?Do=Update&Part_Operation_Key=8355805&Part_Key=2920988&Part_No=WP112789+Rev%2E%2D&Revision=-&Image=&Operation=Polish" onmouseover="status='Go To Detail Form'; return true;" onmouseout="status='';" style="" xpath="1">
Polish
</a>'
With the data I'm using there are hundreds of possible links like the one above, so a proper dynamic solution is needed.

Easiest way to get links directly:
operations = wait.until(
EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'table.StandardGrid a[onmouseover*="Go To Detail Form"]')))
for operations in operations:
print(operations.text, operations.get_attribute['href'])

Scraping Dropdown prompts

I'm having some issues trying to get data from a dropdown button and none of the answers in the site (or at least the ones y found) help me.
The website i'm trying to scrape is amazon, for example, 'Nike Shoes'.
When I enter a product that falls into 'Nike Shoes', I may get a product like this:
https://www.amazon.com/NIKE-Flex-2017-Running-Shoes/dp/B072LGTJKQ/ref=sr_1_1_sspa?ie=UTF8&qid=1546518735&sr=8-1-spons&keywords=nike+shoes&psc=1
Where the size and the color comes with the page. So scraping is simple.
The problem comes when I get this type of products:
https://www.amazon.com/NIKE-Lebron-Soldier-Mid-Top-Basketball/dp/B07KJJ52S4/ref=sr_1_3?ie=UTF8&qid=1546518445&sr=8-3&keywords=nike+shoes
Where I have to select a size, and maybe a color, and also the price changes if I select different sizes.
My question is, is it there a way to, for example, access every "shoe size" so I can at least check the price for that size?
If the page had some sort of list with the sizes within the source code it wouldn't be that hard, but the page changes when I select the size and no "list" of shoe sizes appears on the source (also the URL doesn't change).

Most ecommerce websites deal with variants by embedding json into html and loading appropriate selection with javascript. So once you scrape html you most likely have all of the variant data.
In your case you'd have shoe sizes, their prices etc embeded in html body. If you search unique enough variant name you can see some json in the body:
Now you need to:
Identify where it json part is:
It usually is somewhere in <script> tags or as data-<something> attribute of any tag.
Extract json part:
If it's embedded into javascript directly you can clean extract it with regex:
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
Load the json as dict and parse the tree, e.g.:
import json
d = json.loads(data[0])
d['products'][0]

SharePoint Rest call is not returning all fields

Goal: Have python program pull data from SharePoint so we can store on database.
Issue: I am able to connect to share point and return data, but I am not getting all of the fields I can see when hitting the UI page. The UI page I am hitting is in the list on REST call but is a Custom View
Update: Using the renderashtml I was at least able to see some of the data points I am looking for. I would hope there is a better solution than this
Code:
import sharepy
connection = sharepy.connect("https://{site}.sharepoint.com")
r = connection.get("https://{site}.sharepoint.com/{page}/_api/web/Lists/getbytitle('{list_name}')/items")
print(r.content)
print(r.json())
#I have also tried
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')
#I was able to return data as html
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')/renderashtml
Research: I have taken a look at the rest documentation for sharepoint and I am under the impression you cannot return data from a view. The solution I saw was to first hit the view and then generate a list of columns and use that to build a query to search the list. I have tied that and those fields are not available when I pull the list but are in the view.
https://social.msdn.microsoft.com/forums/sharepoint/en-US/a5815727-925b-4ac5-8a46-b0979a910ebb/query-listitems-by-view-through-rest-api
https://msdn.microsoft.com/en-us/library/office/dn531433.aspx#bk_View

Are you trying to get the data from known fields, or discover the names of the fields?
Can you get the desired data by listing the fields in a select?
_api/web/lists/getbytitle('Documents')/items?$select=Title,Created,DateOfBirth
or to get all of the fields:
_api/web/lists/getbytitle('Documents')/items?$select=*

Youtube search using the google API

I am using the google API and I wanted to do a youtube search.
I request a search with https://www.googleapis.com/youtube/v3/search?part=snippet&q=[word that I searched for]&key=my_keyto get the items array. I noticed though that the items array does not return the results in the same order as when you search youtube yourself.
Ex: if I search for the word 'bad' the first result in the items array is the "Young Lex ft AwKarin - BAD ( Official Music Video Clip )" while if I searched like a youtube user its the "Michael Jackson - Bad (Shortened Version)".
Perhaps they are not ordered and I have to order them using a property or something I have missed.
So my question is how can I make the items array return as the first item the first result that would have appeared in the youtube search.
edit: I have tried adding chart=MostPopular leaving to default videoCategoryId, but it still showed the same first result.

Well, it seems that the YouTube acts like that and it is a natural behavior of it. The only possible way that you can do to match the API and the YouTube site itself is by passing the same filter for it. Example is upload date and viewCount.
Here is the example request for viewCount.
https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.search.list?part=snippet&maxResults=10&order=viewCount&q=spider&_h=1&
and this is for the YouTube site
https://www.youtube.com/results?q=spider&sp=CAM%253D
Hope this slight information helps you.

Re-using a list in Node.Js

I'm developing a web scraper in Node JS and I'm trying to figure out the best approach to combine data from one list, with data on another list.
For example:
Step 1: Scrape data from website A
In this step, I scrape some data using cheerio/request and store this in a list, which is then displayed on the screen in a jQuery data table. The data table has a checkbox next to each scraped row of data and the value I have assigned to each checkbox is a "URL".
Step 2: Scrape again based on URLs chosen in checkboxes
In step 2, the app will scrape another website based on the URLs selected in step 1 and will retrieve some data values from these URLs that are scraped.
My dilemma
I wish to use some data values that were scraped in step 1 along with some data values scraped in step 2. However currently in my app, the data from Step 1 has been lost because it's not being saved anywhere.
Since this is a sort of dynamic search whereby a user will search for data, scrape it and then not neccessarily want to see it again, I think saving data into a database would be overkill? So I'm wondering if I should save the list data from step 1 into a session variable and then link them up together again using the URL (in the checkbox) as the key?
Thanks for your help!
Anthony

if you dont want to do saving for these data then pass it as another inputs of your form , try
< input type=hidden value=json.stringify(item)/>

I dont think using database for storage of the scrapped conetent would be an overkill.
The ideal points to note in this process would be.
Use a document store like mongoDB to dump your json data directly. I proposed mongoDB because you get more resources to refer.
Initially open up a db connection in node and your scraping deamon can reuse it for each time when it parses the http result using cheerio and dumps it to db.
Here Once you get the output from http request to your Target URL, the cheerio parse steps should take more time than dumping the data to a db.
I have followed a similar approach for scrapping movie data from imdb and you can find the corresponding code in this repo, if you are interested. I have also used cheerio and its a good selection in my opinion.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: Webcraping connecting multiple links from same page - python-3.x

Related

Python Selenium Loop Through Table Elements

Scraping Dropdown prompts

SharePoint Rest call is not returning all fields

Youtube search using the google API

Re-using a list in Node.Js

Categories

Resources