webscraping of commercial websites in python - python-3.x

I'm a chemical engineering student. For a project, I'd like to setup a web scraper in python that could catch certain attributes of the different products. For example, I'd like to go to target and scrape info such as material, weight and a photo. So far I've tried using the lxml library, below is the code that I've tried to use unsuccessfully. I'm not interested in web scraping per se, but the data that I can collect from these websites to perform my calculations. I also found that I'll probably need a web crawler to point the scraper towards the websites I need. Anyways, is it possible for you to point me to a source that can teach this stuff for dummies? I've looked online but so far nothing has really worked.
from lxml import html
import requests
page = requests.get('https://www.target.com/p/delta-children-skylar-4-in-1-convertible-crib/-/A-52936884#lnk=sametab')
tree = html.fromstring(page.content)
data = tree.xpath('//div[#id="product-attributes"]/text()')
Thanks in advance!!!

Related

Web Scraping data from a confusing website

I'm trying to pull data from this website: https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml
I've used beautiful soup in the past for simple things like pulling the bestsellers list from Amazon so I'm a bit familiar with it but this website is very confusing and I'm looking for suggestions on how to get started here.
Essentially what I would like to loop through all the states in the 'State' filter and, for every state, I'd like to loop through every RTO in the 'RTO' filter. Based on this, I want to download the table data.
I know I've not added any code with this question - I need your help to understand how to get started doing this project since I got no idea how to navigate this website.
Thanks for your help!
EDIT:
This is how I'm getting the data:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This is me currently trying to find the relevant sections of the data but I'm lost here:
html = list(soup.children)[4]
list(html.children)
data = html.find_all('table')
print(data[0].prettify())
Not sure where to go from here.

Can someone guide me how to collect a list of url address in the tab using python?

I'm trying to collect a list of "https://..." and hope to store them in csv file. I can do them manually such as use excel, copy the urls from the website of interest and paste them one by one. But it's tedious and definitely would take lot of time.
can someone suggest and guide for a faster way?
If you just need the addresses quickly from one page you could run this javascript snippet document.links.forEach(link=>console.log(link.href)) in the console of your browser, this will output all of the links on that page.
If you want to use python to scrape the page I would suggest taking a look at this question on stackoverflow, this uses the beautifulsoup framework.
If there is dynamic content loaded on the page with javascript it's probably better to use something like Selenium, relevant stackoverflow question

Beautiful soup different output for UK and US sites

I have mainly used this site to find solutions so far, however I am struggling to find a solution as to why I get different soup objects for US and UK versions of the same site, even though they are pretty much the same when using inspect element or developer tools on the websites.
I am in the UK if that is possibly a factor, when parsing ebay US(.com) I get the desired result with regards to the tag names, but when using ebay UK a lot of the html code tag names etc seem to have changed.
The following code is an example of how I create the soup object and find listing elements:
from bs4 import BeautifulSoup
import requests
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcomputer+keyboard.TRS0&_nkw=computer+keyboard&_sacat=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for listing in soup.findAll('li', {'class': 's-item'}):
try:
link = listing.find('a', {'class': 's-item__link'})
name = listing.find("h3", {"class": "s-item__title"}).get_text()
price = listing.find("span", {"class": "s-item__price"}).get_text()
print(link.get('href'))
print(name)
print(price + "\n")
except:
pass
>>>https://www.ebay.com/itm/USB-WIRED-STYLISH-SLIM-QWERTY-KEYBOARD-UK-LAYOUT-FOR-PC-DESKTOP-COMPUTER-LAPTOP/392095538686?epid=2298009317&hash=item5b4ab71dfe:g:Zp0AAOSwowBbZw7U
>>>USB WIRED STYLISH SLIM QWERTY KEYBOARD UK LAYOUT FOR PC DESKTOP COMPUTER LAPTOP
>>>$7.15
So an example of the issue I am having:
If I was using the US site (if you change the above URL to .com) and want to find the listing titles I can use findAll('li', {'class': 's-item__title'}) from the soup object
However if I am using the UK site (above URL) I can only find the titles using findAll('li', {'class': 'lvtitle'}) This is also the same if I wanted to retrieve the list of listings For the US soup object I can simply use 's-item', but this is not the case for the UK soup object.
I'm pretty new to programming so apologies for my poor explanation.
EDIT: The above code has been edited to show a working script. Using the above code when I run the script on ebay US I get the correct result (link, name, price of each listing) if I run the same script with the ebay UK URL it returns no results. So it does not seem to be due to a mistake in the script itself, the soup object is different for me, but not for others it seems.
even though they are pretty much the same when using inspecting the HTMl on the websites
Programming lesson that you learn fairly early. Pretty much the same != to the same. In software, the difference between a program running and failing can be one char out of a million.
You are using CSS selectors to target various elements on the page. CSS does the styling of the pages. However, what do you notice about the websites (images are attached at the bottom)? The styling is very different and thus at least some of the CSS is different. To a certain level, these are different websites and thus will need separate ways to scrape them (it could be as small as making the target CSS a variable or as large as completely seperate programs just with shared functions).
I am a bit perplexed that you cannot use s-item__title for both. I see it in the CSS of both the USA and UK eBay sites. Check that you are doing it properly, perhaps by posting your code (you must post code) in a new question specifically asking about this.
Companies like eBay are not really pleased with people scraping their websites and probably take measures to defeat such attempts. Changing up the CSS so that scrapers do not have consistent targets is certainly one method they might use to prevent it from occurring.
I recently personally created a project to fetch data from different websites and ebay was one of them using BeautifulSoup. I can tell you from experience that fetching data from ebay is a struggle and behaves in unexpected manner and would give you unexpected results.
One thing you can do is go to that url and right click to inspect the page and see the html layout to see the results you are getting and how can you go around that (maybe by changing your queries in the url). I know you have already done that but the html in their web page is really big and there is probably some small differences that you didn't catch. Perhaps a good idea is to compare the html from the US and the UK outputs as there could be some tag differences between the two and based on the tags in the UK website you can change your findAll method.
Also another (more formal way) to fetch data is by using the ebay API and here is the link for a quick start guide for the US website https://developer.ebay.com/tools/quick-start

Unable to extract data using Import.io from Amazon web page where data is loaded into the page via Ajax

Anyone know how to extract data from a webpage using Import.io where the data is loaded into the page via Ajax?
I am unable to extract data from below mentioned pages.
There is no issue in first page data extraction, but how do I move on to extract data from second page?
URL is given below.
<http://www.amazon.com/gp/aag/main?ie=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=ATVPDKIKX0DER&orderID=&seller=A13JB7253Q5S1B>
The data on that page is deployed using an interesting mix of technologies; it relies heavily on server side code and Javascript. That type of page can be a challenge, however, there are always methods to get the data. For example, some sellers have a page like this:
http://www.amazon.co.uk/gp/node/index.html?ie=UTF8&marketplaceID=ATVPDKIKX0DER&me=A2WO1PQ2OIOIGM&merchant=A2WO1PQ2OIOIGM
Which is very easy to extract data from, even using the magic algorithm - https://magic.import.io/?site=http:%2F%2Fwww.amazon.co.uk%2Fgp%2Fnode%2Findex.html%3Fie%3DUTF8%26marketplaceID%3DA1F83G8C2ARO7P%26me%3DA2WO1PQ2OIOIGM%26merchant%3DA2WO1PQ2OIOIGM
I had to take off the redirect=true from the URLs before it would work - just an FYI.
Other times some stores don't have such a URL, its a bit of a pain, and there URLs can be tough to figure out.
We do help some of our enterprise customers build bespoke APIs when the data is very important to them, so do feel free to get in touch. I imagine a larger scale workaround would be to create a dataset/API based on a the categories you are interested in and then to filter that larger dataset down (python or CSV style) by seller name. That would probably work!
I managed to get a static dataset but no API. You can find that dataset at the following GUID: c7c63f1c-7081-4d4a-ad91-afe9789a6620
Thanks

Export google search to a spreadsheet

Is it possible for me to create a list of google search results from a specific query and export it into excel? For example, I'd like to google orthodontists in Florida and be able to export the business name, phone number and address to an excel spreadsheet. I've done a lot of searching but I can't find any solutions. I'm looking for someone to point me in the right direction. Any help is appreciated, thanks.
An API is an Application Programming Interface and it's a way for your software to interact with the software on a server. Google has an API called the "Custom Search Engine" which you can use for 100 free queries per day. Other search engines may have more generous free APIs. With a search API you can write a code to download text that contain all the relevant data. You can read more about search engine APIs here.
Another way to collect data from google is to scrape their page. This means that you use a code to download the HTML, and from that HTML you collect the relevant pieces (wikipedia link). With a programming language like python, many people use the Beautiful Soup library for scraping. With code then you can take the relevant parts of the HTML and put it into a format like CSV that is readable by Excel. With python there are ways to write to Excel, directly, too (link).
Finally, here is a link from 2007 that says with Google Spreadsheets you can import HTML.
Update: here is the MS Excel version.
The following web app https://www.resultstoexcel.com/ allows you to download Google search results to a CSV file, a Microsoft Excel readable format, for free.
If you have any problem viewing the downloaded results correctly in MS Excel, please read the FAQs section where you will find how to open the file using the correct column separators.
Where are the results coming from?
A Google search on the topic retrieves many companies that offer online access to Google Search Results through an Application Programming Interface (API). This web app uses SERPSBOT API.

Resources