How to scrape hrefs disguised with unicode (e.g \u003ca href=\)

How to scrape hrefs disguised with unicode (e.g \u003ca href=\) - python-3.x

I'm trying to scrape relative paths contained in hrefs but they arent showing up in anything but the main soup pull. If I try to pull hrefs or links specifically, the ones I'm looking to scrape just don't show up but I know they are there.
\u003ca href=\"/model/ford-1200\"
\u003ca href=\"/model/ford-1300\"
\u003ca href=\"/model/ford-1400\"
Is there a way to get to create a list of the 20 or so "u003ca href"s on the page? I'm looking for just the part in the quotes (e.g. /model/ford-1200, /model/ford-1300, /model/ford-1400), collected into a list.
Is this a use-case where I'm going to need to sack up and learn javascript scraping?

It turns out the issue wasn't solved through addressing the unicode or using re -- it was an issue with the content not loading enough to be scraped. I solved this through using selenium which allowed me to scrape the hrefs as I normally would.

Related

Selenium Python and Beautiful Soup not finding the correct element with href?

I am working on a webscraper that navigates into a page with a list of links, then I use Beautiful Soup to retrieve each of the href. However, when I use Selenium again to try to locate those elements and click them, it is not able to find them.
What I am trying with is this. My href being stored in the variable link:
a=self.drive.find_element(By.XPATH,"//a[#href=link]")
Then I tried with the method contains
a=self.drive.find_element(By.XPATH,"//*[contains(#href,link)]")
But now, the seven different links always point to the same element. The links are long and only differ between them by a number. Does that affect how the method contains work? For instance:
...1953102711/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
...1981395917/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
What can I do to either, find the element by using the exact search, or avoid repetition using contains?

Can someone guide me how to collect a list of url address in the tab using python?

I'm trying to collect a list of "https://..." and hope to store them in csv file. I can do them manually such as use excel, copy the urls from the website of interest and paste them one by one. But it's tedious and definitely would take lot of time.
can someone suggest and guide for a faster way?

If you just need the addresses quickly from one page you could run this javascript snippet document.links.forEach(link=>console.log(link.href)) in the console of your browser, this will output all of the links on that page.
If you want to use python to scrape the page I would suggest taking a look at this question on stackoverflow, this uses the beautifulsoup framework.
If there is dynamic content loaded on the page with javascript it's probably better to use something like Selenium, relevant stackoverflow question

Beautiful soup different output for UK and US sites

I have mainly used this site to find solutions so far, however I am struggling to find a solution as to why I get different soup objects for US and UK versions of the same site, even though they are pretty much the same when using inspect element or developer tools on the websites.
I am in the UK if that is possibly a factor, when parsing ebay US(.com) I get the desired result with regards to the tag names, but when using ebay UK a lot of the html code tag names etc seem to have changed.
The following code is an example of how I create the soup object and find listing elements:
from bs4 import BeautifulSoup
import requests
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcomputer+keyboard.TRS0&_nkw=computer+keyboard&_sacat=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for listing in soup.findAll('li', {'class': 's-item'}):
try:
link = listing.find('a', {'class': 's-item__link'})
name = listing.find("h3", {"class": "s-item__title"}).get_text()
price = listing.find("span", {"class": "s-item__price"}).get_text()
print(link.get('href'))
print(name)
print(price + "\n")
except:
pass
>>>https://www.ebay.com/itm/USB-WIRED-STYLISH-SLIM-QWERTY-KEYBOARD-UK-LAYOUT-FOR-PC-DESKTOP-COMPUTER-LAPTOP/392095538686?epid=2298009317&hash=item5b4ab71dfe:g:Zp0AAOSwowBbZw7U
>>>USB WIRED STYLISH SLIM QWERTY KEYBOARD UK LAYOUT FOR PC DESKTOP COMPUTER LAPTOP
>>>$7.15
So an example of the issue I am having:
If I was using the US site (if you change the above URL to .com) and want to find the listing titles I can use findAll('li', {'class': 's-item__title'}) from the soup object
However if I am using the UK site (above URL) I can only find the titles using findAll('li', {'class': 'lvtitle'}) This is also the same if I wanted to retrieve the list of listings For the US soup object I can simply use 's-item', but this is not the case for the UK soup object.
I'm pretty new to programming so apologies for my poor explanation.
EDIT: The above code has been edited to show a working script. Using the above code when I run the script on ebay US I get the correct result (link, name, price of each listing) if I run the same script with the ebay UK URL it returns no results. So it does not seem to be due to a mistake in the script itself, the soup object is different for me, but not for others it seems.

even though they are pretty much the same when using inspecting the HTMl on the websites
Programming lesson that you learn fairly early. Pretty much the same != to the same. In software, the difference between a program running and failing can be one char out of a million.
You are using CSS selectors to target various elements on the page. CSS does the styling of the pages. However, what do you notice about the websites (images are attached at the bottom)? The styling is very different and thus at least some of the CSS is different. To a certain level, these are different websites and thus will need separate ways to scrape them (it could be as small as making the target CSS a variable or as large as completely seperate programs just with shared functions).
I am a bit perplexed that you cannot use s-item__title for both. I see it in the CSS of both the USA and UK eBay sites. Check that you are doing it properly, perhaps by posting your code (you must post code) in a new question specifically asking about this.
Companies like eBay are not really pleased with people scraping their websites and probably take measures to defeat such attempts. Changing up the CSS so that scrapers do not have consistent targets is certainly one method they might use to prevent it from occurring.

I recently personally created a project to fetch data from different websites and ebay was one of them using BeautifulSoup. I can tell you from experience that fetching data from ebay is a struggle and behaves in unexpected manner and would give you unexpected results.
One thing you can do is go to that url and right click to inspect the page and see the html layout to see the results you are getting and how can you go around that (maybe by changing your queries in the url). I know you have already done that but the html in their web page is really big and there is probably some small differences that you didn't catch. Perhaps a good idea is to compare the html from the US and the UK outputs as there could be some tag differences between the two and based on the tags in the UK website you can change your findAll method.
Also another (more formal way) to fetch data is by using the ebay API and here is the link for a quick start guide for the US website https://developer.ebay.com/tools/quick-start

Getting empty list when web scraping morningstar

I am trying to iterate through symbols for different mutual funds, and using those scrape some info from their Morningstar profiles. The URL is the following:
https://www.morningstar.com/funds/xnas/ZVGIX/quote.html
In the example above, ZVGIX is the symbol. I have tried using xpath to find the data I need, however that returns empty lists. The code I used is below:
for item in symbols:
url = 'https://www.morningstar.com/funds/xnas/'+item+'/quote.html'
page = requests.get(url)
tree = html.fromstring(page.content)
totalAssets = tree.xpath('//*[#id="gr_total_asset_wrap"]/span/span/text()')
print(totalAssets)
According to
Blank List returned when using XPath with Morningstar Key Ratios
and
Web scraping, getting empty list
that is due to the fact that the page content is downloaded in stages. The answer to the first link suggests using selenium and chromedriver, but that is unpractical given the amount of data that I am interested in scraping. The answer to the second suggests there may be a way to load the content with further requests, but it does not explain how one may formulate those requests. So, how can I apply that solution to my case?
Edit: The code above returns [], in case that was not clear.

In case anyone else ends up here: eventually I solved my problem by analyzing the network requests when loading the desired pages. Following those links led to super simple html pages that held different parts of the original page. So rather than scraping from 1 page, I ended up scraping from around 5 pages for each fund.

Extract only content with Scrapy to create WordClouds

I am looking for a smart solution to extract only the main information of a range of different webpages. When building wordclouds, I always get the problem of having to define a range of different stopwords (e.g. "links", "contact",...) to only show the actual content. Now I am looking for a way for not creating a list of stopwords each time I scrape a new website.
I had the idea, that certain html tags tend to have more content than others. Is that a good way of filtering in the preprocessing or do you have any other ideas?
Thank you for your help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to scrape hrefs disguised with unicode (e.g \u003ca href=\) - python-3.x

It turns out the issue wasn't solved through addressing the unicode or using re -- it was an issue with the content not loading enough to be scraped. I solved this through using selenium which allowed me to scrape the hrefs as I normally would.

Related

Selenium Python and Beautiful Soup not finding the correct element with href?

Can someone guide me how to collect a list of url address in the tab using python?

Beautiful soup different output for UK and US sites

Getting empty list when web scraping morningstar

Extract only content with Scrapy to create WordClouds

Categories

Resources