Obtaining data-pids using Beautiful Soup - python-3.x

I am attempting to scrape the following website using Beautiful Soup in Python 3.
https://www.pgatour.com/competition/2017/safeway-open/leaderboard.html
Each player has a data-pid number associated, and the xpath looks like so:
As the class is not constant, and changes with each player, I am having trouble extracting the div.
I have tried to use this after parsing the html, but without luck.
soup.find_all('div',{'class','leaderboard-item'})
Essentially, the output should simply be a list of the numbers within the data-pids. Would very much appreciate any help.

You can use requests lib
import requests
json = requests.get('https://statdata.pgatour.com/r/464/2017/player_stats.json').json()
pids = [player['pid'] for player in json['tournament']['players']]
I can't find a solution how can I parse it using Beautiful soup. Above link to json I've found using chrome developer tools in tab Network.

Related

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?
It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

Python Webscraping with BeautifulSoup : It shows 0s instead of real values

I'm trying to extract 'like' count values on a music-chart web site named melon. And there are count values on browser and dev tools like this.
But on the source code page there are just 0s instead of like count values on a tag that have like count value like this.
So when I run my BeautifulSoup code, it just shows 0 values.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.melon.com/chart/#params%5Bidx%5D=51',
headers={'User-Agent': 'Chrome 77.0.3865.120'}).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select('button > span.cnt')
How can I get the real values like on the website numbers instead of 0s?
I'm really shy about my coding and English skills but I really would like to learn how to make a data analysis automation program. So I hope you to help a poor learner :)
Thanks!
As previously mentioned by Pavan, the webpage is loading the content dynamically, to account for this, you can use Selenium, which is used for browser automation, you can still pass the HTML object to BeautifulSoup afterwards if you still wish to use soup selectors.

Beautiful Soup or Selenium?

I am fairly new to programming and I need a technical explanation to the below questions.
First of all, while I humbly know my way around both "Beautiful Soup" and "Selenium", I would like answers from experienced users, which are really hard to pull of the web or texts.
I am able to get data from a website by opening the page via selenium, then getting page.source for parsing through Beautiful soup. Beautiful soup on its own, does not give the html of the page, instead, it provides the source code of the whole website, which does not include the desired html of a particular page, even though the link is directly to that page!
1) Is there a way of getting the page_source without selenium, but only Beautiful Soup?
2) Can I use selenium without opening the page in question? (like is there an equivalent to .get('http..'), which will not physically open up the link! I find this to be a nightmare if dealing with > 300 links!!!!!)
2) Is there another more efficient pythonic way of doing this?
The code I am currently working with:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
import os
from selenium.webdriver import chrome
driver = webdriver.Chrome(executable_path=r'C:chromedriver.exe')
url= "https.."
driver.get(url)
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source,"lxml")
print(soup.text)
Thank you all in advance.
The api approach, recommended in the comments above, is to essentially hijack the api calls being made by the web page. If you go through the network tab of your browser and find the request being made that gets the data you are looking for, then you can mimic the same request in python.
Curl converter is a simple tool with screenshots of what I mean.
Once you know the request that is being made you can mimic the headers to make the server think you are the website making similar requests.

Web scraping with Beautiful Soup (Not capturing all Information)

I have used the beautiful soup package a few times, but this is the first time it doesn't have all the information I need. How do I get the full webpage? I need to extract all the publications and hyperlinks to the papers.
from bs4 import BeautifulSoup
import requests
url = 'https://openreview.net/group?id=ICLR.cc/2018/Conference'
source = requests.get(url).text
soup = BeautifulSoup(source, 'html.parser')
There are other HTTP requests that are filling in the webpage.
A good way of seeing these is using the inspector provided in a web browser.
In Chrome, you can see these requests under the 'Network' tab in the inspector.
The requests are as follows:
GET https://openreview.net/notes?invitation=ICLR.cc%2F2018%2FConference%2F-%2FBlind_Submission&details=replyCount&offset=0&limit=1000
GET https://openreview.net/notes?invitation=ICLR.cc%2F2018%2FConference%2F-%2FWithdrawn_Submission&noDetails=true&offset=0&limit=1000
GET https://openreview.net/notes?invitation=ICLR.cc%2F2018%2FConference%2F-%2FAcceptance_Decision&noDetails=true&offset=0&limit=1000
It appears that each one returns JSON text with the information you are looking for (the publications and hyperlinks to the papers),
so you can just create an individual request for each of these URL's and access the returned JSON in the following manner:
import json
source = requests.get(new_url).text
# json.loads returns a Python dictionary
data = json.loads(source)
for publication in data['notes']:
publication_info = publication['_bibtex']
url = publication_info.split('\nurl={')[1].split('}')[0]
The element containing the URL for each publication is rather difficult to parse since it has characters not allowed in dictionary names (i.e. '#'),
but this solution should work.
Note that I have not tested this solution, so there might be some errors, but the underlying logic behind the solution should be correct.
Alternatively:
You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, so you need no previous knowledge of docker to work. It requires just a single line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/

Scrape all Text on a Webpage that is buried within Tags in Python 3

I need to scrape a webpage (https://www304.americanexpress.com/credit-card/compare) but I am running into an issue -- the text that I need on the front page is absolutely buried within many different formatting tags.
I know how to scrape a regular page using Beautiful Soup but this is not giving me what I want (i.e. text is missing, some tags make it through...)
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ['https://www304.americanexpress.com/credit-card/compare']
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print (''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))
Is there a special way to scrape this particular webpage?
This is just a regular webpage. For instance <span class="card-offer-des"> contains the text after you use your new Card to make $1,000 in purchases within the first 3 months.. I also tried turning off Javascript in the browser. The text is still there as it should be.
So I don't really see what the problem is. Also, I would suggest that try to learn lxml and xpath. Once you know how that works, it's actually easier to get the text you want.
The code you should try with python is:
if not "what-have-you" in StringPulledFromSite: continue;
if "what-have-you" in StringPulledFromSite:
[your code to save to the filesystem];
And the string you should aim for would be something like:
((<span class=\") && (/>))
you should try to find both (and attempt to be specific, so that you can easily differentiate from them). Once you've found both, save the string, test it and save the text.

Resources