Python 3 Web Scrape & Beautiful Soup - Hidden info

Python 3 Web Scrape & Beautiful Soup - Hidden info - python-3.x

I have just started to use Python 3 & Beautiful soup and have obtained most of the visible information that I can see on the page using find() or findAll().
One of the pages that I have looked at dynamically loads an item code (productPLU) that is hidden on the page that you see normally, but is contained within the HTML code as per below:-
div class="product ng-scope" ng-controller="ProductListItemController" ng-init="productPLU = '384353'"
Is it possible to scrape the productPLU information?
If so, what syntax would I use?
Thanks in advance.

Related

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?

It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

Web-scraping and download .csv from OECD website

Sorry for bothering you with my request. I have started to get acquaintance with web-scraping with the library BeautifulSoup. Beacuase I have to download some data from OECD's websites I wanted to try some web-scraping approaches. More specifically, I wanted to download a .csv file from the following page:
https://goingdigital.oecd.org/en/indicator/50/
As you can see, data can be easily downloaded by clicking on 'Download data'. However, because I will have do deal with some a recursive download with loop, I tried to download it directly from the Python console. Therefore, by inspecting the page, I evidenced the download's URL that I have reported in the following picture:
Hence, I wrote the following code:
from bs4 import BeautifulSoup
import requests
from requests import get
url = 'https://goingdigital.oecd.org/en/indicator/50/'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
containers = html_soup.find_all('div', {'class': 'css-cqestz e12cimw51'})
print(type(containers))
print(len(containers))
d = []
for a in containers[0].find_all('a', href = True):
print(a['href'])
d.append(a['href'])
The object containers is composed by three elements since there are three divs with the specified class. The first one (the one I have selected in the loop) should be the one containing the URL in which I am interested. However, I get no result. Conversely, when I select the third element of the object containers I get the following output:
https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://twitter.com/intent/tweet?text=OECD%20Going%20Digital%20Toolkit&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
mailto:?subject=OECD%20Going%20Digital%20Toolkit%3A%20Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet&body=Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet%0A%0Ahttps%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
By the way, for this download I guess it could be related to the following thread. Thank you in advance!

When you pull data from a website, you should first check whether the content you are looking for is in the page source. If it's not in the page source, you should try web scraping with selenium.
When I examined the site you mentioned, I could not see it in the page source, it shows that the link you want on this page is dynamically created.

Python Webscraping with BeautifulSoup : It shows 0s instead of real values

I'm trying to extract 'like' count values on a music-chart web site named melon. And there are count values on browser and dev tools like this.
But on the source code page there are just 0s instead of like count values on a tag that have like count value like this.
So when I run my BeautifulSoup code, it just shows 0 values.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.melon.com/chart/#params%5Bidx%5D=51',
headers={'User-Agent': 'Chrome 77.0.3865.120'}).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select('button > span.cnt')
How can I get the real values like on the website numbers instead of 0s?
I'm really shy about my coding and English skills but I really would like to learn how to make a data analysis automation program. So I hope you to help a poor learner :)
Thanks!

As previously mentioned by Pavan, the webpage is loading the content dynamically, to account for this, you can use Selenium, which is used for browser automation, you can still pass the HTML object to BeautifulSoup afterwards if you still wish to use soup selectors.

Web Scrape google search pop-up results or www.prokabaddi.com

I am trying to scrape the results after searching for 'Jaipur Pink Panthers' on google or directly visiting the prokabaddi website. Target is to scrape the table which pops up when you click on any match providing the total score spread for the entire match.
I have tried using beautiful soup and selenium but I endup reading nothing with the div class values. Any help in this regard is highly appreciable.
What I have tried as of now is as follows: [PS: I am absolutely new to Python]:
Attempt1:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.sipk-lb-playerName'):
[elem.extract() for elem in soup("span")]
print(item.text)
driver.quit()
Attempt2:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='.sipk-lb-playerName')

Little Background
Websites such as these, are made in such a manner to make the user's life easy by sending only the content that is just required by you at that point in time.
As you move around the website and click on something, the remaining data is sent back to you. So, it basically works like a demand based interaction between you and the server.
What is the issue in your code?
In your first approach, you are getting an empty div list even though you are able to see that element in the html source. The reason is you clicked on Player tab on the web-page and then it got listed there. It generated the new html content at that point of time and hence you see it.
How to do it?
You need to simulate clicking of that button before sending the html source to BeautifulSoup. So, first find that button by using find_element_by_id() method. Then, click it.
element = driver.find_element_by_id('player_Btn')
element.click()
Now, you have the updated html source in your driver object. Just send this to BeautifulSoup constructor.
soup = BeautifulSoup(driver.page_source)
You do not need an lxml parser for this. Now, you can look for the specific class and get all the names (which I have done here).
soup.findAll('div',attrs={'class':'sipk-lb-playerName'})
Voila! You can store the returned list and get only the names formatted as you want.

Obtaining data-pids using Beautiful Soup

I am attempting to scrape the following website using Beautiful Soup in Python 3.
https://www.pgatour.com/competition/2017/safeway-open/leaderboard.html
Each player has a data-pid number associated, and the xpath looks like so:
As the class is not constant, and changes with each player, I am having trouble extracting the div.
I have tried to use this after parsing the html, but without luck.
soup.find_all('div',{'class','leaderboard-item'})
Essentially, the output should simply be a list of the numbers within the data-pids. Would very much appreciate any help.

You can use requests lib
import requests
json = requests.get('https://statdata.pgatour.com/r/464/2017/player_stats.json').json()
pids = [player['pid'] for player in json['tournament']['players']]
I can't find a solution how can I parse it using Beautiful soup. Above link to json I've found using chrome developer tools in tab Network.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python 3 Web Scrape & Beautiful Soup - Hidden info - python-3.x

Related

I cant extract instagram hashtags of a post with bs4

Web-scraping and download .csv from OECD website

Python Webscraping with BeautifulSoup : It shows 0s instead of real values

Web Scrape google search pop-up results or www.prokabaddi.com

Obtaining data-pids using Beautiful Soup

Categories

Resources