I'm trying to extract 'like' count values on a music-chart web site named melon. And there are count values on browser and dev tools like this.
But on the source code page there are just 0s instead of like count values on a tag that have like count value like this.
So when I run my BeautifulSoup code, it just shows 0 values.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.melon.com/chart/#params%5Bidx%5D=51',
headers={'User-Agent': 'Chrome 77.0.3865.120'}).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select('button > span.cnt')
How can I get the real values like on the website numbers instead of 0s?
I'm really shy about my coding and English skills but I really would like to learn how to make a data analysis automation program. So I hope you to help a poor learner :)
Thanks!
As previously mentioned by Pavan, the webpage is loading the content dynamically, to account for this, you can use Selenium, which is used for browser automation, you can still pass the HTML object to BeautifulSoup afterwards if you still wish to use soup selectors.
Related
I am scraping the text from https://www.basketball-reference.com/players/p/parsoch01.html.
But I cannot scrape the contents that is located below the "Total" table in the page. I want to get the number from "Total" and "Advanced" table but the code returns nothing.
It seems that the page loads additional information as the user scroll down the page.
I ran the code below and succeeded to get the data from player's profile section and "Per Game" table. But cannot get value from "Total" table.
from lxml import html
import urllib
playerURL=urllib.urlopen("https://www.basketball-reference.com/players/p/parsoch01.html")
# Use xpath to parse points per game.
ppg=playerPage.xpath('//tr[#id="per_game.2019"]//td[#data-stat="pts_per_g"]//text()')[0]# succeed to get the value
total=playerPage.xpath('//tr[#id="totals.2019"]//td[#data-stat="fga"]//text()')// I expect 182 to be returned but nothing is returned.
Is there any way to get data from the lower part of this page?
It's because the content you wanna extract from that site is within comments. BeautifulSoup can't parse content from comments. To get the result you need to uncomment first so that BeautifulSoup can access it. The following script does exactly what I tried to say:
import requests
from bs4 import BeautifulSoup
URL = "https://www.basketball-reference.com/players/p/parsoch01.html"
r = requests.get(URL).text
#kick out the comment signs from html elements so that BeautifulSoup can access them
comment = r.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment,"lxml")
total = soup.select_one("[id='totals.2019'] > [data-stat='fga']").text
print(total)
Output:
182
Open your web browsers' console and test the xpath to see if it's finding the element you're looking for.
$x("//tr[#id='totals.2019']//td[#data-stat='fga']//text()")
Returns an Array object.
$x("//tr[#id='totals.2019']//td[#data-stat='fga']//text()")[0]
Accesses the value you want.
Also:
# comments in python start with '#' not '//'
I am working to scrape the actual data of graph from the site. But this data in javascript code and store in the list. Then please tell me the how to scrape this data by using python.
click here and see the HTML page image.
In this image show script tag and in this tag one column[] list.In this list data is store
Then please send the solution of this problem.
This is my python code
from bs4 import BeautifulSoup
import urllib.request
urlpage = 'http://www.stockgraph.com/' //This is not original url ,above give
the link of image of html page.
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page,'html.parser')
script=soup.find('script',attrs={'class':'col-md-9 col-md-push-3'})
print(script)
In the above code open url and find out the script tag but I can't scrape javascript code.
please tell me the solution.
My data in script tag and store in the list then how to scrape this data
To get you off in the right direction, I will try to guide you in what you need to do.
First you need to use something to read your webpage like urllib
import urllib2
response = urllib2.urlopen("http://google.com")
page_source = response.read()
You will then need to parse this code using another Module like BeautifulSoup
Follow some documents to get you started on scraping your website
https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
I am trying to scrape the results after searching for 'Jaipur Pink Panthers' on google or directly visiting the prokabaddi website. Target is to scrape the table which pops up when you click on any match providing the total score spread for the entire match.
I have tried using beautiful soup and selenium but I endup reading nothing with the div class values. Any help in this regard is highly appreciable.
What I have tried as of now is as follows: [PS: I am absolutely new to Python]:
Attempt1:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.sipk-lb-playerName'):
[elem.extract() for elem in soup("span")]
print(item.text)
driver.quit()
Attempt2:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='.sipk-lb-playerName')
Little Background
Websites such as these, are made in such a manner to make the user's life easy by sending only the content that is just required by you at that point in time.
As you move around the website and click on something, the remaining data is sent back to you. So, it basically works like a demand based interaction between you and the server.
What is the issue in your code?
In your first approach, you are getting an empty div list even though you are able to see that element in the html source. The reason is you clicked on Player tab on the web-page and then it got listed there. It generated the new html content at that point of time and hence you see it.
How to do it?
You need to simulate clicking of that button before sending the html source to BeautifulSoup. So, first find that button by using find_element_by_id() method. Then, click it.
element = driver.find_element_by_id('player_Btn')
element.click()
Now, you have the updated html source in your driver object. Just send this to BeautifulSoup constructor.
soup = BeautifulSoup(driver.page_source)
You do not need an lxml parser for this. Now, you can look for the specific class and get all the names (which I have done here).
soup.findAll('div',attrs={'class':'sipk-lb-playerName'})
Voila! You can store the returned list and get only the names formatted as you want.
I am attempting to scrape the following website using Beautiful Soup in Python 3.
https://www.pgatour.com/competition/2017/safeway-open/leaderboard.html
Each player has a data-pid number associated, and the xpath looks like so:
As the class is not constant, and changes with each player, I am having trouble extracting the div.
I have tried to use this after parsing the html, but without luck.
soup.find_all('div',{'class','leaderboard-item'})
Essentially, the output should simply be a list of the numbers within the data-pids. Would very much appreciate any help.
You can use requests lib
import requests
json = requests.get('https://statdata.pgatour.com/r/464/2017/player_stats.json').json()
pids = [player['pid'] for player in json['tournament']['players']]
I can't find a solution how can I parse it using Beautiful soup. Above link to json I've found using chrome developer tools in tab Network.
I need to scrape a webpage (https://www304.americanexpress.com/credit-card/compare) but I am running into an issue -- the text that I need on the front page is absolutely buried within many different formatting tags.
I know how to scrape a regular page using Beautiful Soup but this is not giving me what I want (i.e. text is missing, some tags make it through...)
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ['https://www304.americanexpress.com/credit-card/compare']
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print (''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))
Is there a special way to scrape this particular webpage?
This is just a regular webpage. For instance <span class="card-offer-des"> contains the text after you use your new Card to make $1,000 in purchases within the first 3 months.. I also tried turning off Javascript in the browser. The text is still there as it should be.
So I don't really see what the problem is. Also, I would suggest that try to learn lxml and xpath. Once you know how that works, it's actually easier to get the text you want.
The code you should try with python is:
if not "what-have-you" in StringPulledFromSite: continue;
if "what-have-you" in StringPulledFromSite:
[your code to save to the filesystem];
And the string you should aim for would be something like:
((<span class=\") && (/>))
you should try to find both (and attempt to be specific, so that you can easily differentiate from them). Once you've found both, save the string, test it and save the text.