I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!
Related
I have very little experience but a lot of persistence. One of my hobbies is football and I help in a local team. I'm a big fan of statistics and unfortunately the only way to collect data from the local football federation is by web scraping. I have seen python with beautifulsoup package can help but I cannot identify the tags as these I believe are on a table.
My goal is to automatically collect the information to build up a database with players, fixtures, teams, referees,... and build up stats such as how many times has player been on the starting line up, when are teams more likely to score a goal, when to receive a goal,...
I have two links for a reference.
The first is for the general fixtures of a given group.
The second is the details with a match of any of a given match.
Any clues or where to start with the pattern would be great.
you should first get into web-scraping using python native libraries like requests to kinda contact the page that you want. then depending on the page, you should use bs4 (Beautifulsoup) to find what you are looking for inside the page. Then what you wanna do is transform and refine that information into variables (list or dataframes based on your need) and finally save those variables into a dataset. here is an example code:
import requests
from bs4 import BeautifulSoup #assuming you already have bs4 installed
page = requests.get('www.DESTINATIONSITE.com/BLAHBLAH')
soup = BeautifulSoup(page.text, 'html.parser')
soup_str = str(soup.text)
up to this point, you have the string values of the entire page at hand. now you wanna do some regular expression or any other python coding to seperate the information that you want from soup_str and store them in variable(s).
for the rest of the process, if you want to save that data, i suggest you look into libraries like pandas
I have done some progress
import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
send request
url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
read acta
acta_text = []
acta_text_element = soup.find_all(class_='acta-table')
for item in acta_text_element:
acta_text.append(item.text)
when I print the items I get many \n
Mahdi_J thanks for the initial response. I have already started the approach:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup = BeautifulSoup(c,"html.parser")
What I need is some support on where to start for the parsing. On the url I neeed to collect the players for both teams, the goals, bookings to save them to else where. What I'm having trouble is how to parse. I have very little skills and I need some starting point for the parsing.
I'm trying to scrape some data using beautifulsoup and requests libraries in Python 3.7. For each of the items (tag article) on this webpage, there is a youtube link. After finding all the instances of article, I can successfully extract the headlines. This code also successfully finds instances of youtube-player class inside each article, except at index 7, where the output is None.
from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')
for article in articles:
headline = article.h2.a.text
print(headline)
link = article.find('iframe', {'class': 'youtube-player'})
print(link)
However, from the source (output of beautifulsoup), if I directly search for youtube-player, I get all the instances correctly.
links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
print(link)
How can I improve my code to get all the youtube-player instances within article loop?
You can use zip() built-in function to tie titles and youtube links together.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for title, player in zip(soup.select('.entry-title'),
soup.select('iframe.youtube-player')):
print('{:<75}{}'.format(title.text, player['src']))
Prints:
Git: Difference between “add -A”, “add -u”, “add .”, and “add *” https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015 https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
EDIT: It seems that when you use html.parser, BeautifulSoup doesn't recognize the youtube link on one place, use lxml or html5lib instead:
import requests
from bs4 import BeautifulSoup
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")
for article in soup.select('article'):
title = article.select_one('.entry-title')
player = article.select_one('iframe.youtube-player') or {'src':''}
print('{:<75}{}'.format(title.text, player['src']))
(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)
The following code is used to print on screen the tags of the html_doc, which is a variable that contains html code:
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
The following code returns the same result:
print(BeautifulSoup(html_doc, "html.parser").find_all("a").prettify())
What is the difference between using the SoupStrainer and the find_all() function?
Could we use both SoupStrainer and Find_all()?
I found the following but can not understand what it does:
BeautifulSoup(response,parse_only=SoupStrainer("a",href=True)).find_all("a")
find_all returns an iterable of navigable strings, SoupStrainer parses the page to limit your soup with certain parameters.
To iterate over each of the anchors in the page to perform some action on each one you'd use find_all. If you just want to parse out everything from the page except the anchors you'd use SoupStrainer. Using them together just makes it a little bit more efficient.
I'm new to Python so I'm sorry if this is a newbie question.
I'm trying to build a program involving webscraping and I've noticed that Python 3 seems to have significantly fewer web-scraping modules than the Python 2.x series.
Beautiful Soup, mechanize, and scrapy -- the three modules recommended to me -- all seem to be incompatible.
I'm wondering if anyone on this forum has a good option for webscraping using python 3.
Any suggestions would be greatly appreciated.
Thanks,
Will
lxml.html works on Python 3, and gets you html parsing, at least.
BeautifulSoup 4, which is in the works, should support Python 3 (I've done some work on this).
I'm kind of new to, but I found BeautifulSoup 4 to be really good and I'm learning and using this one with requests and lxml modules. requests module is for getting url and lxml (also you can use built in html.parser for parsing, but lxml is faster I guess) is for parsing.
Simple usage is:
import requests
from bs4 import BeautifulSoup
url = 'someUrl'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
Not simple example how to get the href's from html:
links = set()
for link in soup.find_all('a'):
if 'href' in link.attrs:
links.add(link)
Then you will get the set with unique links from your url.
Other example how you can parse the specific parts of html, e.g. if you wish to pars all <p> tags that has class of testClass:
list_of_p = []
for p in soup.find_all('p', {'class': 'testClass'}):
for item in p:
list_of_p.append(item)
and many more you can do with it as easy as it seems.