(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)
Related
I am fairly new to using python to collect data from the web. I am interested in writing a script that collects data from an xml webpage. Here is the address:
https://www.w3schools.com/xml/guestbook.asp
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("/guestbook/guest/fname")
print(guest)
I am not certain why this is returning an empty list. I've tried numerous syntax in the xpath statement, so I'm losing confidence my overall structure is correct.
For context, I want to write something that will parse the entire xml webpage and return a csv that can be used within other programs. I'm starting with the basics to make sure I understand how the various packages work. Thank you for any help.
This should do it
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("//guestbook/guest/fname")
for i in guest:
print(i.text)
In the xpath, you need a double-dash in the beginning. Also, this returns a list with elements. The text of each element can be extracted using .text
I am trying to scrape a web site using python and beautiful soup. The goal is to build a csv file, with the relevant information(location, unit size, rent...)
I am not 100% sure what the problem is but I think it has to do with the strutcture of the class. "result matches_criteria_and_filters first_listing highlighted"
First part of the code:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://www.publicstorage.com/storage-search-landing.aspx?
location=New+York")
c=r.content
After that I would need the class= result matches_criteria_and_filters first_listing highlighted. Here I am not able to do it.
Solutions that I found in other threads were not working.
soup.select("result.matches_criteria_and_filters.first_listing.highlighted")
Another possibility I found is to seperate, but it did not work.
soup.find_all(attrs={'class': 'result'})
soup.find_all(attrs={'class': 'matches_criteria_and_filters'})
Everything I tried, gave empty or none objects.
First try getting the parent div by the code similar to the following:
soup = BeautifulSoup('yourhtml', 'lxml')
results_div = soup.find('div', {'id':'results'})
#now iterate through all children divs
then do whatever you want to do with children divs
I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!
The following code is used to print on screen the tags of the html_doc, which is a variable that contains html code:
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
The following code returns the same result:
print(BeautifulSoup(html_doc, "html.parser").find_all("a").prettify())
What is the difference between using the SoupStrainer and the find_all() function?
Could we use both SoupStrainer and Find_all()?
I found the following but can not understand what it does:
BeautifulSoup(response,parse_only=SoupStrainer("a",href=True)).find_all("a")
find_all returns an iterable of navigable strings, SoupStrainer parses the page to limit your soup with certain parameters.
To iterate over each of the anchors in the page to perform some action on each one you'd use find_all. If you just want to parse out everything from the page except the anchors you'd use SoupStrainer. Using them together just makes it a little bit more efficient.
I am a design researcher. I have several .txt files which contain 75-100 quotations to which I have given various tags like so:
<q 69_A F exercises positive> Well I think it’s very good. I thought that the exercises that Rosy did was very good. I looked at it a few times. I listened and I paid attention but I didn’t really do it on the regular. I didn’t do the exercises on a regular basis. </q>
I am trying to trying to list all the tags ("69_a" "exercises" "positive") by using beautifulsoup. But instead of giving me an output which looks like this:
69_a
exercises
positive
It is giving me an output which looks like this:
q
q
q
q
Finished...
Can you please help me fix this? I have a lot of qualitative data that I want to put through this. The objective is to export all the quotes to a .xlsx file and sort using pivot tables.
from bs4 import BeautifulSoup
file_object = open('Angela_Q_2.txt', 'r')
soup = BeautifulSoup(file_object.read(), "lxml")
tag = soup.findAll('name')
for tag in soup.findAll(True):
print(tag.name)
print('Finished')
What you are wanting to list are called attributes not tags. To access a tags attributes use the .attr value.
Use below as shown:
from bs4 import BeautifulSoup
contents = '<q tag1 tag2>Quote1</q>dome other text<q tag1 tag3>quote2</q>'
soup = BeautifulSoup(contents)
for tag in soup.findAll('q'):
print(tag.attrs)
print(tag.contents)
print('Finished')