Beautiful soup: List all Attributes - python-3.x

I am a design researcher. I have several .txt files which contain 75-100 quotations to which I have given various tags like so:
<q 69_A F exercises positive> Well I think it’s very good. I thought that the exercises that Rosy did was very good. I looked at it a few times. I listened and I paid attention but I didn’t really do it on the regular. I didn’t do the exercises on a regular basis. </q>
I am trying to trying to list all the tags ("69_a" "exercises" "positive") by using beautifulsoup. But instead of giving me an output which looks like this:
69_a
exercises
positive
It is giving me an output which looks like this:
q
q
q
q
Finished...
Can you please help me fix this? I have a lot of qualitative data that I want to put through this. The objective is to export all the quotes to a .xlsx file and sort using pivot tables.
from bs4 import BeautifulSoup
file_object = open('Angela_Q_2.txt', 'r')
soup = BeautifulSoup(file_object.read(), "lxml")
tag = soup.findAll('name')
for tag in soup.findAll(True):
print(tag.name)
print('Finished')

What you are wanting to list are called attributes not tags. To access a tags attributes use the .attr value.
Use below as shown:
from bs4 import BeautifulSoup
contents = '<q tag1 tag2>Quote1</q>dome other text<q tag1 tag3>quote2</q>'
soup = BeautifulSoup(contents)
for tag in soup.findAll('q'):
print(tag.attrs)
print(tag.contents)
print('Finished')

Related

Why does find_next_sibling in bs4 work on one line of code but not another, very similar, line of code?

I'm writing a simple web scraper to get data from the Texas Commission on Environmental Quality (TCEQ) website. The info I need is inside 'td' tags. I'm scraping the appropriate 'td' by referencing the preceding 'th', which all have the same text used to ID. I'm using find_next_sibling to scrape the data into a variable.
Here is my code:
import requests
from bs4 import BeautifulSoup
URL = "https://www2.tceq.texas.gov/oce/eer/index.cfm?fuseaction=main.getDetails&target=323191"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
###This one works
report = soup.find("th", text="Incident Tracking Number:").find_next_sibling("td").text
###This one doesn't
owner = soup.find("th", text="Name of Owner or Operator:").find_next_sibling("td").text
I'm getting this error: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'. This code has several lines like the two above, and, like them, some of them work and some of them don't. I've looked into the HTML to see if there's another tag, but I'm not seeing it if it's there. Please and thank you for any help!
When using the text parameter, you should make sure you provide the text exactly. In your case, there's a space at the end.
soup.find('th', text='Name of Owner or Operator: ').find_next_sibling('td').text
This prints:
\n \n \n \n \n PHILLIPS 66 COMPANY\n \n \n

Processing all values of an array with get_text

(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)

Class consist out of four parts seperated by spaces

I am trying to scrape a web site using python and beautiful soup. The goal is to build a csv file, with the relevant information(location, unit size, rent...)
I am not 100% sure what the problem is but I think it has to do with the strutcture of the class. "result matches_criteria_and_filters first_listing highlighted"
First part of the code:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://www.publicstorage.com/storage-search-landing.aspx?
location=New+York")
c=r.content
After that I would need the class= result matches_criteria_and_filters first_listing highlighted. Here I am not able to do it.
Solutions that I found in other threads were not working.
soup.select("result.matches_criteria_and_filters.first_listing.highlighted")
Another possibility I found is to seperate, but it did not work.
soup.find_all(attrs={'class': 'result'})
soup.find_all(attrs={'class': 'matches_criteria_and_filters'})
Everything I tried, gave empty or none objects.
First try getting the parent div by the code similar to the following:
soup = BeautifulSoup('yourhtml', 'lxml')
results_div = soup.find('div', {'id':'results'})
#now iterate through all children divs
then do whatever you want to do with children divs

Python BeautifulSoup

I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!

Getting all tags with multiple attributes with SoupStrainer and BeautifulSoup

I'm trying to get all the occurrences of the 'td' tag when the class attribute has one of a few different values.
I know how to do this with BeautifulSoup after the fact but due to the amount of time it takes I'm trying to speed it up by selectively parsing each page with SoupStrainer. I at first tried the below but it doesn't seem to work.
strainer = SoupStrainer('td', attrs={'class': ['Value_One', 'Value_Two']})
soup = BeautifulSoup(foo.content, "lxml", parse_only=strainer)
Does anybody know of a way to make this work (it doesn't have to involve SoupStrainer or even Beautiful Soup)?
Depending on what you may mean, of course. You might be able to use scrapy which gives you the ability to formulate xpath expressions such as the one used here. It takes advantage of the fact that the two class attributes are similar. Many other ways of making selections are available.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('.//td[contains(#class,"Value")]/text()').extract()
['value one', 'value two']

Resources