I'm doing the this tutorial ->
http://programminghistorian.org/lessons/intro-to-beautiful-soup
When I run the follow code I get this error:
AttributeError: 'NoneType' object has no attribute 'decompose'
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print(link)
I can't understand why I'm getting this error. I'mm not sure what soup.p.a is doing either. Googled it but nothing came up...
Make sure that you have an html file named 43rd-congress.html within your working directory. And it must have the lines that are mentioned in the tutorial. The error you get, is most probably because the program was not able to find an "a" tag that is nested within an "p" tag in the file 43rd-congress.html that is within your working directory.
The soup.p.a lets you target and scrape out "a" tags that are nested within "p" tags and pass it to the assigned variable (final_link in this case). The decompose function will remove the elements stored in "final_link" from the original BeautifulSoup object "soup".
For example consider this example that is very similar to the one on the site you mentioned.
<p align="left">
<a href="google.com">
<b>Search Again</b>
</a>
</p>
Hello
Yahoo
When you save the above code as 43rd-congress.html into your working directory and run your code you will see the output as
Hello
Yahoo
The "a" tag enclosed within the "p" tag is completely deleted from the "soup" object by action of the program.
Related
After running beautifulsoup on an output from internal process, I some times see that my output has a outer p tag, for example : <p>...</p> in case of incorrect xml's. I know using string operations or using soup.find("p").decode_contents() I can fetch only internal html tags.
soup = bs(''.join(my_output), "lxml")
return soup.find("body").decode_contents()
My question is that I do not want beautifulsoup to add a <p> tag in case there is incorrect xml's and output should be returned as is. Please let me know if its possible ?
Please note that I need to use lxml parser only because that has worked for most of data sets.
Edit:
I know that I can use string manipulation or also use soup.find("p").decode_contents() to not include p tag, what I am looking for is that beautifulsoup itself should not add p tag and return me xml as is.
I'm writing a simple web scraper to get data from the Texas Commission on Environmental Quality (TCEQ) website. The info I need is inside 'td' tags. I'm scraping the appropriate 'td' by referencing the preceding 'th', which all have the same text used to ID. I'm using find_next_sibling to scrape the data into a variable.
Here is my code:
import requests
from bs4 import BeautifulSoup
URL = "https://www2.tceq.texas.gov/oce/eer/index.cfm?fuseaction=main.getDetails&target=323191"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
###This one works
report = soup.find("th", text="Incident Tracking Number:").find_next_sibling("td").text
###This one doesn't
owner = soup.find("th", text="Name of Owner or Operator:").find_next_sibling("td").text
I'm getting this error: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'. This code has several lines like the two above, and, like them, some of them work and some of them don't. I've looked into the HTML to see if there's another tag, but I'm not seeing it if it's there. Please and thank you for any help!
When using the text parameter, you should make sure you provide the text exactly. In your case, there's a space at the end.
soup.find('th', text='Name of Owner or Operator: ').find_next_sibling('td').text
This prints:
\n \n \n \n \n PHILLIPS 66 COMPANY\n \n \n
I'm having multiple issues with trying to scrape a website when the CSS code are all the same. I'm still learning about the soup.find method and things I can do with it. The issue is there are several lines of CSS code on a webpage that has <span class="list-quest" and when I use soup.find(class_='list-quest') for example I will only get the result from the top of the page that uses the same CSS code. Is there a way to get the exact specific line of code? Possibly using Born [dd-mm-yyyy] ? But sadly I do not know how to use a specific keyword such as that for Python to find it.
<span class="list-quest">Born [dd-mm-yyyy]:</span>
By using a regex on the text attribute :
Regex :
Born \d{2}-\d{2}-\d{4}:
Python Code :
from bs4 import BeautifulSoup
import re
text = '<span class="list-quest">Born 01-01-2019:</span>'
soup = BeautifulSoup(text,features='html.parser')
tag = soup.find('span',attrs={'class':'list-quest'} , text=re.compile(r'Born \d{2}-\d{2}-\d{4}'))
print(tag.text)
Demo : Here
You might, with bs4 4.7.1 + be able to use contains
item = soup.select_one('span.list-quest:contains("Born ")')
if item is not None: print(item.text)
(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)
I am trying to scrape a web site using python and beautiful soup. The goal is to build a csv file, with the relevant information(location, unit size, rent...)
I am not 100% sure what the problem is but I think it has to do with the strutcture of the class. "result matches_criteria_and_filters first_listing highlighted"
First part of the code:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://www.publicstorage.com/storage-search-landing.aspx?
location=New+York")
c=r.content
After that I would need the class= result matches_criteria_and_filters first_listing highlighted. Here I am not able to do it.
Solutions that I found in other threads were not working.
soup.select("result.matches_criteria_and_filters.first_listing.highlighted")
Another possibility I found is to seperate, but it did not work.
soup.find_all(attrs={'class': 'result'})
soup.find_all(attrs={'class': 'matches_criteria_and_filters'})
Everything I tried, gave empty or none objects.
First try getting the parent div by the code similar to the following:
soup = BeautifulSoup('yourhtml', 'lxml')
results_div = soup.find('div', {'id':'results'})
#now iterate through all children divs
then do whatever you want to do with children divs