I'm having multiple issues with trying to scrape a website when the CSS code are all the same. I'm still learning about the soup.find method and things I can do with it. The issue is there are several lines of CSS code on a webpage that has <span class="list-quest" and when I use soup.find(class_='list-quest') for example I will only get the result from the top of the page that uses the same CSS code. Is there a way to get the exact specific line of code? Possibly using Born [dd-mm-yyyy] ? But sadly I do not know how to use a specific keyword such as that for Python to find it.
<span class="list-quest">Born [dd-mm-yyyy]:</span>
By using a regex on the text attribute :
Regex :
Born \d{2}-\d{2}-\d{4}:
Python Code :
from bs4 import BeautifulSoup
import re
text = '<span class="list-quest">Born 01-01-2019:</span>'
soup = BeautifulSoup(text,features='html.parser')
tag = soup.find('span',attrs={'class':'list-quest'} , text=re.compile(r'Born \d{2}-\d{2}-\d{4}'))
print(tag.text)
Demo : Here
You might, with bs4 4.7.1 + be able to use contains
item = soup.select_one('span.list-quest:contains("Born ")')
if item is not None: print(item.text)
Related
I have html like this:
<div class="event__scores fontBold">
<span>1</span>
-
<span>2</span>
</div>
I find this element as follows:
current_score = match.find_element(By.XPATH, '//div[contains(#class, "event__scores")]')
print(current_score.get_attribute('innerHTML'))
I can not understand what I need to do to get the text like 1 - 2 without using bs4 or something like that.
I know i can use bs4 like this:
spans = soup.find_all('span')
result = ' - '.join([e.get_text() for e in spans])
But i want to know can i get similar result only using Selenium.
Consider using Explicit Wait instead of find as it might be the case the element won't be loaded yet by the time you will be attempting to find it. Check out How to use Selenium to test web applications using AJAX technology article for more details
current_score = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, '//div[contains(#class, "event__scores")]')))
You're looking for a wrong property, you should be using innerText, not innerHTML
print(current_score.get_attribute('innerText'))
or simply retrieve WebElement.text property
print(current_score.text)
(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)
I am trying to scrape a web site using python and beautiful soup. The goal is to build a csv file, with the relevant information(location, unit size, rent...)
I am not 100% sure what the problem is but I think it has to do with the strutcture of the class. "result matches_criteria_and_filters first_listing highlighted"
First part of the code:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://www.publicstorage.com/storage-search-landing.aspx?
location=New+York")
c=r.content
After that I would need the class= result matches_criteria_and_filters first_listing highlighted. Here I am not able to do it.
Solutions that I found in other threads were not working.
soup.select("result.matches_criteria_and_filters.first_listing.highlighted")
Another possibility I found is to seperate, but it did not work.
soup.find_all(attrs={'class': 'result'})
soup.find_all(attrs={'class': 'matches_criteria_and_filters'})
Everything I tried, gave empty or none objects.
First try getting the parent div by the code similar to the following:
soup = BeautifulSoup('yourhtml', 'lxml')
results_div = soup.find('div', {'id':'results'})
#now iterate through all children divs
then do whatever you want to do with children divs
I would like to search for particular string in scraped html page and perform some action if string is present.
find = soup.find('word')
print(find)
But this gives None even there is word in page. Also, I tried:
find = soup.find_all('word')
print(find)
And it gives [] only.
What find method does is searching for a tag. So when you do soup.find('word') you're asking BeautifulSoup to find all the <word></word> tags. I think it's not what you want.
There are several ways to perform what you're asking. You can use re module for searching with a regular expression like that:
import re
is_present = bool(re.search('word', response.text))
But you can avoid importing extra modules, as you use Scrapy, which has a built-in methods for working with regular expressions. Just use re method on selector:
is_present = bool(response.xpath('//body').re('word'))
Try find = soup.findAll(text="word")
I'm doing the this tutorial ->
http://programminghistorian.org/lessons/intro-to-beautiful-soup
When I run the follow code I get this error:
AttributeError: 'NoneType' object has no attribute 'decompose'
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print(link)
I can't understand why I'm getting this error. I'mm not sure what soup.p.a is doing either. Googled it but nothing came up...
Make sure that you have an html file named 43rd-congress.html within your working directory. And it must have the lines that are mentioned in the tutorial. The error you get, is most probably because the program was not able to find an "a" tag that is nested within an "p" tag in the file 43rd-congress.html that is within your working directory.
The soup.p.a lets you target and scrape out "a" tags that are nested within "p" tags and pass it to the assigned variable (final_link in this case). The decompose function will remove the elements stored in "final_link" from the original BeautifulSoup object "soup".
For example consider this example that is very similar to the one on the site you mentioned.
<p align="left">
<a href="google.com">
<b>Search Again</b>
</a>
</p>
Hello
Yahoo
When you save the above code as 43rd-congress.html into your working directory and run your code you will see the output as
Hello
Yahoo
The "a" tag enclosed within the "p" tag is completely deleted from the "soup" object by action of the program.