Trouble in scraping specific element having same classname using beautifulsoup python - python-3.x

How can I extract text with status information Semi-Furnished,
Available immediately for Family on,Semi-Furnished.
As the div class="proDetailsRowElm" has detail and status information i am ending up getting detail an status information in my list.
Could you please help me to get only status information?
HTML CODE
<div class="proDetailsRowElm">
<label>Details:</label>
<div class="proDetailsRow__list">
<span class="proDetailsRow__item">3 Bathroom</span>
<span class="proDetailsRow__item">3 Balcony</span>
</div>
<a class='stop-propagation underline font-type-4 view-details-link' href="javascript:void(0);" onclick="stopPage=true;window.open('/propertyDetails/3-BHK-1800-Sq-ft-Multistorey-Apartment-FOR-Rent-Kadubeesanahalli-in-Bangalore&id=4d423330363332363633', '_blank');callDetailPropertData('30632663');addViewedPropertyToCookie('30632663',1);detailViewTrack('30632663');clicktrack('1', 'propertyId=30632663,'+'2', 'div'+',sessionId='+sessionId ,'Rent','Kadubeesanahalli','Agent','91','Bangalore' ,'','', 'N','35,000','','3','Multistorey Apartment','','','8','','',false,'','',''); trackPropertyPosition('1', '2', '30632663', 'div')"></a>
</div>
<div class="proDetailsRowElm">
<label>Status:</label>
Semi-Furnished,
Available immediately for Family
</div>
Python code
property_status_list=soup.find_all('div',class_='proDetailsRowElm')
for property_status in property_status_list:
for element in property_status_list:
print(element.text)
Above code Output
Details:
3 Bathroom
3 Balcony
Status:
Furnished,
Available immediately for Family
Required Output
Status:
Furnished,
Available immediately for Family

I'm by no means a BeautifulSoup expert but you might be able to use next_sibling:
property_status_list=soup.find_all('div',class_='proDetailsRowElm')
for property_status in property_status_list:
try:
k = property_status.find('label', text='Status:').next_sibling
print(repr(k))
except:
pass
Returns:
'\nSemi-Furnished,\nAvailable immediately for Family\n'

Related

How do I retrieve text from a text node in Selenium

So, essentially I want to get the text from the site and print it onto console.
This is the HTML snippet:
<div class="inc-vat">
<p class="price">
<span class="smaller currency-symbol">£</span>
1,500.00
<span class="vat-text"> inc. vat</span>
</p>
</div>
Here is an image of the DOM properties:
How would I go abouts retrieving the '1,500.00'? I have tried to use self.browser.find_element_by_xpath('//*[#id="main-content"]/div/div[3]/div[1]/div[1]/text()') but that throws an error which says The result of the xpath expression is: [object Text]. It should be an element. I have also used other methods like .text but they either only print the '£' symbol, print a blank or throw the same error.
You can use below css :
p.price
sample code :-
elem = driver.find_element_by_css_selector("p.price").text.split(' ')[1]
print(elem)

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

how tcan i get block tag with beautifulsoup python 3

I have to scrap ap page and have this in the HTML code
<div id="agentPhone" class="displayNone text large padding bck light_grey"></div>
</div>
in the elenent DOM page I have this
<div id="agentPhone" class="displayNone text large padding bck light_grey" style="display: block;"><span>τηλ:</span> <span class="text color bold ">6908511284</span></div>
</div>
I am trying to get 6908... but I can't manage it.
Is there a way to do?
It's not because the block style, your problem is because the number show with post request after you click on button, You can not have the number with bs but with selenium.

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Resources