how to get soup.find_all to work in BeautifulSoup? - python-3.x

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)

You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)

class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

Related

Getting the ID if i know the specific span text

my brain crashed.
I'm trying to get the ID of a span if specific text matches using BeautifulSoup, this because i need a number from the ID but the ID changes every time when searching for a new product but the product (CORRECT). Purpose of this is because when i have the number, 11 in this case, i can add it in another part of the code to scrape the information i need.
Example:
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>
Been reading documentation but i never seem to get right or not even remotely close. I'm aware how to pull the text (CORRECT) if i know the ID but not reversed.
Find_all() span items with required text and then get the id attribute and split() the attribute value with -
from bs4 import BeautifulSoup
html='''<span id="random-text-10-random-again">IGNORE</span>
<span id="random-text-11-random-again">CORRECT</span>
<span id="random-text-12-random-again">IGNORE</span>'''
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('span',text='CORRECT'):
print(item['id'].split('-')[2])
It will print:
11
I prefer to use :contains to target the innerText by a specified value. Available for bs4 4.7.1+
from bs4 import BeautifulSoup as bs
html = '''
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>'''
soup = bs(html, 'lxml')
target = soup.select_one('span:contains("CORRECT")[id]')
if target is None:
print("Not found")
else:
print(target['id'].split('-')[2])

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Beautiful Soup extract tag attributes, then find_all with multiple attributes

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,
Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

Resources