extract content wherever we have div tag followed by hearder tag by using beautifulsoup - python-3.x

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.

You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

Related

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

BSound: Conditionally Extract href text

Is there a way to use a regular expression to conditionally grab "hrefs"? For exampe, below I only want the text (TUBB1 and TUBB2) of only two hrefs:
href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*"
and just the text of the href target
href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
My final goal is to have a list such as [("TUBB1,"TUBB2"),P04960]
Below is the HTML block I have gotten to with the text I want to extract.
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690
Based on the comments, here is one possible solution to select the required elements:
from bs4 import BeautifulSoup
html = '''<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690'''
soup = BeautifulSoup(html, 'html.parser')
# select all text from elements where href begins with "/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"
part_1 = tuple(s.text for s in soup.select('[href^="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"]'))
# select text from first element where href begins with "http://www.uniprot.org/uniprot/"
part_2 = soup.select_one('[href^="http://www.uniprot.org/uniprot/"]').text
# combine parts and print them:
print([part_1, part_2])
Prints:
[('TUBB1', 'TUBB2'), 'P04690']
I don't think its sexy, but I guess this will do.
z=i.find_all('a')
for j in z:
if "_gene_name" in j['href']:
print(j.text)
if "/pdb/protein" in j['href']:
print(j.text)
Output:
TUBB1
TUBB2
P04690

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

how to get soup.find_all to work in BeautifulSoup?

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)
You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)
class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

Extracting Text Within Tags Inside HTML Comments with BeautifulSoup

I want to extract the text within list element inside a comment without the list tags.But I can't do it with the code below.
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<!--
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
for numbers in soup.findAll(text=lambda text:isinstance(text, Comment)):
print(numbers.extract())
Result is:
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
Desired result :
10
20
30
Try the below approach. It will fetch you the result you wish to get.
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<!--
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all(text=lambda text:isinstance(text, Comment)):
data = BeautifulSoup(item,"html.parser")
for number in data.find_all("li"):
print(number.text)
Output:
10
20
30
Look for all "li" and print just the text.
for tag in soup.find_all("li"):
print(tag.text))

Resources