Extracting Text Within Tags Inside HTML Comments with BeautifulSoup - python-3.x

I want to extract the text within list element inside a comment without the list tags.But I can't do it with the code below.
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<!--
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
for numbers in soup.findAll(text=lambda text:isinstance(text, Comment)):
print(numbers.extract())
Result is:
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
Desired result :
10
20
30

Try the below approach. It will fetch you the result you wish to get.
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<!--
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all(text=lambda text:isinstance(text, Comment)):
data = BeautifulSoup(item,"html.parser")
for number in data.find_all("li"):
print(number.text)
Output:
10
20
30

Look for all "li" and print just the text.
for tag in soup.find_all("li"):
print(tag.text))

Related

lxml get the all the html of an xpath selected element

So I have a given piece of HTML and I want to select a part of it using xpath and lxml
from lxml import etree
example_html = '''
<div>
<span>
<p>abc</p>
<p>def</p>
</span>
</div>
'''
htmlparser = etree.HTMLParser()
tree = etree.fromstring(example_html, htmlparser)
el = tree.xpath('//div')
el now is obviously this [<Element div at 0x7f61154c3f88>] what I want to do is a el[0].get_html() to get:
<span>
<p>abc</p>
<p>def</p>
</span>
Is this possible? And if not with lxml is there another library?
You can try with BeautifulSoup,
from bs4 import BeautifulSoup
example_html = '''
<div>
<span>
<p>abc</p>
<p>def</p>
</span>
</div>
'''
soup = BeautifulSoup(example_html)
print(soup.span)
<span>
<p>abc</p>
<p>def</p>
</span>

find tags contain partial strings in text

<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
soup.findAll('div', text = re.compile('456'))
Only returns div b, no others.
soup.findAll('div', text = re.compile('45'))
Only returns div b, no others.
How to return other DIVs actually partially matches the specific string?
The answer to your question is almost similar to this answer. All you have to do is to tweak the lambda function a bit. Here is the full code:
from bs4 import BeautifulSoup
html = '''
<div id="a">123 456</div>
<div id="b">456</div>
<div id="c">123 456 789</div>
'''
soup = BeautifulSoup(html,'html5lib')
divs = soup.find_all("div", text=lambda text: text and '456' in text)
Output:
>>> divs
[<div id="a">123 456</div>, <div id="b">456</div>, <div id="c">123 456 789</div>]

BSound: Conditionally Extract href text

Is there a way to use a regular expression to conditionally grab "hrefs"? For exampe, below I only want the text (TUBB1 and TUBB2) of only two hrefs:
href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*"
and just the text of the href target
href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
My final goal is to have a list such as [("TUBB1,"TUBB2"),P04960]
Below is the HTML block I have gotten to with the text I want to extract.
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690
Based on the comments, here is one possible solution to select the required elements:
from bs4 import BeautifulSoup
html = '''<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690'''
soup = BeautifulSoup(html, 'html.parser')
# select all text from elements where href begins with "/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"
part_1 = tuple(s.text for s in soup.select('[href^="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"]'))
# select text from first element where href begins with "http://www.uniprot.org/uniprot/"
part_2 = soup.select_one('[href^="http://www.uniprot.org/uniprot/"]').text
# combine parts and print them:
print([part_1, part_2])
Prints:
[('TUBB1', 'TUBB2'), 'P04690']
I don't think its sexy, but I guess this will do.
z=i.find_all('a')
for j in z:
if "_gene_name" in j['href']:
print(j.text)
if "/pdb/protein" in j['href']:
print(j.text)
Output:
TUBB1
TUBB2
P04690

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

Resources