BSound: Conditionally Extract href text - python-3.x

Is there a way to use a regular expression to conditionally grab "hrefs"? For exampe, below I only want the text (TUBB1 and TUBB2) of only two hrefs:
href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*"
and just the text of the href target
href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
My final goal is to have a list such as [("TUBB1,"TUBB2"),P04960]
Below is the HTML block I have gotten to with the text I want to extract.
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690

Based on the comments, here is one possible solution to select the required elements:
from bs4 import BeautifulSoup
html = '''<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690'''
soup = BeautifulSoup(html, 'html.parser')
# select all text from elements where href begins with "/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"
part_1 = tuple(s.text for s in soup.select('[href^="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"]'))
# select text from first element where href begins with "http://www.uniprot.org/uniprot/"
part_2 = soup.select_one('[href^="http://www.uniprot.org/uniprot/"]').text
# combine parts and print them:
print([part_1, part_2])
Prints:
[('TUBB1', 'TUBB2'), 'P04690']

I don't think its sexy, but I guess this will do.
z=i.find_all('a')
for j in z:
if "_gene_name" in j['href']:
print(j.text)
if "/pdb/protein" in j['href']:
print(j.text)
Output:
TUBB1
TUBB2
P04690

Related

How to get data from a tag if it's present in HTML else Empty String if the tag is not present in web scraping Python

Picture contains HTML code for the situation
case 1:
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
case 2:
<li>
<a> some text: </a>
</li>
I want to scrape values for identifiers if it's present, else I want to put an empty string if there is no identifier in that particular case.
I am using scrapy or you can help me with BeautifulSoup as well and will really appreciate your help
It's a little bit unclear what do you want exactly, because your screenshot is little bit different than your example in your question. I suppose you want to search text "some text:" and then get next value inside <strong> (or empty string if there isn't any):
from bs4 import BeautifulSoup
txt = '''
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
<li>
<a> some text: </a>
</li>
'''
soup = BeautifulSoup(txt, 'html.parser')
for t in soup.find_all(lambda t: t.contents[0].strip() == 'some text:'):
identifier = t.parent.find('strong')
identifier = identifier.get_text(strip=True) if identifier else ''
print('Found:', identifier)
Prints:
Found: 'identifier:''random words'
Found:

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

How to fix missing ul tags in html list snippet with Python and Beautiful Soup

If I have a snippet of html like this:
<p><br><p>
<li>stuff</li>
<li>stuff</li>
Is there a way to clean this and add the missing ul/ol tags using beautiful soup, or another python library?
I tried soup.prettify() but it left as is.
It doesn't seem like there's a built-in method which wraps groups of li elements into an ul. However, you can simply loop over the li elements, identify the first element of each li group and wrap it in ul tags. The next elements in the group are appended to the previously created ul:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
ulgroup = 0
uls = []
for li in soup.findAll('li'):
previous_element = li.findPrevious()
# if <li> already wrapped in <ul>, do nothing
if previous_element and previous_element.name == 'ul':
continue
# if <li> is the first element of a <li> group, wrap it in a new <ul>
if not previous_element or previous_element.name != 'li':
ulgroup += 1
ul = soup.new_tag("ul")
li.wrap(ul)
uls.append(ul)
# append rest of <li> group to previously created <ul>
elif ulgroup > 0:
uls[ulgroup-1].append(li)
print(soup.prettify())
For example, the following input:
html = '''
<p><br><p>
<li>stuff1</li>
<li>stuff2</li>
<div></div>
<li>stuff3</li>
<li>stuff4</li>
<li>stuff5</li>
'''
outputs:
<p>
<br/>
<p>
<ul>
<li>
stuff1
</li>
<li>
stuff2
</li>
</ul>
<div>
</div>
<ul>
<li>
stuff3
</li>
<li>
stuff4
</li>
<li>
stuff5
</li>
</ul>
</p>
</p>
Demo: https://repl.it/#glhr/55619920-fixing-uls
First, you have to decide which parser you are going to use. Different parsers treat malformed html differently.
The following BeautifulSoup methods will help you accomplish what you require
new_tag() - create a new ul tag
append() - To append the newly created ul tag somewhere in the soup tree.
extract() - To extract the li tags one by one (which we can append to the ul tag)
decompose() - To remove any unwanted tags from the tree. Which may be formed as a result of the parser's interpretation of the malformed html.
My Solution
Let's create a soup object using html5lib parser and see what we get
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
print(soup)
Outputs:
<html><head></head><body><p><br/></p><p>
</p><li>stuff</li>
<li>stuff</li>
</body></html>
The next step may vary according to what you want to accomplish. I want to remove the second empty p. Add a new ul tag and get all the li tags inside it.
from bs4 import BeautifulSoup
html="""
<p><br><p>
<li>stuff</li>
<li>stuff</li>
"""
soup=BeautifulSoup(html,'html5lib')
second_p=soup.find_all('p')[1]
second_p.decompose()
ul_tag=soup.new_tag('ul')
soup.find('body').append(ul_tag)
for li_tag in soup.find_all('li'):
ul_tag.append(li_tag.extract())
print(soup.prettify())
Outputs:
<html>
<head>
</head>
<body>
<p>
<br/>
</p>
<ul>
<li>
stuff
</li>
<li>
stuff
</li>
</ul>
</body>
</html>

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>
To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Resources