Scrape a span text from multiple span elements of same name within a p tag in a website - python-3.x

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.

What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Related

How to find all the span tag inside of an element in selenium python?

<div id="textelem" class="random">
<span class="a">
TEXT 1
</span>
<span>
<span>TEXT 2 </span>
</span>
<span>TEXT 3</span>
</div>
Python: TargetElem = self.wait.until(EC.presence_of_element_located((By.ID, "textelem")))
I want to get all the text inside of span tags of TargetElem element. How can I get all the span elements inside of TargetElem element and loop through them to get a single string of collected text. Thank you.
simply use .text
TargetElem = self.wait.until(EC.presence_of_element_located((By.ID, "textelem")))
print(TargetElem.text)
I do not think that you actually need a loop, since we are passing textelem id of div and all the span tags are inside the div, so .text should work.

How to get data from a tag if it's present in HTML else Empty String if the tag is not present in web scraping Python

Picture contains HTML code for the situation
case 1:
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
case 2:
<li>
<a> some text: </a>
</li>
I want to scrape values for identifiers if it's present, else I want to put an empty string if there is no identifier in that particular case.
I am using scrapy or you can help me with BeautifulSoup as well and will really appreciate your help
It's a little bit unclear what do you want exactly, because your screenshot is little bit different than your example in your question. I suppose you want to search text "some text:" and then get next value inside <strong> (or empty string if there isn't any):
from bs4 import BeautifulSoup
txt = '''
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
<li>
<a> some text: </a>
</li>
'''
soup = BeautifulSoup(txt, 'html.parser')
for t in soup.find_all(lambda t: t.contents[0].strip() == 'some text:'):
identifier = t.parent.find('strong')
identifier = identifier.get_text(strip=True) if identifier else ''
print('Found:', identifier)
Prints:
Found: 'identifier:''random words'
Found:

BSound: Conditionally Extract href text

Is there a way to use a regular expression to conditionally grab "hrefs"? For exampe, below I only want the text (TUBB1 and TUBB2) of only two hrefs:
href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*"
and just the text of the href target
href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
My final goal is to have a list such as [("TUBB1,"TUBB2"),P04960]
Below is the HTML block I have gotten to with the text I want to extract.
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690
Based on the comments, here is one possible solution to select the required elements:
from bs4 import BeautifulSoup
html = '''<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
P04690
P04690'''
soup = BeautifulSoup(html, 'html.parser')
# select all text from elements where href begins with "/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"
part_1 = tuple(s.text for s in soup.select('[href^="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"]'))
# select text from first element where href begins with "http://www.uniprot.org/uniprot/"
part_2 = soup.select_one('[href^="http://www.uniprot.org/uniprot/"]').text
# combine parts and print them:
print([part_1, part_2])
Prints:
[('TUBB1', 'TUBB2'), 'P04690']
I don't think its sexy, but I guess this will do.
z=i.find_all('a')
for j in z:
if "_gene_name" in j['href']:
print(j.text)
if "/pdb/protein" in j['href']:
print(j.text)
Output:
TUBB1
TUBB2
P04690

Getting the ID if i know the specific span text

my brain crashed.
I'm trying to get the ID of a span if specific text matches using BeautifulSoup, this because i need a number from the ID but the ID changes every time when searching for a new product but the product (CORRECT). Purpose of this is because when i have the number, 11 in this case, i can add it in another part of the code to scrape the information i need.
Example:
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>
Been reading documentation but i never seem to get right or not even remotely close. I'm aware how to pull the text (CORRECT) if i know the ID but not reversed.
Find_all() span items with required text and then get the id attribute and split() the attribute value with -
from bs4 import BeautifulSoup
html='''<span id="random-text-10-random-again">IGNORE</span>
<span id="random-text-11-random-again">CORRECT</span>
<span id="random-text-12-random-again">IGNORE</span>'''
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('span',text='CORRECT'):
print(item['id'].split('-')[2])
It will print:
11
I prefer to use :contains to target the innerText by a specified value. Available for bs4 4.7.1+
from bs4 import BeautifulSoup as bs
html = '''
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>'''
soup = bs(html, 'lxml')
target = soup.select_one('span:contains("CORRECT")[id]')
if target is None:
print("Not found")
else:
print(target['id'].split('-')[2])

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

Resources