Finding out if data-sold-out="false" in html using beautifulsoup - python-3.x

data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="
this is the html code and im trying to get find if data-sold-out="false" or true so i can than do something with it. I am wondering how can i find out what data-sold-out id equal to and return it. I am using python and beautiful soup.
any help appreciated

Are you trying to find any tags with data-sold-out="false" or data-sold-out="true" right?
I think you can do this
all_html = bs('<a data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="">')
a_tag = all_html.findAll(attrs={"data-sold-out": "false"})
then you can extract any attribute inside them like this
for item in a_tag:
print(item['data-style-name'])

Related

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?
Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>
To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

How to get substring from string using xpath 1.0 in lxml

This is the example HTML.
<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>
I am in a situation where I need to extract
Chamber of Secrets
Prisoners in Azkabahn
I am using lxml 4.2.1 in python which uses xpathb1.0.
I have tried to extract using XPath
'substring-after(//a/#href,"HarryPotter:")'
which returns only "Chamber of Secrets".
and with XPath
'//a/#href[substring-after(.,"HarryPotter:")]'
which returns
'HarryPotter:Chamber of Secrets'
'HarryPotter:Prisoners in Azkabahn'
I have researched for it and got new learning but didn't find the fix of my problem.
I have hit and tried different XPath using substring-after.
In my research, I got to know that it could also be accomplished by regex too, then I tried and failed.
I found that it is easy to manipulate a string in XPath 2.0 and above using regex but we can also use regex in XPath 1.0 using XSLT extensions.
Could we do it with substring-after function, if yes then what is the XPath and if No then what is the best approach to get the desired output?
And how we can get the desired output using regex in XPath by sticking to lxml.
Try this approach to get both text values:
from lxml import html
raw_source = """<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>"""
source = html.fromstring(raw_source)
for link in source.xpath('//a'):
print(link.xpath('substring-after(#href, "HarryPotter:")'))
If you want to use substring-after() and substring-before() and together
Here is example:
from lxml import html
f_html = """<html><body><table><tbody><tr><td class="df9" width="20%">
<a class="nodec1" href="javascript:reqDl(1254);" onmouseout="status='';" onmouseover="return dspSt();">
<u>
2014-2
</u>
</a>
</td></tr></tbody></table></body></html>"""
tree_html = html.fromstring(f_html)
deal_id = tree_html.xpath("//td/a/#href")
print(tree_html.xpath('substring-after(//td/a/#href, "javascript:reqDl(")'))
print(tree_html.xpath('substring-before(//td/a/#href, ")")'))
print(tree_html.xpath('substring-after(substring-before(//td/a/#href, ")"), "javascript:reqDl(")'))
Result:
1254);
javascript:reqDl(1254
1254

How to extract value from href in python?

Hi developer. I am facing a problem in extracting a href value in python.
I have a button there after clicking on "view Answer" it take me a next link I want to extract that data which is present in that link.
<div class="col-md-11 col-xs-12">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic- dr">
<div class="hover-div">
<h2 itemprop="name">i need a good Orthopedic dr</h2>
</div>
</a>
<div class="thread-details">
<a href="https://www.marham.pk/forum/thread/4471/i-need-a-good-orthopedic-dr">
<p class="pull-left"><span class="glyphicon glyphicon-comment"></span> View Answers (<span itemprop="answerCount">1</span>) </p>
</a>
</div>
</div>
I need to extract this href tag.
You Can Use Data Scraping In Python.
Beautiful SoupĀ is a Python library for pulling data out of HTML and XML files.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("Your URL WILL GO HERE").read()
soup = bs.BeautifulSoup(sauce,'html5lib')
print(soup)

Resources