How to filter HTML nodes which have text in it from a html page

How to filter HTML nodes which have text in it from a html page - python-3.x

i am new to web scraping and got an issue
I am using BeautifulSoup for scraping a webpage. I want to get nodes which have text in it.
I tried that using get_text() method like this
soup = BeautifulSoup(open('FAQ3.html'), "html.parser")
body = soup.find('body')
for i in body:
if type(i) != bs4.element.Comment and type(i)!= bs4.element.NavigableString :
if i.get_text():
print(i)
but get_text is giving node even if its child have text in it,
sample html:
<div>
<div id="header">
<script src="./FAQ3_files/header-home.js"></script>
</div>
<div>
<div>
this node contain text
</div>
</div>
</div>
while checking topmost div itself, it is returning the whole node as the innermost had text in it,
how to iterate over all nodes and filter only the nodes which actually have text in it?

I used depth-first search for this, it solved my use case
def get_text_bs4(self, soup, leaf):
if soup.name is not None:
if soup.string != None and soup.name != 'script':
if soup.text not in leaf:
leaf[soup.text] = soup
for child in soup.children:
self.get_text_bs4(child, leaf)
return leaf

Related

How do I retrieve text from a text node in Selenium

So, essentially I want to get the text from the site and print it onto console.
This is the HTML snippet:
<div class="inc-vat">
<p class="price">
<span class="smaller currency-symbol">£</span>
1,500.00
<span class="vat-text"> inc. vat</span>
</p>
</div>
Here is an image of the DOM properties:
How would I go abouts retrieving the '1,500.00'? I have tried to use self.browser.find_element_by_xpath('//*[#id="main-content"]/div/div[3]/div[1]/div[1]/text()') but that throws an error which says The result of the xpath expression is: [object Text]. It should be an element. I have also used other methods like .text but they either only print the '£' symbol, print a blank or throw the same error.

You can use below css :
p.price
sample code :-
elem = driver.find_element_by_css_selector("p.price").text.split(' ')[1]
print(elem)

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?

Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>

To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?

Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)

If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.

Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to filter HTML nodes which have text in it from a html page - python-3.x

I used depth-first search for this, it solved my use case def get_text_bs4(self, soup, leaf): if soup.name is not None: if soup.string != None and soup.name != 'script': if soup.text not in leaf: leaf[soup.text] = soup for child in soup.children: self.get_text_bs4(child, leaf) return leaf

Related

How do I retrieve text from a text node in Selenium

Selenium Can't Find Element Returning None or []

I want to extract the href tag using web scraping in python for a website

beautifulsoup get value of attribute using get_attr method

Python 3 BeautifulSoup4 search for text in source page

Categories

Resources