beautifulsoup get value of attribute using get_attr method - python-3.x

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?

Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)

If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

Related

How to scrape nested text between tags using BeautifulSoup?

I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.
Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join:
print('\n'.join(text))

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

Find all elements by partially matched tag in Python ElementTree using XPath

I'm trying to find all heading elements in an XHTML ElementTree, and I was wondering if there is any way to do this with XPath.
<body>
<h1>title</h1>
<h2>heading 1</h2>
<p>text</p>
<h3>heading 2</h3>
<p>text</p>
<h2>heading 3</h2>
<p>text</p>
</body>
My aim is to get all the heading elements in order, and the naive solution doesn't work:
for element in tree.iterfind("h*"):
foo(element)
Because they should be ordered, I cannot iterate through each heading element individually
headings = {f"h{n}" for n in range(1, 6+1)}
for heading in headings:
for element in tree.iterfind(heading):
foo(element)
(but for element in filter(lambda el: el.tag in headings, tree.iterfind()) works)
and I can't use regex because it breaks on comments (which doesn't use string tags)
import re
pattern = re.compile("^h[1-6]$")
is_heading = lambda el: pattern.match(el.tag)
for element in filter(is_heading, tree.iterfind()):
foo(element)
(but is_heading = lambda el: isinstance(el.tag, str) and pattern.match(el.tag) works)
None of the solutions are particularly elegant, so I was wondering if there was a better way of finding all heading elements in order using xpath?
Like this:
//*[self::h1 or self::h2 or self::h3]
If you can use lxml, you can use the union operator |...
from lxml import etree
xml = """
<body>
<h1>title</h1>
<h2>heading 1</h2>
<p>text</p>
<h3>heading 2</h3>
<p>text</p>
<h2>heading 3</h2>
<p>text</p>
</body>
"""
tree = etree.fromstring(xml)
for elm in tree.xpath("//h1|//h2|//h3"):
print(elm.text)
printed output...
title
heading 1
heading 2
heading 3
lxml would also allow you to use the self:: axis like mentioned in another answer if you prefer.
Another method.
from simplified_scrapy import SimplifiedDoc,req,utils
html ='''
<body>
<h1>title</h1>
<h2>heading 1</h2>
<p>text</p>
<h3>heading 2</h3>
<p>text</p>
<h2>heading 3</h2>
<p>text</p>
</body>'''
doc = SimplifiedDoc(html)
hs = doc.getElementsByReg('h[1-9]')
print(hs.text)
Result:
['title', 'heading 1', 'heading 2', 'heading 3']
This XPath should work too:
'//*[starts-with(name(), "h") and not(translate(substring(name(),string-length(name())), "0123456789", ""))]'

how to extract href attribute of ‘a’ element using id= instead of class name

I have the following:
<div id="header-author" class="some random class">
<a id="author-text" class="some random class" href="/page?id=232">
<span class="some random class">
Hello there
</span>
</a>
and i want to extract only href attributes of id="author-text"
i cant use class to extract because the class is used by other elements which has href links which i do not want to extract
i have tried this
soupeddata = BeautifulSoup(my_html_code, "html.parser")
my_data = soupeddata.find_all("a", id= "author-text")
for x in my_data:
my_href = x.get("href")
print(my_href)
Thank you in advance and will be sure to upvote/accept answer!
Use this:
my_data = soupeddata.find_all('a', attrs = {'id': 'author-text'})
You can also pass class attribute inside the dict.
From the BeautifulSoup documentation:
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them
into a dictionary and passing the dictionary into find_all() as
the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Using XPath, select node without text sibling

I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[#class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?
Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.
As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:
>>> bars[0].tail = ''
>>> print(html.tostring(bars[0], encoding="unicode"))
<span class="bar">bar</span>

Resources