how to extract href attribute of ‘a’ element using id= instead of class name - python-3.x

I have the following:
<div id="header-author" class="some random class">
<a id="author-text" class="some random class" href="/page?id=232">
<span class="some random class">
Hello there
</span>
</a>
and i want to extract only href attributes of id="author-text"
i cant use class to extract because the class is used by other elements which has href links which i do not want to extract
i have tried this
soupeddata = BeautifulSoup(my_html_code, "html.parser")
my_data = soupeddata.find_all("a", id= "author-text")
for x in my_data:
my_href = x.get("href")
print(my_href)
Thank you in advance and will be sure to upvote/accept answer!

Use this:
my_data = soupeddata.find_all('a', attrs = {'id': 'author-text'})
You can also pass class attribute inside the dict.
From the BeautifulSoup documentation:
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them
into a dictionary and passing the dictionary into find_all() as
the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Related

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>
To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?
Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)
If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

Beautiful Soup extract tag attributes, then find_all with multiple attributes

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,
Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

How do I iterate through elements in Selenium and Python?

I am trying to get to the texts inside the span tags by iterating through the li list of this HTML:
<ol class="KambiBC-event-result__score-list">
<li class="KambiBC-event-result__match">
<span class="KambiBC-event-result__points">1</span>
<span class="KambiBC-event-result__points">1</span>
</li>
</ol>
but i am getting the error
AttributeError: 'list' object has no attribute
'find_element_by_class_name'
on my code:
meci = driver.find_elements_by_class_name('KambiBC-event-result__match')
for items in meci:
scor = meci.find_element_by_class_name('KambiBC-event-result__points')
print (scor.text)
You are not using items inside the loop. You loop should be
meci = driver.find_elements_by_class_name('KambiBC-event-result__match')
for items in meci:
scor = items.find_element_by_class_name('KambiBC-event-result__points')
print (scor.text)
meci.find_element_by_class_name should be items.find_element_by_class_name
To answer your second comment, all you need to do is add ":nth-child(2)" to the end of the class name.
The class 'KambiBC-event-result__points' would read 'KambiBC-event-result__points:nth-child(2)' to only access the second child.

Resources