I want to extract the href tag using web scraping in python for a website - python-3.x

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>

To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

Related

Extracting URL basis text of innerhtml Python

I have multiple websites and i want to get the "Contact Us" Url for each of the website. The Urls are not necessarily contained in same class for all websites. However, the innerHTML of all the websites essentially contains the word "contact"
Is there a way to extract URL from a webpage, if the innerhtml contains specific word.
For example, in case of below HTML, i want to extract the URL if the innerhtml contains the word "contact" ( case insensitive ).
HTML = {
<a class="" style="COLOR: #000000; TEXT-DECORATION: none" href="http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E" target=
"_parent">
<font size="2">
<strong>Contact Us</strong>
</font>
</a>
}
output required :-
'http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E'
I could reach to below code so far, but it doesn't seem to work:-
link=[]
driver.get(main_url)
elements = driver.find_elements_by_xpath("//a").get_attribute('href') # the href is not always contained in a tag
for el in elements:
if 'contact'.casefold() in str(el.text):
link.append(el.get_attribute('href'))
Any help is greatly appreciated,
Try this:-
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all("a")
link=[]
for link in links:
if 'contact' in link.text.lower():
link.append(link.get(a.href))
The output for the url you have mentioned is :-
<font face="Verdana" size="1">Get more details</font>
Try following code:
link=[]
elements = driver.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
link.append(el.get_attribute("href"))

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

Python: Beautiful soup to get text

I'm trying to get link under a href and also the text available in the next <td scope = "raw">
I've tried
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
records = []
for link in soup.find_all('a'):
Name = link.text
Links = link.get('href')
records.append((Name, Links))
However this gives me eps8453.htm as text since this is the text under tag <a href>. Is there any way we can look for the text i.e. "10-K" in the tag <td scope = "raw"> next to tag <a href>
Please help!
Use find_next <td> tag after <a> tag inside table.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
records = []
for link in soup.find('table', class_='tableFile').find_all('a'):
Name = link.text
Links = link.get('href')
text=link.find_next('td').contents[0]
print(Name,text)
records.append((Name, Links,text))
Output:
eps8453.htm 10-K
ex31-1.htm EX-31.1
ex31-2.htm EX-31.2
ex32-1.htm EX-32.1
yu-logo.jpg GRAPHIC
yu_sig.jpg GRAPHIC
0001171520-19-000171.txt
 

Finding out if data-sold-out="false" in html using beautifulsoup

data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="
this is the html code and im trying to get find if data-sold-out="false" or true so i can than do something with it. I am wondering how can i find out what data-sold-out id equal to and return it. I am using python and beautiful soup.
any help appreciated
Are you trying to find any tags with data-sold-out="false" or data-sold-out="true" right?
I think you can do this
all_html = bs('<a data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="">')
a_tag = all_html.findAll(attrs={"data-sold-out": "false"})
then you can extract any attribute inside them like this
for item in a_tag:
print(item['data-style-name'])

Beautiful Soup extract tag attributes, then find_all with multiple attributes

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,
Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

Resources