Python: Beautiful soup to get text - python-3.x

I'm trying to get link under a href and also the text available in the next <td scope = "raw">
I've tried
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
records = []
for link in soup.find_all('a'):
Name = link.text
Links = link.get('href')
records.append((Name, Links))
However this gives me eps8453.htm as text since this is the text under tag <a href>. Is there any way we can look for the text i.e. "10-K" in the tag <td scope = "raw"> next to tag <a href>
Please help!

Use find_next <td> tag after <a> tag inside table.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1491829/0001171520-19-000171-index.htm"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
records = []
for link in soup.find('table', class_='tableFile').find_all('a'):
Name = link.text
Links = link.get('href')
text=link.find_next('td').contents[0]
print(Name,text)
records.append((Name, Links,text))
Output:
eps8453.htm 10-K
ex31-1.htm EX-31.1
ex31-2.htm EX-31.2
ex32-1.htm EX-32.1
yu-logo.jpg GRAPHIC
yu_sig.jpg GRAPHIC
0001171520-19-000171.txt
 

Related

How to scrape nested text between tags using BeautifulSoup?

I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.
Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join:
print('\n'.join(text))

Extracting URL basis text of innerhtml Python

I have multiple websites and i want to get the "Contact Us" Url for each of the website. The Urls are not necessarily contained in same class for all websites. However, the innerHTML of all the websites essentially contains the word "contact"
Is there a way to extract URL from a webpage, if the innerhtml contains specific word.
For example, in case of below HTML, i want to extract the URL if the innerhtml contains the word "contact" ( case insensitive ).
HTML = {
<a class="" style="COLOR: #000000; TEXT-DECORATION: none" href="http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E" target=
"_parent">
<font size="2">
<strong>Contact Us</strong>
</font>
</a>
}
output required :-
'http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E'
I could reach to below code so far, but it doesn't seem to work:-
link=[]
driver.get(main_url)
elements = driver.find_elements_by_xpath("//a").get_attribute('href') # the href is not always contained in a tag
for el in elements:
if 'contact'.casefold() in str(el.text):
link.append(el.get_attribute('href'))
Any help is greatly appreciated,
Try this:-
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all("a")
link=[]
for link in links:
if 'contact' in link.text.lower():
link.append(link.get(a.href))
The output for the url you have mentioned is :-
<font face="Verdana" size="1">Get more details</font>
Try following code:
link=[]
elements = driver.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
link.append(el.get_attribute("href"))

extract content wherever we have div tag followed by hearder tag by using beautifulsoup

I am trying to extract div tags and header tags when they are together.
ex:
<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
I tried solution provided in below link.
here the header tag inside div tag...
but my requirement is div tag after header tag.
Scraping text in h3 and div tags using beautifulSoup, Python
also i tried something like this but not worked
soup = bs4.BeautifulSoup(page, 'lxml')
found = soup..find_all({"h3", "div"})
I need content from H3 tag and all the content inside div tag where ever these two combination exists.
You could use CSS selector h3:has(+div) - this will select all <h3> which have div immediately after it:
data = '''<h3>header</h3>
<div>some text here
<ul>
<li>list</li>
<li>list</li>
<li>list</li>
</ul>
</div>
<h3>This header is not selected</h3>
<p>Beacause this is P tag, not DIV</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for h3 in soup.select('h3:has(+div)'):
print('Header:')
print(h3.text)
print('Next <div>:')
print(h3.find_next_sibling('div').get_text(separator=",", strip=True))
Prints:
Header:
header
Next <div>:
some text here,list,list,list
Further reading:
CSS Selectors reference

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>
To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

Beautifulsoup placing </li> in the wrong place because there is no closing tag [duplicate]

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Resources