Beautiful Soup extract tag attributes, then find_all with multiple attributes - python-3.x

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,

Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

Related

Python - Web Scraping :How to access div tag of 1 class when I am getting data for div tags for multiple classes

I want div tag of 2 different classes in my result.
I am using following command to scrape the data -
'''
result = soup.select('div', {'class' : ['col-s-12', 'search-page-text clearfix row'] })
'''
Now, I have specific set of information in class 'col-s-12' and another set of information n class 'search-page-text clearfix row'
Now, I want to find children of only div tag with class - 'col-s-12'. When I am running below code, then it looks for children of both the div tags, since I have not specified anywhere which class I want to search
'''
for div in result:
print(div)
prod_name = div.find("a" , recursive=False)[0] #should come from 'col-s-12' only
prod_info = div.find("a" , recursive=False)[0] # should come from 'search-page-text clearfix row' only
'''
Example -
'''
<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>
'''
You can search for first <a> tag under tag with class="col-s-12" and then use .find_next('a') to search next <a> tag.
Note: .select() method accepts only CSS selectors, not dictionaries.
For example:
txt = '''<div class = 'col-s-12'>
This is what I want or variable **prod_name**
</div>
<div class = 'search-page-text clearfix row'>
<a> This should be stored in variable **prod_info** </a>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
prod_name = soup.select_one('.col-s-12 > a')
prod_info = prod_name.find_next('a')
print(prod_name.get_text(strip=True))
print(prod_info.get_text(strip=True))
Prints:
This is what I want or variable **prod_name**
This should be stored in variable **prod_info**

Getting the ID if i know the specific span text

my brain crashed.
I'm trying to get the ID of a span if specific text matches using BeautifulSoup, this because i need a number from the ID but the ID changes every time when searching for a new product but the product (CORRECT). Purpose of this is because when i have the number, 11 in this case, i can add it in another part of the code to scrape the information i need.
Example:
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>
Been reading documentation but i never seem to get right or not even remotely close. I'm aware how to pull the text (CORRECT) if i know the ID but not reversed.
Find_all() span items with required text and then get the id attribute and split() the attribute value with -
from bs4 import BeautifulSoup
html='''<span id="random-text-10-random-again">IGNORE</span>
<span id="random-text-11-random-again">CORRECT</span>
<span id="random-text-12-random-again">IGNORE</span>'''
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('span',text='CORRECT'):
print(item['id'].split('-')[2])
It will print:
11
I prefer to use :contains to target the innerText by a specified value. Available for bs4 4.7.1+
from bs4 import BeautifulSoup as bs
html = '''
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>'''
soup = bs(html, 'lxml')
target = soup.select_one('span:contains("CORRECT")[id]')
if target is None:
print("Not found")
else:
print(target['id'].split('-')[2])

Beautifulsoup filter "find_all" results, limited to .jpeg file via Regex

I would like to acquire some pictures from a forum. The find_all results gives me most what I want, which are jpeg files. However It also gives me few gif files which I do not desire. Another problem is that the gif file is an attachment, not a valid link, and it causes trouble when I save files.
soup_imgs = soup.find(name='div', attrs={'class':'t_msgfont'}).find_all('img', alt="")
for i in soup_imgs:
src = i['src']
print(src)
I tried to avoid that gif files in my find_all selections search, but useless, both jpeg and gif files are in the same section. What should I do to filter my result then? Please give me some help, chief. I am pretty amateur with coding. Playing with Python is just a hobby of mine.
You can filter it via regular expression.Please refer the following example.Hope this helps.
import re
from bs4 import BeautifulSoup
data='''<html>
<body>
<h2>List of images</h2>
<div class="t_msgfont">
<img src="img_chania.jpeg" alt="" width="460" height="345">
<img src="wrongname.gif" alt="">
<img src="img_girl.jpeg" alt="" width="500" height="600">
</div>
</body>
</html>'''
soup=BeautifulSoup(data, "html.parser")
soup_imgs = soup.find('div', attrs={'class':'t_msgfont'}).find_all('img', alt="" ,src=re.compile(".jpeg"))
for i in soup_imgs:
src = i['src']
print(src)
Try the following which I suspect you can shorten. It uses the ends with operator ($) to specify that the src attributes value of the child img elements ends with .jpg (edited to jpg from jpeg in light of OP's comment that it is actually jpg)
srcs = [item['src'] for item in soup.select("div.t_msgfont img[alt=''][src$='.jpg']")]
Have a look at shortening the selector(I can't without seeing the HTML in question), you may well get away with something like
srcs = [item['src'] for item in soup.select(".t_msgfont [alt=''][src$='.jpg']")]
or even
srcs = [item['src'] for item in soup.select(".t_msgfont [src$='.jpg']")]
I would suggest you to use requests-html to find the image resources in the page.
It's pretty simple compared to BeautifulSoup + requests.
Here's the code to do it.
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get(url)
for i in resp.html.absolute_links:
if i.endswith('.jpeg'):
print(i)

I want to extract the href tag using web scraping in python for a website

I want to get the text in the href that is https://lecturenotes.in/course/all/btech/electrical-engineering?utm_source=megamenu&utm_medium=web&utm_campaign=course where the code below is part of a tag
<div class="subject-content withripple"><span class="subject-action" data-type="subscribe" data-toggle="tooltip" data-placement="top" title="" data-original-title="Subscribe"></span><div class="clearfix"></div><span class="short-name text-uppercase">C</span><h4 class="text-truncate text-capitalize mb-0" title="Programming In C">Programming In C</h4><span class="course">Course: B.TECH</span><div class="ripple-container"></div></div>
To find all href-
soup = BeautifulSoup(<HTML content>)
attrs = {'class': ''}
a_tags = soup.find_all("a",)
href_links = list(map(lambda x: x["href"],a_tags))
You can find the HTML content by making a get request to the desired page.
Mention attributes such as class_name in attrs to to tell the program where to look.

how to get soup.find_all to work in BeautifulSoup?

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)
You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)
class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

Resources