Beautifulsoup filter "find_all" results, limited to .jpeg file via Regex - python-3.x

I would like to acquire some pictures from a forum. The find_all results gives me most what I want, which are jpeg files. However It also gives me few gif files which I do not desire. Another problem is that the gif file is an attachment, not a valid link, and it causes trouble when I save files.
soup_imgs = soup.find(name='div', attrs={'class':'t_msgfont'}).find_all('img', alt="")
for i in soup_imgs:
src = i['src']
print(src)
I tried to avoid that gif files in my find_all selections search, but useless, both jpeg and gif files are in the same section. What should I do to filter my result then? Please give me some help, chief. I am pretty amateur with coding. Playing with Python is just a hobby of mine.

You can filter it via regular expression.Please refer the following example.Hope this helps.
import re
from bs4 import BeautifulSoup
data='''<html>
<body>
<h2>List of images</h2>
<div class="t_msgfont">
<img src="img_chania.jpeg" alt="" width="460" height="345">
<img src="wrongname.gif" alt="">
<img src="img_girl.jpeg" alt="" width="500" height="600">
</div>
</body>
</html>'''
soup=BeautifulSoup(data, "html.parser")
soup_imgs = soup.find('div', attrs={'class':'t_msgfont'}).find_all('img', alt="" ,src=re.compile(".jpeg"))
for i in soup_imgs:
src = i['src']
print(src)

Try the following which I suspect you can shorten. It uses the ends with operator ($) to specify that the src attributes value of the child img elements ends with .jpg (edited to jpg from jpeg in light of OP's comment that it is actually jpg)
srcs = [item['src'] for item in soup.select("div.t_msgfont img[alt=''][src$='.jpg']")]
Have a look at shortening the selector(I can't without seeing the HTML in question), you may well get away with something like
srcs = [item['src'] for item in soup.select(".t_msgfont [alt=''][src$='.jpg']")]
or even
srcs = [item['src'] for item in soup.select(".t_msgfont [src$='.jpg']")]

I would suggest you to use requests-html to find the image resources in the page.
It's pretty simple compared to BeautifulSoup + requests.
Here's the code to do it.
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get(url)
for i in resp.html.absolute_links:
if i.endswith('.jpeg'):
print(i)

Related

Remove embedded image data from HTML with BeautifulSoup

I would like to use BS4 to remove embedded images to save space, but to leave the tag. For example remove the base64 data but leave <img class="blah" src="data:image/jpeg;base64,<DELETED>
I can do this to remove everything including the tag:
tags=soup.findAll('img')
for match in tags:
match.decompose()
Removes everything but I would like to keep the tag reference without the actual binary source.
Is that possible?
Python3
markup = """
<div>
<p>Take the red pill</p>
<img src="data:image/png;base64, iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Follow the white rabbit" />
</div>
"""
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.img
tag['src'] = "data:image/jpeg;base64,"
print(tag)
Outputs
<img alt="Follow the white rabbit" src="data:image/jpeg;base64,"/>
Here is how I managed to do it. Easy really?
for match in tags:
match['src']='deleted'

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

how to get soup.find_all to work in BeautifulSoup?

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)
You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)
class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

Beautiful Soup extract tag attributes, then find_all with multiple attributes

I am trying to extract the same information which appears numerous times on the same page. I am able to find the tag that it fits in which looks like this:
<div class="title" style="visibility: visible">
From this, i'd like to extract:
class="title"
AND
style="visibility: visible"
Then do a:
find_all('div),{'class':'title,'style''visibility: visible'}
This is going to happen in numerous instances, so I can't hardcode it. Sometimes the tag will have a class, sometimes a class and style....sometimes more....
Is this possible?
Really appreciate any direction on this.
Many thanks,
Also, you can use find_all method if you want more than one div in the content
code:
from bs4 import BeautifulSoup
import json
data = """<div class="title" style="visibility: visible"> </div>"""
soup = BeautifulSoup(data, 'html.parser') #parse content to BeautifulSoup Module
div_content = dict(soup.find("div").attrs)
print("div_content : {0}".format(div_content)) #div content
print("style_content : {0}".format(div_content.get("style"))) # style attribute
print("class_content : {0}".format(div_content.get("class")[0])) # class attribute
output:
div_content : {u'style': u'visibility: visible', u'class': [u'title']}
style_content : visibility: visible
class_content : title

Search entire text for images

I have a problem with a project.
I need to search a string for images.
I want to get the source of the image and modify the html form of the img tag.
For example the image form is:
and I want to change it to:
<div class="col-md-3">
<hr class="visible-sm visible-xs tall" />
<a class="img-thumbnail lightbox pull-left" href="upload/uploader/up_164.jpg" data-plugin-options='{"type":"image"}' title="Image title">
<img class="img-responsive" width="215" src="upload/uploader/up_164.jpg"><span class="zoom"><i class="fa fa-search"></i>
</span></a>
I have done some part of this.
I can find the image, change the form of the html but cannot loop this for all images found in the string.
My code goes like
Using the following function I get the string between two strings
// Get substring between
function GetBetween($var1="",$var2="",$pool){
$temp1 = strpos($pool,$var1)+strlen($var1);
$result = substr($pool,$temp1,strlen($pool));
$dd=strpos($result,$var2);
if($dd == 0){
$dd = strlen($result);
}
return substr($result,0,$dd);
}
And then I get the image tag from the string
$imageFile = GetBetween("img","/>",$newText);
The next was to filter the source of the image:
$imageSource = GetBetween('src="','\"',$imageFile);
And for the last part I call str_replace to do the job:
$newText = str_replace('oldform', 'newform', $newText);
The problem is in case there are more tha one images, I cannot loop this process.
Thank you in advance.
The best, simple and safe way to read an xml file is to use an xml parser.
And, I think you will gain a lot of time.

Resources