Remove embedded image data from HTML with BeautifulSoup - python-3.x

I would like to use BS4 to remove embedded images to save space, but to leave the tag. For example remove the base64 data but leave <img class="blah" src="data:image/jpeg;base64,<DELETED>
I can do this to remove everything including the tag:
tags=soup.findAll('img')
for match in tags:
match.decompose()
Removes everything but I would like to keep the tag reference without the actual binary source.
Is that possible?

Python3
markup = """
<div>
<p>Take the red pill</p>
<img src="data:image/png;base64, iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Follow the white rabbit" />
</div>
"""
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.img
tag['src'] = "data:image/jpeg;base64,"
print(tag)
Outputs
<img alt="Follow the white rabbit" src="data:image/jpeg;base64,"/>

Here is how I managed to do it. Easy really?
for match in tags:
match['src']='deleted'

Related

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

Beautifulsoup filter "find_all" results, limited to .jpeg file via Regex

I would like to acquire some pictures from a forum. The find_all results gives me most what I want, which are jpeg files. However It also gives me few gif files which I do not desire. Another problem is that the gif file is an attachment, not a valid link, and it causes trouble when I save files.
soup_imgs = soup.find(name='div', attrs={'class':'t_msgfont'}).find_all('img', alt="")
for i in soup_imgs:
src = i['src']
print(src)
I tried to avoid that gif files in my find_all selections search, but useless, both jpeg and gif files are in the same section. What should I do to filter my result then? Please give me some help, chief. I am pretty amateur with coding. Playing with Python is just a hobby of mine.
You can filter it via regular expression.Please refer the following example.Hope this helps.
import re
from bs4 import BeautifulSoup
data='''<html>
<body>
<h2>List of images</h2>
<div class="t_msgfont">
<img src="img_chania.jpeg" alt="" width="460" height="345">
<img src="wrongname.gif" alt="">
<img src="img_girl.jpeg" alt="" width="500" height="600">
</div>
</body>
</html>'''
soup=BeautifulSoup(data, "html.parser")
soup_imgs = soup.find('div', attrs={'class':'t_msgfont'}).find_all('img', alt="" ,src=re.compile(".jpeg"))
for i in soup_imgs:
src = i['src']
print(src)
Try the following which I suspect you can shorten. It uses the ends with operator ($) to specify that the src attributes value of the child img elements ends with .jpg (edited to jpg from jpeg in light of OP's comment that it is actually jpg)
srcs = [item['src'] for item in soup.select("div.t_msgfont img[alt=''][src$='.jpg']")]
Have a look at shortening the selector(I can't without seeing the HTML in question), you may well get away with something like
srcs = [item['src'] for item in soup.select(".t_msgfont [alt=''][src$='.jpg']")]
or even
srcs = [item['src'] for item in soup.select(".t_msgfont [src$='.jpg']")]
I would suggest you to use requests-html to find the image resources in the page.
It's pretty simple compared to BeautifulSoup + requests.
Here's the code to do it.
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get(url)
for i in resp.html.absolute_links:
if i.endswith('.jpeg'):
print(i)

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

How to scrape a string from the div tag using Selenium and Python?

I have source code like the code below. I'm trying to scrape out the '11 tigers' string. I'm new to xpath, can anyone suggest how to get it using selenium or beatiful soup? I'm thinking driver.find_element_by_xpath or soup.find_all.
source:
<div class="count-box fixed_when_handheld s-vgLeft0_5 s-vgPullBottom1 s-vgRight0_5 u-colorGray6 u-fontSize18 u-fontWeight200" style="display: block;">
<div class="label-container u-floatLeft">11 tigers</div>
<div class="u-floatRight">
<div class="hide_when_tablet hide_when_desktop s-vgLeft0_5 s-vgRight0_5 u-textAlignCenter">
<div class="js-show-handheld-filters c-button c-button--md c-button--blue s-vgRight1">
Filter
</div>
<div class="js-save-handheld-filters c-button c-button--md c-button--transparent">
Save
</div>
</div>
</div>
<div class="cb"></div>
</div>
You can use same .count-box .label-container css selector for both BS and Selenium.
BS:
page = BeautifulSoup(yourhtml, "html.parser")
# if you need first one
label = page.select_one(".count-box .label-container").text
# if you need all
labels = page.select(".count-box .label-container")
for label in labels:
print(label.text)
Selenium:
labels = driver.find_elements_by_css_selector(".count-box .label-container")
for label in labels:
print(label.text)
Variant of the answer given by Sers.
page = BeautifulSoup(html_text, "lxml")
# first one
label = page.find('div',{'class':'count-box label-container')).text
# for all
labels = page.find('div',{'class':'count-box label-container'))
for label in labels:
print(label.text)
Use lxml parser as it's faster. You need to install it explicitly via pip install lxml
To extract the text 11 tigers you can use either of the following solution:
Using css_selector:
my_text = driver.find_element_by_css_selector("div.count-box>div.label-container.u-floatLeft").get_attribute("innerHTML")
Using xpath:
my_text = driver.find_element_by_xpath("//div[contains(#class, 'count-box')]/div[#class='label-container u-floatLeft']").get_attribute("innerHTML")

Search entire text for images

I have a problem with a project.
I need to search a string for images.
I want to get the source of the image and modify the html form of the img tag.
For example the image form is:
and I want to change it to:
<div class="col-md-3">
<hr class="visible-sm visible-xs tall" />
<a class="img-thumbnail lightbox pull-left" href="upload/uploader/up_164.jpg" data-plugin-options='{"type":"image"}' title="Image title">
<img class="img-responsive" width="215" src="upload/uploader/up_164.jpg"><span class="zoom"><i class="fa fa-search"></i>
</span></a>
I have done some part of this.
I can find the image, change the form of the html but cannot loop this for all images found in the string.
My code goes like
Using the following function I get the string between two strings
// Get substring between
function GetBetween($var1="",$var2="",$pool){
$temp1 = strpos($pool,$var1)+strlen($var1);
$result = substr($pool,$temp1,strlen($pool));
$dd=strpos($result,$var2);
if($dd == 0){
$dd = strlen($result);
}
return substr($result,0,$dd);
}
And then I get the image tag from the string
$imageFile = GetBetween("img","/>",$newText);
The next was to filter the source of the image:
$imageSource = GetBetween('src="','\"',$imageFile);
And for the last part I call str_replace to do the job:
$newText = str_replace('oldform', 'newform', $newText);
The problem is in case there are more tha one images, I cannot loop this process.
Thank you in advance.
The best, simple and safe way to read an xml file is to use an xml parser.
And, I think you will gain a lot of time.

Resources