PIL and skimage taking too long to load 56000 URL's - python-3.x

My dataframe data3 has 56000 rows, with an image thumbnail URL as one of its column values. I am evaluating whether each one of those images are low contrast or not. I let the code below run for 9 hours! but still no result, and the kernel was still busy. Can you please let me know whats wrong?
P.S. I tried the code with a subset of my dataframe (100 rows) and it took 3 seconds to succesfully run. Surely by that standard, 56000 rows should take 30 minutes. Is there a memory overrun happening with temp files or something?
Maybe I need to introduce a try block here to catch any exceptions (even though no error is showing)? I'm not sure how to do that.
from PIL import Image
import urllib.request
import skimage
def f(row):
URL=row['ThumbnailURL']
#URL = 'http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167'
with urllib.request.urlopen(URL) as url:
with open('temp.jpg', 'wb') as f:
f.write(url.read())
tutu = Image.open('temp.jpg')
val=skimage.exposure.is_low_contrast(tutu, fraction_threshold=0.4, lower_percentile=1, upper_percentile=99, method='linear')
return val
data3['lowcontornot'] = data3.apply(f, axis=1)

The solution below avoids saving temporary images to disk, thus reducing the overhead:
def f(row):
url = row['ThumbnailURL']
img = Image.open(BytesIO(requests.get(url).content))
return is_low_contrast(img, fraction_threshold=0.4, lower_percentile=1,
upper_percentile=99, method='linear')
To give this code a try you need to include the following imports:
import requests
from io import BytesIO
from PIL import Image
from skimage.exposure import is_low_contrast

Related

boto3 read bucket files using concurrency method

I am trying to read bucket files without to saving them as a file:
import boto3
import botocore
from io import StringIO
import pandas as pd
s3 = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
bucket = self.s3.Bucket('deutsche-boerse-xetra-pds')
objects = self.bucket.objects.filter(Prefix= date)
file = pd.read_csv(StringIO(self.bucket.Object(key=object.key).get().get('Body').read().decode('utf-8')))
This code works quite well. However, I would like to use concurrency (python asyncio) to speed up the reading process. I did a search into documentation but I could only find something for the download function but not for the get function.
Do you have any suggestion?
Thanks in advance.
I found out a solution which works with multiprocessing since my final goal was to reduce the processing time.
As follow the code:
def generate_bucket():
s3_resoursce = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
xetra_bucket = s3_resoursce.Bucket('deutsche-boerse-xetra-pds')
return s3_resoursce, xetra_bucket
def read_csv(object):
s3local, bucket_local = self.generate_bucket()
return pd.read_csv(StringIO(bucket_local.Object(key=object).get().get('Body').read().decode('utf-8')))
def import_raw_data(date: List[str]) -> pd.DataFrame:
import multiprocessing
s3local, bucket_local2 = self.generate_bucket()
objects = [i.key for i in list(bucket_local2.objects.filter(Prefix= date[0]))]
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
df = pd.concat(p.map(self.read_csv, objects))
return df
For me it works, but I am sure that there could be the possibility to improve this code. I'm open to suggestions.

Python Pillow ImageChops.difference always None

I'm trying to compare screenshots of 2 interactive maps. The screenshots are taken with Selenium and using Pillow to compare.
...
from selenium.webdriver.common.by import By
from selenium import webdriver
from io import BytesIO
from PIL import ImageChops, Image
...
png_bytes1 = driver.find_element(By.CSS_SELECTOR, "body").screenshot_as_png
png_bytes2 = driver2.find_element(By.CSS_SELECTOR, "body").screenshot_as_png
img1 = Image.open(BytesIO(png_bytes1))
img2 = Image.open(BytesIO(png_bytes2))
diff = ImageChops.difference(img1, img2)
print(diff.getbbox())
But diff is always blank. I manually used img1.show() and img2.show() to obtain the images below. diff.show() is always blank and diff.getbbox() prints None. What am I doing wrong and is there a better way of doing it?
Update: It works if I first save these images as jpg. Anyone have ideas why?
It seems ImageChops.difference() will only work if the image parameters are Image objects. PNG files are PngImageFile objects, with an RGBA mode for an extra alpha layer and need to be converted using converted_img1 = img1.convert('RGB').

Threading/Async in Requests-html

I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)

Opening Image from Website

I was trying to make a simple program to pull an image from the website xkcd.com, and I seem to be running into a problem where it returns list object has no attribute show. Anyone know how to fix this?
import requests
from lxml import html
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
tree = html.fromstring(r.content)
final = tree.xpath("""//*[#id="comic"]/img""")
final.show()
Your call to requests.get is retrieving the actual image, the byte code for the png. There is no html to parse or search for with xpath.
Note here, the content is bytes:
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
print(r.content)
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xe4\x00\x00\x01#\x08\x03\x00\x00\x00M\x7f\xe4\xc6\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f
Here you see that you can save the results directly to disk.
import requests
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
with open("myimage.png", "wb") as f:
f.write(r.content)
[Edit] And to Show the image (you will need to install pillow.)
import requests
from PIL import Image
import io
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
img = Image.open(io.BytesIO(r.content))
img.show()

Storing matplotlib images in S3 with S3.Object().put() on boto3 1.5.36

Amongst other things I am drawing plots using matplotlib, which I would like to immediately store as S3 objects.
According to the answers provided in this and this other question and the fine manual, I need S3.Object.put() to move my data into AWS and the procedure should be along the lines of
from matplotlib import pyplot as plt
import numpy as np
import boto3
import io
# plot something
fig, ax = plt.subplots()
x = np.linspace(0, 3*np.pi, 500)
a = ax.plot(x, np.sin(x**2))
# get image data, cf. https://stackoverflow.com/a/45099838/1129682
buf = io.BytesIO()
fig.savefig(buf, format="png")
buf.seek(0)
image = buf.read()
# put the image into S3
s3 = boto3.resource('s3', aws_access_key_id=awskey, aws_secret_access_key=awssecret)
s3.Object(mybucket, mykey).put(ACL='public-read', Body=image)
However, I end up with a new S3 object with content-length zero.
The following gives me a new S3 object with content-length 6.
s3.Object(mybucket, mykey).put(ACL='public-read', Body="foobar")
When I put the next line, I end up with content in the S3 object, but its not a usable image:
s3.Object(mybucket, mykey).put(ACL='public-read', Body=str(image))
I can make it work by going through an actual file, like this:
with open("/tmp/iamstupid","wb") as fh:
fig.savefile(fh, format="png")
s3.Bucket(mybucket).upload_file("/tmp/iamstupid", mykey)
So it seems to work. I am just unable to use the interface correctly. What am I doing wrong? How can I achieve my goal using S3.Object.put()
I was able to resolve it. I found the answer in this question. Its a Python3 thing. From what I understand Python3 "usually" works with unicode. If you want single bytes you have to be explicit about it. The correct usage therefore is
s3.Object(mybucket, mykey).put(ACL='public-read', Body=bytes(image))
I find this a bit strange, since buf.read() is already supposed to return an object of type bytes but I did stop wondering, b/c it works now.

Resources