How do I open images stored in GCP in Google datalab? - python-3.x

I have been trying to open a image that I stored in the GCP bucket in my datalab notebook. When I use Image.open() it says like "No such file or directory: 'images/00001.jpeg'"
My code is:
nama_bucket = storage.Bucket("sample_bucket")
for obj in nama_bucket.objects():
Image.open(obj.key)
I just need to open the images stored in the bucket and view it. Thanks for the help!

I was able to reproduce the issue and get the same error as you (No such file or directory).
I will describe the workaround I used to solve it. However,there are few issues that I can see in the code snippet provided:
Class IPython.display.Image has no method 'open'.
You will need to wrap the Image constructor in a display() method.
With Storage APIs for Google Cloud Datalab, what resolved the issue for me was using the url parameter instead of the filename.
Here is the solution that worked for me:
import google.datalab.storage as storage
from IPython.display import Image
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
display(Image(url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)))
Let me know if it helps!
EDIT 1:
As you mentioned that you're using the PIL and would like your images to be handled by it, here's the way to achieve that (I have tested it and it worked well for me):
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
Notice that this way you will not need to use IPython.display.Image at all.
EDIT 2:
Indeed, the error cannot identify image file <_io.BytesIO object at 0x7f8f33bdbdb0> is appearing because you have a directory in your bucket. In order to solve this issue it's important to understand how Google Cloud Storage sub-directories work.
Here's how I organized the files in my bucket to replicate your situation:
my-bucket/
img/
test-file-1.png
test-file-2.png
test-file-3.jpeg
test-file-4.png
Even though gsutil achieves the hierarchical file tree illusion by applying a variety of rules, to try to make naming work the way users would expect, in fact, the test-files 1-3 just happen to have '/'s in their names while there's no actual 'img' directory.
You can still still list all images from your bucket. With the structure I mentioned above it can be achieved, for example, by checking the file's extension:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image
if obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg'):
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
If you need to get only the images "stored in a particular sub-directory" of your bucket, you will also need to check the files by name:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
folder = '<name-of-the-directory>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image AND that it has the required sub-directory in its name
if (obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg')) and folder in obj.key:
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)

Related

Saving (.svg) images using Scrapy

Im using Scrapy and I want to save some of the .svg images from the webpage locally on my computer. The urls for these images have the structure '__.com/svg/4/8/3/1425.svg' (and is a full working url, https included).
Ive defined the item in my items.py file:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
Ive added the following to my settings:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True
In the main parse function im calling:
imageItem = ImageItem()
imageItem['image_urls'] = [url]
yield imageItem
But it doesn't save the images. Ive followed the documentation and tried numerous things but keep getting the following error:
StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>
During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>
Am I missing something? Can anyone help? I am fully stumped.
Gallaecio was right! Scrapy was having an issue with the .svg file type. Changed the imagePipeline to the filePipeline and it works!
For anyone stuck the documentation is here
Python Imaging Library (PIL), which is used by the ImagesPipeline, does not support vector images.
If you still want to benefit from the ImagesPipeline capabilities and not switch to the more general FilesPipeline, you can do something along those lines
from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO
class SvgCompatibleImagesPipeline(ImagesPipeline):
def get_images(self, response, request, info, *, item=None):
"""
Add processing of SVG images to the standard images pipeline
"""
if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
b = BytesIO()
renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
res = response.replace(body=b.getvalue())
else:
res = response
return super().get_images(res, request, info, item=item)
This will replace the SVG image in the response body by a PNG version of it, which can be further processed by the regular ImagesPipeline.

How to download an image from the internet using google colab jupyter

I need to download an image using a url. I managed to obtain the urls of the images I need to download, but now I'm lost on how to download it to my local computer. I'm using google colab/ jupyter. Thank you!
here's my code so far:
from bs4 import BeautifulSoup
import requests
import json
import urllib.request
#use Globe API to get data
#input userid - plan: have program read userids from csv or excel file
userid = xxxxxxxx
#use Globe API to get data
source = requests.get('https://api.globe.gov/search/v1/measurement/protocol/measureddate/userid/?protocols=land_covers&startdate=2020-05-04&enddate=2020-07-16&userid=' + str(userid) +'&geojson=FALSE&sample=FALSE').text
#set up BeautifulSoup4
soup = BeautifulSoup(source, 'lxml')
#Isolate the Json data and put it into a string called "paragraph"
body = soup.find('body')
paragraph = body.p.text
#load the string into a python object
data = json.loads(paragraph)
#pick out the needed information and store them
for landcover in data['results']:
siteId = landcover['siteId']
measuredDate = landcover['measuredDate']
latitude = landcover['latitude']
longitude = landcover['longitude']
protocol = landcover['protocol']
DownURL = landcover['data']['landcoversDownwardPhotoUrl']
#Here is where I want to download the url contained in 'DownURL'
Try
from google.colab import files as FILE
import os
img_data = requests.get(DownURL).content
with open('image_name.jpg', 'wb') as handler:
handler.write(img_data)
FILE.download('image_name.jpg')
os.remove('image_name.jpg') # to save up space
You can call a random function in case you do not wish to set an image name or a counter variable which keeps increments at each loop iteration.

How to only upload only the filename (not entire directly name+filename) to a google storage bucket

I have found a way to upload multiple csv files at once into a google cloud storage bucket that satisfy certain criteria. The problem I have is that when they all upload to the google storage bucket, the entire path name of the file is uploaded with it. I am wanting to upload only the actual file name
I have tried using the os.path.basename but it doesn't work. Is there any other way to obtain just the basename before it gets uploaded OR is there a way simply to rename the file before it gets uploaded?
import glob
import os
from pathlib import Path
from os import listdir
from google.cloud import storage
GOOGLE_APPLICATION_CREDENTIALS = "O:\My Creds\creds.json"
for file in glob.glob("O:\Team Drives\AU_A\Raw_Dauts\Dynamets\**\*.csv", recursive = True):
filename = os.path.basename(file) # throught this would work but doesn't
storage_client = storage.Client.from_service_account_json(GOOGLE_APPLICATION_CREDENTIALS)
bucket = storage_client.get_bucket('bukcetang81')
blob = bucket.blob("Dynamic_datasets/" +filename)
blob.upload_from_filename(filename)
I'd suggest that you remove parameter "filename", instead hard code it with the name of the file itself:
for file in glob.glob("O:\Team Drives\AU_A\Raw_Dauts\Dynamets\**\*.csv", recursive = True):
storage_client = storage.Client.from_service_account_json(GOOGLE_APPLICATION_CREDENTIALS)
bucket = storage_client.get_bucket('bukcetang81')
blob = bucket.blob(destination_filename)
blob.upload_from_filename(source_filename)

Is there a way for google cloud storage client to point to 'file object' on cloud storage to be then used by lxml?

With Google Cloud Storage Client I could not read a Storage file as an object as required by the lxml.etree.parse. I could read the Cloud storage file as a blob, but that did not work well with lxml.
I am trying to convert XML files using an XSLT file. I want to have a Google Cloud Function(in Python3.7) that will be triggered as soon as the XML file is uploaded to Cloud Storage. I have tried this code by storing the files locally and it works. However need a way to get this working with Cloud Storage as well.
----Using local files (Working Code):
import lxml.etree as ET
filename = "C:\\GCP\\Files\\Profile.xml"
xsltfile = "C:\\GCP\\Files\\Transform.xslt"
outpath = "C:\\GCP\\Files\\Output\\Output.json"
dom = ET.parse(filename)
xslt = ET.parse(xsltfile)
transform = ET.XSLT(xslt)
newdom = transform(dom)
xdom = str(newdom)
text_file = open(outpath, "w")
text_file.write(xdom)
text_file.close()
----Using Cloud storage(not working):
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml=xmlblob.download_as_string()
xmldom = ET.parse(inputxml)
Error: failed to load external entity
The error is expected as I am passing an XML string instead of a File Object as expected by ET.parse
How can I pass a file object from Cloud storage to make this work?
The lxml.etree.parse() function expects a string as a filename. If you want to pass it file contents instead, you need to wrap it in a StringIO or BytesIO (in this case, the latter):
from io import BytesIO
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml = xmlblob.download_as_string()
xmldom = ET.parse(BytesIO(inputxml))
See the lxml documentation here: https://lxml.de/parsing.html.

Opening Image from Website

I was trying to make a simple program to pull an image from the website xkcd.com, and I seem to be running into a problem where it returns list object has no attribute show. Anyone know how to fix this?
import requests
from lxml import html
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
tree = html.fromstring(r.content)
final = tree.xpath("""//*[#id="comic"]/img""")
final.show()
Your call to requests.get is retrieving the actual image, the byte code for the png. There is no html to parse or search for with xpath.
Note here, the content is bytes:
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
print(r.content)
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xe4\x00\x00\x01#\x08\x03\x00\x00\x00M\x7f\xe4\xc6\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f
Here you see that you can save the results directly to disk.
import requests
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
with open("myimage.png", "wb") as f:
f.write(r.content)
[Edit] And to Show the image (you will need to install pillow.)
import requests
from PIL import Image
import io
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
img = Image.open(io.BytesIO(r.content))
img.show()

Resources