Opening Image from Website - python-3.x

I was trying to make a simple program to pull an image from the website xkcd.com, and I seem to be running into a problem where it returns list object has no attribute show. Anyone know how to fix this?
import requests
from lxml import html
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
tree = html.fromstring(r.content)
final = tree.xpath("""//*[#id="comic"]/img""")
final.show()

Your call to requests.get is retrieving the actual image, the byte code for the png. There is no html to parse or search for with xpath.
Note here, the content is bytes:
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
print(r.content)
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xe4\x00\x00\x01#\x08\x03\x00\x00\x00M\x7f\xe4\xc6\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f
Here you see that you can save the results directly to disk.
import requests
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
with open("myimage.png", "wb") as f:
f.write(r.content)
[Edit] And to Show the image (you will need to install pillow.)
import requests
from PIL import Image
import io
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
img = Image.open(io.BytesIO(r.content))
img.show()

Related

Saving (.svg) images using Scrapy

Im using Scrapy and I want to save some of the .svg images from the webpage locally on my computer. The urls for these images have the structure '__.com/svg/4/8/3/1425.svg' (and is a full working url, https included).
Ive defined the item in my items.py file:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
Ive added the following to my settings:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True
In the main parse function im calling:
imageItem = ImageItem()
imageItem['image_urls'] = [url]
yield imageItem
But it doesn't save the images. Ive followed the documentation and tried numerous things but keep getting the following error:
StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>
During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>
Am I missing something? Can anyone help? I am fully stumped.
Gallaecio was right! Scrapy was having an issue with the .svg file type. Changed the imagePipeline to the filePipeline and it works!
For anyone stuck the documentation is here
Python Imaging Library (PIL), which is used by the ImagesPipeline, does not support vector images.
If you still want to benefit from the ImagesPipeline capabilities and not switch to the more general FilesPipeline, you can do something along those lines
from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO
class SvgCompatibleImagesPipeline(ImagesPipeline):
def get_images(self, response, request, info, *, item=None):
"""
Add processing of SVG images to the standard images pipeline
"""
if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
b = BytesIO()
renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
res = response.replace(body=b.getvalue())
else:
res = response
return super().get_images(res, request, info, item=item)
This will replace the SVG image in the response body by a PNG version of it, which can be further processed by the regular ImagesPipeline.

How to download an image from the internet using google colab jupyter

I need to download an image using a url. I managed to obtain the urls of the images I need to download, but now I'm lost on how to download it to my local computer. I'm using google colab/ jupyter. Thank you!
here's my code so far:
from bs4 import BeautifulSoup
import requests
import json
import urllib.request
#use Globe API to get data
#input userid - plan: have program read userids from csv or excel file
userid = xxxxxxxx
#use Globe API to get data
source = requests.get('https://api.globe.gov/search/v1/measurement/protocol/measureddate/userid/?protocols=land_covers&startdate=2020-05-04&enddate=2020-07-16&userid=' + str(userid) +'&geojson=FALSE&sample=FALSE').text
#set up BeautifulSoup4
soup = BeautifulSoup(source, 'lxml')
#Isolate the Json data and put it into a string called "paragraph"
body = soup.find('body')
paragraph = body.p.text
#load the string into a python object
data = json.loads(paragraph)
#pick out the needed information and store them
for landcover in data['results']:
siteId = landcover['siteId']
measuredDate = landcover['measuredDate']
latitude = landcover['latitude']
longitude = landcover['longitude']
protocol = landcover['protocol']
DownURL = landcover['data']['landcoversDownwardPhotoUrl']
#Here is where I want to download the url contained in 'DownURL'
Try
from google.colab import files as FILE
import os
img_data = requests.get(DownURL).content
with open('image_name.jpg', 'wb') as handler:
handler.write(img_data)
FILE.download('image_name.jpg')
os.remove('image_name.jpg') # to save up space
You can call a random function in case you do not wish to set an image name or a counter variable which keeps increments at each loop iteration.

How do I open images stored in GCP in Google datalab?

I have been trying to open a image that I stored in the GCP bucket in my datalab notebook. When I use Image.open() it says like "No such file or directory: 'images/00001.jpeg'"
My code is:
nama_bucket = storage.Bucket("sample_bucket")
for obj in nama_bucket.objects():
Image.open(obj.key)
I just need to open the images stored in the bucket and view it. Thanks for the help!
I was able to reproduce the issue and get the same error as you (No such file or directory).
I will describe the workaround I used to solve it. However,there are few issues that I can see in the code snippet provided:
Class IPython.display.Image has no method 'open'.
You will need to wrap the Image constructor in a display() method.
With Storage APIs for Google Cloud Datalab, what resolved the issue for me was using the url parameter instead of the filename.
Here is the solution that worked for me:
import google.datalab.storage as storage
from IPython.display import Image
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
display(Image(url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)))
Let me know if it helps!
EDIT 1:
As you mentioned that you're using the PIL and would like your images to be handled by it, here's the way to achieve that (I have tested it and it worked well for me):
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
Notice that this way you will not need to use IPython.display.Image at all.
EDIT 2:
Indeed, the error cannot identify image file <_io.BytesIO object at 0x7f8f33bdbdb0> is appearing because you have a directory in your bucket. In order to solve this issue it's important to understand how Google Cloud Storage sub-directories work.
Here's how I organized the files in my bucket to replicate your situation:
my-bucket/
img/
test-file-1.png
test-file-2.png
test-file-3.jpeg
test-file-4.png
Even though gsutil achieves the hierarchical file tree illusion by applying a variety of rules, to try to make naming work the way users would expect, in fact, the test-files 1-3 just happen to have '/'s in their names while there's no actual 'img' directory.
You can still still list all images from your bucket. With the structure I mentioned above it can be achieved, for example, by checking the file's extension:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image
if obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg'):
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
If you need to get only the images "stored in a particular sub-directory" of your bucket, you will also need to check the files by name:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
folder = '<name-of-the-directory>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image AND that it has the required sub-directory in its name
if (obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg')) and folder in obj.key:
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

find() in Beautifulsoup returns None

I'm very new to programming in general and I'm trying to write my own little torrent leecher. I'm using Beautifulsoup In order to extract the title and the magnet link of a torrent file. However find() element keeps returning none no matter what I do. The page is correct. I've also tested with find_next_sibling and read all the similar questions but to no avail. Since there are no errors I have no idea what my mistake is.
Any help would be much appreciated. Below is my code:
import urllib3
from bs4 import BeautifulSoup
print("Please enter the movie name: \n")
search_string = input("")
search_string.rstrip()
search_string.lstrip()
open_page = ('https://www.yify-torrent.org/search/' + search_string + '/s-1/all/all/') # get link - creates a search string with input value
print(open_page)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
manager = urllib3.PoolManager(10)
page_content = manager.urlopen('GET',open_page)
soup = BeautifulSoup(page_content,'html.parser')
magnet = soup.find('a', attrs={'class': 'movielink'}, href=True)
print(magnet)
Check out the following script which does exactly what you wanna achieve. I used requests library instead of urllib3. The main mistake you made is that you looked for the magnet link in the wrong place. You need to go one layer deep to dig out that link. Try using quote instead of string manipulation to fit your search query within the url.
Give this a shot:
import requests
from urllib.parse import urljoin
from urllib.parse import quote
from bs4 import BeautifulSoup
keyword = 'The Last Of The Mohicans'
url = 'https://www.yify-torrent.org/search/'
base = f"{url}{quote(keyword)}{'/p-1/all/all/'}"
res = requests.get(base)
soup = BeautifulSoup(res.text,'html.parser')
tlink = urljoin(url,soup.select_one(".img-item .movielink").get("href"))
req = requests.get(tlink)
sauce = BeautifulSoup(req.text,"html.parser")
title = sauce.select_one("h1[itemprop='name']").text
magnet = sauce.select_one("a#dm").get("href")
print(f"{title}\n{magnet}")

Resources