Saving (.svg) images using Scrapy - svg

Im using Scrapy and I want to save some of the .svg images from the webpage locally on my computer. The urls for these images have the structure '__.com/svg/4/8/3/1425.svg' (and is a full working url, https included).
Ive defined the item in my items.py file:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
Ive added the following to my settings:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True
In the main parse function im calling:
imageItem = ImageItem()
imageItem['image_urls'] = [url]
yield imageItem
But it doesn't save the images. Ive followed the documentation and tried numerous things but keep getting the following error:
StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>
During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>
Am I missing something? Can anyone help? I am fully stumped.

Gallaecio was right! Scrapy was having an issue with the .svg file type. Changed the imagePipeline to the filePipeline and it works!
For anyone stuck the documentation is here

Python Imaging Library (PIL), which is used by the ImagesPipeline, does not support vector images.
If you still want to benefit from the ImagesPipeline capabilities and not switch to the more general FilesPipeline, you can do something along those lines
from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO
class SvgCompatibleImagesPipeline(ImagesPipeline):
def get_images(self, response, request, info, *, item=None):
"""
Add processing of SVG images to the standard images pipeline
"""
if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
b = BytesIO()
renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
res = response.replace(body=b.getvalue())
else:
res = response
return super().get_images(res, request, info, item=item)
This will replace the SVG image in the response body by a PNG version of it, which can be further processed by the regular ImagesPipeline.

Related

Passing base64 .docx to docx.Document results in BadZipFile exception

I'm writing an Azure function in Python 3.9 that needs to accept a base64 string created from a known .docx file which will serve as a template. My code will decode the base64, pass it to a BytesIO instance, and pass that to docx.Document(). However, I'm receiving an exception BadZipFile: File is not a zip file.
Below is a slimmed down version of my code. It fails on document = Document(bytesIODoc). I'm beginning to think it's an encoding/decoding issue, but I don't know nearly enough about it to get to the solution.
from docx import Document
from io import BytesIO
import base64
var = {
'template': 'Some_base64_from_docx_file',
'data': {'some': 'data'}
}
run_stuff = ParseBody(body=var)
output = run_stuff.run()
class ParseBody():
def __init__(self, body):
self.template = str(body['template'])
self.contents = body['data']
def _decode_template(self):
b64Doc = base64.b64decode(self.template)
bytesIODoc = BytesIO(b64Doc)
document = Document(bytesIODoc)
def run(self):
self.document = self._decode_template()
I've also tried the following change to _decode_template and am getting the same exception. This is running base64.decodebytes() on the b64Doc object and passing that to BytesIO instead of directly passing b64Doc.
def _decode_template(self):
b64Doc = base64.b64decode(self.template)
bytesDoc = base64.decodebytes(b64Doc)
bytesIODoc = BytesIO(bytesDoc)
I have successfully tried the following on the same exact .docx file to be sure that this is possible. I can open the document in Python, base64 encode it, decode into bytes, pass that to a BytesIO instance, and pass that to docx.Document successfully.
file = r'WordTemplate.docx'
doc = open(file, 'rb').read()
b64Doc = base64.b64encode(doc)
bytesDoc = base64.decodebytes(b64Doc)
bytesIODoc= BytesIO(bytesDoc)
newDoc = Document(bytesIODoc)
I've tried countless other solutions to no avail that have lead me further away from a resolution. This is the closest I've gotten. Any help is greatly appreciated!
The answer to the question linked below actually helped me resolve my own issue. How to generate a DOCX in Python and save it in memory?
All I had to do was change document = Document(bytesIODoc) to the following:
document = Document()
document.save(bytesIODoc)

Why is Scrapy not returning the entire HTML code?

I am trying to convert my selenium web scraper to scrapy because selenium is nor mainly intended for web scraping.
I just started writing it and have already hit a roadblock. My code is below.
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = "https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]').getall()
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
with open(Path(work_path, "extract.txt"), 'wb') as g:
g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
My roadblock is response.body does not contain the element sought by the xpath expression //span[#class = "pagination-pageInfo"] but the website does have it. I am way out of my depth with the inner workings of websites and am not a programmer by profession....unfortunately. Would anyone help me understand what is happening please?
You have to understand first that there is a big difference in what you are watching in your browser, against what is actually sent to you by the server.
The server, appart from the HTML, most of the times is sending you JavaScript code that has influence over the HTML itself at runtime.
For example, the first GET you do to a page, it can give you an empty table and some JavaScript code. That code then is in charge of hitting a database and filling the table. If you try to scrape that site with Scrapy alone it will fail to get the table because Scrapy does not have a JavaScript engine able to parse the code.
This is your case here, and will be your case for most of the pages you will try to crawl.
You need something to render the code in the page. The best option for Scrapy is Splash:
https://github.com/scrapinghub/splash
Which is a headless and scriptable browser you can use with a Scrapy plugin. It's mantained by Scrapinghub (the creators of Scrapy), so it will work pretty good.

How do I open images stored in GCP in Google datalab?

I have been trying to open a image that I stored in the GCP bucket in my datalab notebook. When I use Image.open() it says like "No such file or directory: 'images/00001.jpeg'"
My code is:
nama_bucket = storage.Bucket("sample_bucket")
for obj in nama_bucket.objects():
Image.open(obj.key)
I just need to open the images stored in the bucket and view it. Thanks for the help!
I was able to reproduce the issue and get the same error as you (No such file or directory).
I will describe the workaround I used to solve it. However,there are few issues that I can see in the code snippet provided:
Class IPython.display.Image has no method 'open'.
You will need to wrap the Image constructor in a display() method.
With Storage APIs for Google Cloud Datalab, what resolved the issue for me was using the url parameter instead of the filename.
Here is the solution that worked for me:
import google.datalab.storage as storage
from IPython.display import Image
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
display(Image(url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)))
Let me know if it helps!
EDIT 1:
As you mentioned that you're using the PIL and would like your images to be handled by it, here's the way to achieve that (I have tested it and it worked well for me):
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
Notice that this way you will not need to use IPython.display.Image at all.
EDIT 2:
Indeed, the error cannot identify image file <_io.BytesIO object at 0x7f8f33bdbdb0> is appearing because you have a directory in your bucket. In order to solve this issue it's important to understand how Google Cloud Storage sub-directories work.
Here's how I organized the files in my bucket to replicate your situation:
my-bucket/
img/
test-file-1.png
test-file-2.png
test-file-3.jpeg
test-file-4.png
Even though gsutil achieves the hierarchical file tree illusion by applying a variety of rules, to try to make naming work the way users would expect, in fact, the test-files 1-3 just happen to have '/'s in their names while there's no actual 'img' directory.
You can still still list all images from your bucket. With the structure I mentioned above it can be achieved, for example, by checking the file's extension:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image
if obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg'):
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
If you need to get only the images "stored in a particular sub-directory" of your bucket, you will also need to check the files by name:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
folder = '<name-of-the-directory>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image AND that it has the required sub-directory in its name
if (obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg')) and folder in obj.key:
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)

How to download bulk amount of images from google or any website

actually, I need to do a project on machine learning. In that I want a lot of images for training. I searched for this problem, but I failed to do so.
can anyone help me to solve this. Thanks in advance.
I used google images to download images using selenium. It is just a basic approach.
from selenium import webdriver
import time
import urllib.request
import os
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("path\\to\\the\\webdriverFile")
browser.get("https://www.google.com")
search = browser.find_element_by_name(‘q’)
search.send_keys(key_words,Keys.ENTER) # use required key_words to download images
elem = browser.find_element_by_link_text(‘Images’)
elem.get_attribute(‘href’)
elem.click()
value = 0
for i in range(20):
browser.execute_script(“scrollBy(“+ str(value) +”,+1000);”)
value += 1000
time.sleep(3)
elem1 = browser.find_element_by_id(‘islmp’)
sub = elem1.find_elements_by_tag_name(“img”)
try:
os.mkdir(‘downloads’)
except FileExistsError:
pass
count = 0
for i in sub:
src = i.get_attribute('src')
try:
if src != None:
src = str(src)
print(src)
count+=1
urllib.request.urlretrieve(src,
os.path.join('downloads','image'+str(count)+'.jpg'))
else:
raise TypeError
except TypeError:
print('fail')
if count == required_images_number: ## use number as required
break
check this for detailed explanation.
download driver here
My tip to you is: Use pictures API. This is my favourite: Bing Image Search API
Following text from Send search queries using the REST API and Python.
Running the quickstart
To get started, set subscription_key to a valid subscription key for the Bing API service.
Python
subscription_key = None
assert subscription_key
Next, verify that the search_url endpoint is correct. At this writing, only one endpoint is used for Bing search APIs. If you encounter authorization errors, double-check this value against the Bing search endpoint in your Azure dashboard.
Python
search_url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
Set search_term to look for images of puppies.
Python
search_term = "puppies"
The following block uses the requests library in Python to call out to the Bing search APIs and return the results as a JSON object. Observe that we pass in the API key via the headers dictionary and the search term via the params dictionary. To see the full list of options that can be used to filter search results, refer to the REST API documentation.
Python
import requests
headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
params = {"q": search_term, "license": "public", "imageType": "photo"}
response = requests.get(search_url, headers=headers, params=params)
response.raise_for_status()
search_results = response.json()
The search_results object contains the actual images along with rich metadata such as related items. For example, the following line of code can extract the thumbnail URLS for the first 16 results.
Python
thumbnail_urls = [img["thumbnailUrl"] for img in search_results["value"][:16]]
Then use the PIL library to download the thumbnail images and the matplotlib library to render them on a $4 \times 4$ grid.
Python
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
f, axes = plt.subplots(4, 4)
for i in range(4):
for j in range(4):
image_data = requests.get(thumbnail_urls[i+4*j])
image_data.raise_for_status()
image = Image.open(BytesIO(image_data.content))
axes[i][j].imshow(image)
axes[i][j].axis("off")
plt.show()
Sample JSON response
Responses from the Bing Image Search API are returned as JSON. This sample response has been truncated to show a single result.
JSON
{
"_type":"Images",
"instrumentation":{
"_type":"ResponseInstrumentation"
},
"readLink":"images\/search?q=tropical ocean",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=tropical ocean&FORM=OIIARP",
"totalEstimatedMatches":842,
"nextOffset":47,
"value":[
{
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?view=detailv2&FORM=OIIRPO&q=tropical+ocean&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&simid=608027248313960152",
"name":"My Life in the Ocean | The greatest WordPress.com site in ...",
"thumbnailUrl":"https:\/\/tse3.mm.bing.net\/th?id=OIP.fmwSKKmKpmZtJiBDps1kLAHaEo&pid=Api",
"datePublished":"2017-11-03T08:51:00.0000000Z",
"contentUrl":"https:\/\/mylifeintheocean.files.wordpress.com\/2012\/11\/tropical-ocean-wallpaper-1920x12003.jpg",
"hostPageUrl":"https:\/\/mylifeintheocean.wordpress.com\/",
"contentSize":"897388 B",
"encodingFormat":"jpeg",
"hostPageDisplayUrl":"https:\/\/mylifeintheocean.wordpress.com",
"width":1920,
"height":1200,
"thumbnail":{
"width":474,
"height":296
},
"imageInsightsToken":"ccid_fmwSKKmK*mid_8607ACDACB243BDEA7E1EF78127DA931E680E3A5*simid_608027248313960152*thid_OIP.fmwSKKmKpmZtJiBDps1kLAHaEo",
"insightsMetadata":{
"recipeSourcesCount":0,
"bestRepresentativeQuery":{
"text":"Tropical Beaches Desktop Wallpaper",
"displayText":"Tropical Beaches Desktop Wallpaper",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=Tropical+Beaches+Desktop+Wallpaper&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&FORM=IDBQDM"
},
"pagesIncludingCount":115,
"availableSizesCount":44
},
"imageId":"8607ACDACB243BDEA7E1EF78127DA931E680E3A5",
"accentColor":"0050B2"
}
}

Opening Image from Website

I was trying to make a simple program to pull an image from the website xkcd.com, and I seem to be running into a problem where it returns list object has no attribute show. Anyone know how to fix this?
import requests
from lxml import html
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
tree = html.fromstring(r.content)
final = tree.xpath("""//*[#id="comic"]/img""")
final.show()
Your call to requests.get is retrieving the actual image, the byte code for the png. There is no html to parse or search for with xpath.
Note here, the content is bytes:
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
print(r.content)
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xe4\x00\x00\x01#\x08\x03\x00\x00\x00M\x7f\xe4\xc6\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f
Here you see that you can save the results directly to disk.
import requests
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
with open("myimage.png", "wb") as f:
f.write(r.content)
[Edit] And to Show the image (you will need to install pillow.)
import requests
from PIL import Image
import io
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
img = Image.open(io.BytesIO(r.content))
img.show()

Resources