Downloading an image from the web and saving - python-3.x

I am trying to download an image from Wikipedia and save it to a file locally (using Python 3.9.x). Following this link I tried:
import urllib.request
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, when I try to open this file (Mac OS) I get an error: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
I did some more search and came across this article which suggests modifying the User-Agent. Following that I modified the above code as follows:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0')]
urllib.request.install_opener(opener)
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, modifying the User-Agent did NOT help and I still get the same error while trying to open the file: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
Another piece of information: the downloaded file (that does not open) is 235 KB. But if I download the image manually (Right Click -> Save Image As...) it is 455 KB.
I was wondering what else am I missing? Thank you!

The problem is, you're trying to download the web page with the .jpg format.
This link you used is actually not a photo link, but a Web site contains a photograph.
That's why the photo size is 455KB and the size of the file you're downloading is 235KB.
Instead of this :
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
Use this :
http = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Abacus_4.jpg/800px-Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
It is better to open any photo you want to use first with the "open image in new tab" option in your browser and then copy the url.

Related

Google Maps API using urllib.request cannot save jpg file

I am using Google Maps API, static map, and would like to save an image file in format JPG.
When I am saving a PNG using urllib.request.urlretrieve(url, 'map_46_6.png') this is working fine. However, when I am using urllib.request.urlretrieve(url, 'map_46_6.jpg'), this is not working. Opening the file gives an error « Not a JPG file: starts with 0x89 0x50 ». Changing manually the extension to PNG will resolve it.
The following is the code :
import urllib.request
url = 'http://maps.googleapis.com/maps/api/staticmap?scale=2&center=46.257632,6.108669&zoom=12&size=400x400&maptype=satellite&key=xxxxx'
urllib.request.urlretrieve(url, 'map_46_6.jpg')
As this code is part of a previously built pipeline, I would need the JPG files for the next steps.
My question is, is there a setting in Urllib, Google Maps or anything else that could result in this error? Thank you very much in advance !
I have found a solution. If one wants jpg, one needs to explicitly code the format, &format=jpg like the following:
import urllib.request
url = 'https://maps.googleapis.com/maps/api/staticmap?scale=2&center=46.257632,6.108669&zoom=16&size=400x400&maptype=satellite&format=jpg&key=xxxx'
urllib.request.urlretrieve(url, 'map_46_6.jpg')

Downloading with node modifies excel files and causes data loss

I am trying to create a script in node.js which will download an excel file. My code is built upon first making an http.get request to the URL and then write to a file using response.pipe and createWriteStream. My code is as follows:
const fs = require("fs");
const http = require("http");
let url = "http://www.functionalglycomics.org:80/glycomics/HFileServlet?operation=downloadRawFile&fileType=DAT&sideMenu=no&objId=1002183";
http.get(url, response => {
let file = fs.createWriteStream('file.xls');
let stream = response.pipe(file);
})
If you download the following file using Firefox the file downloads appropriately and if you open the file it works fine and excel does not give any errors.
http://www.functionalglycomics.org:80/glycomics/HFileServlet?operation=downloadRawFile&fileType=DAT&sideMenu=no&objId=1002183
Note- the download link above will not work with Chrome due to this issue with the filename containing , in filename. Therefore I cannot use puppeteer for this.
However if I use my script above and download the file and try to open it in excel it gives me an error stating "data may have been lost" 5 times but then eventually still opens the file.
My question is therefore, what is causing this data loss when downloading using nodejs?
Update
Some data about versions:
Node:v12.13.1
Excel: Office 2019
OS: Windows 10 latest
Update 2
Based on the comments below from jarmod, I tried using wget on Windows PowerShell. It downloads the file too but also produces the excel error.
I posted this as an issue on the node.js github. #Hakerh400 provided a good description of what is happening there but briefly, on Windows NTFS file system there is something called ADS (Alternate-Data Streams) which keeps track of which files are downloaded from the internet to prompt security concerns. You can read more about it in #Hakerh400 comment here.
The workaround proposed is to add this Zone.Identifier ADS to the file after the download is complete using the following example:
http.get(url, response => {
let file = fs.createWriteStream('file.xls');
let stream = response.pipe(file);
fs.writeFileSync(
'file.xls:Zone.Identifier',
`[ZoneTransfer]\r\nZoneId=3\r\nHostUrl=${url}`,
);
})
Note- This workaround allows you to open the Excel file in "Protected View" without any concerns. However if you click on "Enable Editing" in the security prompt in Excel, the "File Error: data may have been lost" error still pops up (Excel 2019). However, there is no real data loss in terms of the sheets/data in cells.
I hope this answer helps anyone who faces anything similar.

Using Python3 Sharepy to download an excel file from a shared 0365 Commercial sharepoint results in corrupt file

Similar to a previous question that wasn't fully answered, here, I am attempting to use Python3 and Sharepy to download an Excel file, manipulate it with pandas, and reupload it back to sharepoint.
The issue may be, I don't know where the true excel file is stored, I only have a reference link that can be shared to other people with access. Downloading that link is an html to Excel Online and not the Excel file I was intending. Any tips?
import sharepy
from sharepy import connect
from sharepy import SharePointSession
server='https://mycompany365.sharepoint.com'
user='first.m.last#mycompany.com'
password='1234Password1234'
# Copy/Paste file link from sharepoint below. #<--- partially works
site = "https://mycompany.sharepoint.com/:x:/r/sites/Sales/Shared%20Documents/General/My_File.xlsx?d=wb182f80code74bd586b225codebeb1c&csf=1&e=CodeeT"
s = sharepy.connect(server,user,password)
# Download file to same folder as python script, save as My_File.xlsx.
r = s.getfile(site,\
filename = 'My_File.xlsx')
print("Script Complete")
My site = results in an html pointed at the correct online file but not the true Excel file. How do I find where the true file is?
After some brute force, I found that deleting the random code at the end of the "share link" opens the original file:
# Copy/Paste file link from sharepoint below. Before...
site = "https://mycompany.sharepoint.com/:x:/r/sites/Sales/Shared%20Documents/General/My_File.xlsx?d=wb182f80code74bd586b225codebeb1c&csf=1&e=CodeeT"
# Copy/Paste file link from sharepoint below. After.
site = "https://mycompany.sharepoint.com/:x:/r/sites/Sales/Shared%20Documents/General/My_File.xlsx"
Thanks for your answer Arthur. It helped me resolving my issue of corrupted files. I was using this URL and was getting corrupted files after downloading.
url = 'https://mycompany.sharepoint.com/sites/MyProject/Retainer%20Query/Forms/AllItems.aspx?id=%2Fsites%2FSnowfig%2FRetainer%20Query%2Fimage001.png'
I changed the url to this
url = 'https://mycompany.sharepoint.com/sites/MyProject/Retainer%20Query/image001.png'
And now I can open these files.

In python 3, requests.get().content works to download images, but not for this type of url

I've been using different versions of a web scraper to download anime images from a number of websites I like using beautifulsoup, urllib, and requests.
when I have the image link i use requests.get(name_of_url).content and write the file to a directory on my computer. It has been working for other sites but not on this new one. With this new site, the program runs fine, but the file is not written correctly, as I am unable to view it with any image viewers. Here is my code without all of the html parsing, just the url to image download section:
import requests
import os
img_data = requests.get("https://cs.sankakucomplex.com/data/ba/bc/babc83a0361198bb43a9b367273b3ef7.jpg?e=1510027320&m=euskBFzOAk-YJJjfbP-26A").content
completename = os.path.join('C:\\', 'Users', 'jesse', '.spyder-py3', 'Image_scraper','sankaku', 'testtesttest.jpg')
with open(completename, 'wb') as handler:
handler.write(img_data)
I'm fairly certain that the issue is coming from the different url structure this sight has. If you notice after the ".jpg?" there is more url information, which the other sites I was looking through did not previously have. I'm open to using urllib2 or another library, I'm just learning to use python to interface with html over the last 2-3 weeks. Any ideas or suggestions are appreciated~
thank you

Magento: "Image does not exist"

I'm importing a CSV file in Magento (version 1.9).
I receive the error: 'Image does not exist'.
I've tried to do everything I could find on the internet.
The template I'm using for upload is the default template taken from my export folder.
I've added the / before the image name and I've also saved the file as UTF-8 format.
Any advice would help.
Use advanced profiler
System > Import/Export > Dataflow – Profiles
You only need to include the attributes that are required, which is just the SKU. Plus the appropiate image attributes. Plus labels if you want to go all out.
When you are creating your new profile, enter the following settings:
Now you can hit save! With our Profile now complete, we just need to create the folder media/import. This is where you will be storing all your images awaiting import.
When uploading images, they need to be within a folder called media/import. Once saved to that folder you can then reference them relatively. By that I mean if your image is in media/import/test.jpg in your csv reference it as /test.jpg. It’s as easy as that.
Please check this link for more information
Import products using csv
in the Default Import
first move all the images in media/import folder and then use '/imagename' in csv and then import.
And give the 777 permission to the import folder.
Let me know if you have any query....
check 3 point before upload csv file in Magento
create media > import folder and place all images inside import
folder import folder should have 777 permission
the path of images should be /desert-002.jpg
It may issue with image path in CSV if a image path in CSV is abg/test.jpg then it path in Dir is ..media/import/abg/test.jpg.also check image extension letter issue. Suppose your image extension I'd JPG and you rewrite in CSV is jpg .then it show image not exits
Your file template must look like this:
sku,image
product-001,/product_image.jpg
This file must exist: yourdocroot/media/import/product_image.jpg
More detail please read this method:
Mage_Catalog_Model_Convert_Adapter_Product::saveImageDataRow
You will see these lines:
$imageFile = trim($importData['_media_image']);
$imageFile = ltrim($imageFile, DS);
$imageFilePath = Mage::getBaseDir('media') . DS . 'import' . DS . $imageFile;
I hope this help!!!

Resources