Downloading with node modifies excel files and causes data loss - node.js

I am trying to create a script in node.js which will download an excel file. My code is built upon first making an http.get request to the URL and then write to a file using response.pipe and createWriteStream. My code is as follows:
const fs = require("fs");
const http = require("http");
let url = "http://www.functionalglycomics.org:80/glycomics/HFileServlet?operation=downloadRawFile&fileType=DAT&sideMenu=no&objId=1002183";
http.get(url, response => {
let file = fs.createWriteStream('file.xls');
let stream = response.pipe(file);
})
If you download the following file using Firefox the file downloads appropriately and if you open the file it works fine and excel does not give any errors.
http://www.functionalglycomics.org:80/glycomics/HFileServlet?operation=downloadRawFile&fileType=DAT&sideMenu=no&objId=1002183
Note- the download link above will not work with Chrome due to this issue with the filename containing , in filename. Therefore I cannot use puppeteer for this.
However if I use my script above and download the file and try to open it in excel it gives me an error stating "data may have been lost" 5 times but then eventually still opens the file.
My question is therefore, what is causing this data loss when downloading using nodejs?
Update
Some data about versions:
Node:v12.13.1
Excel: Office 2019
OS: Windows 10 latest
Update 2
Based on the comments below from jarmod, I tried using wget on Windows PowerShell. It downloads the file too but also produces the excel error.

I posted this as an issue on the node.js github. #Hakerh400 provided a good description of what is happening there but briefly, on Windows NTFS file system there is something called ADS (Alternate-Data Streams) which keeps track of which files are downloaded from the internet to prompt security concerns. You can read more about it in #Hakerh400 comment here.
The workaround proposed is to add this Zone.Identifier ADS to the file after the download is complete using the following example:
http.get(url, response => {
let file = fs.createWriteStream('file.xls');
let stream = response.pipe(file);
fs.writeFileSync(
'file.xls:Zone.Identifier',
`[ZoneTransfer]\r\nZoneId=3\r\nHostUrl=${url}`,
);
})
Note- This workaround allows you to open the Excel file in "Protected View" without any concerns. However if you click on "Enable Editing" in the security prompt in Excel, the "File Error: data may have been lost" error still pops up (Excel 2019). However, there is no real data loss in terms of the sheets/data in cells.
I hope this answer helps anyone who faces anything similar.

Related

Downloading an image from the web and saving

I am trying to download an image from Wikipedia and save it to a file locally (using Python 3.9.x). Following this link I tried:
import urllib.request
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, when I try to open this file (Mac OS) I get an error: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
I did some more search and came across this article which suggests modifying the User-Agent. Following that I modified the above code as follows:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0')]
urllib.request.install_opener(opener)
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, modifying the User-Agent did NOT help and I still get the same error while trying to open the file: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
Another piece of information: the downloaded file (that does not open) is 235 KB. But if I download the image manually (Right Click -> Save Image As...) it is 455 KB.
I was wondering what else am I missing? Thank you!
The problem is, you're trying to download the web page with the .jpg format.
This link you used is actually not a photo link, but a Web site contains a photograph.
That's why the photo size is 455KB and the size of the file you're downloading is 235KB.
Instead of this :
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
Use this :
http = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Abacus_4.jpg/800px-Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
It is better to open any photo you want to use first with the "open image in new tab" option in your browser and then copy the url.

Autorenaming duplicate filename downloads in chrome/puppeteer/ubuntu

I'm downloading pdf files using headFULL chromium & puppeteer. I call a javascript function in the browser context and the download starts. The file name comes as is from the server. Issue: Many files I download in a directory are of same names coming from the server and Chrome instead of autosuffixing an index (1) to the file, overwrites the existing one.
Since the file is downloaded by calling a JS function and I have inspected the function as well, I don't have access to a the pdf url. It is triggered using the function call and thus I have no control over the file names.
I have a list of the file names but that in no way helps in changing the filename on the fly, if it it's duplicate name already exists on the machine.
Config: Ubuntu 18.04, Puppeteer 1.18.1
I know either it's a config issue with Nautilus file manager or with Chrome. Is it possible to configure any of these two?
I cannot foresee an option within nodejs where I can rename the file before it's downloaded. A workaround is to download each file in a temp folder, then move it to the required folder while doing a check if it already exists and rename if so. But it adds a lot of time complexity. It would be great to have chrome or nautilus do the task.
Function which triggers the download:
await page.evaluate( (doc_index,arg1,arg2) => openDocument(String(doc_index), String(arg1), String(arg2) ,'ABC','','','XYZ') , doc_index,arg1,arg2 )
Expected behaviour: When the above function is called and pdf starts downloading in the set folder, if a pdf of the same name exists, the new pdf should be renamed to something like pdf_name.pdf(1) or the like.

Backup Google Drive to .zip file with file conversion

I keep going in circles on this topic, and can't find an automated method that works for mass data on a Google Drive. Here is the goal I'm looking to achieve:
My company uses an unlimited Google Drive to store shared documents, and we are looking to backup the contents automatically. But we can't have the data stored in a backup with google documents like ".gdoc" and ".gsheet"... we need to have the documents backed up in Microsoft/Open-Office format (".docx" and ".xlsx").
We currently use Google's Takeout page to zip all the contents of the Drive and save it on our Linux server (That has redundant storage). And it does zip and export the files to the correct formats.
Here: [https://takeout.google.com/settings/takeout][1]
Now that works... but requires a bit of manual work on our part. And babysitting the zip, download and upload processes is becoming wasteful. I have searched and have read that the google API for Takeout is unavailable to use through gscript. So, that seems to be out of the question.
Using Google scripts, I have been able to convert single files.... but can't, for instance, convert a folder of ".gsheet" files to ".xlsx" format. Maybe copying and converting all the google files into a new folder on the drive could be possible. Having access to the drive and the converted "backup", we could then backup the collection of converted files via the server...
So here is the just of it all:
Can you mass-convert all of a google drive and/or a specific folder on the drive from ".gdoc" to ".docx", and ".gsheet" to ".xlsx". Can this be done with gscript?
If not able to via the method in question one, is anyone familiar with an Linux of Mac app that could do such a directory conversion? (Don't believe it because of googles proprietary file types)
I'm stuck in a bit of a hole, and any insight to this problem could help. I really wish Google would allow users to convert and export drive folders via a script selection.
#1) Can you mass-convert all of a google drive and/or a specific folder on the drive from ".gdoc" to ".docx", and ".gsheet" to ".xlsx". Can this be done with gscript?
You can try this:
How To Automaticlly Convert files in Google App Script
Converting file in Google App Script into blob
var documentId = DocumentApp.getActiveDocument().getId();
function getBlob(documentId) {
var file = Drive.Files.get(documentId);
var url = file.exportLinks['application/vnd.openxmlformats-officedocument.wordprocessingml.document'];
var oauthToken = ScriptApp.getOAuthToken();
var response = UrlFetchApp.fetch(url, {
headers: {
'Authorization': 'Bearer ' + oauthToken
}
});
return response.getBlob();
}
Saving file as docx in Drive
function saveFile(blob) {
var file = {
title: 'Converted_into_MS_Word.docx',
mimeType: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
};
file = Drive.Files.insert(file, blob);
Logger.log('ID: %s, File size (bytes): %s', file.id, file.fileSize);
return file;
}
Time-driven triggers
A time-driven trigger (also called a clock trigger) is similar to a cron job in Unix. Time-driven triggers let scripts execute at a particular time or on a recurring interval, as frequently as every minute or as infrequently as once per month. (Note that an add-on can use a time-driven trigger once per hour at most.) The time may be slightly randomized — for example, if you create a recurring 9 a.m. trigger, Apps Script chooses a time between 9 a.m. and 10 a.m., then keeps that timing consistent from day to day so that 24 hours elapse before the trigger fires again.
function createTimeDrivenTriggers() {
// Trigger every 6 hours.
ScriptApp.newTrigger('myFunction')
.timeBased()
.everyHours(6)
.create();
// Trigger every Monday at 09:00.
ScriptApp.newTrigger('myFunction')
.timeBased()
.onWeekDay(ScriptApp.WeekDay.MONDAY)
.atHour(9)
.create();
}
Process:
List all files id inside a folder
Convert Files
Insert Code to a Time-driven Triggers
2) If not able to via the method in question one, is anyone familiar with an Linux of Mac app that could do such a directory conversion? (Don't believe it because of googles proprietary file types)
If you are want to save it locally try setting a cronjob and use Download Files
The Drive API allows you to download files that are stored in Google Drive. Also, you can download exported versions of Google Documents (Documents, Spreadsheets, Presentations, etc.) in formats that your app can handle. Drive also supports providing users direct access to a file via the URL in the webViewLink property.
Depending on the type of download you'd like to perform — a file, a Google Document, or a content link — you'll use one of the following URLs:
Download a file — files.get with alt=media file resource
Download and export a Google Doc — files.export
Link a user to a file — webContentLink from the file resource
Sample Code :
$fileId = '0BwwA4oUTeiV1UVNwOHItT0xfa2M';
$content = $driveService->files->get($fileId, array(
'alt' => 'media' ));
Hope this helps and answered all you questions

Error when opening XLSX file made by tealeg xlsx in Go language on Google App Engine

I'm using https://github.com/tealeg/xlsx to generate xlsx file in Go language. The application is running on Google App Engine.
var file *xlsx.File
var sheet *xlsx.Sheet
var row *xlsx.Row
var cell *xlsx.Cell
var err error
file = xlsx.NewFile()
sheet, err = file.AddSheet("TEST")
if err != nil {
fmt.Printf(err.Error())
}
row = sheet.AddRow()
cell = row.AddCell()
cell.Value = "I am a cell!"
w.Header().Set("Content-Type", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
w.Header().Add("Content-Disposition", "attachment; filename=Test.xlsx")
file.Write(w)
fmt.Fprint(w, nil)
The variable w is http.ResponseWriter.
I have tested this code and the browser downloaded the xlsx file successfully and I can open it with LibreOffice on Linux 64 bit. However when I tried to open the file with Microsoft Excel 2010 on Windows 7 32 bit, it gave me the following error message:-
Excel found unreadable content in 'Test.xlsx'. Do you want to recover the contents of this workbook? If you trust the source of this workbook, click Yes.
Once I clicked yes, it showed the content "I am a cell!" correctly then I clicked on Enable Editing button and it gave me a message:-
Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.
How to fix this problem by making Microsoft Excel opens xlsx files generated by tealeg/xlsx without any error message?
Your last line is completely unnecessary, and it corrupts the Excel file (XLSX is a zip archive):
fmt.Fprint(w, nil)
This will print the bytes of the "<nil>" string to the web response after you already wrote the content of the Excel file, which I believe you just left there as an accident. Remove that.
Also File.Write() returns an error, also check that, for example:
if err = file.Write(w); err != nil {
// Handle error, e.g. log it and send back an error response
}
If the error still persists, it's not AppEngine related. I suggest first try the Excel generation locally, saving it to a file using File.Save(), e.g.:
if err = file.Save("MyXLSXFile.xlsx"); err != nil {
fmt.Println("Failed to save file:", err)
}
// If no error, try opening "MyXLSXFile.xlsx" on you computer.
Also note the comment from the home page of the library:
The support for writing XLSX files is currently extremely minimal. It will expand slowly, but in the meantime patches are welcome!

Excel Object in asp.net

I have a serious issue.
I am using excel object for opening the excel file
it works fine i my PC.
when i make application as a website and running the page and uploading it gives the error "'C:\Documents and Settings\Administrator\Desktop\Work\SABRE MSO Mapping Request Template.xlsx' could not be found. Check the spelling of the file name, and verify that the file location is correct. If you are trying to open the file from your list of most recently used files, make sure that the file has not been renamed, moved, or deleted. ".
I think it taking server path...but i want to open client excel file before saving the file to the server.
Plz help.
have you tried server.mappath() method ?? Do you have proper permissions setup to access the folder??
Are you passing complete file path to the excel for opening the file? Please try this:
if (fileUpload.HasFile)
{
string fileName = "PATH_RELATIVE_TO_YOUR_SITE" + "FILE_NAME";
fileUpload.PostedFile.SaveAs(fileName);
//NOW open excel using fileName;
}
also you need write permissions to the path (folder) you are writing file to.

Resources