Why can't I download from S3 using wget? - python-3.x

When I put https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv into a browser, I can download a file no problem. But when I say,
wget.download('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv', out='data/')
I get a 404 error. Is there something wrong with the format of that URL?
This is not a duplicate of HTTP Error 404: Not Found when using wget to download a link. wget works fine with other files. This appears to be something specific to S3 which is explained below.

The root cause is a bug in S3, as described here: https://stackoverflow.com/a/38285197/4323
One workaround is to use the requests library instead:
r = requests.get('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv')
This works fine. You can inspect r.text or write it to a file. For the most efficient way, see https://stackoverflow.com/a/39217788/4323

Related

use getObject() from forge-api npm, how to make the return result as a download link?

I am using forge-api getObject() to download the excel from BIM360 hub. I set up express sever in the backend and make the call in the frontend.
I could get the result of the object and it looks like this:
So my question is:
How can I convert the result as a download link correctly? I could download the excel, but the excel can not be opened...
My code looks like this:
backend:
frontend:
I think all you need to modify in your backend code is to return content.body, instead of content
See e.g. https://github.com/Autodesk-Forge/forge-derivatives-explorer/blob/master/routes/data.management.js#L296
It might even be better if you generated a pre-signed URL for the file and passed that to the client. In that case, the file would not be downloaded to your server first and then to the client, but directly to the client in a single step.
https://forge.autodesk.com/en/docs/data/v2/reference/http/buckets-:bucketKey-objects-:objectName-signed-POST/

How to download via URL from DBFS in Azure Databricks

Documented here its mentioned that I am supposed to download a file from Data Bricks File System from a URL like:
https://<your-region>.azuredatabricks.net?o=######/files/my-stuff/my-file.txt
But when I try to download it from the URL with my own "o=" parameter similar to this:
https://westeurope.azuredatabricks.net/?o=1234567890123456/files/my-stuff/my-file.txt
it only gives the following error:
HTTP ERROR: 500
Problem accessing /. Reason:
java.lang.NumberFormatException: For input string:
"1234567890123456/files/my-stuff/my-file.txt"
Am I using the wrong URL or is the documentation wrong?
I already found a similar question that was answered, but that one does not seem to fit to the Azure Databricks documentation and might for AWS Databricks:
Databricks: Download a dbfs:/FileStore File to my Local Machine?
Thanks in advance for your help
The URL should be:
https://westeurope.azuredatabricks.net/files/my-stuff/my-file.txt?o=1234567890123456
Note that the file must be in the filestore folder.
As a side note I've been working on something called DBFS explorer to help with things like this if you would like to give it a try?
https://datathirst.net/projects/dbfs-explorer/

How to download and write a jar file in Node.js?

So I'm working on a Minecraft launcher (because why not, good experience), and I'm stuck when it comes to downloading the libraries.
I have a valid jar URL here. When you download it in the browser, it works fine. But, when you download it with Node.js, 7-zip gives this error when trying to open it:
An attempt was made to move the file pointer before the beginning of the file.
I'm using a module called snekfetch, but I've also tried it with request. Both items gave the same issue. Here's my current test code:
request.get('https://libraries.minecraft.net/com/google/code/gson/gson/2.8.0/gson-2.8.0.jar').then(r => {
fs.writeFileSync('./mything.jar', r.body);
});
Am I doing something wrong to download the jar file?
Okay, so now that I've seen this answer, I need to modify the question. I've gotten it to work using pipes, but I need inline-code because this is a for loop that's downloading (hence my usage of writeFileSync, and in my actual code I use await for the request). Is it even possible to download and write without piping?
It turns out this is an issue with the snekfetch library. Switching to snekfetch v3 fixed it.
You can check out the status of the issue here.

Downloading file from Dropbox API for use in Python Environment with Apache Tika on Heroku

I'm trying to use Dropbox as a cloud-based file receptacle for an app/script. The script, written in Python, needs to take PDFs from the Dropbox and use the tika-python wrapper to convert to string.
I'm able to connect to the Dropbox API and use the files_download_to_file() method to download the PDFs to disk, and then use the tika from_file() method to pull that download file from the disk to process. Example:
# Download ex.pdf to local disk
dbx.files_download_to_file('/my_local_path/ex_on_disk.pdf', '/my_dropbox_path/ex.pdf')
from tika import parser
parsed = parser.from_file('ex_on_disk.pdf')
The problem is that I'm planning on running this app on something like Heroku. I don't think I'm able to save anything locally and then access it again. I'm not sure how to get something from the Dropbox API that can be directly referenced by the tika wrapper to run the same as above. I think the PHP SDK has a file_get_contents and a file_put_contents set of methods but it doesn't appear to have a companion in the Python SDK.
I've tried using the shareable links in place of a filename but that hasn't worked. Any ideas? I know there's also the files_download method which downloads the FileMetadata object but I have no idea what to do with this and am having trouble finding more about it.
TLDR; How can I reference a file on Dropbox with a filename string such as 'example.pdf' to be used in another function that is trying to read a file from disk, without saving that Dropbox file to disk?
I figured it out. I used the files_download method to get the byte string and then use the from_buffer method of tika instead:
md, response = dbx.files_download(path)
file_contents = response.content
parsed = parser.from_buffer(file_contents)

AWS S3 serving gzipped files but not readable

I'm using AWS S3 to host a static webpage, almost all assets are gzipped before being uploaded.
During the upload the "content-encoding" header is correctly set to "gzip" (and this also reflects when actually loading the file from AWS).
The thing is, the files can't be read and are still in gzip format although the correct headers are set...
The files are uploaded using npm s3-deploy, here's a screenshot of what the request looks like:
and the contents of the file in the browser:
If I upload the file manually and set the content-encoding header to "gzip" it works perfectly. Sadly I have a couple hundred files to upload for every deployment and can not do this manually all the time (I hope that's understandable ;) ).
Has anyone an idea of what's going on here? Anyone worked with s3-deploy and can help?
I use my own bash script for S3 deployments, you can try to do it:
webpath='path'
BUCKET='BUCKETNAME'
for file in $webpath/js/*.gz; do
aws s3 cp "$file" s3://"$BUCKET/js/" --content-encoding 'gzip' --region='eu-west-1'
done

Resources