How to download via URL from DBFS in Azure Databricks - azure

Documented here its mentioned that I am supposed to download a file from Data Bricks File System from a URL like:
https://<your-region>.azuredatabricks.net?o=######/files/my-stuff/my-file.txt
But when I try to download it from the URL with my own "o=" parameter similar to this:
https://westeurope.azuredatabricks.net/?o=1234567890123456/files/my-stuff/my-file.txt
it only gives the following error:
HTTP ERROR: 500
Problem accessing /. Reason:
java.lang.NumberFormatException: For input string:
"1234567890123456/files/my-stuff/my-file.txt"
Am I using the wrong URL or is the documentation wrong?
I already found a similar question that was answered, but that one does not seem to fit to the Azure Databricks documentation and might for AWS Databricks:
Databricks: Download a dbfs:/FileStore File to my Local Machine?
Thanks in advance for your help

The URL should be:
https://westeurope.azuredatabricks.net/files/my-stuff/my-file.txt?o=1234567890123456
Note that the file must be in the filestore folder.
As a side note I've been working on something called DBFS explorer to help with things like this if you would like to give it a try?
https://datathirst.net/projects/dbfs-explorer/

Related

pd.read_parquet produces: OSError: Passed non-file path

I'm trying to retrieve a DataFrame with pd.read_parquet and I get the following error:
OSError: Passed non-file path: my/path
I access the .parquet file in GCS with a path with prefix gs://
For some unknown reason, the OSError shows the path without the prefix gs://
I did suspect it was related to credentials, however from my Mac I can use gsutil with no problem. I can access and download the parquet file via the Google Cloud Platform. I just can't read the .parquet file directly in the gs:// path. Any ideas why not?
Arrow does not currently natively support loading data from GCS. There is some progress in implementing this the tasks are tracked in JIRA.
As a workaround you should be able to use the fsspec GCS filesystem implementation to access objects (I would try installing it and using 'gcs://' instead 'gs://'), or opening the file directly with the gcsfs api and passing it through.

Azcopy interprets source as local and adds current path when it is a gcloud storage https url

We want to copy files from Google Storage to Azure Storage.
We used following this guide: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-google-cloud
We run this command:
azcopy copy 'https://storage.googleapis.com/telia-ddi-delivery-plaace/activity_daily_al1_20min/' 'https://plaacedatalakegen2.blob.core.windows.net/teliamovement?<SASKEY>' --recursive=true
And get this resulting error:
INFO: Scanning...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
failed to perform copy command due to error: cannot start job due to error: cannot scan the path /Users/peder/Downloads/https:/storage.googleapis.com/telia-ddi-delivery-plaace/activity_daily_al1_20min, please verify that it is a valid.
It seems to us that azcopy interprets the source as a local file destination and therefore adds the current location we run it from which is: /Users/peder/Downloads/. But we are unable to find any arguments to indicate that it is a web location and it is identical to the documentation in this guide:
azcopy copy 'https://storage.cloud.google.com/mybucket/mydirectory' 'https://mystorageaccount.blob.core.windows.net/mycontainer/mydirectory' --recursive=true
What we have tried:
We are doing this on a Mac in Terminal, but we also tested PowerShell for Mac.
We have tried single and double quotes.
We copied the Azure Storage url with SAS key from the console to ensure that has correct syntax
We tried cp instead of copy as the help page for azcopy used that.
Is there anything wrong with our command? Or can it be that azcopy has been changed since the guide was written?
I also created an issue for this on the Azure Documentation git page: https://github.com/MicrosoftDocs/azure-docs/issues/78890
The reason you're running into this issue is because the URL storage.cloud.google.com is hardcoded in the application source code for Google Cloud Storage. From this link:
const gcpHostPattern = "^storage.cloud.google.com"
const invalidGCPURLErrorMessage = "Invalid GCP URL"
const gcpEssentialHostPart = "google.com"
Since you're using storage.googleapis.com instead of storage.cloud.google.com, it is not recognized by azcopy as a valid Google Cloud Storage endpoint and it considers the value as one of the directories in your local file system.

Display image in Databricks notebook error

I am working on creating a databricks notebook template with company logo. Using the below code to display image is throwing error.
Code:
%md
<img src ='/test/image/MyImage.jpg'>
Error:
HTTP ERROR 403: Invalid or missing CSRF token
Please guide me.
You either need to store image somewhere, and refer to it as a full URL, for example, you can refer your company site.
Another way is to upload file to the /FileStore directory on DBFS, and then you can refer to it using the /files/ URL, that is supported in both HTML and Markdown (see docs):
%md
![my_test_image](files/image.jpg)
You can upload image using databricks-cli, or via UI (if you have DBFS File Browser enabled). (Another option is the DBFS REST API, but it's cumbersome)

Azure Blob Using Python

I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.

Why can't I download from S3 using wget?

When I put https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv into a browser, I can download a file no problem. But when I say,
wget.download('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv', out='data/')
I get a 404 error. Is there something wrong with the format of that URL?
This is not a duplicate of HTTP Error 404: Not Found when using wget to download a link. wget works fine with other files. This appears to be something specific to S3 which is explained below.
The root cause is a bug in S3, as described here: https://stackoverflow.com/a/38285197/4323
One workaround is to use the requests library instead:
r = requests.get('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv')
This works fine. You can inspect r.text or write it to a file. For the most efficient way, see https://stackoverflow.com/a/39217788/4323

Resources