MainThread: Vaex: Error while Opening Azure Data Lake Parquet file - azure

I tried to open a parquet on an Azure data lake gen 2 storage using SAS URL generated (with the datetime limit and token embedded in the url) using vaex by doing:
vaex.open(sas_url)
and I got the error
ERROR:MainThread:vaex:error opening 'the path which was also the sas_url(can't post it for security reasons)'
ValueError: Do not know how to open (can't publicize the sas url) , no handler for https is known
How do I get vaex to read the file or is there another azure storage that works better with vaex?

I finally found a solution! Vaex can read files in Azure blob storage with this:
import vaex
import adlfs
storage_account = "..."
account_key = "..."
container = "..."
object_path = "..."
fs = adlfs.AzureBlobFileSystem(account_name=storage_account, account_key=account_key)
df = vaex.open(f"abfs://{container}/{object_path}", fs=fs)
for more details, I found the solution in https://github.com/vaexio/vaex/issues/1272

Vaex is not capable to read the data using https source, that's the reason you are getting error "no handler for https is known".
Also, as per the document, vaex supports data input from Amazon S3 buckets and Google cloud storage.
Cloud support:
Amazon Web Services S3
Google Cloud Storage
Other cloud storage options
They mentioned that other cloud storages are also supported but there is no supporting document anywhere with any example where they are fetching the data from Azure storage account, that also using SAS URL.
Also please visit API document for vaex library for more info.

Related

Combining google.cloud.storage.blob.Blob and google.appengine.api.images.get_serving_url

I've got images stored as Blobs in Google Cloud Storage and I can see amazing capabilities offered by get_serving_url() which requires a blob_key as its first parameter. But I cannot see anyway to get a key from the Blob - am I mixing up multiple things called Blobs?
I'm using Python3 on GAE.
Try the code below but you'll need to first enable the Bundled Services API
from google.appengine.ext import blobstore
# Create the cloud storage file name
blobstore_filename = f"/gs/{<your_file_name>}"
# Now create the blob_key from the cloud storage file name
blob_key = blobstore.create_gs_key(blobstore_filename)

Connection error while creating linked service (Synapse) Salesforce

when I tried to create a linked services (Salesforce) - AutoResolveIntegrationRuntime type, I'm getting this error:
ERROR [HY000] [Microsoft][Salesforce] (22) Error parsing XML response from Salesforce: not well-formed (invalid token) at line 1 ERROR [HY000] [Microsoft][Salesforce] (22) Error parsing XML response from Salesforce: not well-formed (invalid token) at line 1 Activity ID: af3f6fd8-172b-4327-bac2-187863960c02.
I verified that all the credentials are correct and it shows on salesforce user login history that the login attempt was successful, but the linked service setup is throwing an error.
enter image description here
As per the updated official document:
XML format is supported for the following connectors: Amazon S3,
Amazon S3 Compatible Storage, Azure Blob, Azure Data Lake Storage
Gen1, Azure Data Lake Storage Gen2, Azure Files, File System, FTP,
Google Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP. It is
supported as source but not sink.
Even if it is working, you can take the reference from the files and environment for which it is working fine and correct the XML file to make it well formed.
You can also check your URL and Username once again. Also refer linked service properties for salesforce.

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

Azure ML studio export data Azure Storage V2

I already post my problem here and they suggested me to post it here.
I am trying to export data from Azure ML to Azure Storage but I have this error:
Error writing to cloud storage: The remote server returned an error: (400) Bad Request.. Please check the url. . ( Error 0151 )
My blob storage configuration is Storage v2 / Standard and Require secure transfer set as enabled.
If I set the Require secure transfer set as disabled, the export works fine.
How can I export data to my blob storage with the require secure transfer set as enabled ?
According to the offical tutorial Export to Azure Blob Storage, there are two authentication types for exporting data to Azure Blob Storage: SAS and Account. The description for them as below.
For Authentication type, choose Public (SAS URL) if you know that the storage supports access via a SAS URL.
A SAS URL is a special type of URL that can be generated by using an Azure storage utility, and is available for only a limited time. It contains all the information that is needed for authentication and download.
For URI, type or paste the full URI that defines the account and the public blob.
For private accounts, choose Account, and provide the account name and the account key, so that the experiment can write to the storage account.
Account name: Type or paste the name of the account where you want to save the data. For example, if the full URL of the storage account is http://myshared.blob.core.windows.net, you would type myshared.
Account key: Paste the storage access key that is associated with the account.
I try to use a simple module combination as the figure and Python code below to test the issue you got.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
dataframe1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
return dataframe1,
When I tried to use the authentication type Account of my Blob Storage V2 account, I got the same issue as yours which the error code is Error 0151 as below via click the View error log Button under the link of View output log.
Error 0151
There was an error writing to cloud storage. Please check the URL.
This error in Azure Machine Learning occurs when the module tries to write data to cloud storage but the URL is unavailable or invalid.
Resolution
Check the URL and verify that it is writable.
Exception Messages
Error writing to cloud storage (possibly a bad url).
Error writing to cloud storage: {0}. Please check the url.
Based on the error description above, the error should be caused by the blob url with SAS incorrectly generated by the Export Data module code with account information. May I think the code is old and not compatible with the new V2 storage API or API version information. You can report it to feedback.azure.com.
However, I switched to use SAS authentication type to type a blob url with a SAS query string of my container which I generated via Azure Storage Explorer tool as below, it works fine.
Fig 1: Right click on the container of your Blob Storage account, and click the Get Shared Access Signature
Fig 2: Enable the permission Write (recommended to use UTC timezone) and click Create button
Fig 3: Copy the Query string value, and build a blob url with a container SAS query string like https://<account name>.blob.core.windows.net/<container name>/<blob name><query string>
Note: The blob must be not exist in the container, otherwise an Error 0057 will be caused.

Cannot read .json from a google cloud bucket

I have a folder structure within a bucket of google cloud storage
bucket_name = 'logs'
json_location = '/logs/files/2018/file.json'
I try to read this json file in jupyter notebook using this code
from google.cloud import storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "logs/files/2018/file.json"
def download_blob(source_blob_name, bucket_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name))
Then calling the function
download_blob('file.json', 'logs', 'file.json')
And I get this error
DefaultCredentialsError: File /logs/files/2018/file.json was not found.
I have looked at all the similar question asked on stackoverflow and cannot find a solution.
The json file is present and can be open or downloaded in the json_location on google cloud storage.
There are two different perspectives regarding the json file you refer:
1) The json file used for authenticating to GCP.
2) The json you want to download from a bucket to your local machine.
For the first one, if you are accessing remotely to you Jupyter server, most probably the json doesn't exist in such remote machine, but in your local machine. If this is your scenario try to upload the json to the Jupyter server. Executing ls -l /logs/files/2018/file.json in the remote machine could help to verify its correctness. Then, os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "JSON_PATH_ON_JUPYTER_SERVER" should work.
On the other hand, I executed your code and got:
>>> download_blob('static/upload_files_CS.png', 'bucketrsantiago', 'file2.json')
Blob static/upload_files_CS.png downloaded to file2.json.
The file gs://bucketrsantiago/static/upload_files_CS.png was downloaded to my local machine with the name file2.json. This helps to clarify that the only problem is regarding the authentication json file.
GOOGLE_APPLICATION_CREDENTIALS is supposed to point to a file on the local disk where you are running jupyter. You need the credentials in order to call GCS, so you can't fetch them from GCS.
In fact, you are best off not messing around with credentials at all in your program, and leaving the client library to it. Don't touch GOOGLE_APPLICATION_CREDENTIALS in our application. Instead:
If you are running on GCE, just make sure your GCE instances [has a service account with the right scopes and permissions]. Applications running in that instance will automatically have the permissions of that service account.
If you are running locally, install google cloud SDK and run gcloud auth application-default login. Your program will then automatically use whichever account you log in as.
Complete instructions here

Resources