Cannot list blobs in Azure container - python-3.x

I would like to list the blobs (the files) in an Azure container. To do so, I reproduced exactly the code snippet given as an example in the official documentation (see here). Here is what my code looks like:
from azure.storage.blob import BlobServiceClient, ContainerClient
from azure.identity import ClientSecretCredential
token_credential = ClientSecretCredential(tenant_id='WWW',
client_id='XXX',
client_secret='YYY')
service = BlobServiceClient("ZZZ", credential=token_credential)
container_client = service.get_container_client(container='AAA')
print(container_client.container_name)
blob_list = container_client.list_blobs()
for blob in blob_list:
print(blob.name + '\n')
All the lines in this example run fine except the last one, which throws the following error:
---------------------------------------------------------------------------
StorageErrorException Traceback (most recent call last)
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/storage/blob/_list_blobs_helper.py in _get_next_cb(self, continuation_token)
75 cls=return_context_and_deserialized,
---> 76 use_location=self.location_mode)
77 except StorageErrorException as error:
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/storage/blob/_generated/operations/_container_operations.py in list_blob_flat_segment(self, prefix, marker, maxresults, include, timeout, request_id, cls, **kwargs)
1215 map_error(status_code=response.status_code, response=response, error_map=error_map)
-> 1216 raise models.StorageErrorException(response, self._deserialize)
1217
StorageErrorException: (InvalidQueryParameterValue) Value for one of the query parameters specified in the request URI is invalid.
RequestId:39f9e5c3-201f-0114-551d-efab6d000000
Time:2021-01-20T11:13:03.6566856Z
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-6-a064d97987b5> in <module>
----> 1 for blob in blob_list:
2 print(blob.name + '\n')
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/core/paging.py in __next__(self)
127 if self._page_iterator is None:
128 self._page_iterator = itertools.chain.from_iterable(self.by_page())
--> 129 return next(self._page_iterator)
130
131 next = __next__ # Python 2 compatibility.
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/core/paging.py in __next__(self)
74 raise StopIteration("End of paging")
75 try:
---> 76 self._response = self._get_next(self.continuation_token)
77 except AzureError as error:
78 if not error.continuation_token:
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/storage/blob/_list_blobs_helper.py in _get_next_cb(self, continuation_token)
76 use_location=self.location_mode)
77 except StorageErrorException as error:
---> 78 process_storage_error(error)
79
80 def _extract_data_cb(self, get_next_return):
/anaconda/envs/azureml_py36_automl/lib/python3.6/site-packages/azure/storage/blob/_shared/response_handlers.py in process_storage_error(storage_error)
92 error_body = ContentDecodePolicy.deserialize_from_http_generics(storage_error.response)
93 if error_body:
---> 94 for info in error_body.iter():
95 if info.tag.lower() == 'code':
96 error_code = info.text
AttributeError: 'dict' object has no attribute 'iter'
What am I doing wrong?
For info, I'm using Python 3.6.9 with azure-storage-blob version==12.5.0 and azure-identity==1.4.1.

The package azure.storage.blob is used to to access Azure blob. Account URL that we use in script should be like https://{StorageAccountName}.blob.core.windows.net. The URL https://{StorageAccountName}.dfs.core.windows.net is the URL of Azure Data Lake Storage Gen2. If you want to list files stored in Azure Data Lake Storage Gen2, you need to use the package azure-storage-file-datalake. Besides regarding how to use the package, please refer to the sample

Related

Creating Spark database in Azure Databricks with location to ADLS Gen 2 using ABFS driver throws an exception

I am creating a database in Azure Databricks using the abfss location in the create table statement and it throws an exception.
Authentication to ADLS - Session Scoped Access Key Authentication as below
spark.conf.set("fs.azure.account.key.formula1supportdl.dfs.core.windows.net",
dbutils.secrets.get(scope="databricks-support-scope", key="formula1supportdl-account-key"))
Access Method to ADLS - abfs driver as below
%sql
CREATE DATABASE f1_demo_abfss
LOCATION 'abfss://demo#formula1supportdl.dfs.core.windows.net/'
Error message as below
Additional information
Access to the same storage account/ container works perfectly fine if I use the cluster scoped authentication. i.e., when adding the spark configuration to the cluster configuration instead of the using in the notebook.
Specifying the location for a table as above works perfectly fine too. This is only for a problem in create database statement.
Example CREATE TABLE with the same location works perfectly fine as below
This is only a problem when trying to access the location on CREATE DATABASE statement using session scoped authentication. As I said above, cluster scoped authentication works perfectly fine. Also, CREATE TABLE with the same location works perfectly fine.
Any help would be greatly appreciated!
Stack trace as requested by Pradeep
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<command-2302823602205408> in <cell line: 1>()
5 display(df)
6 return df
----> 7 _sqldf = ____databricks_percent_sql()
8 finally:
9 del ____databricks_percent_sql
<command-2302823602205408> in ____databricks_percent_sql()
2 def ____databricks_percent_sql():
3 import base64
----> 4 df = spark.sql(base64.standard_b64decode("Q1JFQVRFIERBVEFCQVNFIGYxX2RlbW9fYWJmc3MKTE9DQVRJT04gJ2FiZnNzOi8vZGVtb0Bmb3JtdWxhMXN1cHBvcnRkbC5kZnMuY29yZS53aW5kb3dzLm5ldC8n").decode())
5 display(df)
6 return df
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
46 start = time.perf_counter()
47 try:
---> 48 res = func(*args, **kwargs)
49 logger.log_success(
50 module_name, class_name, function_name, time.perf_counter() - start, signature
/databricks/spark/python/pyspark/sql/session.py in sql(self, sqlQuery, **kwargs)
1117 sqlQuery = formatter.format(sqlQuery, **kwargs)
1118 try:
-> 1119 return DataFrame(self._jsparkSession.sql(sqlQuery), self)
1120 finally:
1121 if len(kwargs) > 0:
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
200 # Hide where the exception came from that shows a non-Pythonic
201 # JVM exception message.
--> 202 raise converted from None
203 else:
204 raise
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.contracts.exceptions.KeyProviderException Failure to initialize configuration)
Also, Raised with Microsoft support here
https://learn.microsoft.com/en-us/answers/questions/1180145/creating-spark-database-in-azure-databricks-with-l

Tweepy pagination KeyError: 0

I tried using tweepy's pagination based on the code provided in it's documentation:
```
import tweepy
auth = tweepy.AppAuthHandler("Consumer Key here", "Consumer Secret here")
api = tweepy.API(auth)
for status in tweepy.Cursor(api.search_tweets, "Tweepy",
count=100).items(250):
print(status.id)
```
However, I get the following error:
```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16136/3940301818.py in <module>
----> 1 for status in tweepy.Cursor(api.search_tweets, "Tweepy",
2 count=100).items(250):
3 print(status.id)
C:\ProgramData\Anaconda3\lib\site-packages\tweepy\cursor.py in __next__(self)
84
85 def __next__(self):
---> 86 return self.next()
87
88 def next(self):
C:\ProgramData\Anaconda3\lib\site-packages\tweepy\cursor.py in next(self)
290 self.page_index += 1
291 self.num_tweets += 1
--> 292 return self.current_page[self.page_index]
293
294 def prev(self):
KeyError: 0
```
Can someone explain and rectify the error please?
With the current version of Tweepy 4.8.0, the AuthHandler syntax has changed.
Update Tweepy:
pip install Tweepy -U
and the following should work:
import tweepy
auth = tweepy.OAuth2AppHandler("Consumer Key here", "Consumer Secret here")
api = tweepy.API(auth)
for status in tweepy.Cursor(api.search_tweets, "Tweepy",
count=100).items(250):
print(status.id)

InvalidConfigError: Invalid client secrets file while saving PyDrive credentials

I use a Colaboratory Notebook and tried to automate the GoogleAuth process while using PyDrive librairy.
I tried the way dano proposed here: https://stackoverflow.com/a/24542604/10131744
Nevertheless, I get an error, linked to client secret.
Here is my code:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
#gauth.credentials = GoogleCredentials.get_application_default()
gauth.LoadCredentialsFile("mycreds.txt")
if gauth.credentials is None:
# Authenticate if they're not there
gauth.LocalWebserverAuth()
elif gauth.access_token_expired:
# Refresh them if expired
gauth.Refresh()
else:
# Initialize the saved creds
gauth.Authorize()
gauth.SaveCredentialsFile("mycreds.txt")
drive = GoogleDrive(gauth)
And here is the message I get:
/usr/local/lib/python3.6/dist-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access mycreds.txt: No such file or directory
warnings.warn(_MISSING_FILE_MESSAGE.format(filename))
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/oauth2client/clientsecrets.py in _loadfile(filename)
120 try:
--> 121 with open(filename, 'r') as fp:
122 obj = json.load(fp)
FileNotFoundError: [Errno 2] No such file or directory: 'client_secrets.json'
During handling of the above exception, another exception occurred:
InvalidClientSecretsError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pydrive/auth.py in LoadClientConfigFile(self, client_config_file)
385 try:
--> 386 client_type, client_info = clientsecrets.loadfile(client_config_file)
387 except clientsecrets.InvalidClientSecretsError as error:
/usr/local/lib/python3.6/dist-packages/oauth2client/clientsecrets.py in loadfile(filename, cache)
164 if not cache:
--> 165 return _loadfile(filename)
166
/usr/local/lib/python3.6/dist-packages/oauth2client/clientsecrets.py in _loadfile(filename)
124 raise InvalidClientSecretsError('Error opening file', exc.filename,
--> 125 exc.strerror, exc.errno)
126 return _validate_clientsecrets(obj)
InvalidClientSecretsError: ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)
During handling of the above exception, another exception occurred:
InvalidConfigError Traceback (most recent call last)
<ipython-input-9-370983bd3c5e> in <module>()
13 if gauth.credentials is None:
14 # Authenticate if they're not there
---> 15 gauth.LocalWebserverAuth()
16 elif gauth.access_token_expired:
17 # Refresh them if expired
/usr/local/lib/python3.6/dist-packages/pydrive/auth.py in _decorated(self, *args, **kwargs)
111 self.LoadCredentials()
112 if self.flow is None:
--> 113 self.GetFlow()
114 if self.credentials is None:
115 code = decoratee(self, *args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pydrive/auth.py in GetFlow(self)
441 if not all(config in self.client_config \
442 for config in self.CLIENT_CONFIGS_LIST):
--> 443 self.LoadClientConfig()
444 constructor_kwargs = {
445 'redirect_uri': self.client_config['redirect_uri'],
/usr/local/lib/python3.6/dist-packages/pydrive/auth.py in LoadClientConfig(self, backend)
364 raise InvalidConfigError('Please specify client config backend')
365 if backend == 'file':
--> 366 self.LoadClientConfigFile()
367 elif backend == 'settings':
368 self.LoadClientConfigSettings()
/usr/local/lib/python3.6/dist-packages/pydrive/auth.py in LoadClientConfigFile(self, client_config_file)
386 client_type, client_info = clientsecrets.loadfile(client_config_file)
387 except clientsecrets.InvalidClientSecretsError as error:
--> 388 raise InvalidConfigError('Invalid client secrets file %s' % error)
389 if not client_type in (clientsecrets.TYPE_WEB,
390 clientsecrets.TYPE_INSTALLED):
InvalidConfigError: Invalid client secrets file ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)
I've tried to add a client_secrets.json file according to this answer: https://stackoverflow.com/a/33426759/10131744
But either I did something wrong, or the .json file is not in the right place, but it doesn't work.
Thanks a lot for your help.

Connect to S3 accelerate endpoint with boto3

I want to download a file into a Python file object from an S3 bucket that has acceleration activated. I came across a few resources suggesting whether to overwrite the endpoint_url to "s3-accelerate.amazonaws.com" and/or to use the use_accelerate_endpoint attribute.
I have tried both, and several variations but the same error was returned everytime. One of the scripts I tried is:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "path"}))
input = BytesIO()
s3.download_fileobj(<MY_BUCKET>,<MY_KEY>, input)
Returns the following error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-61-92b89b45f215> in <module>()
11 "addressing_style": "path"}))
12 input = BytesIO()
---> 13 s3.download_fileobj(bucket, filename, input)
14
15
~/Project/venv/lib/python3.5/site-packages/boto3/s3/inject.py in download_fileobj(self, Bucket, Key, Fileobj, ExtraArgs, Callback, Config)
568 bucket=Bucket, key=Key, fileobj=Fileobj,
569 extra_args=ExtraArgs, subscribers=subscribers)
--> 570 return future.result()
571
572
~/Project//venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
71 # however if a KeyboardInterrupt is raised we want want to exit
72 # out of this and propogate the exception.
---> 73 return self._coordinator.result()
74 except KeyboardInterrupt as e:
75 self.cancel()
~/Project/venv/lib/python3.5/site-packages/s3transfer/futures.py in result(self)
231 # final result.
232 if self._exception:
--> 233 raise self._exception
234 return self._result
235
~/Project/venv/lib/python3.5/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/Project/venv/lib/python3.5/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future)
347 Bucket=transfer_future.meta.call_args.bucket,
348 Key=transfer_future.meta.call_args.key,
--> 349 **transfer_future.meta.call_args.extra_args
350 )
351 transfer_future.meta.provide_transfer_size(
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
310 "%s() only accepts keyword arguments." % py_operation_name)
311 # The "self" in this scope is referring to the BaseClient.
--> 312 return self._make_api_call(operation_name, kwargs)
313
314 _api_call.__name__ = str(py_operation_name)
~/Project/venv/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
603 error_code = parsed_response.get("Error", {}).get("Code")
604 error_class = self.exceptions.from_code(error_code)
--> 605 raise error_class(parsed_response, operation_name)
606 else:
607 return parsed_response
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
When I run the same script with "use_accelerate_endpoint": False it works fine.
However, it returned the same error when:
I overwrite the endpoint_url with "s3-accelerate.amazonaws.com"
I define "addressing_style": "virtual"
When running
s3.get_bucket_accelerate_configuration(Bucket=<MY_BUCKET>)
I get {..., 'Status': 'Enabled'} as expected.
Any idea what is wrong with that code and what I should change to properly query the accelerate endpoint of that bucket?
Using python3.5 with boto3==1.4.7, botocore==1.7.43 on Ubuntu 17.04.
EDIT:
I have also tried a similar script for uploads:
from botocore.config import Config
import boto3
from io import BytesIO
session = boto3.session.Session()
s3 = session.client(
service_name='s3',
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=Config(s3={"use_accelerate_endpoint": True,
"addressing_style": "virtual"}))
output = BytesIO()
output.seek(0)
s3.upload_fileobj(output, <MY_BUCKET>,<MY_KEY>)
Which works without the use_accelerate_endpoint option (so my keys are fine), but returns this error when True:
ClientError: An error occurred (SignatureDoesNotMatch) when calling the PutObject operation: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I have tried both addressing_style options here as well (virtual and path)
Using boto3==1.4.7 and botocore==1.7.43.
Here is one way to retrieve an object from a bucket with transfer acceleration enabled.
import boto3
from botocore.config import Config
from io import BytesIO
config = Config(s3={"use_accelerate_endpoint": True})
s3_resource = boto3.resource("s3",
aws_access_key_id=<MY_KEY_ID>,
aws_secret_access_key=<MY_KEY>,
region_name="us-west-2",
config=config)
s3_client = s3_resource.meta.client
file_object = BytesIO()
s3_client.download_fileobj(<MY_BUCKET>, <MY_KEY>, file_object)
Note that the client sends a HEAD request to the accelerated endpoint before a GET.
The canonical request of which looks somewhat like the following:
CanonicalRequest:
HEAD
/<MY_KEY>
host:<MY_BUCKET>.s3-accelerate.amazonaws.com
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20200520T204128Z
host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Some reasons why the HEAD request can fail include:
Object with given key doesn't exist or has strict access control enabled
Invalid credentials
Transfer acceleration isn't enabled

How do you use dask + distributed for NFS files?

Working from Matthew Rocklin's post on distributed data frames with Dask, I'm trying to distribute some summary statistics calculations across my cluster. Setting up the cluster with dcluster ... works fine. Inside a notebook,
import dask.dataframe as dd
from distributed import Executor, progress
e = Executor('...:8786')
df = dd.read_csv(...)
The file I'm reading is on an NFS mount that all the worker machines have access to. At this point I can look at df.head() for example and everything looks correct. From the blog post, I think I should be able to do this:
df_future = e.persist(df)
progress(df_future)
# ... wait for everything to load ...
df_future.head()
But that's an error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-26-8d59adace8bf> in <module>()
----> 1 fraudf.head()
/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
358
359 if compute:
--> 360 result = result.compute()
361 return result
362
/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
35
36 def compute(self, **kwargs):
---> 37 return compute(self, **kwargs)[0]
38
39 #classmethod
/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
108 for opt, val in groups.items()])
109 keys = [var._keys() for var in variables]
--> 110 results = get(dsk, keys, **kwargs)
111
112 results_iter = iter(results)
/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
55 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
56 cache=cache, queue=queue, get_id=_thread_get_id,
---> 57 **kwargs)
58
59 return results
/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
479 _execute_task(task, data) # Re-execute locally
480 else:
--> 481 raise(remote_exception(res, tb))
482 state['cache'][key] = res
483 finish_task(dsk, key, state, results, keyorder.get)
AttributeError: 'Future' object has no attribute 'head'
Traceback
---------
File "/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/async.py", line 264, in execute_task
result = _execute_task(task, data)
File "/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/async.py", line 246, in _execute_task
return func(*args2)
File "/work/analytics2/analytics/python/envs/analytics/lib/python3.5/site-packages/dask/dataframe/core.py", line 354, in <lambda>
dsk = {(name, 0): (lambda x, n: x.head(n=n), (self._name, 0), n)}
What's the right approach to distributing a data frame when it comes from a normal file system instead of HDFS?
Dask is trying to use the single-machine scheduler, which is the default if you create a dataframe using the normal dask library. Switch the default to use your cluster with the following lines:
import dask
dask.set_options(get=e.get)

Resources