Getting StringIndexOutOfBoundsException while mounting a storage container in Azure databricks - databricks

When i try to mount a storage container in azure databricks , I get the below error . This container is not mounted, hence I am mounting it
scope = 'myscope'
storage_account_name = dbutils.secrets.get(scope=scope, key="AccessKey")
storage_account_access_key = dbutils.secrets.get(scope=scope, key="Accessprimary-key")
cont="reference"
configs = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":storage_account_access_key}
dbutils.fs.mount(
source = f"wasbs://{cont}#{storage_account_name}.blob.core.windows.net/",
mount_point = f"/mnt/{cont}",
extra_configs = configs
)
Can someone help help on fixing this issue . Basically i want to mount the container . The scope and secrets are already set and it works perfecting if i do unmount any existing containers
Error :
ExecutionError Traceback (most recent call last)
<command-3711978919081072> in <module>
32
33
---> 34 dbutils.fs.mount(
35 source = f"wasbs://{cont}#{storage_account_name}.blob.core.windows.net/",
36 mount_point = f"/mnt/{cont}",
/databricks/python_shell/dbruntime/dbutils.py in f_with_exception_handling(*args, **kwargs)
387 exc.__context__ = None
388 exc.__cause__ = None
--> 389 raise exc
390
391 return f_with_exception_handling
ExecutionError: An error occurred while calling o388.mount.
: shaded.databricks.org.apache.hadoop.fs.azure.AzureException:
java.lang.StringIndexOutOfBoundsException: String index out of range: 11

I think that you have a problem with this line:
storage_account_name = dbutils.secrets.get(scope=scope, key="AccessKey")
This variable should be the storage account name, but it looks like you're pulling off the access key for that storage account.

Related

AttributeError: 'ImageCollection' object has no attribute 'tag' Docker sdk 2.0

I want to build, tag and push a docker image into an ecr repository using python3, from an ec2
i'm using docker and boto3
docker==4.0.1
here i am building the image and turns out ok
response = self._get_docker_client().images.build(path=build_info.context,
dockerfile=build_info.dockerfile,
tag=tag,
buildargs=build_info.arguments,
encoding=DOCKER_ENCODING,
quiet=False,
forcerm=True)
here it fails because he cant find the attribute "tag"
self._get_docker_client().images.tag(repository=repo, tag=tag, force=True)
another way to get the same error i'm trying to give the method a target image id to make the tag using the response of the build, i get in my IDE (intellij) 2 different methods to make tags, one with "ImageApimixin" and other with "Image" as you may see so i tried a different aproach
for i in response[1]:
print(i)
if 'Successfully built' in i.get('stream', ''):
print('JECN BUILD data[stream]')
print(i['stream'])
image = i['stream'].strip().split()[-1]
print('JECN BUILD image')
print(image)
self._get_docker_client().images.tag(self, image=image, repository='999335850108.dkr.ecr.us-east-2.amazonaws.com/jenkins/alpine', tag='3.13.1', force=True)
in both cases i get the same error (i burned some code in the last try)
'ImageCollection' object has no attribute 'tag'
amazon-ebs: Process Process-1:2:
amazon-ebs: Traceback (most recent call last):
amazon-ebs: File "/root/.local/lib/python3.7/site-packages/bulldocker/services/build_service.py", line 25, in perform
amazon-ebs: build_info.image_id = self.__build(build_info)
amazon-ebs: File "/root/.local/lib/python3.7/site-packages/bulldocker/services/build_service.py", line 114, in __build
amazon-ebs: self._get_docker_client().images.tag(self, image=image, repository='999335850108.dkr.ecr.us-east-2.amazonaws.com/jenkins/alpine', tag='3.13.1', force=True)
amazon-ebs: AttributeError: 'ImageCollection' object has no attribute 'tag'
i still don't get why the library is confused here, when i look into the ImageCollection it only appears inside the Scope used by docker client and models libary but i really ran out of ideas
here i build my docker client
def get_ecr_docker_client():
print('JECN GETTING DOCKER CLIENT')
access_key_id, secret_access_key, aws_external_id = get_param_from_store()
aws_region = DEFAULT_AWS_REGION
print(os.environ)
print(access_key_id)
print(secret_access_key)
docker_client = docker.from_env()
client = boto3.client(
service_name='sts',
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
region_name=aws_region,
)
assumed_role_object = client.assume_role(
RoleArn="arn:aws:iam::999335850108:role/adl-pre-ops-jenkins-role",
RoleSessionName="AssumeRoleSession1",
ExternalId=aws_external_id
)
credentials = assumed_role_object['Credentials']
ecr_client = boto3.client(
service_name='ecr',
aws_access_key_id=credentials['AccessKeyId'],
aws_secret_access_key=credentials['SecretAccessKey'],
aws_session_token=credentials['SessionToken'],
region_name=aws_region
)
ecr_credentials = \
(ecr_client.get_authorization_token(registryIds=[ECR_OPERATIONAL_REGISTRY_ID]))['authorizationData'][0]
ecr_username = 'AWS'
decoded = base64.b64decode(ecr_credentials['authorizationToken'])
ecr_password = (base64.b64decode(ecr_credentials['authorizationToken']).replace(b'AWS:', b'').decode('utf-8'))
ecr_url = ecr_credentials['proxyEndpoint']
docker_client.login(
username=ecr_username, password=ecr_password, registry=ecr_url)
return docker_client

List files in azure file storage and get the most recent file

I have files in azure file storage, so I listed the files by the date upload and then I want to select the most recent file uploaded.
So to do this, I created a function that should have returned me, the list of the files. However when I see the output, it return only one file and the other are missing.
Here is my code:
file_service = FileService(account_name='', account_key='')
generator = list(file_service.list_directories_and_files(''))
def list_files_in(generator,file_service):
list_files=[]
for file_or_dir in generator:
file_in = file_service.get_file_properties(share_name='', directory_name="", file_name=file_or_dir.name)
file_date= file_in.properties.last_modified.date()
list_tuple = (file_date,file_or_dir.name)
list_files.append(list_tuple)
return list_files
To get the latest files in azure blob you need to write the logic to get that, below are the code which will give you the same :
For Files
result = file_service.list_directories_and_files(share_name, directory_name)
for file_or_dir in result:
if isinstance(file_or_dir, File):
file = file_service.get_file_properties(share_name, directory_name, file_name, timeout=None, snapshot=None)
print(file_or_dir.name, file.properties.last_modified)
For Blob
from azure.storage.blob import ContainerClient
container = ContainerClient.from_connection_string(conn_str={your_connection_string}, container_name = {your_container_name})
for blob in container.list_blobs():
print(f'{blob.name} : {blob.last_modified}')
You can get the key's and account details from the azure portal for the account access. In this blob.last_modified will give you latest blob item.

BatchErrorException: The specified operation is not valid for the current state of the resource

Scenario:
I have a job created in azure batch against a pool. Now, I created another pool and want to point my job to the newly created pool.
I use the azure-batch SDK to write the following piece of code
import azure.batch.batch_service_client as batch
batch_service_client = batch.BatchServiceClient(credentials, batch_url = account_url)
job_id="LinuxTrainingJob"
pool_id="linux-e6a63ad4-9e52-4b9a-8b09-2a0249802981"
pool_info = batch.models.PoolInformation(pool_id=pool_id)
job_patch_param = batch.models.JobPatchParameter(pool_info=pool_info)
batch_service_client.job.patch(job_id, job_patch_param)
This gives me the following error
BatchErrorException Traceback (most recent call last)
<ipython-input-104-ada32b24d6a0> in <module>
2 pool_info = batch.models.PoolInformation(pool_id=pool_id)
3 job_patch_param = batch.models.JobPatchParameter(pool_info=pool_info)
----> 4 batch_service_client.job.patch(job_id, job_patch_param)
~/anaconda3/lib/python3.8/site-packages/azure/batch/operations/job_operations.py in patch(self, job_id, job_patch_parameter, job_patch_options, custom_headers, raw, **operation_config)
452
453 if response.status_code not in [200]:
--> 454 raise models.BatchErrorException(self._deserialize, response)
455
456 if raw:
BatchErrorException: {'additional_properties': {}, 'lang': 'en-US', 'value': 'The specified operation is not valid for the current state of the resource.\nRequestId:46074112-9a99-4569-a078-30a7f4ad2b91\nTime:2020-10-06T17:52:43.6924378Z'}
The credentials are set above and are working properly as I was able to create pool and jobs using the same client.
Environment details
azure-batch==9.0.0
python 3.8.3
Ubuntu 18.04
To assign a Job to another Pool you must call the disableJob API to drain currently running tasks from the Pool. Then you can call updateJob to assign a new poolId to run on. Once it is updated you can then call enableJob to continue the jobs execution.

Use Python to create multiple containers with specific content

I have a single container with around 200k images on my blob storage. I want to write a script in Python that copies out batches of 20k of these images to new containers called something like imageset1, imageset2, ..., imageset20 (the last container will have less than 20k images in it which is fine).
I have the following so far:
from azure.storage.blob import BlockBlobService
from io import BytesIO from shutil
import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Here is where I want to chunk up the container contents into batches of 20k
# Then I want to write the above to a set of new containers using, I think, something like this...
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
It's the chunking up the contents of a container and writing the results out to new containers which I don't know how to do. Can anyone help?
here is my sample code to realize your needs, and it works on my container.
from azure.storage.blob.baseblobservice import BaseBlobService
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<the source container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
blobs = blob_service.list_blobs(container_name)
# The target container index starts with 1
container_index = 1
# The blob number in new container, such as 3 in my testing
num_per_container = 3
count = 0
# The prefix of new container name
prefix_of_new_container = 'imageset'
flag_of_new_container = False
for blob in blobs:
if flag_of_new_container == False:
flag_of_new_container = blob_service.create_container("%s%d" % (prefix_of_new_container, container_index))
print(blob.name, "%s%d" % (prefix_of_new_container,container_index))
blob_service.copy_blob("%s%d" % (prefix_of_new_container, container_index), blob.name, "https://%s.blob.core.windows.net/%s/%s" % (account_name, container_name, blob.name))
count += 1
if count == num_per_container:
container_index += 1
count = 0
flag_of_new_container = False
Note: I only use BaseBlobService because it's enough for your needs, even for AppendBlob or PageBlob. Also, you can use BlockBlobService instead of it.

Problems with Azure Databricks opening a file on the Blob Storage

With Azure Databricks i'm able to list the files in the blob storage, get them in a array.
But when I try to open one f the file i'm getting a error. Probably due to the special syntax.
storage_account_name = "tesb"
storage_container_name = "rttracking-in"
storage_account_access_key = "xyz"
file_location = "wasbs://rttracking-in"
file_type = "xml"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
xmlfiles = dbutils.fs.ls("wasbs://"+storage_container_name+"#"+storage_account_name+".blob.core.windows.net/")
import pandas as pd
import xml.etree.ElementTree as ET
import re
import os
firstfile = xmlfiles[0].path
root = ET.parse(firstfile).getroot()
The error is
IOError: [Errno 2] No such file or directory: u'wasbs://rttracking-in#tstoweuyptoesb.blob.core.windows.net/rtTracking_00001.xml'
My guess is that ET.parse() does not know the Spark context in which you have set up the connection to the Storage Account. Alternatively you can try to mount the storage. Then you can access files through native paths as if the files were local.
See here: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-an-azure-blob-storage-container
This should work then:
root = ET.parse("/mnt/<mount-name>/...")
I did mount the Storage and then this does the trick
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()

Resources