Azure Blob Storage not listing blobs - azure

I am with trouble listing blobs from a specific container
I am using the oficial code, in Python, to list:
from azure.storage.blob import BlockBlobService
account_name = 'xxxx'
account_key = 'xxxx'
container_name = 'yyyyyy'
block_blob_service = BlockBlobService(account_name=account_name,
account_key=account_key)
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
I have received the error:
raise AzureException(ex.args[0])
AzureException: can only concatenate str (not "tuple") to str
The version of azure storage related packages installed are:
azure-mgmt-storage 2.0.0
azure-storage-blob 1.4.0
azure-storage-common 1.4.0

I tried to run the same code of yours with my account, it works fine without any issue. Then, according to the error information, I also tried to reproduce it, as below.
Test 1. When I tried to run the code '123'+('A','B') in Python 3.7, I got the similar issue as the figure below.
Test 2. When ran the same code in Python 3.6, the error information is different.
Test 3. When in Python 2 (just on WSL), the same issue is like in Python 3.7
So I guess you were using Python 3.7 or 2 to run your code, and the issue was caused by using + symbol to concat a string with a tuple at other where of your codes. Please try to check carefully or update your post for more details about the debug information includes line number and its codes for helping to analyze.

Related

python-docx import is not able to be imported even though the library is installed

System:
MacBook Pro (13-inch, 2020, Two Thunderbolt 3 ports)
Processor 1.4 GHz Quad-Core Intel Core i5
Memory 16GB
OS macOs Monterey version 12.6.1
I'm still fairly new to Python and just learned about the docx library.
I get the following error in Visual Studio Code about the docx library.
Import "docx" could not be resolved.
enter image description here
When I check to see the version installed I get the following:
pip3 show python-docx
Name: python-docx
Version: 0.8.11
I am able to create Word documents with one Python script even with the import issue. However, I've tried to create one using a table and this is what is causing me issues. I'm not sure if the import issue is the root cause.
When I run my script I get the following:
python3 test.py
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/docx/styles/styles.py:139: UserWarning: style lookup by style_id is deprecated. Use style name as key instead.
return self._get_style_id_from_style(self[style_name], style_type)`
The code causing the error:
Add review question table
table = document.add_table(rows=1, cols=3)
table.style = 'TableGrid'
In researching I found I may need to import the following:
from docx.oxml.table import CT_TableStyle
And add the following to that section:
# Add review question table
table = document.add_table(rows=1, cols=3)
style = CT_TableStyle()
style.name = 'TableGrid'
table.style = style
I now get the following warning:
Import "docx.oxml.table" could not be resolved
And the following error when running the script:
line 2, in
from docx.oxml.table import CT_TableStyle
ImportError: cannot import name 'CT_TableStyle' from 'docx.oxml.table' (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/docx/oxml/table.py)
I also created a virtual environment and still have the same issues. If you need additional details, just let me know what to provide.
Kind regards,
Marcus

.jpg file not loading in databricks from blob storage (Azure data lake)

I have the .jpg pictures in the data lake in my blob storage. I am trying to load the pictures and display them for testing purposes but it seems like they can't be loaded properly. I tried a few solutions but none of them showed the pictures.
path = 'abfss://dev#storage.dfs.core.windows.net/username/project_name/test.jpg'
spark.read.format("image").load(path) -- Method 1
display(spark.read.format("binaryFile").load(pic)) -- Method 2
Method 1 brought this error. It looks like a binary file (converted from jpg to binary) and that's why I tried solution 2 but that did not load anything either.
Out[51]: DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>]
For method 2, I see this error when I ran it.
SecurityException: Your administrator has forbidden Scala UDFs from being run on this cluster
I cannot install the libraries very easily. It needs to be reviewed and approved by the administrators first so please suggest something with spark and/or python libraries if possible. Thanks
Edit:
I added these 2 lines and it looks like the image has been read but it cannot be loaded for some reason. I am not sure what's going on. The goal is to read it properly and decode the pictures eventually but it cannot happen until the picture is loaded properly.
df = spark.read.format("image").load(path)
df.show()
df.printSchema()
I tried to reproduce the same in my environment, loading dataset into databricks and got below results:
Mount an Azure Data Lake Storage Gen2 Account in Databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "xxxxxxxxx",
"fs.azure.account.oauth2.client.secret": "xxxxxxxxx",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/xxxxxxxxx/oauth2/v2.0/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder_name>",
mount_point = "/mnt/<folder_name>",
extra_configs = configs)
Or
Mount storage account with databricks using access key:
dbutils.fs.mount(
source = "wasbs://<container>#<Storage_account_name>.blob.core.windows.net/",
mount_point = "/mnt/df1/",
extra_configs = {"fs.azure.account.key.vamblob.blob.core.windows.net":"<access_key>"})
Now, using below code, I got these results.
sample_img_dir= "/mnt/df1/"
image_df = spark.read.format("image").load(sample_img_dir)
display(image_df)
Update:
spark.conf.set("fs.azure.account.key.<storage_account>.dfs.core.windows.net","<access_key>")
Reference:
Mounting Azure Data Lake Storage Gen2 Account in Databricks By Ron L'Esteve.
Finally after spending hours on this, I found a solution which was pretty straight forward but drove me crazy. I was on a right path but needed to print the pictures using "binaryFile" format in Spark. Here is what worked for me.
## importing libraries
import io
import matplotlib.pyplot as plt
import matplotlib. Image as mpimg
## Directory
path_dir = 'abfss://dev#container.dfs.core.windows.net/username/projectname/a.jpg'
## Reading the images
df = spark.read.format("binaryFile").load(path_dir)
## Selecting the path and content
df = df.select('path','content')
## Taking out the image
image_list = df.select("content").rdd.flatMap(lambda x: x).collect()
image = mpimg.imread(io.BytesIO(image_list[0]), format='jpg')
plt.imshow(image)
It looks like binaryFile is a right format at least in this case and the upper code was able to decode it successfully.

Read outlook emails in databricks

I would like to read mails from microsoft outlook using python and run the script using a databricks cluster.
I'm using win32com on my local machine and able to read emails. However, when i try to install the same package on databricks , it seems to throw an error saying
DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pywin32, --disable-pip-version-check) exited with code 1. ERROR: Could not find a version that satisfies the requirement pywin32
ERROR: No matching distribution found for pywin32
sample code is as follows
import win32com.client
import pandas as pd
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI").Folders
emails_list = [
'xyz#outlook.com'
]
subjects = []
categories = []
body_content = []
names = []
for id, name in enumerate(emails_list):
folder = outlook(name)
#print('Accessing email - ' , folder)
inbox = folder.Folders("Inbox")
message = inbox.Items
message = message.GetFirst()
body_content.append(message.Body)
subjects.append(message.Subject)
categories.append(message.Categories)
names.append(name)
df = pd.DataFrame(list(zip(names,subjects,categories,body_content)),
columns=['names','subjects','categories','body_content'])
df.head(3)
Databricks clusters are using Linux (specifically, Ubuntu Linux), so you can't use COM library that is designed for Windows. Potentially you can access your emails in the Office 365 using IMAP protocol, or something like that (see docs). Python has built-in imaplib library that could be used for that purpose, for example, like in the following article.

how to create subtitles from aws transcribe

I'm using AWS SDK for python (boto3) and want to set the subtitle output format (i.e. SRT). When I use this code, I get the error below which mentioned parameter Subtitle is not a valid parameter but according to AWS Documentation, I should be able to pass some values in this parameter.
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY)
transcribe = boto3.client('transcribe',aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY, region_name=region_name)
job_name = "kateri1234"
job_uri = "s3://transcribe-upload12/english.mp4"
transcribe.start_transcription_job(TranscriptionJobName=job_name,Media{'MediaFileUri':job_uri},
MediaFormat='mp4',
LanguageCode='en-US',
Subtitles={'Formats': ['vtt']},
OutputBucketName = "transcribe-output12"
)
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
print("Not ready yet...")
time.sleep(5)
print(status)
ERROR i get is Unknown parameter in input: "Subtitles", must be one of: TranscriptionJobName, LanguageCode, MediaSampleRateHertz, MediaFormat, Media, OutputBucketName, OutputEncryptionKMSKeyId, Settings
refered the aws documentation
I have faced a similar issue, and after some research, I have found out it is because of my boto3 and botocore versions.
I have upgraded these two packages, and it worked. My requirements.txt for these two packages:
boto3==1.20.0
botocore==1.23.54
P.S: Remember to check these two new versions are compatible with your other python packages. Especially if you are using other AWS Libraries like awsebcli. To make sure everything is working perfectly together, try running this command after upgrading these two libraries to check the errors:
pip check

reading a csv file from azure blob storage with PySpark

I'm trying to do a machine learning project using a PySpark HDInsight cluster on Microsoft Azure. To operate on my cluster a use a Jupyter notebook. Also, I have my data (a csv file), stored on the Azure Blob storage.
According to the documentation the syntax of the path to my file is:
path = 'wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
However, when i try to read the csv file with the following command:
csvFile = spark.read.csv(path, header=True, inferSchema=True)
I get the following error:
'java.net.URISyntaxException: Illegal character in scheme name at index 4: wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
Here is a screenshot of the the error looks like in the notebook:
Any ideas on how to fix this?
It is either (unencrypted):
wasb://...
or (encrypted):
wasbs://...
not
wasb[s]://...

Resources