Read data from an Azure blob container into the Computer Vision service

Read data from an Azure blob container into the Computer Vision service - azure

I'm using the Azure CV module to process images, so far I have only used local images or images freely available on the web. But now I need to use the images I have stored in a storage account container.
I don't see how to do this in the documentation, E.G: this code allow to use local images:
import os
import sys
import requests
# If you are using a Jupyter notebook, uncomment the following line.
# %matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
# Add your Computer Vision subscription key and endpoint to your environment variables.
if 'COMPUTER_VISION_SUBSCRIPTION_KEY' in os.environ:
subscription_key = os.environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
else:
print("\nSet the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.\n**Restart your shell or IDE for changes to take effect.**")
sys.exit()
if 'COMPUTER_VISION_ENDPOINT' in os.environ:
endpoint = os.environ['COMPUTER_VISION_ENDPOINT']
analyze_url = endpoint + "vision/v3.0/analyze"
# Set image_path to the local path of an image that you want to analyze.
# Sample images are here, if needed:
# https://github.com/Azure-Samples/cognitive-services-sample-data-files/tree/master/ComputerVision/Images
image_path = "C:/Documents/ImageToAnalyze.jpg"
# Read the image into a byte array
image_data = open(image_path, "rb").read()
headers = {'Ocp-Apim-Subscription-Key': subscription_key,
'Content-Type': 'application/octet-stream'}
params = {'visualFeatures': 'Categories,Description,Color'}
response = requests.post(
analyze_url, headers=headers, params=params, data=image_data)
response.raise_for_status()
# The 'analysis' object contains various fields that describe the image. The most
# relevant caption for the image is obtained from the 'description' property.
analysis = response.json()
print(analysis)
image_caption = analysis["description"]["captions"][0]["text"].capitalize()
# Display the image and overlay it with the caption.
image = Image.open(BytesIO(image_data))
plt.imshow(image)
plt.axis("off")
_ = plt.title(image_caption, size="x-large", y=-0.1)
plt.show()
This other to use images from the web:
computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))
remote_image_url = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/landmark.jpg"
'''
Describe an Image - remote
This example describes the contents of an image with the confidence score.
'''
print("===== Describe an image - remote =====")
# Call API
description_results = computervision_client.describe_image(remote_image_url )
# Get the captions (descriptions) from the response, with confidence level
print("Description of remote image: ")
if (len(description_results.captions) == 0):
print("No description detected.")
else:
for caption in description_results.captions:
print("'{}' with confidence {:.2f}%".format(caption.text, caption.confidence * 100))
And this other to read data from a storage container:
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="my_connection_string", container_name="my_container", blob_name="my_blob")
with open("./BlockDestination.txt", "wb") as my_blob:
blob_data = blob.download_blob()
blob_data.readinto(my_blob)
But I don't see how to make the connection between the storage container and the CV service

Two simple options:
Not recommended: Set your blob container to "public" and simply use the full blob urls as you would use any other public URL.
Recommended: Construct SAS tokens for your files in blob storage. Append them to the full blob URL to create a "temporary private download link" which can be used to download the file as if it was public. You can also build the link outside of the CV service if you face any issues there.
A full blob URL with a SAS token should look something like this:
https://storagesamples.blob.core.windows.net/sample-container/blob1.txt?se=2019-08-03&sp=rw&sv=2018-11-09&sr=b&skoid=<skoid>&sktid=<sktid>&skt=2019-08-02T2
2%3A32%3A01Z&ske=2019-08-03T00%3A00%3A00Z&sks=b&skv=2018-11-09&sig=<signature>
https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/storage/azure-storage-blob/samples/blob_samples_authentication.py#L110
# Instantiate a BlobServiceClient using a connection string
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(self.connection_string)
# [START create_sas_token]
# Create a SAS token to use to authenticate a new client
from datetime import datetime, timedelta
from azure.storage.blob import ResourceTypes, AccountSasPermissions, generate_account_sas
sas_token = generate_account_sas(
blob_service_client.account_name,
account_key=blob_service_client.credential.account_key,
resource_types=ResourceTypes(object=True),
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
# [END create_sas_token]

If you check the sample:
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="my_connection_string", container_name="my_container", blob_name="my_blob")
with open("./BlockDestination.txt", "wb") as my_blob:
blob_data = blob.download_blob()
blob_data.readinto(my_blob)
all you need to do is get a byte array from my_blob
rather than
Read the image into a byte array
image_data = open(image_path, "rb").read()
you should
Read from the byte array
image_data = my_blob.tobytes()

Related

Reading a pdf in AWS lambda using PyMuPDF

I am trying to read a pdf in AWS lambda. The pdf is stored in an s3 bucket. I need to extract the text from pdf and translate them into any required language. I am able to run my code in my notebook but when I run it on Lambda I get this error message in my cloudwatch logs - task timed out after 3.01 seconds.
import fitz
import base64
from io import BytesIO
from PIL import Image
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
try:
print("Inside handler")
s3_bucket = "my_bucket"
pdf_file_name = 'sample.pdf'
pdf_file = s3.get_object(Bucket=s3_bucket, Key=pdf_file_name)
file_content = pdf_file['Body'].read()
print("Before reading ")
with fitz.open(stream=file_content, filetype="pdf") as doc:

Try to extend the timeout, which by default is set at 3 sec.
If that does not help, try to increase the allocated memory.
Also, you may consider pushing
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
out of your handler. Put it right after the imports. The function will run more efficiently on frequent invocation.

Slow upload to azure blob storage with python

The Api receive the file, than tries to create an unique blob name.
Than I upload in chunks of 4MB to the blob. Each chunk takes something about 8 seconds, is this normal? My upload speed is 110Mbps. I tried uploading a 50MB file and it took almost 2 minutes. I don't know if the azure_blob_storage version is related to this, I'm using azure-storage-blob==12.14.1
import uuid
import os
from azure.storage.blob import BlobClient, BlobBlock, BlobServiceClient
import time
import uuid
#catalog_api.route("/catalog", methods=['POST'])
def catalog():
file = request.files['file']
url_bucket, file_name, file_type = upload_to_blob(file)
def upload_to_blob(self, file):
file_name = file.filename
file_type = file.content_type
blob_client = self.generate_blob_client(file_name)
blob_url = self.upload_chunks(blob_client, file)
return blob_url, file_name, file_type
def generate_blob_client(self, file_name: str):
blob_service_client = BlobServiceClient.from_connection_string(self.connection_string)
container_client = blob_service_client.get_container_client(self.container_name)
for _ in range(self.max_blob_name_tries):
blob_name = self.generate_blob_name(file_name)
blob_client = container_client.get_blob_client(blob_name)
if not blob_client.exists():
return blob_client
raise Exception("Couldnt create the blob")
def upload_chunks(self, blob_client: BlobClient, file):
block_list=[]
chunk_size = self.chunk_size
while True:
read_data = file.read(chunk_size)
if not read_data:
print("uploaded")
break
print("uploading")
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)
return blob_client.url
```

I tried in my environment and got below results:
I tried with 50 mb file to upload blob storage account with chunk size of 4*1024*1024 from local environment to storage account it takes 45 secs.
Code:
import uuid
from azure.storage.blob import BlobBlock, BlobServiceClient
import time
connection_string="<storage account connection string >"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client("file.pdf")
start=time.time()
#upload data
block_list=[]
chunk_size=4*1024*1024
with open("C:\\file.pdf",'rb') as f:
while True:
read_data = f.read(chunk_size)
if not read_data:
break # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)
end=time.time()
print("Time taken to upload blob:", end - start, "secs")
In the above code, I added the timing method of both start and end at end of code I used the end-start process to know the timing of uploaded file in blob storage.
Console:
Make sure your internet speed is good and also, I tried with some other internet speed it takes maximum 78secs.
Portal:

Convert Web url to Image

I am trying to take screenshot of an URL but somehow it takes the screenshot of the gateway because of restricted entry. So tried adding ID and password to open the link but it does not for reason, could you help?
import requests
import urllib.parse
BASE = 'https://mini.s-shot.ru/1024x0/JPEG/1024/Z100/?' # we can modify size, format, zoom as needed
url = 'https://mail.google.com/mail/'#or whatever link you need
url = urllib.parse.quote_plus(url) #
print(url)
Id="XXXXXX"
import getpass
key = getpass.getpass('Password :: ')
path = 'target1.jpg'
response = requests.get(BASE + url+Id+Password, stream=True)
if response.status_code == 200:
with open(path, 'wb') as file:
for chunk in response:
file.write(chunk)
Thanks!

how to set path of bucket in amazonsagemaker jupyter notebook?

I'm new to the aws how to set path of my bucket and access file of that bucket?
Is there anything i need to change with prefix ?
import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role
role = get_execution_role()
region = boto3.Session().region_name
bucket='ltfs1' # Replace with your s3 bucket name
prefix = 'sagemaker/ltfs1' # Used as part of the path in the bucket where you store data
# bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket) # The URL to access the bucket
I'm using the above code but it's showing file not found error

If the file you are accessing is in the root directory of your s3 bucket, you can access the file like this:
import pandas as pd
bucket='ltfs1'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
training_data = pd.read_csv(data_location)

You need to use "sage.session.s3_input" to specify the location of s3 bucket where the training data is present.
Below is sample code:
import sagemaker as sage
from sagemaker import get_execution_role
role = get_execution_role()
sess = sage.Session()
bucket= 'dev.xxxx.sagemaker'
prefix="EstimatorName"
s3_training_file_location = "s3://{}/csv".format(bucket)
data_location_config = sage.session.s3_input(s3_data=s3_training_file_location, content_type="csv")
output_path="s3://{}/{}".format(bucket,prefix)
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/CustomEstimator:latest'.format(account, region)
print(image)
# xxxxxx.dkr.ecr.us-heast-1.amazonaws.com/CustomEstimator:latest
tree = sage.estimator.Estimator(image,
role, 1, 'ml.c4.2xlarge',
base_job_name='CustomJobName',
code_location=output_path,
output_path=output_path,
sagemaker_session=sess)
tree.fit(data_location_config)

Read an video file object from S3 and use it for further processing through Opencv

import boto3
import cv2
import numpy as np
s3 = boto3.resource('s3')
vid = (s3.Object('bucketname', 'video.blob').get()['Body'].read())
cap = cv2.VideoCapture(vid)
This is my code. I have a video file in an s3 bucket. I want to do some processing on it with OpenCV and I don't want to download it. So I'm trying to store that video file into vid. Now the problem is that type(vid) is byte which is the reason to result in this error TypeError: an integer is required (got type bytes)
on line 6. I tried converting it into an integer or a string but was unable to.
On an attempt to convert byte to an integer: I referred to this and was getting length issues. This is just a sample video file. The actual file I want to do processing on will be huge when converted to byte object.
On an attempt to get the object as a string and then convert it to an integer: I referred to this. Even this doesn't seem to work for me.
If anyone can help me solve this issue, I will be grateful. Please comment if anything is uncertain to you regarding my issue and I'll try to provide more details.

If streaming the video from a url is an acceptable solution, I think that is the easiest solution. You just need to generate a url to read the video from.
import boto3
import cv2
s3_client = boto3.client('s3')
bucket = 'bucketname'
key = 'video.blob'
url = s3_client.generate_presigned_url('get_object',
Params = {'Bucket': bucket, 'Key': key},
ExpiresIn = 600) #this url will be available for 600 seconds
cap = cv2.VideoCapture(url)
ret, frame = cap.read()
You should see that you are able to read and process frames from that url.

Refer below the useful code snippet to perform various operations on S3 bucket.
import boto3
s3 = boto3.resource('s3', region_name='us-east-2')
for listing buckets in s3
for bucket in s3.buckets.all():
print(bucket.name)
bucket creation in s3
my_bucket=s3.create_bucket(Bucket='Bucket Name', CreateBucketConfiguration={
'LocationConstraint': 'us-east-2'
})
listing down objects inside bucket
my_bucket = s3.Bucket('Bucket Name')
for file in my_bucket.objects.all():
print (file.key)
Uploading a file from current directory
import os
print(os.getcwd())
fileName="B01.jpg"
bucketName="Bucket Name"
file = open(fileName)
s3.meta.client.upload_file(fileName, bucketName, 'test2.txt')
reading image/video from bucket
import matplotlib.pyplot as plt
s3 = boto3.resource('s3', region_name='us-east-2')
bucket = s3.Bucket('Bucket Name') # bucket name
object = bucket.Object('maisie_williams.jpg') # image name
object.download_file('B01.jpg') #donwload image with this name
img=plt.imread('B01.jpg') #read the downloaded image
imgplot = plt.imshow(img) #plot the image
plt.show(imgplot)
Reading from one bucket and then dumping it to another
import boto3
s3 = boto3.resource('s3', region_name='us-east-2')
bucket = s3.Bucket('Bucket Name') # bucket name
object = bucket.Object('maisie_williams.jpg') # image name
object.download_file('B01.jpg')
fileName="B01.jpg"
bucketName="Bucket Name"
file = open(fileName)
s3.meta.client.upload_file(fileName, bucketName, 'testz.jpg')
If you have access keys then you can probably do the folowing
keys = pd.read_csv('accessKeys.csv')
#creating Session for S3 buckets
session = boto3.session.Session(aws_access_key_id=keys['Access key ID'][0],
aws_secret_access_key=keys['Secret access key'][0])
s3 = session.resource('s3')
buck = s3.Bucket('Bucket Name')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read data from an Azure blob container into the Computer Vision service - azure

Related

Reading a pdf in AWS lambda using PyMuPDF

Slow upload to azure blob storage with python

Convert Web url to Image

how to set path of bucket in amazonsagemaker jupyter notebook?

Read an video file object from S3 and use it for further processing through Opencv

Categories

Resources