how to set path of bucket in amazonsagemaker jupyter notebook?

how to set path of bucket in amazonsagemaker jupyter notebook? - python-3.x

I'm new to the aws how to set path of my bucket and access file of that bucket?
Is there anything i need to change with prefix ?
import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role
role = get_execution_role()
region = boto3.Session().region_name
bucket='ltfs1' # Replace with your s3 bucket name
prefix = 'sagemaker/ltfs1' # Used as part of the path in the bucket where you store data
# bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket) # The URL to access the bucket
I'm using the above code but it's showing file not found error

If the file you are accessing is in the root directory of your s3 bucket, you can access the file like this:
import pandas as pd
bucket='ltfs1'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
training_data = pd.read_csv(data_location)

You need to use "sage.session.s3_input" to specify the location of s3 bucket where the training data is present.
Below is sample code:
import sagemaker as sage
from sagemaker import get_execution_role
role = get_execution_role()
sess = sage.Session()
bucket= 'dev.xxxx.sagemaker'
prefix="EstimatorName"
s3_training_file_location = "s3://{}/csv".format(bucket)
data_location_config = sage.session.s3_input(s3_data=s3_training_file_location, content_type="csv")
output_path="s3://{}/{}".format(bucket,prefix)
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/CustomEstimator:latest'.format(account, region)
print(image)
# xxxxxx.dkr.ecr.us-heast-1.amazonaws.com/CustomEstimator:latest
tree = sage.estimator.Estimator(image,
role, 1, 'ml.c4.2xlarge',
base_job_name='CustomJobName',
code_location=output_path,
output_path=output_path,
sagemaker_session=sess)
tree.fit(data_location_config)

Related

Reading a pdf in AWS lambda using PyMuPDF

I am trying to read a pdf in AWS lambda. The pdf is stored in an s3 bucket. I need to extract the text from pdf and translate them into any required language. I am able to run my code in my notebook but when I run it on Lambda I get this error message in my cloudwatch logs - task timed out after 3.01 seconds.
import fitz
import base64
from io import BytesIO
from PIL import Image
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
try:
print("Inside handler")
s3_bucket = "my_bucket"
pdf_file_name = 'sample.pdf'
pdf_file = s3.get_object(Bucket=s3_bucket, Key=pdf_file_name)
file_content = pdf_file['Body'].read()
print("Before reading ")
with fitz.open(stream=file_content, filetype="pdf") as doc:

Try to extend the timeout, which by default is set at 3 sec.
If that does not help, try to increase the allocated memory.
Also, you may consider pushing
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
out of your handler. Put it right after the imports. The function will run more efficiently on frequent invocation.

Header is repeating when merging multiple files using Python shell in AWS Glue

I am new to Python and AWS Glue.
I am trying to merge few excel files in a S3 source bucket and generate 1 output file (csv) in a target S3 bucket. I am able to read and generate the output file with merged data but the only problem is that the header is repeating from each file.
Can someone help to debug to remove the repeating headers?
Below is my code:
import pandas as pd
import glob
import xlrd
import openpyxl
import boto3
import io
import json
import os
from io import StringIO
import numpy as np
s3 = boto3.resource('s3')
bucket = s3.Bucket('test bucket')
prefix_objs = bucket.objects.filter(Prefix='source/file')
prefix_df = []
for obj in prefix_objs:
key = obj.key
print(key)
temp = pd.read_excel(obj.get()['Body'], encoding='utf8')
prefix_df.append(temp)
bucket = 'test bucket'
csv_buffer = StringIO()
for current_df in prefix_df:
current_df.to_csv(csv_buffer, index = None)
print(current_df)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'merge.csv').put(Body=csv_buffer.getvalue())
Please help!
Regards,
Vijay

Change this line and add the parameter header.
temp = pd.read_excel(obj.get()['Body'], encoding='utf8')
to
temp = pd.read_excel(obj.get()['Body'], encoding='utf8', header=1)
or
temp = pd.read_excel(obj.get()['Body'], encoding='utf8', skiprows=1)
You need to test the header value, because sometimes the header starts not in the first row.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Read data from an Azure blob container into the Computer Vision service

I'm using the Azure CV module to process images, so far I have only used local images or images freely available on the web. But now I need to use the images I have stored in a storage account container.
I don't see how to do this in the documentation, E.G: this code allow to use local images:
import os
import sys
import requests
# If you are using a Jupyter notebook, uncomment the following line.
# %matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
# Add your Computer Vision subscription key and endpoint to your environment variables.
if 'COMPUTER_VISION_SUBSCRIPTION_KEY' in os.environ:
subscription_key = os.environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
else:
print("\nSet the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.\n**Restart your shell or IDE for changes to take effect.**")
sys.exit()
if 'COMPUTER_VISION_ENDPOINT' in os.environ:
endpoint = os.environ['COMPUTER_VISION_ENDPOINT']
analyze_url = endpoint + "vision/v3.0/analyze"
# Set image_path to the local path of an image that you want to analyze.
# Sample images are here, if needed:
# https://github.com/Azure-Samples/cognitive-services-sample-data-files/tree/master/ComputerVision/Images
image_path = "C:/Documents/ImageToAnalyze.jpg"
# Read the image into a byte array
image_data = open(image_path, "rb").read()
headers = {'Ocp-Apim-Subscription-Key': subscription_key,
'Content-Type': 'application/octet-stream'}
params = {'visualFeatures': 'Categories,Description,Color'}
response = requests.post(
analyze_url, headers=headers, params=params, data=image_data)
response.raise_for_status()
# The 'analysis' object contains various fields that describe the image. The most
# relevant caption for the image is obtained from the 'description' property.
analysis = response.json()
print(analysis)
image_caption = analysis["description"]["captions"][0]["text"].capitalize()
# Display the image and overlay it with the caption.
image = Image.open(BytesIO(image_data))
plt.imshow(image)
plt.axis("off")
_ = plt.title(image_caption, size="x-large", y=-0.1)
plt.show()
This other to use images from the web:
computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))
remote_image_url = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/landmark.jpg"
'''
Describe an Image - remote
This example describes the contents of an image with the confidence score.
'''
print("===== Describe an image - remote =====")
# Call API
description_results = computervision_client.describe_image(remote_image_url )
# Get the captions (descriptions) from the response, with confidence level
print("Description of remote image: ")
if (len(description_results.captions) == 0):
print("No description detected.")
else:
for caption in description_results.captions:
print("'{}' with confidence {:.2f}%".format(caption.text, caption.confidence * 100))
And this other to read data from a storage container:
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="my_connection_string", container_name="my_container", blob_name="my_blob")
with open("./BlockDestination.txt", "wb") as my_blob:
blob_data = blob.download_blob()
blob_data.readinto(my_blob)
But I don't see how to make the connection between the storage container and the CV service

Two simple options:
Not recommended: Set your blob container to "public" and simply use the full blob urls as you would use any other public URL.
Recommended: Construct SAS tokens for your files in blob storage. Append them to the full blob URL to create a "temporary private download link" which can be used to download the file as if it was public. You can also build the link outside of the CV service if you face any issues there.
A full blob URL with a SAS token should look something like this:
https://storagesamples.blob.core.windows.net/sample-container/blob1.txt?se=2019-08-03&sp=rw&sv=2018-11-09&sr=b&skoid=<skoid>&sktid=<sktid>&skt=2019-08-02T2
2%3A32%3A01Z&ske=2019-08-03T00%3A00%3A00Z&sks=b&skv=2018-11-09&sig=<signature>
https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/storage/azure-storage-blob/samples/blob_samples_authentication.py#L110
# Instantiate a BlobServiceClient using a connection string
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(self.connection_string)
# [START create_sas_token]
# Create a SAS token to use to authenticate a new client
from datetime import datetime, timedelta
from azure.storage.blob import ResourceTypes, AccountSasPermissions, generate_account_sas
sas_token = generate_account_sas(
blob_service_client.account_name,
account_key=blob_service_client.credential.account_key,
resource_types=ResourceTypes(object=True),
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
# [END create_sas_token]

If you check the sample:
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="my_connection_string", container_name="my_container", blob_name="my_blob")
with open("./BlockDestination.txt", "wb") as my_blob:
blob_data = blob.download_blob()
blob_data.readinto(my_blob)
all you need to do is get a byte array from my_blob
rather than
Read the image into a byte array
image_data = open(image_path, "rb").read()
you should
Read from the byte array
image_data = my_blob.tobytes()

how to download files from s3 bucket based on files modified date?

I want to download files from a particular s3 bucket based on files Last modified date.
I have researched on how to connect boto3 and there is plenty of code and documentation available for downloading the file without any conditions. I made a pseudo code
def download_file_s3(bucket_name,modified_date)
# connect to reseource s3
s3 = boto3.resource('s3',aws_access_key_id='demo', aws_secret_access_key='demo')
# connect to the desired bucket
my_bucket = s3.Bucket(bucket_name)
# Get files
for file in my_bucket.objects.all():
I want to complete this function, basically, passing a modified date the function returns the files in the s3 bucket for that particular modified date.

I have a Better solution or a function which could do this automatically. Just pass In the Bucket name and Download path name.
from boto3.session import Session
from datetime import date, timedelta
import boto3
import re
def Download_pdf_specifc_date_subfolder(bucket_name,download_path)
ACCESS_KEY = 'XYZ'
SECRET_KEY = 'ABC'
Bucket_name=bucket_name
# code to create a session
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
bucket = s3.Bucket(Bucket_name)
# code to get the yesterdays date
yesterday = date.today() - timedelta(days=1)
x=yesterday.strftime('20%y-%m-%d')
print(x)
#code to add the files to a list which needs to be downloaded
files_to_downloaded = []
#code to take all the files from s3 under a specific bucket
for fileObject in bucket.objects.all():
file_name = str(fileObject.key)
last_modified=str(fileObject.last_modified)
last_modified=last_modified.split()
if last_modified[0]==x:
# Enter the specific bucketname in the regex in place of Airports to filter only the particluar subfolder
if re.findall(r"Airports/[a-zA-Z]+", file_name):
files_to_downloaded.append(file_name)
# code to Download into a specific Folder
for fileObject in bucket.objects.all():
file_name = str(fileObject.key)
if file_name in files_to_downloaded:
print(file_name)
d_path=download_path + file_name
print(d_path)
bucket.download_file(file_name,d_path)
Download_pdf_specifc_date_subfolder(bucket_name,download_path)
Ultimately the function will give the results in the specific Folder with the files to be downloaded.

Here is my test code and it will print the last_modified datetime of objects which have the datetime after what I set.
import boto3
from datetime import datetime
from datetime import timezone
s3 = boto3.resource('s3')
response = s3.Bucket('<bucket name>').objects.all()
for item in response:
obj = s3.Object(item.bucket_name, item.key)
if obj.last_modified > datetime(2019, 8, 1, 0, 0, 0, tzinfo=timezone.utc):
print(obj.last_modified)
If you have a specific date, then
import boto3
from datetime import datetime, timezone
s3 = boto3.resource('s3')
response = s3.Bucket('<bucket name>').objects.all()
date = '20190827' # input('Insert Date as a form YYYYmmdd')
for item in response:
obj = s3.Object(item.bucket_name, item.key)
if obj.last_modified.strftime('%Y%m%d') == date:
print(obj.last_modified)
will give the results as follows.
2019-08-27 07:13:04+00:00
2019-08-27 07:13:36+00:00
2019-08-27 07:13:39+00:00

If edited this answer to download all files after a certain timestamp and then write the current time to a file for use in the next iteration. You can easily adapt this to only download files of a specific date, month, year, yesterday, etc.
import os
import boto3
import datetime
import pandas as pd
### Load AWS Key, Secret and Region
# ....
###
# Open file to read last download time and update file with current time
latesttime_file = "latest request.txt"
with open(latesttime_file, 'r') as f:
latest_download = pd.to_datetime(f.read(), utc=True)
with open(latesttime_file, 'w') as f:
f.write(str(datetime.datetime.utcnow()))
# Initialize S3-client
s3_client = boto3.client('s3',
region_name=AWS_REGION,
aws_access_key_id=AWS_KEY_ID,
aws_secret_access_key=AWS_SECRET)
def download_dir(prefix, local, bucket, timestamp, client=s3_client):
"""
params:
- prefix: pattern to match in s3
- local: local path to folder in which to place files
- bucket: s3 bucket with target contents
- client: initialized s3 client object
"""
keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket':bucket,
'Prefix':prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
t = i.get('LastModified')
if k[-1] != '/':
if t > timestamp:
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
for k in keys:
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
client.download_file(bucket, k, dest_pathname)
download_dir(<prefix or ''>, <local folder to download to>, <bucketname>, latest_download)

Read an video file object from S3 and use it for further processing through Opencv

import boto3
import cv2
import numpy as np
s3 = boto3.resource('s3')
vid = (s3.Object('bucketname', 'video.blob').get()['Body'].read())
cap = cv2.VideoCapture(vid)
This is my code. I have a video file in an s3 bucket. I want to do some processing on it with OpenCV and I don't want to download it. So I'm trying to store that video file into vid. Now the problem is that type(vid) is byte which is the reason to result in this error TypeError: an integer is required (got type bytes)
on line 6. I tried converting it into an integer or a string but was unable to.
On an attempt to convert byte to an integer: I referred to this and was getting length issues. This is just a sample video file. The actual file I want to do processing on will be huge when converted to byte object.
On an attempt to get the object as a string and then convert it to an integer: I referred to this. Even this doesn't seem to work for me.
If anyone can help me solve this issue, I will be grateful. Please comment if anything is uncertain to you regarding my issue and I'll try to provide more details.

If streaming the video from a url is an acceptable solution, I think that is the easiest solution. You just need to generate a url to read the video from.
import boto3
import cv2
s3_client = boto3.client('s3')
bucket = 'bucketname'
key = 'video.blob'
url = s3_client.generate_presigned_url('get_object',
Params = {'Bucket': bucket, 'Key': key},
ExpiresIn = 600) #this url will be available for 600 seconds
cap = cv2.VideoCapture(url)
ret, frame = cap.read()
You should see that you are able to read and process frames from that url.

Refer below the useful code snippet to perform various operations on S3 bucket.
import boto3
s3 = boto3.resource('s3', region_name='us-east-2')
for listing buckets in s3
for bucket in s3.buckets.all():
print(bucket.name)
bucket creation in s3
my_bucket=s3.create_bucket(Bucket='Bucket Name', CreateBucketConfiguration={
'LocationConstraint': 'us-east-2'
})
listing down objects inside bucket
my_bucket = s3.Bucket('Bucket Name')
for file in my_bucket.objects.all():
print (file.key)
Uploading a file from current directory
import os
print(os.getcwd())
fileName="B01.jpg"
bucketName="Bucket Name"
file = open(fileName)
s3.meta.client.upload_file(fileName, bucketName, 'test2.txt')
reading image/video from bucket
import matplotlib.pyplot as plt
s3 = boto3.resource('s3', region_name='us-east-2')
bucket = s3.Bucket('Bucket Name') # bucket name
object = bucket.Object('maisie_williams.jpg') # image name
object.download_file('B01.jpg') #donwload image with this name
img=plt.imread('B01.jpg') #read the downloaded image
imgplot = plt.imshow(img) #plot the image
plt.show(imgplot)
Reading from one bucket and then dumping it to another
import boto3
s3 = boto3.resource('s3', region_name='us-east-2')
bucket = s3.Bucket('Bucket Name') # bucket name
object = bucket.Object('maisie_williams.jpg') # image name
object.download_file('B01.jpg')
fileName="B01.jpg"
bucketName="Bucket Name"
file = open(fileName)
s3.meta.client.upload_file(fileName, bucketName, 'testz.jpg')
If you have access keys then you can probably do the folowing
keys = pd.read_csv('accessKeys.csv')
#creating Session for S3 buckets
session = boto3.session.Session(aws_access_key_id=keys['Access key ID'][0],
aws_secret_access_key=keys['Secret access key'][0])
s3 = session.resource('s3')
buck = s3.Bucket('Bucket Name')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to set path of bucket in amazonsagemaker jupyter notebook? - python-3.x

If the file you are accessing is in the root directory of your s3 bucket, you can access the file like this: import pandas as pd bucket='ltfs1' data_key = 'data.csv' data_location = 's3://{}/{}'.format(bucket, data_key) training_data = pd.read_csv(data_location)

Related

Reading a pdf in AWS lambda using PyMuPDF

Header is repeating when merging multiple files using Python shell in AWS Glue

Read data from an Azure blob container into the Computer Vision service

how to download files from s3 bucket based on files modified date?

Read an video file object from S3 and use it for further processing through Opencv

Categories

Resources