get aws lambda to start transcribe job with filename without extension - python-3.x

I have a lambda function which will start a transcribe job when an object is put into the s3 bucket. I am having trouble getting setting the transcribe job to be the file name without the extension; also the file is not putting into the correct prefix folder in S3 bucket for some reason, here's what I have:
import json
import boto3
import time
import os
from urllib.request import urlopen
transcribe = boto3.client('transcribe')
def lambda_handler(event, context):
if event:
file_obj = event["Records"][0]
bucket_name = str(file_obj['s3']['bucket']['name'])
file_name = str(file_obj['s3']['object']['key'])
s3_uri = create_uri(bucket_name, file_name)
job_name = filename
print(os.path.splitext(file_name)[0])
transcribe.start_transcription_job(TranscriptionJobName = job_name,
Media = {'MediaFileUri': s3_uri},
MediaFormat = 'mp3',
LanguageCode = "en-US",
OutputBucketName = "sbox-digirepo-transcribe-us-east-1",
Settings={
# 'VocabularyName': 'string',
'ShowSpeakerLabels': True,
'MaxSpeakerLabels': 2,
'ChannelIdentification': False
})
while Ture:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]:
break
print("Transcription in progress")
time.sleep(5)
s3.put_object(Bucket = bucket_name, Key="output/{}.json".format(job_name), Body=load_)
return {
'statusCode': 200,
'body': json.dumps('Transcription job created!')
}
def create_uri(bucket_name, file_name):
return "s3://"+bucket_name+"/"+file_name
the error i get is
[ERROR] BadRequestException: An error occurred (BadRequestException) when calling the StartTranscriptionJob operation: 1 validation error detected: Value 'input/7800533A.mp3' at 'transcriptionJobName' failed to satisfy constraint: Member must satisfy regular expression pattern: ^[0-9a-zA-Z._-]+
so my desired output should have the TranscriptionJobName value to be 7800533A for this case, and the result OutputBucketName to be in s3bucket/output. any help is appreciated, thanks in advance.

The TranscriptionJobName argument is a friendly name for your job and is pretty limited based on the regex. You're passing it the full object key, which contains the prefix input/, but / is a disallowed character in the job name. You could just split out the file name portion in your code:
job_name = file_name.split('/')[-1]
I put a full example of uploading media and starting an AWS Transcribe job on GitHub that puts all this in context.

Related

How can I scrape PDFs within a Lambda function and save them to an S3 bucket?

I'm trying to develop a simple lambda function that will scrape a pdf and save it to an s3 bucket given the url and the desired filename as input data. I keep receiving the error "Read-only file system,' and I'm not sure if I have to change the bucket permissions or if there is something else I am missing. I am new to S3 and Lambda and would appreciate any help.
This is my code:
import urllib.request
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
url = event['url']
filename = event['filename'] + ".pdf"
response = urllib.request.urlopen(url)
file = open(filename, 'w')
file.write(response.read())
s3.upload_fileobj(response.read(), 'sasbreports', filename)
file.close()
This was my event file:
{
"url": "https://purpose-cms-preprod01.s3.amazonaws.com/wp-content/uploads/2022/03/09205150/FY21-NIKE-Impact-Report_SASB-Summary.pdf",
"filename": "nike"
}
When I tested the function, I received this error:
{
"errorMessage": "[Errno 30] Read-only file system: 'nike.pdf.pdf'",
"errorType": "OSError",
"requestId": "de0b23d3-1e62-482c-bdf8-e27e82251941",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 15, in lambda_handler\n file = open(filename + \".pdf\", 'w')\n"
]
}
AWS Lambda has limited space in /tmp, the sole writable location.
Writing into this space can be dangerous without a proper disk management since this storage is kept alive across multiple executions. It can lead to a saturation or unexpected file share with previous requests.
Instead of saving locally the PDF, write it directly to S3, without involving file system this way:
import urllib.request
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
url = event['url']
filename = event['filename']
response = urllib.request.urlopen(url)
s3.upload_fileobj(response.read(), 'sasbreports', filename)
BTW: The .pdf appending should be removed according your use case.
AWS Lambda functions can only write to the /tmp/ directory. All other directories are Read-Only.
Also, there is a default limit of 512MB for storage in /tmp/, so make sure you delete the files after upload it to S3 for situations where the Lambda environment is re-used for future executions.

Phone book search on AWS Lambda and S3

I want to make a serverless application in AWS Lambda for phone book searches.
What I've done:
Created a bucket and uploaded a CSV file to it.
Created a role with full access to the bucket.
Created a Lambda function
Created API Gateway with GET and POST methods
The Lambda function contains the following code:
import boto3
import json
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket='namebbacket',
Key='sample_data.csv',
ExpressionType='SQL',
Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
print(statsDetails['BytesScanned'])
print("Stats details bytesProcessed: ")
print(statsDetails['BytesProcessed'])
print("Stats details bytesReturned: ")
print(statsDetails['BytesReturned'])
When I access the Invoke URL, I get the following error:
{errorMessage = Handler 'lambda_handler' missing on module 'lambda_function', errorType = Runtime.HandlerNotFound}
CSV structure: Name, PhoneNumber, City, Occupation
How to solve this problem?
Please refer to this documentation topic to learn how to write a Lambda function in Python. You are missing the Handler. See: AWS Lambda function handler in Python
Wecome to S.O. #smac2020 links you to the right place AWS Lambda function handler in Python. In short, AWS Lambda needs to know where to find your code, hence the "handler". Though a better way to think about it might be "entry-point."
Here is a close approximation of your function, refactored for use on AWS Lambda:
import json
import boto3
def function_to_be_called(event, context):
# TODO implement
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket='stack-exchange',
Key='48836509/dogs.csv',
ExpressionType='SQL',
Expression="SELECT * FROM s3object s where s.\"breen_name\" = 'pug'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!'),
'pugInfo': records
}
This function produces the following result:
Response
{
"statusCode": 200,
"body": "\"Hello from Lambda!\"",
"currentWorkdingDirectory": "/var/task",
"currentdirlist": [
"lambda_function.py"
],
"pugInfo": "1,pug,toy\r\n"
}
The "entry point" for this function is in a Python file called lambda_function.py and the function function_to_be_called. Together these are the "handler." We can see this in the Console:
or using the API through Boto3
import boto3
awslambda = boto3.client('lambda')
awslambda.get_function_configuration('s3SelectFunction')
Which returns:
{'CodeSha256': 'mFVVlakisUIIsLstQsJUpeBIeww4QhJjl7wJaXqsJ+Q=',
'CodeSize': 565,
'Description': '',
'FunctionArn': 'arn:aws:lambda:us-east-1:***********:function:s3SelectFunction',
'FunctionName': 's3SelectFunction',
'Handler': 'lambda_function.function_to_be_called',
'LastModified': '2021-03-10T00:57:48.651+0000',
'MemorySize': 128,
'ResponseMetadata': ...
'Version': '$LATEST'}

Getting Error While Inserting JSON data to DynamoDB using Python

HI,
I am trying to put the json data into AWS dynamodb table using AWS Lambda , however i am getting an error like below. My json file is uploaded to S3 bucket.
Parameter validation failed:
Invalid type for parameter Item, value:
{
"IPList": [
"10.1.0.36",
"10.1.0.27"
],
"TimeStamp": "2020-04-22 11:43:13",
"IPCount": 2,
"LoadBalancerName": "internal-ALB-1447121364.us-west-2.elb.amazonaws.com"
}
, type: <class 'str'>, valid types: <class 'dict'>: ParamValidationError
Below i my python script:-
import boto3
import json
s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
json_file_name = event['Records'][0]['s3']['object']['key']
json_object = s3_client.get_object(Bucket=bucket,Key=json_file_name)
jsonFileReader = json_object['Body'].read()
jsonDict = json.loads(jsonFileReader)
table = dynamodb.Table('test')
table.put_item(Item=jsonDict)
return 'Hello'
Below is my json content
"{\"IPList\": [\"10.1.0.36\", \"10.1.0.27\"], \"TimeStamp\": \"2020-04-22 11:43:13\",
\"IPCount\": 2, \"LoadBalancerName\": \"internal-ALB-1447121364.us-west-2.elb.amazonaws.com\"}"
Can someone help me, how can insert data to dynamodb.
json.loads(jsonFileReader) returns string, but table.put_item() expects dict. Use json.load() instead.

Writing string to S3 with boto3: "'dict' object has no attribute 'put'"

In an AWS lambda, I am using boto3 to put a string into an S3 file:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket=XXX, Key=YYY)
data.put('Body', 'hello')
I am told this:
[ERROR] AttributeError: 'dict' object has no attribute 'put'
The same happens with data.put('hello') which is the method recommended by the top answers at How to write a file or data to an S3 object using boto3 and with data.put_object: 'dict' object has no attribute 'put_object'.
What am I doing wrong?
On the opposite, reading works great (with data.get('Body').read().decode('utf-8')).
put_object is a method of the s3 object, not the data object.
Here is a full working example with Python 3.7:
import json
import boto3
s3 = boto3.client('s3')
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
bucket = 'mybucket'
key = 'id.txt'
id = None
# Write id to S3
s3.put_object(Body='Hello!', Bucket=bucket, Key=key)
# Read id from S3
data = s3.get_object(Bucket=bucket, Key=key)
id = data.get('Body').read().decode('utf-8')
logger.info("Id:" + id)
return {
'statusCode': 200,
'body': json.dumps('Id:' + id)
}

aws firehose lambda function invocation gives wrong output strcuture format

When i insert a data object to aws firhose stream using a put operation it works fine .As lambda function is enabled on my firehose stream .hence a lambda function is invoked but gives me a output structure response error :
"errorMessage":"Invalid output structure: Please check your function and make sure the processed records contain valid result status of Dropped, Ok, or ProcessingFailed."
so now i have created my lambda function like this way to make the correct output strcuture :
import base64
import json
print('Loading function')
def lambda_handler(event, context):
output=[]
print('event'+str(event))
for record in event['records']:
payload = base64.b64decode(record['data'])
print('payload'+str(payload))
payload=base64.b64encode(payload)
output_record={
'recordId':record['recordId'],
'result': 'Ok',
'data': base64.b64encode(json.dumps('hello'))
}
output.append(output_record)
return { 'records': output }
Now i am getting the follwing eror on encoding the 'data' field as
"errorMessage": "a bytes-like object is required, not 'str'",
and if i change the 'hello' to bytes like b'hello' then i get the following error :
"errorMessage": "Object of type bytes is not JSON serializable",
import json
import base64
import gzip
import io
import zlib
def lambda_handler(event, context):
output = []
for record in event['records']:
payload = base64.b64decode(record['data']).decode('utf-8')
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8')).decode('utf-8')
}
output.append(output_record)
return {'records': output}

Resources