Scenario :
I want to pass S3 (source file location) and the s3 (output file location) as
input parameters in my workflow .
Workflow : Aws Step Function calls -> lambda function and lambda function calls -> the glue job,
I want to pass the parameters from step function -> lambda function -> glue job, where glue job does
some transformation on the S3 input file and writes its output to S3 output file
Below are step function, lambda function and glue job respectively and the input
json which is passed to step function as input.
1:Input (Parameters passed) :
{
"S3InputFileLocation": "s3://bucket_name/sourcefile.csv",
"S3OutputFileLocation": "s3://bucket_name/FinalOutputfile.csv"
}
2: Step Function/ state machine ( which calls lambda with the above input
parameters) :
{
"StartAt":"AWSStepFunctionInitiator",
"States":{
"AWSStepFunctionInitiator": {
"Type":"Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:AWSLambdaFunction",
"InputPath": "$",
"End": true
}
}
}
3: Lambda Function(I.E AWSLambdaFunction invoked above, which in turn calls AWSGlueJob below):
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'S3InputFileLocation': event["S3InputFileLocation"],
'S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}
4: AWS Glue Job Script:
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
print('AWS Glue Job started')
args = getResolvedOptions(sys.argv, ['AWSGlueJob','S3InputFileLocation', 'S3OutputFileLocation'])
S3InputFileLocation= args['S3InputFileLocation']
S3OutputFileLocation= args['S3OutputFileLocation']
glueContext = GlueContext(SparkContext.getOrCreate())
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': [S3_InputLocation] }, format="csv" )
datasink = glueContext.write_dynamic_frame.from_options(frame = dfnew, connection_type = "s3",connection_options = {"path": S3_OutputLocation}, format = "csv", transformation_ctx ="datasink")
The above step function and corresponding workflow executes without any compilation or run time error, also I do see parameters successfully passed from Step function to lambda function, but none of my print statements in glue job are getting logged in cloud watch that means there is some issue when lambda function calls the glue job. Kindly help me figure out if there is some issue in the way I am invoking glue from lambda ?
Hej,
Maybe it is already solved but what helps are those two links:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
And to give precise answer to your question: Add '--' to the arguments in here:
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'--S3InputFileLocation': event["S3InputFileLocation"],
'--S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}
Related
I have written below the AWS lambda function to export dynamodb table to the S3 bucket. But when I execute the below code, I am getting an error
'dynamodb.ServiceResource' object has no attribute 'export_table_to_point_in_time'
import boto3
import datetime
def lambda_handler(event,context):
client = boto3.resource('dynamodb',endpoint_url="http://localhost:8000")
response = client.export_table_to_point_in_time(
TableArn='table arn string',
ExportTime=datetime(2015, 1, 1),
S3Bucket='my-bucket',
S3BucketOwner='string',
ExportFormat='DYNAMODB_JSON'
)
print("Response :", response)
Boto 3 version : 1.24.82
ExportTableToPointInTime is not available on DynamoDB Local, so if you are trying to do it in local (assumed from the localhost endpoint) you cannot.
Moreover, the Resource client does not have that API. You must use the Client client.
import boto3
dynamodb = boto3.client('dynamodb')
I want to make a serverless application in AWS Lambda for phone book searches.
What I've done:
Created a bucket and uploaded a CSV file to it.
Created a role with full access to the bucket.
Created a Lambda function
Created API Gateway with GET and POST methods
The Lambda function contains the following code:
import boto3
import json
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket='namebbacket',
Key='sample_data.csv',
ExpressionType='SQL',
Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
print(statsDetails['BytesScanned'])
print("Stats details bytesProcessed: ")
print(statsDetails['BytesProcessed'])
print("Stats details bytesReturned: ")
print(statsDetails['BytesReturned'])
When I access the Invoke URL, I get the following error:
{errorMessage = Handler 'lambda_handler' missing on module 'lambda_function', errorType = Runtime.HandlerNotFound}
CSV structure: Name, PhoneNumber, City, Occupation
How to solve this problem?
Please refer to this documentation topic to learn how to write a Lambda function in Python. You are missing the Handler. See: AWS Lambda function handler in Python
Wecome to S.O. #smac2020 links you to the right place AWS Lambda function handler in Python. In short, AWS Lambda needs to know where to find your code, hence the "handler". Though a better way to think about it might be "entry-point."
Here is a close approximation of your function, refactored for use on AWS Lambda:
import json
import boto3
def function_to_be_called(event, context):
# TODO implement
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket='stack-exchange',
Key='48836509/dogs.csv',
ExpressionType='SQL',
Expression="SELECT * FROM s3object s where s.\"breen_name\" = 'pug'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!'),
'pugInfo': records
}
This function produces the following result:
Response
{
"statusCode": 200,
"body": "\"Hello from Lambda!\"",
"currentWorkdingDirectory": "/var/task",
"currentdirlist": [
"lambda_function.py"
],
"pugInfo": "1,pug,toy\r\n"
}
The "entry point" for this function is in a Python file called lambda_function.py and the function function_to_be_called. Together these are the "handler." We can see this in the Console:
or using the API through Boto3
import boto3
awslambda = boto3.client('lambda')
awslambda.get_function_configuration('s3SelectFunction')
Which returns:
{'CodeSha256': 'mFVVlakisUIIsLstQsJUpeBIeww4QhJjl7wJaXqsJ+Q=',
'CodeSize': 565,
'Description': '',
'FunctionArn': 'arn:aws:lambda:us-east-1:***********:function:s3SelectFunction',
'FunctionName': 's3SelectFunction',
'Handler': 'lambda_function.function_to_be_called',
'LastModified': '2021-03-10T00:57:48.651+0000',
'MemorySize': 128,
'ResponseMetadata': ...
'Version': '$LATEST'}
I have a lambda function which will start a transcribe job when an object is put into the s3 bucket. I am having trouble getting setting the transcribe job to be the file name without the extension; also the file is not putting into the correct prefix folder in S3 bucket for some reason, here's what I have:
import json
import boto3
import time
import os
from urllib.request import urlopen
transcribe = boto3.client('transcribe')
def lambda_handler(event, context):
if event:
file_obj = event["Records"][0]
bucket_name = str(file_obj['s3']['bucket']['name'])
file_name = str(file_obj['s3']['object']['key'])
s3_uri = create_uri(bucket_name, file_name)
job_name = filename
print(os.path.splitext(file_name)[0])
transcribe.start_transcription_job(TranscriptionJobName = job_name,
Media = {'MediaFileUri': s3_uri},
MediaFormat = 'mp3',
LanguageCode = "en-US",
OutputBucketName = "sbox-digirepo-transcribe-us-east-1",
Settings={
# 'VocabularyName': 'string',
'ShowSpeakerLabels': True,
'MaxSpeakerLabels': 2,
'ChannelIdentification': False
})
while Ture:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]:
break
print("Transcription in progress")
time.sleep(5)
s3.put_object(Bucket = bucket_name, Key="output/{}.json".format(job_name), Body=load_)
return {
'statusCode': 200,
'body': json.dumps('Transcription job created!')
}
def create_uri(bucket_name, file_name):
return "s3://"+bucket_name+"/"+file_name
the error i get is
[ERROR] BadRequestException: An error occurred (BadRequestException) when calling the StartTranscriptionJob operation: 1 validation error detected: Value 'input/7800533A.mp3' at 'transcriptionJobName' failed to satisfy constraint: Member must satisfy regular expression pattern: ^[0-9a-zA-Z._-]+
so my desired output should have the TranscriptionJobName value to be 7800533A for this case, and the result OutputBucketName to be in s3bucket/output. any help is appreciated, thanks in advance.
The TranscriptionJobName argument is a friendly name for your job and is pretty limited based on the regex. You're passing it the full object key, which contains the prefix input/, but / is a disallowed character in the job name. You could just split out the file name portion in your code:
job_name = file_name.split('/')[-1]
I put a full example of uploading media and starting an AWS Transcribe job on GitHub that puts all this in context.
import json
import base64
from google.cloud import bigquery
import ast
import pandas as pd
import sys
import pandas_gbq
def process_data(data):
#msg = str(data)
df = pd.DataFrame({"Data":data},index=[0])
df['time'] = pd.datetime.now()
lst = list(df)
df[lst] = df[lst].astype(str)
pandas_gbq.to_gbq(df,'datasetid.tableid',project_id='project_id',if_exists='append')
def receive_messages(project_id, subscription_name):
"""Receives messages from a pull subscription."""
# [START pubsub_subscriber_async_pull]
# [START pubsub_quickstart_subscriber]
import time
from google.cloud import pubsub_v1
# TODO project_id = "Your Google Cloud Project ID"
# TODO subscription_name = "Your Pub/Sub subscription name"
subscriber = pubsub_v1.SubscriberClient()
# The `subscription_path` method creates a fully qualified identifier
# in the form `projects/{project_id}/subscriptions/{subscription_name}`
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
def callback(message):
#print('Received message: {}'.format(message))
process_data(message)
message.ack()
subscriber.subscribe(subscription_path, callback=callback)
# The subscriber is non-blocking. We must keep the main thread from
# exiting to allow it to process messages asynchronously in the background.
# print('Listening for messages on {}'.format(subscription_path))
while True:
time.sleep(60)
# [END pubsub_subscriber_async_pull]
# [END pubsub_quickstart_subscriber]
receive_messages(project-id,sub-id)
I'm streaming the real time data from Pub/Sub to bigquery using cloud functions.
Here the following error:
Deployment failure:
Function failed on loading user code. Error message: Error: function load attempt timed out.
Your code is in a while True loop. Cloud Functions considers that your code has crashed because it does not return. Then your function is killed.
Redesign so that Pub/Sub is calling your Cloud Function using events (triggers). Follow this guide on how to implement a correct design:
Google Cloud Pub/Sub Triggers
I have a cloud function that is triggered by cloud Pub/Sub. I want the same function trigger dataflow using Python SDK. Here is my code:
import base64
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
print('Message of pubsub : {}'.format(message))
I deploy the function this way:
gcloud beta functions deploy hello_pubsub --runtime python37 --trigger-topic topic1
You have to embed your pipeline python code with your function. When your function is called, you simply call the pipeline python main function which executes the pipeline in your file.
If you developed and tried your pipeline in Cloud Shell and you already ran it in Dataflow pipeline, your code should have this structure:
def run(argv=None, save_main_session=True):
# Parse argument
# Set options
# Start Pipeline in p variable
# Perform your transform in Pipeline
# Run your Pipeline
result = p.run()
# Wait the end of the pipeline
result.wait_until_finish()
Thus, call this function with the correct argument especially the runner=DataflowRunner to allow the python code to load the pipeline in Dataflow service.
Delete at the end the result.wait_until_finish() because your function won't live all the dataflow process long.
You can also use template if you want.
You can use Cloud Dataflow templates to launch your job. You will need to code the following steps:
Retrieve credentials
Generate Dataflow service instance
Get GCP PROJECT_ID
Generate template body
Execute template
Here is an example using your base code (feel free to split into multiple methods to reduce code inside hello_pubsub method).
from googleapiclient.discovery import build
import base64
import google.auth
import os
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
credentials, _ = google.auth.default()
service = build('dataflow', 'v1b3', credentials=credentials)
gcp_project = os.environ["GCLOUD_PROJECT"]
template_path = gs://template_file_path_on_storage/
template_body = {
"parameters": {
"keyA": "valueA",
"keyB": "valueB",
},
"environment": {
"envVariable": "value"
}
}
request = service.projects().templates().launch(projectId=gcp_project, gcsPath=template_path, body=template_body)
response = request.execute()
print(response)
In template_body variable, parameters values are the arguments that will be sent to your pipeline and environment values are used by Dataflow service (serviceAccount, workers and network configuration).
LaunchTemplateParameters documentation
RuntimeEnvironment documentation