I have a cloud function that is triggered by cloud Pub/Sub. I want the same function trigger dataflow using Python SDK. Here is my code:
import base64
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
print('Message of pubsub : {}'.format(message))
I deploy the function this way:
gcloud beta functions deploy hello_pubsub --runtime python37 --trigger-topic topic1
You have to embed your pipeline python code with your function. When your function is called, you simply call the pipeline python main function which executes the pipeline in your file.
If you developed and tried your pipeline in Cloud Shell and you already ran it in Dataflow pipeline, your code should have this structure:
def run(argv=None, save_main_session=True):
# Parse argument
# Set options
# Start Pipeline in p variable
# Perform your transform in Pipeline
# Run your Pipeline
result = p.run()
# Wait the end of the pipeline
result.wait_until_finish()
Thus, call this function with the correct argument especially the runner=DataflowRunner to allow the python code to load the pipeline in Dataflow service.
Delete at the end the result.wait_until_finish() because your function won't live all the dataflow process long.
You can also use template if you want.
You can use Cloud Dataflow templates to launch your job. You will need to code the following steps:
Retrieve credentials
Generate Dataflow service instance
Get GCP PROJECT_ID
Generate template body
Execute template
Here is an example using your base code (feel free to split into multiple methods to reduce code inside hello_pubsub method).
from googleapiclient.discovery import build
import base64
import google.auth
import os
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
credentials, _ = google.auth.default()
service = build('dataflow', 'v1b3', credentials=credentials)
gcp_project = os.environ["GCLOUD_PROJECT"]
template_path = gs://template_file_path_on_storage/
template_body = {
"parameters": {
"keyA": "valueA",
"keyB": "valueB",
},
"environment": {
"envVariable": "value"
}
}
request = service.projects().templates().launch(projectId=gcp_project, gcsPath=template_path, body=template_body)
response = request.execute()
print(response)
In template_body variable, parameters values are the arguments that will be sent to your pipeline and environment values are used by Dataflow service (serviceAccount, workers and network configuration).
LaunchTemplateParameters documentation
RuntimeEnvironment documentation
Related
I am trying to mock a cloud storage response while testing a cloud function in a local testing environment.
1. I am starting the server - as advised by the docs - using a subprocess:
subprocess.Popen(
[
"functions-framework",
"--target",
"my_target",
"--port",
"8080",
"--source",
f"{os.getcwd()}/src/main.py",
],
cwd=os.path.dirname(__file__),
stdout=subprocess.PIPE,
)
2. I then ping the local server from my test function:
res = requests.get(
f"{SETTINGS.BASE_URL}:8080/test",
)
The server is up and running and it works just fine...
The cloud function talks to a Cloud Storage bucket that serves as a "database". However, I do not want to call it and Mock the buckets behaviour. I have tried many different things, such as patching the function, using MagicMock etc.. Nothing works. It always pings the real, live Cloud Storage bucket. I think I am not quite getting how to patch the cloud storage behaviour in the subprocess that the server is using.
Any ideas?
I tried e.g.:
def test_return_from_cs(cls):
mock_client = MagicMock(spec=google.cloud.storage.Client)
# Create a mock bucket
mock_bucket = MagicMock(spec=google.cloud.storage.Bucket)
# Create a mock blob
mock_blob = MagicMock(spec=google.cloud.storage.Blob)
# Set the return value of the mock client's get_bucket method to be the mock bucket
mock_client.get_bucket.return_value = mock_bucket
# Set the return value of the mock bucket's get_blob method to be the mock blob
mock_bucket.get_blob.return_value = mock_blob
# Set the return value of the mock blob's download_as_string method to be a string "test_data"
mock_blob.download_as_string.return_value = "test_data"
import subprocess
import os
subprocess.Popen(
[
"functions-framework",
"--target",
"my_target",
"--port",
"9090",
"--source",
f"{os.getcwd()}/src/main.py",
],
cwd=os.path.dirname(__file__),
stdout=subprocess.PIPE,
)
res = requests.get(
f"{SETTINGS.BASE_URL}:9090/test",
)
assert res.text == "test_data"
subprocess.terminate()
I also tried:
#patch("src.my_module.main.storage.Client", autospec=True)
def test_return_from_cs(cls, clientMock):
blob_mock = (
clientMock().get_bucket.return_value.get_blob.return_value
) # split this up for readability
bucket_mock = blob_mock.get_blob.return_value
bucket_mock.download_as_string.return_value = "test"
res = cls.session.get(
f"{SETTINGS.BASE_URL}:{SETTINGS.PORT}/test",
)
assert res.text == "test"
In the second attempt above the server is already running and imported from a conftest.py.
I guess I do not understand how to mock the call to CS within the subprocess...
It always pings the actual cloud storage bucket...
We are using a simple python azure function to forward a JSON payload to an event hub. We have configured the event hub as the function output binding. Our requirement is to verify an APIKEY that comes as part of the header and if the request header doesn't have the APIKEY or match with our APIKEY, we want to skip the function output trigger. How do we achieve this?
The current code looks like this
import logging
import azure.functions as func
import json
def main(req: func.HttpRequest) -> str:
logging.info('Send an output)
try:
if req.headers.get("MYAPIKEY") == APIKEY:
body = req.get_json()
return json.dumps(body)
except :
func.HttpResponse("Function failed")
Event hub output binding requires at least one output per function call.
If you are using return value version rather than that try using IAsyncCollector output binding.
You can check this Github discussion where you can use function out method.
Here is the other Gitbhub discussion with related issue.
Scenario :
I want to pass S3 (source file location) and the s3 (output file location) as
input parameters in my workflow .
Workflow : Aws Step Function calls -> lambda function and lambda function calls -> the glue job,
I want to pass the parameters from step function -> lambda function -> glue job, where glue job does
some transformation on the S3 input file and writes its output to S3 output file
Below are step function, lambda function and glue job respectively and the input
json which is passed to step function as input.
1:Input (Parameters passed) :
{
"S3InputFileLocation": "s3://bucket_name/sourcefile.csv",
"S3OutputFileLocation": "s3://bucket_name/FinalOutputfile.csv"
}
2: Step Function/ state machine ( which calls lambda with the above input
parameters) :
{
"StartAt":"AWSStepFunctionInitiator",
"States":{
"AWSStepFunctionInitiator": {
"Type":"Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:AWSLambdaFunction",
"InputPath": "$",
"End": true
}
}
}
3: Lambda Function(I.E AWSLambdaFunction invoked above, which in turn calls AWSGlueJob below):
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'S3InputFileLocation': event["S3InputFileLocation"],
'S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}
4: AWS Glue Job Script:
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
print('AWS Glue Job started')
args = getResolvedOptions(sys.argv, ['AWSGlueJob','S3InputFileLocation', 'S3OutputFileLocation'])
S3InputFileLocation= args['S3InputFileLocation']
S3OutputFileLocation= args['S3OutputFileLocation']
glueContext = GlueContext(SparkContext.getOrCreate())
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': [S3_InputLocation] }, format="csv" )
datasink = glueContext.write_dynamic_frame.from_options(frame = dfnew, connection_type = "s3",connection_options = {"path": S3_OutputLocation}, format = "csv", transformation_ctx ="datasink")
The above step function and corresponding workflow executes without any compilation or run time error, also I do see parameters successfully passed from Step function to lambda function, but none of my print statements in glue job are getting logged in cloud watch that means there is some issue when lambda function calls the glue job. Kindly help me figure out if there is some issue in the way I am invoking glue from lambda ?
Hej,
Maybe it is already solved but what helps are those two links:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
And to give precise answer to your question: Add '--' to the arguments in here:
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'--S3InputFileLocation': event["S3InputFileLocation"],
'--S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}
I'm following a tutorial on setting up AWS API Gateway with a Lambda Function to create a restful API. I have the following code:
import json
def lambda_handler(event, context):
# 1. Parse query string parameters
transactionId = event['queryStringParameters']['transactionid']
transactionType = event['queryStringParameters']['type']
transactionAmounts = event['queryStringParameters']['amount']
# 2. Construct the body of the response object
transactionResponse = {}
# returning values originally passed in then add separate field at the bottom
transactionResponse['transactionid'] = transactionId
transactionResponse['type'] = transactionType
transactionResponse['amount'] = transactionAmounts
transactionResponse['message'] = 'hello from lambda land'
# 3. Construct http response object
responseObject = {}
responseObject['StatusCode'] = 200
responseObject['headers'] = {}
responseObject['headers']['Content-Type'] = 'application/json'
responseObject['body'] = json.dumps(transactionResponse)
# 4. Return the response object
return responseObject
When I link the API Gateway to this function and try to call it using query parameters I get the error:
{
"message":"Internal server error"
}
When I test the lambda function it returns the error:
{
"errorMessage": "'transactionid'",
"errorType": "KeyError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 5, in lambda_handler\n transactionId = event['queryStringParameters']['transactionid']\n"
]
Does anybody have any idea of what's going on here/how to get it to work?
I recommend adding a couple of diagnostics, as follows:
import json
def lambda_handler(event, context):
print('event:', json.dumps(event))
print('queryStringParameters:', json.dumps(event['queryStringParameters']))
transactionId = event['queryStringParameters']['transactionid']
transactionType = event['queryStringParameters']['type']
transactionAmounts = event['queryStringParameters']['amount']
// remainder of code ...
That way you can see what is in event and event['queryStringParameters'] to be sure that it matches what you expected to see. These will be logged in CloudWatch Logs (and you can see them in the AWS Lambda console if you are testing events using the console).
In your case, it turns out that your test event included transactionId when your code expected to see transactionid (different spelling). Hence the KeyError exception.
just remove ['queryStringParameters']. the print event line shows the event i only a array not a key value pair. I happen to be following the same tutorial. I'm still on the api gateway part so i'll update once mine is completed.
Whan you test from the lambda function there is no queryStringParameters in the event but it is there when called from the api gateway, you can also test from the api gateway where queryStringParameters is required to get the values passed.
The problem is not your code. It is the Lambda function intergration setting. Please do not enable Lambda function intergration setting . You can still attach the Lambda function without it. Leave this unchecked.
It's because of the typo in responseObject['StatusCode'] = 200.
'StatusCode' should be 'statusCode'.
I got the same issue, and it was that.
I've setup a Python script that will take certain bigquery tables from one dataset, clean them with a SQL query, and add the cleaned tables to a new dataset. This script works correctly. I want to set this up as a cloud function that triggers at midnight every day.
I've also used cloud scheduler to send a message to a pubsub topic at midnight every day. I've verified that this works correctly. I am new to pubsub but I followed the tutorial in the documentation and managed to setup a test cloud function that prints out hello world when it gets a push notification from pubsub.
However, my issue is that when I try to combine the two and automate my script - I get a log message that the execution crashed:
Function execution took 1119 ms, finished with status: 'crash'
To help you understand what I'm doing, here is the code in my main.py:
# Global libraries
import base64
# Local libraries
from scripts.one_minute_tables import helper
def one_minute_tables(event, context):
# Log out the message that triggered the function
print("""This Function was triggered by messageId {} published at {}
""".format(context.event_id, context.timestamp))
# Get the message from the event data
name = base64.b64decode(event['data']).decode('utf-8')
# If it's the message for the daily midnight schedule, execute function
if name == 'midnight':
helper.format_tables('raw_data','table1')
else:
pass
For the sake of convenience, this is a simplified version of my python script:
# Global libraries
from google.cloud import bigquery
import os
# Login to bigquery by providing credentials
credential_path = 'secret.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
def format_tables(dataset, list_of_tables):
# Initialize the client
client = bigquery.Client()
# Loop through the list of tables
for table in list_of_tables:
# Create the query object
script = f"""
SELECT *
FROM {dataset}.{table}
"""
# Call the API
query = client.query(script)
# Wait for job to finish
results = query.result()
# Print
print('Data cleaned and updated in table: {}.{}'.format(dataset, table))
This is my folder structure:
And my requirements.txt file has only one entry in it: google-cloud-bigquery==1.24.0
I'd appreciate your help in figuring out what I need to fix to run this script with the pubsub trigger without getting a log message that says the execution crashed.
EDIT: Based on the comments, this is the log of the function crash
{
"textPayload": "Function execution took 1078 ms, finished with status: 'crash'",
"insertId": "000000-689fdf20-aee2-4900-b5a1-91c34d7c1448",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "one_minute_tables",
"region": "us-central1",
"project_id": "PROJECT_ID"
}
},
"timestamp": "2020-05-15T16:53:53.672758031Z",
"severity": "DEBUG",
"labels": {
"execution_id": "x883cqs07f2w"
},
"logName": "projects/PROJECT_ID/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/PROJECT_ID/traces/f391b48a469cbbaeccad5d04b4a704a0",
"receiveTimestamp": "2020-05-15T16:53:53.871051291Z"
}
The problem comes from the list_of_tables attributes. You call your function like this
if name == 'midnight':
helper.format_tables('raw_data','table1')
And you iterate on your 'table1' parameter
Perform this, it should work
if name == 'midnight':
helper.format_tables('raw_data',['table1'])