I've setup a Python script that will take certain bigquery tables from one dataset, clean them with a SQL query, and add the cleaned tables to a new dataset. This script works correctly. I want to set this up as a cloud function that triggers at midnight every day.
I've also used cloud scheduler to send a message to a pubsub topic at midnight every day. I've verified that this works correctly. I am new to pubsub but I followed the tutorial in the documentation and managed to setup a test cloud function that prints out hello world when it gets a push notification from pubsub.
However, my issue is that when I try to combine the two and automate my script - I get a log message that the execution crashed:
Function execution took 1119 ms, finished with status: 'crash'
To help you understand what I'm doing, here is the code in my main.py:
# Global libraries
import base64
# Local libraries
from scripts.one_minute_tables import helper
def one_minute_tables(event, context):
# Log out the message that triggered the function
print("""This Function was triggered by messageId {} published at {}
""".format(context.event_id, context.timestamp))
# Get the message from the event data
name = base64.b64decode(event['data']).decode('utf-8')
# If it's the message for the daily midnight schedule, execute function
if name == 'midnight':
helper.format_tables('raw_data','table1')
else:
pass
For the sake of convenience, this is a simplified version of my python script:
# Global libraries
from google.cloud import bigquery
import os
# Login to bigquery by providing credentials
credential_path = 'secret.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
def format_tables(dataset, list_of_tables):
# Initialize the client
client = bigquery.Client()
# Loop through the list of tables
for table in list_of_tables:
# Create the query object
script = f"""
SELECT *
FROM {dataset}.{table}
"""
# Call the API
query = client.query(script)
# Wait for job to finish
results = query.result()
# Print
print('Data cleaned and updated in table: {}.{}'.format(dataset, table))
This is my folder structure:
And my requirements.txt file has only one entry in it: google-cloud-bigquery==1.24.0
I'd appreciate your help in figuring out what I need to fix to run this script with the pubsub trigger without getting a log message that says the execution crashed.
EDIT: Based on the comments, this is the log of the function crash
{
"textPayload": "Function execution took 1078 ms, finished with status: 'crash'",
"insertId": "000000-689fdf20-aee2-4900-b5a1-91c34d7c1448",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "one_minute_tables",
"region": "us-central1",
"project_id": "PROJECT_ID"
}
},
"timestamp": "2020-05-15T16:53:53.672758031Z",
"severity": "DEBUG",
"labels": {
"execution_id": "x883cqs07f2w"
},
"logName": "projects/PROJECT_ID/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/PROJECT_ID/traces/f391b48a469cbbaeccad5d04b4a704a0",
"receiveTimestamp": "2020-05-15T16:53:53.871051291Z"
}
The problem comes from the list_of_tables attributes. You call your function like this
if name == 'midnight':
helper.format_tables('raw_data','table1')
And you iterate on your 'table1' parameter
Perform this, it should work
if name == 'midnight':
helper.format_tables('raw_data',['table1'])
Related
I'm working on a project in which I have two actions - delete and write, and I need to make them "atomic", i.e., to be performed without any other process reading from the database after the data was deleted but not yet written.
The code under the Concept Demonstration rubric demonstrates the desired outcome with the use of multiprocessing.Process and multiprocessing.Pipe objects:
Concept Demonstration:
import time
from datetime import datetime
import boto3
from multiprocessing import Process, Pipe
from boto3.dynamodb.conditions import Key, Attr
def write_and_del_dynamodb(lock_pipe):
_, lck_snd = lock_pipe
dynamodb = boto3.client("dynamodb")
add_params = {
"TableName": "Testing",
"Item": {
"stream_url": {"S": "stream_url_test"},
"added_time_utc": {"S": str(datetime.utcnow())},
"ConditionExpression": 'attribute_not_exists(stream_url)'
}
del_params = {
"TableName": "Testing",
"Key": {'stream_url': {"S": 'stream_url_test'}}
}
lck_snd.send('LOCK_DOWN')
print('''
================ ATOMIC WRITE-DELETE-WRITE PROCESS ====================
''')
print('Putting an item...')
dynamodb.put_item(**add_params)
print('Sleeping for 1s...')
time.sleep(1)
print("Deleting an item...")
dynamodb.delete_item(**del_params)
print('Sleeping for 3s...')
time.sleep(3)
print('Putting an item...')
dynamodb.put_item(**add_params)
print('''
================ ATOMIC WRITE-DELETE-WRITE PROCESS ====================
''')
lck_snd.send('LOCK_UP')
def read_dynamodb(lock_pipe):
lck_rcv, _ = lock_pipe
dynamodb = boto3.client("dynamodb")
params = {
"TableName": "Testing",
"Key": {'stream_url': {"S": 'stream_url_test'}}
}
while lck_rcv.recv() != 'LOCK_UP':
pass
db_response = dynamodb.get_item(**params)
items = db_response.get('Item')
print(f'''
================ READ PROCESS ====================
Data read:
{items}
================ READ PROCESS ====================
''')
if __name__ == '__main__':
lock_pipe_receive, lock_pipe_send = Pipe()
Process(name="write_and_del_dynamodb", target=write_and_del_dynamodb, args=((lock_pipe_receive, lock_pipe_send),)).start()
time.sleep(1)
Process(name="read_dynamodb", target=read_dynamodb, args=((lock_pipe_receive, lock_pipe_send),)).start()
in the code above two processes share the same resource, i.e., the dynamoDB data table Testing.
The working principle:
First process starts by sending a 'LOCK_DOWN' on the lock_pipe_send end, to signify the start of an atomic procedure.
First process writes some data to the dynamoDB.
First process deletes the data from the dynamoDB.
First process writed the data to the dynamoDB again.
The first process sends the 'LOCK_UP' on the lock_pipe_send end to signal the second process that the data may be read.
The second process (which was waiting on the while lck_rcv.recv() != 'LOCK_UP' condition) receives the LOCK_UP message, and reads the data from the dynamoDB.
My question:
How can I implement this behavior in a distributed manner, i.e., when I have the two processes sitting on different physical machines in the AWS, as shown in the following image?
Thanks in advance.
I have a cloud function that is triggered by cloud Pub/Sub. I want the same function trigger dataflow using Python SDK. Here is my code:
import base64
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
print('Message of pubsub : {}'.format(message))
I deploy the function this way:
gcloud beta functions deploy hello_pubsub --runtime python37 --trigger-topic topic1
You have to embed your pipeline python code with your function. When your function is called, you simply call the pipeline python main function which executes the pipeline in your file.
If you developed and tried your pipeline in Cloud Shell and you already ran it in Dataflow pipeline, your code should have this structure:
def run(argv=None, save_main_session=True):
# Parse argument
# Set options
# Start Pipeline in p variable
# Perform your transform in Pipeline
# Run your Pipeline
result = p.run()
# Wait the end of the pipeline
result.wait_until_finish()
Thus, call this function with the correct argument especially the runner=DataflowRunner to allow the python code to load the pipeline in Dataflow service.
Delete at the end the result.wait_until_finish() because your function won't live all the dataflow process long.
You can also use template if you want.
You can use Cloud Dataflow templates to launch your job. You will need to code the following steps:
Retrieve credentials
Generate Dataflow service instance
Get GCP PROJECT_ID
Generate template body
Execute template
Here is an example using your base code (feel free to split into multiple methods to reduce code inside hello_pubsub method).
from googleapiclient.discovery import build
import base64
import google.auth
import os
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
credentials, _ = google.auth.default()
service = build('dataflow', 'v1b3', credentials=credentials)
gcp_project = os.environ["GCLOUD_PROJECT"]
template_path = gs://template_file_path_on_storage/
template_body = {
"parameters": {
"keyA": "valueA",
"keyB": "valueB",
},
"environment": {
"envVariable": "value"
}
}
request = service.projects().templates().launch(projectId=gcp_project, gcsPath=template_path, body=template_body)
response = request.execute()
print(response)
In template_body variable, parameters values are the arguments that will be sent to your pipeline and environment values are used by Dataflow service (serviceAccount, workers and network configuration).
LaunchTemplateParameters documentation
RuntimeEnvironment documentation
I have Pub/Sub subscribe logic wrapped inside a subscribe method that is being called once during service initialization for every subscription:
def subscribe(self,
callback: typing.Callable,
subscription_name: str,
topic_name: str,
project_name: str = None) -> typing.Optional[SubscriberClient]:
"""Subscribes to Pub/Sub topic and return subscriber client
:param callback: subscription callback method
:param subscription_name: name of the subscription
:param topic_name: name of the topic
:param project_name: optional project name. Uses default project if not set
:return: subscriber client or None if testing
"""
project = project_name if project_name else self.pubsub_project_id
self.logger.info('Subscribing to project `{}`, topic `{}`'.format(project, topic_name))
project_path = self.pubsub_subscriber.project_path(project)
topic_path = self.pubsub_subscriber.topic_path(project, topic_name)
subscription_path = self.pubsub_subscriber.subscription_path(project, subscription_name)
# check if there is an existing subscription, if not, create it
if subscription_path not in [s.name for s in self.pubsub_subscriber.list_subscriptions(project_path)]:
self.logger.info('Creating new subscription `{}`, topic `{}`'.format(subscription_name, topic_name))
self.pubsub_subscriber.create_subscription(subscription_path, topic_path)
# subscribe to the topic
self.pubsub_subscriber.subscribe(
subscription_path, callback=callback,
scheduler=self.thread_scheduler
)
return self.pubsub_subscriber
This method is called like this:
self.subscribe_client = self.subscribe(
callback=self.pubsub_callback,
subscription_name='subscription_topic',
topic_name='topic'
)
The callback method does a bunch of stuff, sends 2 emails then acknowledges the message
def pubsub_callback(self, data: gcloud_pubsub_subscriber.Message):
self.logger.debug('Processing pub sub message')
try:
self.do_something_with_message(data)
self.logger.debug('Acknowledging the message')
data.ack()
self.logger.debug('Acknowledged')
return
except:
self.logger.warning({
"message": "Failed to process Pub/Sub message",
"request_size": data.size,
"data": data.data
}, exc_info=True)
self.logger.debug('Acknowledging the message 2')
data.ack()
When I run push something to the subscription, callback runs, prints all the debug messages including Acknowledged. The message, however, stays in the Pub/Sub, the callback gets called again and it takes exponential time after each retry. The question is what could cause the message to stay in the pub/sub even after the ack is called?
I have several such subscriptions, all of them work as expected. Deadline is not an option, the callback finishes almost immediately and I played with the ack deadline anyways, nothing helped.
When I try to process these messages from locally running app connected to that pub-sub, it completes just fine and acknowledge takes the message out of the queue as expected.
So the problem manifests only in deployed service (running inside a kubernetes pod)
Callback executes buck ack does seemingly nothing
Acking messages from a script running locally (...and doing the exact same stuff) or through the GCP UI works as expected.
Any ideas?
Acknowledgements are best-effort in Pub/Sub, so it's possible but unusual for messages to be redelivered.
If you are consistently receiving duplicates, it might be due to duplicate publishes of the same message contents. As far as Pub/Sub is concerned, these are different messages and will be assigned different message IDs. Check the Pub/Sub-provided message IDs to ensure that you are actually receiving the same message multiple times.
There is an edge case in dealing with large backlogs of small messages with streaming pull (which is what the Python client library uses). If you are running multiple clients subscribing on the same subscription, this edge case may be relevant.
You can also check your subscription's Stackdriver metrics to see:
if its acks are being sent successfully (subscription/ack_message_count)
if its backlog is decreasing (subscription/backlog_bytes)
if your subscriber is missing the ack deadline (subscription/streaming_pull_ack_message_operation_count filtered by response_code != "success")
If you're not missing the ack deadline and your backlog is remaining steady, you should contact Google Cloud support with your project name, subscription name, and a sample of the duplicate message IDs. They will be able to investigate why these duplicates are happening.
I did some additional testing and I finally found the problem.
TL;DR: I was using the same google.cloud.pubsub_v1.subscriber.scheduler.ThreadScheduler for all subscriptions.
Here are the snippets of the code I used to test it. This is the broken version:
server.py
import concurrent.futures.thread
import os
import time
from google.api_core.exceptions import AlreadyExists
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.scheduler import ThreadScheduler
def create_subscription(project_id, topic_name, subscription_name):
"""Create a new pull subscription on the given topic."""
subscriber = pubsub_v1.SubscriberClient()
topic_path = subscriber.topic_path(project_id, topic_name)
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
subscription = subscriber.create_subscription(
subscription_path, topic_path)
print('Subscription created: {}'.format(subscription))
def receive_messages(project_id, subscription_name, t_scheduler):
"""Receives messages from a pull subscription."""
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
def callback(message):
print('Received message: {}'.format(message.data))
message.ack()
subscriber.subscribe(subscription_path, callback=callback, scheduler=t_scheduler)
print('Listening for messages on {}'.format(subscription_path))
project_id = os.getenv("PUBSUB_PROJECT_ID")
publisher = pubsub_v1.PublisherClient()
project_path = publisher.project_path(project_id)
# Create both topics
try:
topics = [topic.name.split('/')[-1] for topic in publisher.list_topics(project_path)]
if 'topic_a' not in topics:
publisher.create_topic(publisher.topic_path(project_id, 'topic_a'))
if 'topic_b' not in topics:
publisher.create_topic(publisher.topic_path(project_id, 'topic_b'))
except AlreadyExists:
print('Topics already exists')
# Create subscriptions on both topics
sub_client = pubsub_v1.SubscriberClient()
project_path = sub_client.project_path(project_id)
try:
subs = [sub.name.split('/')[-1] for sub in sub_client.list_subscriptions(project_path)]
if 'topic_a_sub' not in subs:
create_subscription(project_id, 'topic_a', 'topic_a_sub')
if 'topic_b_sub' not in subs:
create_subscription(project_id, 'topic_b', 'topic_b_sub')
except AlreadyExists:
print('Subscriptions already exists')
scheduler = ThreadScheduler(concurrent.futures.thread.ThreadPoolExecutor(10))
receive_messages(project_id, 'topic_a_sub', scheduler)
receive_messages(project_id, 'topic_b_sub', scheduler)
while True:
time.sleep(60)
client.py
import datetime
import os
import random
import sys
from time import sleep
from google.cloud import pubsub_v1
def publish_messages(pid, topic_name):
"""Publishes multiple messages to a Pub/Sub topic."""
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(pid, topic_name)
for n in range(1, 10):
data = '[{} - {}] Message number {}'.format(datetime.datetime.now().isoformat(), topic_name, n)
data = data.encode('utf-8')
publisher.publish(topic_path, data=data)
sleep(random.randint(10, 50) / 10.0)
project_id = os.getenv("PUBSUB_PROJECT_ID")
publish_messages(project_id, sys.argv[1])
I connected to the cloud pub/sub, the server created topics and subscriptions. Then I ran the client script multiple times in parallel for both topics. After a short while, once I changed server code to instantiate new thread scheduler inside receive_messages scope, the server cleaned up both topics and functioned as expected.
Confusing thing is that in either case, the server printed out the received message for all the messages.
I am going to post this to https://github.com/googleapis/google-cloud-python/issues
When using the Python client API for the Google Cloud Scheduler I always get the above error message for some reason. I also tried to start the parent path without the slash but got the same result.
Any hint is much appreciated!
import os
from google.cloud import scheduler_v1
def gcloudscheduler(data, context):
current_folder = os.path.dirname(os.path.abspath(__file__))
abs_auth_path = os.path.join(current_folder, 'auth.json')
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = abs_auth_path
response = scheduler_v1.CloudSchedulerClient().create_job(data["parent"], data["job"])
print(response)
I used following parameter:
{"job": {
"pubsub_target": {
"topic_name": "trade-tests",
"attributes": {
"attrKey": "attrValue"
}
},
"schedule": "* * * * *"
},
"parent": "/projects/my-project-id/locations/europe-west1"
}
The problem was actually not the parent parameter but the incorrect format of the topic-name. It should have been projects/my-project-id/topics/trade-tests. Even though the error message says it should with a slash. But it is in line with the API doc here and here.
The problem was just that the error message didn't say which resource name the error was about.
I have an HTTP triggered Consumption plan Azure Function that I want to keep warm by POSTing an empty payload to it regularly.
I am doing this with a Scheduled Function with this configuration:
__init__.py
import os
import datetime
import logging
import azure.functions as func
import urllib.parse, urllib.request, urllib.error
def main(mytimer: func.TimerRequest) -> None:
try:
url = f"https://FUNCTIONNAME.azurewebsites.net/api/predictor?code={os.environ['CODE']}"
request = urllib.request.Request(url, {})
response = urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
message = e.read().decode()
if message == "expected outcome":
pass
else:
logging.info(f"Error: {message}")
function.json
{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "mytimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 */9 5-17 * * 1-5"
}
]
}
When I inspect my logs they are filled with HTML. Here is a snippet of the HTML:
...
<h1>Server Error</h1>
...
<h2>502 - Web server received an invalid response while acting as a gateway or proxy server.</h2>
<h3>There is a problem with the page you are looking for, and it cannot be displayed. When the Web server (while acting as a gateway or proxy) contacted the upstream content server, it received an invalid response from the content server.</h3>
Running the logic of __init__.py locally works fine. What might be wrong here?
Hmm... That is weird. Looks like the response wasn't able to route to the correct instance I guess.
BTW, I believe you could simply have the time triggered function in the same function app as the one you want to keep warm. This function really doesn't have to do anything too.
Also, you might want to take a look at Azure Functions Premium which supports having pre-warmed instances. Note that this is still in preview.