I'm working on a project in which I have two actions - delete and write, and I need to make them "atomic", i.e., to be performed without any other process reading from the database after the data was deleted but not yet written.
The code under the Concept Demonstration rubric demonstrates the desired outcome with the use of multiprocessing.Process and multiprocessing.Pipe objects:
Concept Demonstration:
import time
from datetime import datetime
import boto3
from multiprocessing import Process, Pipe
from boto3.dynamodb.conditions import Key, Attr
def write_and_del_dynamodb(lock_pipe):
_, lck_snd = lock_pipe
dynamodb = boto3.client("dynamodb")
add_params = {
"TableName": "Testing",
"Item": {
"stream_url": {"S": "stream_url_test"},
"added_time_utc": {"S": str(datetime.utcnow())},
"ConditionExpression": 'attribute_not_exists(stream_url)'
}
del_params = {
"TableName": "Testing",
"Key": {'stream_url': {"S": 'stream_url_test'}}
}
lck_snd.send('LOCK_DOWN')
print('''
================ ATOMIC WRITE-DELETE-WRITE PROCESS ====================
''')
print('Putting an item...')
dynamodb.put_item(**add_params)
print('Sleeping for 1s...')
time.sleep(1)
print("Deleting an item...")
dynamodb.delete_item(**del_params)
print('Sleeping for 3s...')
time.sleep(3)
print('Putting an item...')
dynamodb.put_item(**add_params)
print('''
================ ATOMIC WRITE-DELETE-WRITE PROCESS ====================
''')
lck_snd.send('LOCK_UP')
def read_dynamodb(lock_pipe):
lck_rcv, _ = lock_pipe
dynamodb = boto3.client("dynamodb")
params = {
"TableName": "Testing",
"Key": {'stream_url': {"S": 'stream_url_test'}}
}
while lck_rcv.recv() != 'LOCK_UP':
pass
db_response = dynamodb.get_item(**params)
items = db_response.get('Item')
print(f'''
================ READ PROCESS ====================
Data read:
{items}
================ READ PROCESS ====================
''')
if __name__ == '__main__':
lock_pipe_receive, lock_pipe_send = Pipe()
Process(name="write_and_del_dynamodb", target=write_and_del_dynamodb, args=((lock_pipe_receive, lock_pipe_send),)).start()
time.sleep(1)
Process(name="read_dynamodb", target=read_dynamodb, args=((lock_pipe_receive, lock_pipe_send),)).start()
in the code above two processes share the same resource, i.e., the dynamoDB data table Testing.
The working principle:
First process starts by sending a 'LOCK_DOWN' on the lock_pipe_send end, to signify the start of an atomic procedure.
First process writes some data to the dynamoDB.
First process deletes the data from the dynamoDB.
First process writed the data to the dynamoDB again.
The first process sends the 'LOCK_UP' on the lock_pipe_send end to signal the second process that the data may be read.
The second process (which was waiting on the while lck_rcv.recv() != 'LOCK_UP' condition) receives the LOCK_UP message, and reads the data from the dynamoDB.
My question:
How can I implement this behavior in a distributed manner, i.e., when I have the two processes sitting on different physical machines in the AWS, as shown in the following image?
Thanks in advance.
Related
enter image description here
I am trying to call store procedure using groovy script and the processor I am using is Execute Script (using groovy because i want to capture the response of store procedure).
But the flow files are getting stuck and when I am restarting the processor it's getting passed
The same code I am using on other environment it's working fine without an issue.
Below is code I am using to call the store procedure:
import org.apache.commons.io.IOUtils
import org.apache.nifi.controller.ControllerService
import org.apache.nifi.processor.io.StreamCallback
import java.nio.charset.*
import groovy.sql.OutParameter
import groovy.sql.Sql
import java.sql.ResultSet
import java.sql.Clob
try{
def lookup = context.controllerServiceLookup
def dbServiceName = ConncationPool.value
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find {
cs -> lookup.getControllerServiceName(cs) == dbServiceName
}
def conn = lookup.getControllerService(dbcpServiceId).getConnection();
sql = Sql.newInstance(conn);
def flowFile = session.get()
if(!flowFile) return
attr1= flowFile.getAttribute('attr1')
attr2= flowFile.getAttribute('attr2')
attr3= flowFile.getAttribute('attr3')
def data = []
String sqlString ="""{call procedure_name(?,?,?,?)}""";
def OUT_JSON
def parametersList = [attr1,attr2,attr3,Sql.VARCHAR];
sql.call(sqlString, parametersList) {out_json_response ->
OUT_JSON = out_json_response
};
def attrMap = ['out_json_response':String.valueOf(OUT_JSON),'Conn':String.valueOf(conn)]
flowFile = session.putAllAttributes(flowFile, attrMap)
conn.close()
sql.close();
session.transfer(flowFile, REL_SUCCESS)
}
catch (e){
if (conn != null) conn.close();
if (sql != null) sql.close();
log.error('Scripting error', e)
flowFile = session.putAttribute(flowFile, "error", e.getMessage())
session.transfer(flowFile, REL_FAILURE)
} finally {
if (conn != null) conn.close();
if (sql != null) sql.close();
}
Can you please help me to solve the issue. Is anyone face the same issue?
I cannot see the run schedule in your screenshot.
So first go to the configuration by right clicking on the processor.
There you'll find a tab named Scheduling. Click on it.
Now, check if the Scheduling Strategy is marked as CRON driven or Timer driven.
There you can also check the Run Schedule.
When you mentioned that your script runs after restarting or every 15 minutes, I thought that your Run Schedule is set to run every ~15 minutes.
Just verify that and if that is the case, stop the processor and change the below configuration to:
Scheduling Strategy: Timer Driven
Run Schedule: 0 sec
NOTE: Before making any changes, if the initial flow was not developed by you, check if changing the Scheduling of the processor will not have any undesired effect. Or just make sure why it was set to run every 15 minutes.
We are sending Avro data encoded with (azure.schemaregistry.encoder.avroencoder) to Event-Hub using a standalone python job and we can deserialize using the same decoder using another standalone python consumer. The schema registry is also supplied to the Avro encoder in this case.
This is the stand alone producer I use
import os
from azure.eventhub import EventHubProducerClient, EventData
from azure.schemaregistry import SchemaRegistryClient
from azure.schemaregistry.encoder.avroencoder import AvroEncoder
from azure.identity import DefaultAzureCredential
os.environ["AZURE_CLIENT_ID"] = ''
os.environ["AZURE_TENANT_ID"] = ''
os.environ["AZURE_CLIENT_SECRET"] = ''
token_credential = DefaultAzureCredential()
fully_qualified_namespace = ""
group_name = "testSchemaReg"
eventhub_connection_str = ""
eventhub_name = ""
definition = """
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}"""
schema_registry_client = SchemaRegistryClient(fully_qualified_namespace, token_credential)
avro_encoder = AvroEncoder(client=schema_registry_client, group_name=group_name, auto_register=True)
eventhub_producer = EventHubProducerClient.from_connection_string(
conn_str=eventhub_connection_str,
eventhub_name=eventhub_name
)
with eventhub_producer, avro_encoder:
event_data_batch = eventhub_producer.create_batch()
dict_content = {"name": "Bob", "favorite_number": 7, "favorite_color": "red"}
event_data = avro_encoder.encode(dict_content, schema=definition, message_type=EventData)
event_data_batch.add(event_data)
eventhub_producer.send_batch(event_data_batch)
I was able to deserialise using the stand alone consumer
async def on_event(partition_context, event):
print("Received the event: \"{}\" from the partition with ID: \"{}\"".format(event.body_as_str(encoding='UTF-8'),
partition_context.partition_id))
print("message type is :")
print(type(event))
dec = avro_encoder.decode(event)
print("decoded msg:\n")
print(dec)
await partition_context.update_checkpoint(event)
async def main():
client = EventHubConsumerClient.from_connection_string(
"connection str"
"topic name",
consumer_group="$Default",
eventhub_name="")
async with client:
await client.receive(on_event=on_event, starting_position="-1")
As a next step , I replaced the standalone python consumer with the py-spark consumer running on synapse notebook. Below are the problems I faced
The from_avro function in spark is not able to deserialize the Avro message encoded with azure encoder.
As a work a round, I tied creating an UDF which makes use of azure encoder , but I see that azure encoder is expecting the event to be of type EventData, but when spark reads the data using event hub API, we get the data in Byte Array.
#udf
def decode(row_msg):
encoder = AvroEncoder(client=schema_registry_client)
encoder.decode(bytes(row_msg))
I don't see any proper documentation on the deserializer that we can use with spark or any distributed system.
All examples are with Stand Alone clients. Do we have any connector that we can use with spark/Flink ?
Answering my own question, azure event hub schema registry doesn't support spark or any distributed system.
They are working on it and trying to add this support to spark
https://github.com/Azure/azure-event-hubs-spark/pull/573
Because the avro schema is not part of the payload ("the data"), the from_avro function in spark will not be able to deserialize the message. This should be expected.
In order to decode, you also need to pass the associated content_type value on the EventData object into the decode method. This content_type value holds the schema ID that will be used to retrieve the schema used for deserialization. You can set content_type along with content in the MessageContent dict. This sample should be helpful.
We currently don't have a connector to be used with spark/flint. However, if this is something you're interested in, please feel free to file a feature-request issue here: https://github.com/Azure/azure-sdk-for-python/issues.
I've setup a Python script that will take certain bigquery tables from one dataset, clean them with a SQL query, and add the cleaned tables to a new dataset. This script works correctly. I want to set this up as a cloud function that triggers at midnight every day.
I've also used cloud scheduler to send a message to a pubsub topic at midnight every day. I've verified that this works correctly. I am new to pubsub but I followed the tutorial in the documentation and managed to setup a test cloud function that prints out hello world when it gets a push notification from pubsub.
However, my issue is that when I try to combine the two and automate my script - I get a log message that the execution crashed:
Function execution took 1119 ms, finished with status: 'crash'
To help you understand what I'm doing, here is the code in my main.py:
# Global libraries
import base64
# Local libraries
from scripts.one_minute_tables import helper
def one_minute_tables(event, context):
# Log out the message that triggered the function
print("""This Function was triggered by messageId {} published at {}
""".format(context.event_id, context.timestamp))
# Get the message from the event data
name = base64.b64decode(event['data']).decode('utf-8')
# If it's the message for the daily midnight schedule, execute function
if name == 'midnight':
helper.format_tables('raw_data','table1')
else:
pass
For the sake of convenience, this is a simplified version of my python script:
# Global libraries
from google.cloud import bigquery
import os
# Login to bigquery by providing credentials
credential_path = 'secret.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
def format_tables(dataset, list_of_tables):
# Initialize the client
client = bigquery.Client()
# Loop through the list of tables
for table in list_of_tables:
# Create the query object
script = f"""
SELECT *
FROM {dataset}.{table}
"""
# Call the API
query = client.query(script)
# Wait for job to finish
results = query.result()
# Print
print('Data cleaned and updated in table: {}.{}'.format(dataset, table))
This is my folder structure:
And my requirements.txt file has only one entry in it: google-cloud-bigquery==1.24.0
I'd appreciate your help in figuring out what I need to fix to run this script with the pubsub trigger without getting a log message that says the execution crashed.
EDIT: Based on the comments, this is the log of the function crash
{
"textPayload": "Function execution took 1078 ms, finished with status: 'crash'",
"insertId": "000000-689fdf20-aee2-4900-b5a1-91c34d7c1448",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "one_minute_tables",
"region": "us-central1",
"project_id": "PROJECT_ID"
}
},
"timestamp": "2020-05-15T16:53:53.672758031Z",
"severity": "DEBUG",
"labels": {
"execution_id": "x883cqs07f2w"
},
"logName": "projects/PROJECT_ID/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/PROJECT_ID/traces/f391b48a469cbbaeccad5d04b4a704a0",
"receiveTimestamp": "2020-05-15T16:53:53.871051291Z"
}
The problem comes from the list_of_tables attributes. You call your function like this
if name == 'midnight':
helper.format_tables('raw_data','table1')
And you iterate on your 'table1' parameter
Perform this, it should work
if name == 'midnight':
helper.format_tables('raw_data',['table1'])
I have Pub/Sub subscribe logic wrapped inside a subscribe method that is being called once during service initialization for every subscription:
def subscribe(self,
callback: typing.Callable,
subscription_name: str,
topic_name: str,
project_name: str = None) -> typing.Optional[SubscriberClient]:
"""Subscribes to Pub/Sub topic and return subscriber client
:param callback: subscription callback method
:param subscription_name: name of the subscription
:param topic_name: name of the topic
:param project_name: optional project name. Uses default project if not set
:return: subscriber client or None if testing
"""
project = project_name if project_name else self.pubsub_project_id
self.logger.info('Subscribing to project `{}`, topic `{}`'.format(project, topic_name))
project_path = self.pubsub_subscriber.project_path(project)
topic_path = self.pubsub_subscriber.topic_path(project, topic_name)
subscription_path = self.pubsub_subscriber.subscription_path(project, subscription_name)
# check if there is an existing subscription, if not, create it
if subscription_path not in [s.name for s in self.pubsub_subscriber.list_subscriptions(project_path)]:
self.logger.info('Creating new subscription `{}`, topic `{}`'.format(subscription_name, topic_name))
self.pubsub_subscriber.create_subscription(subscription_path, topic_path)
# subscribe to the topic
self.pubsub_subscriber.subscribe(
subscription_path, callback=callback,
scheduler=self.thread_scheduler
)
return self.pubsub_subscriber
This method is called like this:
self.subscribe_client = self.subscribe(
callback=self.pubsub_callback,
subscription_name='subscription_topic',
topic_name='topic'
)
The callback method does a bunch of stuff, sends 2 emails then acknowledges the message
def pubsub_callback(self, data: gcloud_pubsub_subscriber.Message):
self.logger.debug('Processing pub sub message')
try:
self.do_something_with_message(data)
self.logger.debug('Acknowledging the message')
data.ack()
self.logger.debug('Acknowledged')
return
except:
self.logger.warning({
"message": "Failed to process Pub/Sub message",
"request_size": data.size,
"data": data.data
}, exc_info=True)
self.logger.debug('Acknowledging the message 2')
data.ack()
When I run push something to the subscription, callback runs, prints all the debug messages including Acknowledged. The message, however, stays in the Pub/Sub, the callback gets called again and it takes exponential time after each retry. The question is what could cause the message to stay in the pub/sub even after the ack is called?
I have several such subscriptions, all of them work as expected. Deadline is not an option, the callback finishes almost immediately and I played with the ack deadline anyways, nothing helped.
When I try to process these messages from locally running app connected to that pub-sub, it completes just fine and acknowledge takes the message out of the queue as expected.
So the problem manifests only in deployed service (running inside a kubernetes pod)
Callback executes buck ack does seemingly nothing
Acking messages from a script running locally (...and doing the exact same stuff) or through the GCP UI works as expected.
Any ideas?
Acknowledgements are best-effort in Pub/Sub, so it's possible but unusual for messages to be redelivered.
If you are consistently receiving duplicates, it might be due to duplicate publishes of the same message contents. As far as Pub/Sub is concerned, these are different messages and will be assigned different message IDs. Check the Pub/Sub-provided message IDs to ensure that you are actually receiving the same message multiple times.
There is an edge case in dealing with large backlogs of small messages with streaming pull (which is what the Python client library uses). If you are running multiple clients subscribing on the same subscription, this edge case may be relevant.
You can also check your subscription's Stackdriver metrics to see:
if its acks are being sent successfully (subscription/ack_message_count)
if its backlog is decreasing (subscription/backlog_bytes)
if your subscriber is missing the ack deadline (subscription/streaming_pull_ack_message_operation_count filtered by response_code != "success")
If you're not missing the ack deadline and your backlog is remaining steady, you should contact Google Cloud support with your project name, subscription name, and a sample of the duplicate message IDs. They will be able to investigate why these duplicates are happening.
I did some additional testing and I finally found the problem.
TL;DR: I was using the same google.cloud.pubsub_v1.subscriber.scheduler.ThreadScheduler for all subscriptions.
Here are the snippets of the code I used to test it. This is the broken version:
server.py
import concurrent.futures.thread
import os
import time
from google.api_core.exceptions import AlreadyExists
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.scheduler import ThreadScheduler
def create_subscription(project_id, topic_name, subscription_name):
"""Create a new pull subscription on the given topic."""
subscriber = pubsub_v1.SubscriberClient()
topic_path = subscriber.topic_path(project_id, topic_name)
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
subscription = subscriber.create_subscription(
subscription_path, topic_path)
print('Subscription created: {}'.format(subscription))
def receive_messages(project_id, subscription_name, t_scheduler):
"""Receives messages from a pull subscription."""
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
def callback(message):
print('Received message: {}'.format(message.data))
message.ack()
subscriber.subscribe(subscription_path, callback=callback, scheduler=t_scheduler)
print('Listening for messages on {}'.format(subscription_path))
project_id = os.getenv("PUBSUB_PROJECT_ID")
publisher = pubsub_v1.PublisherClient()
project_path = publisher.project_path(project_id)
# Create both topics
try:
topics = [topic.name.split('/')[-1] for topic in publisher.list_topics(project_path)]
if 'topic_a' not in topics:
publisher.create_topic(publisher.topic_path(project_id, 'topic_a'))
if 'topic_b' not in topics:
publisher.create_topic(publisher.topic_path(project_id, 'topic_b'))
except AlreadyExists:
print('Topics already exists')
# Create subscriptions on both topics
sub_client = pubsub_v1.SubscriberClient()
project_path = sub_client.project_path(project_id)
try:
subs = [sub.name.split('/')[-1] for sub in sub_client.list_subscriptions(project_path)]
if 'topic_a_sub' not in subs:
create_subscription(project_id, 'topic_a', 'topic_a_sub')
if 'topic_b_sub' not in subs:
create_subscription(project_id, 'topic_b', 'topic_b_sub')
except AlreadyExists:
print('Subscriptions already exists')
scheduler = ThreadScheduler(concurrent.futures.thread.ThreadPoolExecutor(10))
receive_messages(project_id, 'topic_a_sub', scheduler)
receive_messages(project_id, 'topic_b_sub', scheduler)
while True:
time.sleep(60)
client.py
import datetime
import os
import random
import sys
from time import sleep
from google.cloud import pubsub_v1
def publish_messages(pid, topic_name):
"""Publishes multiple messages to a Pub/Sub topic."""
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(pid, topic_name)
for n in range(1, 10):
data = '[{} - {}] Message number {}'.format(datetime.datetime.now().isoformat(), topic_name, n)
data = data.encode('utf-8')
publisher.publish(topic_path, data=data)
sleep(random.randint(10, 50) / 10.0)
project_id = os.getenv("PUBSUB_PROJECT_ID")
publish_messages(project_id, sys.argv[1])
I connected to the cloud pub/sub, the server created topics and subscriptions. Then I ran the client script multiple times in parallel for both topics. After a short while, once I changed server code to instantiate new thread scheduler inside receive_messages scope, the server cleaned up both topics and functioned as expected.
Confusing thing is that in either case, the server printed out the received message for all the messages.
I am going to post this to https://github.com/googleapis/google-cloud-python/issues
I use to import data from excel ,but i use the bootstrap.groovy to write the code and my import script method is called when the application starts.
Here the scenarios is i m having 8000 related data once to import if they are not on my database.And,also when i deploy it to tomcat6 it is blocking other apps from deployment ,until it finish the import.So,i want to use separate thread for to run the script in anyway without affecting performance AND BLOCKING OTHER FROM DEPLOYMENT.
code excerpt ...
class BootStrap {
def grailsApplication
def sessionFactory
def excelService
def importStateLgaArea(){
String fileName = grailsApplication.mainContext.servletContext.getRealPath('filename.xlsx')
ExcelImporter importer = new ExcelImporter(fileName)
def listState = importer.getStateLgaTerritoryList() //get the map,form excel
log.info "List form excel:${listState}"
def checkPreviousImport = Area.findByName('Osusu')
if(!checkPreviousImport) {
int i = 0
int j = 0 // up
date cases
def beforeTime = System.currentTimeMillis()
for(row in listState){
def state = State.findByName(row['state'])
if(!state) {
// log.info "Saving State:${row['state']}"
row['state'] = row['state'].toString().toLowerCase().capitalize()
// log.info "after capitalized" + row['state']
state = new State(name:row['state'])
if(!state.save(flash:true)){
log.info "${state.errors}"
break;
}
}
}
}
For import of large data I suggest to take in consideration the use of Spring Batch. Is easy to integrate it in grails. You can try with this plugin or integrate it manually.