Hazelcast Jet. 'onComplete' event at sink? - hazelcast-jet

In a typical pipeline scenario, say I have a bounded stream, where I read from a file. Is there a way in Jet where I can subscribe to an "OnComplete" event, which will be triggered once the stream is written to sink?
I don't seem to find such an option. I want something like below.
p.readFrom(fileSource)
.writeTo(Sinks.logger())
.onComplete(doSomething());
Edit :
Reference for the comment.
BatchStage stage = p.readFrom(source).map(transform);
stage.map(enrich)
.writeTo(Sinks.filesBuilder(folder1).build())
.onComplete(doSomething1());
stage.map(enrich2)
.aggregate(...)
.writeTo(Sinks.filesBuilder(folder2).build())
.onComplete(doSomething2());

After you submit a job:
JetInstance jet = ...
Job job = jet.newJob(p);
You can retrieve a CompletableFuture associated with the job and wait for its completion. Alternatively, you can call
job.join();
which is equivalent to
job.getFuture().join()
It will wait until the job is completed.

Related

'Delay until' finish time of 'Queue a new build' not working in Azure Logic App

I'm triggering an Azure Logic App from an https webhook for a docker image in Azure Container Registry.
The workflow is roughly:
When a HTTP request is received
Queue a new build
Delay until
FinishTime of Queue a new build
See: Workflow image
The Delay until action doesn't work in that the queueried FinishTime is 0001-01-01T00:00:00.
It complains about the wrong format, so I manually added a Z after the FinishTime keyword.
Now the time stamp is in the right format, however, the timestamp 0001-01-01T00:00:00Z obviously doesn't make sense and subsequent steps are executed without delay.
Anything that I am missing?
edit: Queue a new build queues an Azure pipeline build. I.e. the FinishTime property comes from the pipeline.
You need to set a timestamp in future, the timestamp 0001-01-01T00:00:00Z you set to the "Delay until" action is not a future time. If you set a timestamp as 2020-04-02T07:30:00Z, the "Delay until" action will take effect.
Update:
I don't think the "Delay until" can do what you expect, but maybe you can refer to the operations below. Just add a "Condition" action to judge if the FinishTime is greater than current time.
The expression in the "Condition" is:
sub(ticks(variables('FinishTime')), ticks(utcNow()))
In a word, if the FinishTime is greater than current time --> do the "Delay until" aciton. If the FinishTime is less than current time --> do anything else which you want.(By the way you need to pay attention to the time zone of your timestamp, maybe you need to convert all of the time zone to UTC)
I've been in touch with an Azure support engineer, who has confirmed that the Delay until action should work as I intended to use it, however, that the FinishTime property will not hold a value that I can use.
In the meantime, I have found a workaround, where I'm using some logic and quite a few additional steps. Inconvenient but at least it does what I want.
Here are the most important steps that are executed after the workflow gets triggered from a webhook (docker base image update in Azure Container Registry).
Essentially, I'm initializing the following variables and queing a new build:
buildStatusCompleted: String value containing the target value completed
jarsBuildStatus: String value containing the initial value notStarted
jarsBuildResult: String value containing the default value failed
Then, I'm using an Until action to monitor when the jarsBuildStatus's value is switching to completed.
In the Until action, I'm repeating the following steps until jarsBuildStatus changes its value to buildStatusCompleted:
Delay for 15 seconds
HTTP request to Azure DevOps build, authenticating with personal access token
Parse JSON body of previous raw HTTP output for status and result keywords
Set jarsBuildStatus = status
After breaking out of the Until action (loop), the jarsBuildResult is set to the parsed result.
All these steps are part of a larger build orchestration workflow, where I'm repeating the given steps multiple times for several different Azure DevOps build pipelines.
The final action in the workflow is sending all the status, result and other relevant data as a build summary to Azure DevOps.
To me, this is only a workaround and I'll leave this question open to see if others have suggestions as well or in case the Azure support engineers can give more insight into the Delay until action.
Here's an image of the final workflow (at least, the part where I implemented the Delay until action):
edit: Turns out, I can simplify the workflow because there's a dedicated Azure DevOps action in the Logic App called Send an HTTP request to Azure DevOps, which omits the need for manual authentication (Azure support engineer pointed this out).
The workflow now looks like this:
That is, I can query the build status directly and set the jarsBuildStatus as
#{body('Send_an_HTTP_request_to_Azure_DevOps:_jar''s')['status']}
The code snippet above is automagically converted to a value for the Set variable action. Thus, no need to use an additional Parse JSON action.

Using If-Condition ADF V2

I have one Copy activity and two stored Proc Activity and i want to basically update the status of my pipeline as Failed in Logtable if any of these activities failed with error message details. Below is the flow of my pipeline
I wanted to use If-Condition activity and need help in setting the expression for it. For Copy activity i can use the below expression, but not sure about getting the status of stored Proc activity
#or(equals(activity('Copy Data Source1').output.executionDetails[0].status, 'Failed'), <expression to get the status of Stored Proc>)
If the above expression is true then i want to have one common stored proc activity that i will set in Add If True Activity to log the error details
Let me know if this possible.
I think you have overcomplicated this.
A much easier way to do that is to leverage a Failure path for required activities. Furthermore, SP would not be executed when Copy Data fails, therefore checking the status of execution of SP doesn't really make sense.
My pipeline would look like this:

Azure function CosmosDbTrigger (Start from the beginning option)

I have a azure function with cosmos db trigger which makes some calculations and write results to db. If something goes wrong i want to have a possibility to start from the first item or specific item make calculations again. Is it possible? Thanks
public static void Run([CosmosDBTrigger(
databaseName: "db",
collectionName: "collection",
ConnectionStringSetting = "DocDbConnStr",
CreateLeaseCollectionIfNotExists = true,
LeaseCollectionName = "leases")]IReadOnlyList<Document> input, TraceWriter log)
{
...
}
Right now, the StartFromBeginning option is not exposed to the Cosmos DB Trigger. The default behavior is to start receiving changes from the moment the Function starts running, leases/checkpoints will be generated in case the Host/Runtime shutsdown so when the Host/Runtime is back up it will pickup from the last checkpointed item.
The Trigger does not implement dead-lettering or error handling as it might generate infinite-loops / unexpected billing / multiple processing of the same batch if the error is not related to the batch itself (for example, you process the documents and then send an email and the email fails, the entire batch would be re-processed for an error not related to the feed itself), so we recommend users to implement their own try/catch or error handling logic inside the Function's code. It's the same approach as the Event Hub Trigger.
That being said, we are in the process of exposing several new options on the Trigger and there is a contributor working on an advanced retrying mechanism.
As #Matias Quaranta and #Pankaj Rawat say in the comments, the accept answer is old and is no longer true. You can use StartFromTheBeginning as a C# attribute within your azure function's parameter list like so:
[FunctionName(nameof(MyAzureFunction))]
public async Task RunAsync([CosmosDBTrigger(
databaseName: "myCosmosDbName",
collectionName: "myCollectionName",
ConnectionStringSetting = "cosmosConnectionString",
LeaseCollectionName = "leases",
CreateLeaseCollectionIfNotExists = true,
MaxItemsPerInvocation = 1000,
StartFromBeginning = true)]IReadOnlyList<Document> documents)
{
....
}
Please change the accepted answer.
The current offsets (positions in Cosmos DB change feed) are managed by clients, Azure Functions runtime in this case.
Functions store the offsets in lease collection (it's called leases in your example).
To restart from a specific item, you would have to make a snapshot of documents in leases collection at some point, and then restore your current collection to that snapshot when needed.
I am not familiar with a tool that automates that for you, other than generic tools working with Cosmos DB collections.
Check startFromBeginning option available in Function v2. Unfortunately, I'm still using V1 and not able to verify.
When set, it tells the Trigger to start reading changes from the beginning of the history of the collection instead of the current time. This only works the first time the Trigger starts, as in subsequent runs, the checkpoints are already stored. Setting this to true when there are leases already created has no effect.

Right way to delete and then reindex ES documents

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?
You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api
Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x
You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Implementing Multithreading in Oracle Procedures

I am working on Oracle 10gR2.
And here is my problem -
I have a procedure, lets call it *proc_parent* (inside a package) which is supposed to call another procedure, lets call it *user_creation*. I have to call *user_creation* inside a loop, which is reading some columns from a table - and these column values are passed as parameters to the *user_creation* procedure.
The code is like this:
FOR i IN (SELECT community_id,
password,
username
FROM customer
WHERE community_id IS NOT NULL
AND created_by = 'SRC_GLOB'
)
LOOP
user_creation (i.community_id,i.password,i.username);
END LOOP;
COMMIT;
user_Creation procedure is invoking a web service for some business logic, and then based on the response updates a table.
I need to find a way by which I can use multi-threading here, so that I can run multiple instances of this procedure to speed up things. I know I can use *DBMS_SCHEDULER* and probably *DBMS_ALERT* but I am not able to figure out, how to use them inside a loop.
Can someone guide me in the right direction?
Thanks,
Ankur
what you can do is submit lots of jobs in the same time. See Example 28-2 Creating a Set of Lightweight Jobs in a Single Transaction
This fills a pl/sql table with all jobs you want to submit in one tx, all at the same time. As soon as they are submitted (enabled) they will start running, as many as the system can handle, or as many as are allowed by a resource manager plan.
The overhead that the Lightweight jobs have is very ... minimal/light.
I would like to close this question. DBMS_SCHEDULER as well as DBMS_JOB (though DBMS_SCHEDULER is preferred) can be used inside the loop to submit and execute the job.
For instance, here's a sample code, using DBMS_JOB which can be invoked inside a loop:
...
FOR i IN (SELECT community_id,
password,
username
FROM customer
WHERE community_id IS NOT NULL
AND created_by = 'SRC_GLOB'
)
LOOP
DBMS_JOB.SUBMIT(JOB => jobnum,
WHAT => 'BEGIN user_creation (i.community_id,i.password,i.username); END;'
COMMIT;
END LOOP;
Using a commit after SUBMIT will kick off the job (and hence the procedure) in parallel.

Resources