aggregate logstash filter with "multiple pipelines" - logstash

I would like to let httpd access_log entries be processed by two different logstash filters.
One of them is the "aggregate" filter, which is known to only work properly with a single worker thread. However, the other filter (let's call it "otherfilter") should be allowed to work with several worker threads, so that there is no loss of performance.
To accomplish this I would like to use the "multiple pipeline" feature of logstash. Basically one pipeline should read the data ("input pipeline") and distribute it to two other pipelines on which the two mentioned filters operate (let's call them "aggregate pipeline" and "otherfilter pipeline").
First tests have shown, that the results of the aggregate filter are not correct, if the input pipeline is set up to work with more than one thread. That is, when aggregating in the interval of 60 seconds an events counter sometimes shows more and sometimes less events as acutally occurred. The problem seems that events arrive "not ordered" in the aggregate filter, and thus, intervals (whose start and end are determined based on timestamp field) are incorrect.
So I ask myself whether what I want to achieve is at all feasible with "multiple pipelines"?

You can breakup a single pipeline in multiple pipelines, but since you want to use the aggregate filter you need to make sure that everything that happens before the event enters the aggregate filter is running with only one worker.
For example, if you broke up your pipeline into pipeline A, which is your input, pipeline B, which is your aggregate filter, and pipeline C, which is your other filter.
This will only work if:
Pipeline A is running with only one worker.
Pipeline B is running with only one worker.
Pipeline C runs after pipeline B and don't rely on the orders of the events.
If your input pipeline is running with more than one worker you can't guarantee the order of the events when they enter your aggregate pipeline, so basically your input and your aggregate should be in the same pipeline and then you can direct the output to the other filter pipeline that runs with more than one worker.

Related

Data Factory - Foreach activity: run in parallel but sequentially

I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.
My problem is that two notebooks have dependencies on each other.
That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.
What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.
The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.
Config Table:
sPath
TableSource
TableDest
order
path1
dbo.table1
dbo.table1
1
path2
dbo.table2
dbo.table2
2
This is my pipeline:
and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.
Can anyone please help me in achieving this?
Thank you!
I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps
Took a sample config file as in below image.
Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.
select distinct sortorder from config_table order by sortorder
For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.
#activity('Lookup1').output.value
-Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2
In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1
select * from config_table where sortorder=
#{pipeline().parameters.pp_sortorder}
-Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.
After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.
pipeline1 Output status
second execute pipeline is started once first activity is ended
pipeline2 Output status
All Stored procedure activity has started to execute simultaneously

Am I using switch/case wrong here to control?

I am trying to check the logs and depending the last log, run a different step in the transformation. Am I supposed to use some other steps or am I making another mistake here?
For example, if the query returns 1 I want execute SQL script to run, for 2 I want execute SQL script 2 to run and for 3 I want transformation to abort. But it keeps running all the steps even if only one value returns from the CONTROL step.
The transformation looks like this
And the switch/case step looks like this
It looks like it's correctly configured, but keep in mind that in a transformation all steps are initiated at the beginning of the transformation, waiting to receive the streaming data from the previous step. So the Abort and Execute script steps are going to be started as soon as the transformation is started, if they don't need data from the previous step to run, they are going to run at the beginning.
If you want the scripts to be executed depending on the result of the CONTROL output, you'll need to use a job, that runs the steps (actions) sequencially:
A transformation runs the CONTROL step and afterwards you put a "Copy rows to result" step to make the data produced from the CONTROL step available to the job
After the transformation, you use a "Simple evaluation" action in the job, to determine which script (or abort) to run. Jobs also have an "Execute SQL Script" action, so you can put it afterwards.
I'm supposing your CONTROL step only produces one row, if the output is more than one row, the "Simple evaluation" action won't do the job, you'll have to design one of various transformations to execute for each row of the previous transformation, running what you need.

How to re-try an ADF pipeline execution until conditions are met

An ADF pipeline needs to be executed on a daily basis, lets say at 03:00 h AM.
But prior execution we also need to check if the data sources are available.
Data is provided by an external agent, it periodically loads the corresponding data into each source table and let us know when this process is completed using a flag-table: if data source 1 is ready it set flag to 1.
I don't find a way to implement this logic with ADF.
We would need something that, for instance, at 03.00 h would trigger an 'element' that checks the flags, if the flags are not up don't launch the pipeline. Past, lets say, 10 minutes, check again the flags, and be like this for at most X times OR until the flags are up.
If the flags are up, launch the pipeline execution and stop trying to launch the pipeline any further.
How would you do it?
The logic per se is not complicated in any way, but I wouldn't know where to implement it. Should I develop an Azure Funtions that launches the Pipeline or is there a way to achieve it with an out-of-the-box AZDF activity?
There is a UNTIL iteration activity where you can check if your clause.
Example:
Your azure function (AF) checking the flag and returns 0 or 1.
Build ADF pipeline with UNTIL activity where you check the output of AF (if its 1 do something). In UNTIL activity you can have your process step. For example, you have a variable flag that will before until activity is 0. In your until you check if it's 1. if it is do your processing step, if its not, put WAIT activity on 10 min or so.
So you have the ability in ADF to iterate until something it's not satisfied.
Hope that this will help you :)

Bull queue: Ensure unique job within time period by using partial timestamp in jobId

I need to ensure the same job added to queue isn't duplicated within a certain period of time.
Is it worth including partial timestamps (i.e. D/M/Y-HH:M) in my unique jobId strings, so it processes only if not in the same Minute?
It would still duplicate if one job was added at 12:01 and the other at 12:09 – or does Bull have a much better way of doing this?
Bull is designed to support idempotence by ignoring jobs that were added with existing job ids. Be careful to not enable options such as removeOnCompleted, since the job will be removed after completion and not being considered the next time you add a job.
In your case, where you want to make sure that no new jobs are added during a given timespan, just make sure that all the job ids during that timestamp are the same, for example as you wrote in your comment removing the 4 last digits of your UNIX timestamp.
I feel you should use Bull's API to check that the job is running or not, then you decide if you add the job to the queue if not (patch on the producer).
You can also decide to check if a similar job is already running when your are running the job (inside the process function) and do an early return instead of executing the job (patch on the consumer).
You can use the Queue getJobs function to do so:
getJobs(types: string[], start?: number, end?: number, asc?: boolean):Promise<Job[]>
"Returns a promise that will return an array of job instances of the given types. Optional parameters for range and ordering are provided."
From documentation:
https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#queuegetjobs
The Job item should provide enough data so you can find the one you are looking for.

Workflow System with Azure Table Storage

I have a system where we need to run a simple workflow.
Example:
On Jan 1st 08:15 trigger task A for object Z
When triggered then run some code (implementation details not important)
Schedule task B for object Z to run at Jan 3rd 10:25 (and so on)
The workflow itself is simple, but I need to run 500.000+ instances and that's the tricky part.
I know Windows Workflow Foundation and for that very same reason I have chosen not to use that.
My initial design would be to use Azure Table Storage and I would really appreciate some feedback on the design.
The system will consist of two tables
Table "Jobs"
PartitionKey: ObjectId
Rowkey: ProcessOn (UTC Ticks in reverse so that newest are on top)
Attributes: State (Pending, Processed, Error, Skipped), etc...
Table "Timetable"
PartitionKey: YYYYMMDD
Rowkey: YYYYMMDDHHMM_<GUID>
Attributes: Job_PartitionKey, Job_RowKey
The idea is that the runs table will have the complete history of jobs per object and the Timetable will have a list of all jobs to run in the future.
Some assumptions:
A job will never span more than one Object
There will only ever be one pending job per Object
The "job" is very lightweight e.g. posting a message to a queue
The system must be able to perform these tasks:
Execute pending jobs
Query for all records in "Timetable" with a "partition <= Today" and "RowKey <= today"
For each record (in parallel)
Lookup job in Jobs table via PartitionKey and RowKey
If "not exists" or State != Pending then skip
Execute "logic". If fails => log and maybe do some retry logic
Submit "Next run date in Timetable"
Submit "Update State = Processed" and "New Job Record (next run)" as a single transaction
When all are finished => Delete all processed Timetable records
Concern: Only two of the three records modifications are in a transaction. Could this be overcome in any way?
Stop workflow
Stop/pause workflow for Object Z
Query top 1 jobs in Jobs table by PartitionKey
If any AND State == Pending then update to "Cancelled"
(No need to bother cleaning Timetable it will clean itself up "when time comes")
Start workflow
Create Pending record in Jobs table
Create record in Timetable
In terms of "executing the thing" I would
be using a Azure Function or Scheduler-thing to execute the pending jobs every 5 minutes or so.
Any comments or suggestions would be highly appreciated.
Thanks!
How about using Service Bus instead? The BrokeredMessage class has a property called ScheduledEnqueueTimeUtc. You can just schedule when you want your jobs to run via the ScheduledEnqueueTimeUtc property, and then fuggedabouddit. You can then have a triggered webjob that monitors the Service Bus messaging queue, and will be triggered very near when the job message is enqueued. I'm a big fan of relying on existing services to minimize the coding needed.

Resources