Kettle Table Input Thread - multithreading

When I run a transformation at Kettle that have a Table Input, if I access the process list of my database, I see more than one process running with the MySQL of my table input step.
So my question is, is it kettle using threads to run the step or is it something else?

Kettle is a Dataflow programming language. It means that each steps of a transformation runs on its own thread, independently of the others.
Each thread waits for data from the input step(s), process them and deliver them on the output step(s). The data is grouped in packs of about 1000 rows to optimize speed.
This flexibility of processing all the step at the same time has many advantages and explains some otherwise strange behavior, like the fact that the number of steps in the execution history table (at the bottom) are often multiples of 1000, and the the auto-lock issue : When you Input a table that you also truncate in an Output table, then Kettle enter in an auto-lock, each steps waiting for the other to finish.

Related

Is it bad to run cron jobs to poll from a huge table of scheduled job records?

I've a table which a cron job would poll at every minute to send out messages to other services. The records in the table are essentially activities that are scheduled to run at a certain time. The cron job simply checks to see which of those activities are ready to be run and send a message of that activity through SQS to the other services.
When an activity is found to be ready to run by the cron job, that record will be marked as done after sending a message through SQS. There is an API which allows other services to check whether a scheduled activity has already been done. So keeping a history of those done records is needed.
My concern here, however, is whether a design like this is scalable in the long run. There are around 200k scheduled activities a day, or even more on some days. Since I'm keeping the records by marking them as done after they are completed, I'm worried that the table will eventually get very huge with ten over millions of rows and become an issue for the cron job to run as frequently.
Even with a properly indexed table, is my concern valid? Otherwise, what other alternatives can I design it if I had to somehow persist those scheduled activities for a cron or something to poll and check when they are ready to run?
I'm using Postgres database.
As long as the number of rows that the cron job's query has to fetch stays constant and you can use an index, the size of the table won't matter.
Index scans are O(n) with respect to the number of rows scanned and O(log(n)) with respect to the table size. To be more specific, increasing the table size by a factor between 10 and 200 (smaller size of the index key leads to better fan-out) will make an index scan use one more block, and that block is normally cached.
If the table gets large, you might still want to consider partitioning, but mostly so that you can get rid of old data efficiently.
With the right index, the cron job should have no serious problem. You can have a partial/filtered index, like
create index on jobs (id) where status <> 'done'.
To keep the size of the index small. The query has to match the index where clause.
I used (id) just because an empty list is not allowed and so something has to be there. Based on your comment, schedule_dt might be a better choice. If you include all the columns you select, you can get an index-only scan. But if you don't, it will still use the index, it just has to visit the table to fetch the columns for those specific rows. I suspect the index only scan attempt won't be worth it to you as the pages you need probably won't be marked all visible, as modifications were made to neighboring tuples just one minute ago.
However, it does seem a bit odd to mark a job as done when it has only been scheduled, rather than being done.
There is an API which allows other services to check whether a scheduled activity has already been done.
A table that increases in size without bound is likely to present management problems apart from the cron job. Surely the services aren't going to have to look back months in order to do this, are they? Could you delete 'done' jobs after a few days? What if a service tries to look up a job and rather than finding it 'done', it just doesn't find it at all?
I don't think the cron job is inherently a problem, but it would be cleaner not to have it. Why doesn't whoever inserts the job just invoke SQS in real time?

Run VoltDB stored procedures at regular interval from VoltDB

Is there any way to execute VoltDB stored procedures at regular interval or schedule store procedure to run at a specific time?
I am exploring VotlDB to shift out product from RDBMS to VotlDB. Out produce written in java.
Most of the query can be migrated into the VoltDB stored procedures. But In our product, we have cron job in oracle which executes at regular interval. Now I do not find such features in VoltDB.
I know VoltDB stored procedures can be called from the application at regular interval but our product deploys in an Active-Active mode, in that case, all application will call store procedure at regular interval and that is not a good solution or otherwise, we have to develop some mechanism to run procedure from one instance only.
so It would be good if I get cron job feature from VoltDB.
I work at VoltDB. There isn't currently a feature like this in VoltDB, for example like DBMS_JOB in Oracle.
You could certainly use a cron job on one of the servers in your cluster, or on some other server within your network that could invoke sqlcmd to run a script or echo individual SQL statements or execute procedure commands through sqlcmd to the database. Making cron jobs highly available is a general problem. You might find these other discussions helpful:
How to convert Linux cron jobs to "the Amazon way"?
https://www.reddit.com/r/linuxadmin/comments/3j3bz4/run_cronjob_only_on_one_node_in_cluster/
You could also look into something like rcron.
One thing to be careful of when converting from an RDBMS to VoltDB is that VoltDB is optimized for processing many small transactions in parallel across many partitions. While the architecture of serialized execution per partition excels for many operational and streaming workloads, it is not designed to perform bulk operations on many rows at a time, especially transactions that need to perform writes on many rows that may be in different partitions within one transaction.
If you have a periodic job that does something like "process all the new rows that meet some criteria" you may find this transaction is slow and every time it runs it could delay other parts of the workload, especially if many rows have accumulated. It would be more the "VoltDB Way" to replace a simple INSERT statement that you may be using to ingest data (to be processed later by a scheduled job) with a procedure that inserts and immediately processes the row of data. You might even need a procedure that checks for other records and processes small sets of rows as a group, for example stitching together segments of data that go together but may have arrived out of order. By operating on fewer records at a time within one partition at a time, this type of procedure would be more scalable and would keep the data closer to your desired finished state in real time, rather than always having some data waiting to be processed.

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

What is "Serial Single Threaded" type of Transformation Engine in Kettle 6.0.1.0?

I am very new to Kettle tool and found a transformation property where "Transformation Engine Type" can be changed. Can someone help me to understand what "Transformation Engine Type" mean and if it is selected to "Serial Single Threaded", how does transformation's behavior change?
By default, PDI transformations launch all steps in parallel. So, if you have a transformation with 4 steps,
Table input --> Dimension lookup --> Calculator --> Table output
Each step will process rows as they arrive. Table input sends the first block of a few thousand rows to Dimension lookup, and the lookups start immediately. If you have a large volume of data you will have 4 threads continuously doing some work, and rows of data are passed from one thread to the next.
This is the normal behaviour and it's one of the engine's strengths.
However, you may be in a situation where you have a very large transformation, with dozens of steps, but each step doing very little work. In such case, the overhead of parallelising the execution doesn't pay off and you end up with many threads having to wait for CPU time. In such cases, you may be better off in choosing a Single Thread execution model, in which all steps run in the same thread and data is processed serially.
Which one is better depends a lot on your specific use case and there's no substitute to actualy trying both and comparing their speeds.

How to react on specific event with spark streaming

I'm new to Spark streaming and have following situation:
Multiple (health) devices send their data to my service, every event has at least following data inside (userId, timestamp, pulse, bloodPressure).
In the DB I have per user a threshold for pulse and bloodPressure.
Use Case:
I would like to make a sliding window with Spark streaming which calculates the average per user for pulse and bloodpressure, let's say within 10 min.
After 10 min I would like to check in the DB if the values exceed the threshold per user and execute an action, e.g. call a rest service to send an alarm.
Could somebody tell me if this is generally possible with Spark, and if yes, point me in the right direction?
This is definitely possible. It's not necessarily the best tool to do so though. It depends on the volume of input you expect. If you have hundreds of thousands devices sending one event every second, maybe Spark could be justified. Anyway it's not up to me to validate your architectural choices but keep in mind that resorting to Spark for these use cases make sense only if the volume of data cannot be handled by a single machine.
Also, if the latency of the alert is important and a second or two make a difference, Spark is not the best tool. A processor on a single machine can achieve lower latencies. Otherwise use something more streaming-oriented, like Apache Flink.
As a general advice, if you want to do it in Spark, you just need to create a source (I don't know where your data come from), load the thresholds in a broadcast variable (assuming they are constant over time) and write the logic. To make the rest call, use forEachRdd as the output sink and implement the call logic there.

Resources