Jdbc_streaming filter plugin performance - logstash

Recently I have used Jdbc_streaming filter plugin of logstash, it is very helpful plugin which allows me to connect with my database on the fly and perform checks against my events.
But are there any drawbacks or pitfall of using this filter.
I mean I have the following queries :
For example , I am firing select query against each of my events.
Is it a good idea to query my database for each event. I mean what if I am processing a syslog event of a server which is continuously sending me data, in that case for each event I will be triggering a select query on my database so how will my database will react in terms of load and response time.
What about the number of connections, how they are managed.
How this will behave if I join multiple tables.
I hope I am able to convey my question.
I just want to understand , how exactly it is working in back end and does querying my database at massive speed will degrade my database performance.

I am not sure whether this answer is correct or not.
But as per my experience , logstash works in sequential manner for the above plugin.
It creates only single connection to RDS and query's the DB for each record.
So there is no connection overhead, but then it degrades the performance by many folds.
This answer is just from my experience, it might be possible that this can be a completely wrong answer. Any edits or answers are welcome.

Related

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

Robot's Tracker Threads and Display

Application: The purposed application has an tcp server able to handle several connections with the robots.
I choosed to work with database/ no files, so i'm using a sqlite db to save information about the robots and their full history, models of robots, tasks, etc...
The robots send us several data like odometry, tasks information, and so on...
I create a thread for every new robot's connection to handle the messages and update the informations of the robots on the database. Now lets start talk about my problems:
The application got to show information about the robots in realtime, and I was thinking about using QSqlQueryModel, set the right query and the show it on a QTableView but then I got to some problems/ solutions to think about:
Problem number 1: There are informations to show on the QTableView that are not on the database: I have the current consumption on the database and the actual charge on the database in capacity, but I want to show also on my table the remaining battery time, how can I add that column with the right behaviour (math implemented) in my TableView.
Problem number 2: I will be receiving messages each second for each robot, so, updating the db and the the gui(loading the query) may not be the best solution when I have a big number of robots connected? Is it better to update the table, and only update the db each minute or something like this? If I use this method I cant work with the table with the QSqlQueryModel to update the tables, so what is the approach that you recommend me to use?
Thanks
SancheZ
I have run into similar problem before; my conclusion was QSqlQueryModel is not the best option for display purposes. You may want some processing on query results, or you may want to create, remove, change display data based on the result for a fancier gui. I think best is to implement your own delegates and override the view related methods - setData, setEditor
This way you have the control over all your columns and direct union of raw data and its display equivalent (i.e. EditData, UserData).
Yes, it is better if you update your view real-time and run a batch execute at lower frequency to update the big data. In general app is the middle layer and db is a bottom layer for data monitoring, unless you use db in memory shared cache.
EDIT: One important point, you cannot run updates in multiple threads (you can, but sqlite blocks the thread until it gets the lock) so it is best to run update from a single thread

cassandra trigger execution sequence

I want to use cassandra tigger to import my data to elasticsearch for searching.
Considering the data consistency, I hope they execute atomically.
So I want to know the trigger of the execution sequence, together with "write commitlog","memtable", "index" atomically, or the trigger is completely asynchronous?
Triggers are run before anything you listed above. The intent is to capture a mutation before it is persisted in the database. This is to potentially enhance data as it is received. What you have outlined about could have some edge failure conditions with data indexed in ES and not persisted to the database.
Have you looked at the DataStax search product? It has a much deeper integration with Cassandra that avoids these problems.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources