Enforcing LogStash events to be outputted in order - logstash

I've a pipeline like this:
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/logstash-core/lib/jars/postgresql-42.2.6.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
jdbc_user => "${JDBC_USER}"
jdbc_password => "${JDBC_PASSWORD}"
statement =>
# Select everything from the my_table and
# add a fake record to indicate the end of the results.
"SELECT * FROM my_table
UNION ALL
SELECT 'END_OF_QUERY_RESULTS' AS some_key;
"
}
}
filter {
ruby {
init => "
require 'net/http'
require 'json'
"
code => '
if event.get("some_key") == "END_OF_QUERY_RESULTS"
uri = URI.parse(ENV["MY_URL"])
response = Net::HTTP.get_response(uri)
result = JSON.parse(response.body)
if response.code == "202"
puts "Success!"
else
puts "ERROR: Couldn\'t start processing."
end
event.cancel()
end
'
}
}
output {
mongodb {
bulk => true
bulk_interval => 2
collection => "${MONGO_DB_COLLECTION}"
database => "${MONGO_DB_NAME}"
generateId => true
uri => "mongodb://${MONGO_DB_HOST}:${MONGO_DB_PORT}/${MONGO_DB_NAME}"
}
}
I simply grab all the data from a PostreSQL table to a MongoDB collection.
What I'm trying to achieve is: I want to call an API after loading ALL the data into MongoDB collection.
What I tried:
I tried the above approach to add a fake record at the end of the SQL query results to use as a flag to indicate the last event. The problem with this approach is LogStash does not maintain the order of events, hence, the event with 'END_OF_QUERY_RESULTS' string can become to the filter before it is actually the last one.
Setting pipeline.workers: 1 and pipeline.ordered: true, both don't seem to work.
I tried to sleep for a while in the Ruby filter and it works but I don't/can't really know how much time I should sleep.

Related

Logstash [aggregate filter] to pass data between events

i am currently working on a project with the Elastic stack for a log monitoring system. The logs i have to load are in a specific format so i have to write my own logstash scripts to read them. In particular one type of logs where i have a date in the start of the file and the timestamp in each of the other lines has no date, my goal is to extract the date from the first line and add it to all the next ones, after some research i found that the aggregate filter can help but i can't get it to work, here is my config file :
input
{
file {
path => "F:/ELK/data/testFile.txt"
#path => "F:/ELK/data/*/request/*"
start_position => "beginning"
sincedb_path => "NUL"
}
}
filter
{
mutate {
add_field => { "taskId" => "all" }
}
grok
{
match => {"message" => "-- %{NOTSPACE} %{NOTSPACE}: %{DAY}, %{MONTH:month} %{MONTHDAY:day}, %{YEAR:year}%{GREEDYDATA}"}
tag_on_failure => ["not_date_line"]
}
if "not_date_line" not in [tags]
{
mutate{
replace => {'taskId' => "%{day}/%{month}/%{year}"}
remove_field => ["day","month","year"]
}
aggregate
{
task_id => "%{taskId}"
code => "map['taskId'] = event.get('taskId')"
map_action => "create"
}
}
else
{
dissect
{
mapping => { message => "%{sequence_index} %{time} %{pid} %{puid} %{stack_level} %{operation} %{params} %{op_type} %{form_event} %{op_duration}"}
}
aggregate {
task_id => "%{taskId}"
code => "event.set('taskId', map['taskId'])"
map_action => "update"
timeout => 0
}
mutate
{
strip => ["op_duration"]
replace => {"time" => "%{taskId}-%{time}"}
}
}
mutate
{
remove_field => ['#timestamp','host','#version','path','message','tags']
}
}
output
{
stdout{}
}
the scripts reads the date correctly but then doesn't work to replace the value in the other events :
{
"taskId" => "22/October/2020"
}
{
"pid" => "45",
"sequence_index" => "10853799",
"op_type" => "1",
"time" => "all-16:23:29:629",
"params" => "90",
"stack_level" => "0",
"op_duration" => "",
"operation" => "10",
"form_event" => "0",
"taskId" => "all",
"puid" => "1724"
}
I am using only one worker to ensure the order of the events is kept intact , if you know of any other way to achieve this i'm open to suggestions, thank you !
For the lines which have a date you are setting the taskId to "%{day}/%{month}/%{year}", for the rest of the lines you are setting it to "all". The aggregate filter will not aggregate across events with different task ids.
I suggest you use a constant taskId and store the date in some other field, then in a single aggregate filter you can use something like
code => '
date = event.get("date")
if date
#date = date
else
event.set("date", #date)
end
'
#date is an instance variable, so its scope is limited to that aggregate filter, but it is preserved across events. It is not shared with other aggregate filters (that would require a class variable or a global variable).
Note that you require event order to be preserved, so you should set pipeline.workers to 1.
Thanks to #Badger and some other post he answered on the elastic forum, i found a solution using a single ruby filter and an instance variable, couldn't get it to work with the aggregate filter but that is not an issue for me.
ruby
{
init => '#date = ""'
code => "
event.set('date',#date) unless #date.empty?
#date = event.get('date') unless event.get('date').empty?
"
}

Sync MongoDb to ElasticSearch

I am looking for a way to sync collections in MongoDB with Elastic Search (ES). The goal is to have MongoDB as a primary data source and use MongoDB as a full text search engine. (The business logic of my project is written in python).
Several approaches are online available.
Mongo-connect
River plugin
logstash-input-mongodb (logstash plugin) see similar question
Transporter
However, most of the suggestions are several years old and I could not find any solution that supports the current version of ES (ES 7.4.0). Is anyone using such a construct? Do you have any suggestions?
I thought about dropping MongoDB as primary data source and just using ES for storing and searching. Though I have read that ES should not be used as a primary data source.
Edit
Thank you #gurdeep.sabarwal. I followed your approach. However, I do not manage to sync the mongodb to ES. My configuration looks like this:
input {
jdbc {
# jdbc_driver_library => "/usr/share/logstash/mongodb-driver-3.11.0-source.jar"
jdbc_driver_library => "/usr/share/logstash/mongojdbc1.5.jar"
# jdbc_driver_library => "/usr/share/logstash/mongodb-driver-3.11.1.jar"
# jdbc_driver_class => "mongodb.jdbc.MongoDriver"
# jdbc_driver_class => "Java::com.mongodb.MongoClient"
jdbc_driver_class => "Java::com.dbschema.MongoJdbcDriver"
jdbc_driver_class => "com.dbschema.MongoJdbcDriver"
# jdbc_driver_class => ""
jdbc_connection_string => "jdbc:mongodb://<myserver>:27017/<mydb>"
jdbc_user => "user"
jdbc_password => "pw"
statement => "db.getCollection('mycollection').find({})"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200/"]
index => "myindex"
}
}
This brings me a bit closer to my goal. However, I get the following error:
Error: Java::com.dbschema.MongoJdbcDriver not loaded. Are you sure you've included the correct jdbc driver in :jdbc_driver_library?
Exception: LogStash::ConfigurationError`
Since, it did not work, I tried also the commented version but did not succeed.
download https://dbschema.com/jdbc-drivers/MongoDbJdbcDriver.zip
unzip and copy all the files to the path(~/logstash-7.4.2/logstash-core/lib/jars/)
modify the config file(mongo-logstash.conf) below:
run: ~/logstash-7.4.2/bin/logstash -f mongo-logstash.conf
success, please try it!
ps: this is my first answer in stackoverflow :-)
input {
jdbc{
# NOT THIS # jdbc_driver_class => "Java::mongodb.jdbc.MongoDriver"
jdbc_driver_class => "com.dbschema.MongoJdbcDriver"
jdbc_driver_library => "mongojdbc1.5.jar"
jdbc_user => "" #no user and pwd
jdbc_password => ""
jdbc_connection_string => "jdbc:mongodb://127.0.0.1:27017/db1"
statement => "db.t1.find()"
}
}
output {
#stdout { codec => dots }
stdout { }
}
For ELK stack, I have implemented using (1st and 2nd ) approach and while doing research i came accross multiple appraches , so you could pick anyone.
but my personal choice is 1st or 2nd becoz it give you lots of option for customization.
if you need code let me know,i can share snippet of it. i don't want to make answer long!.
1.Use dbSchemeJdbc jar(https://dbschema.com) to stream data from mongodb to ElasticSearch.
a.OpenSource dbSchemeJdbc jar
b.You could write native mongodb query or aggregation query directly in logstash.
your pipeline may look like the following:
input {
jdbc{
jdbc_user => "user"
jdbc_password => "pass"
jdbc_driver_class => "Java::com.dbschema.MongoJdbcDriver"
jdbc_driver_library => "mongojdbc1.2.jar"
jdbc_connection_string => "jdbc:mongodb://user:pass#host1:27060/cdcsmb"
statement => "db.product.find()"
}
}
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => "localhost:9200"
index => "target_index"
document_type => "document_type"
document_id => "%{id}"
}
}
2.Use unityJdbc jar (http://unityjdbc.com) to stream data from mongodb to ElasticSearch
a.You have to pay for unityjdbc jar
b.You could write SQL format query in logstash to get data from mongodb.
your pipeline may look like the following:
input {
jdbc{
jdbc_user => "user"
jdbc_password => "pass"
jdbc_driver_class => "Java::mongodb.jdbc.MongoDriver"
jdbc_driver_library => "mongodb_unityjdbc_full.jar"
jdbc_connection_string => "jdbc:mongodb://user:pass#host1:27060/cdcsmb"
statement=> "SELECT * FROM employee WHERE status = 'active'"
}
}
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => "localhost:9200"
index => "target_index"
document_type => "document_type"
document_id => "%{id}"
}
}
3.Use logstash-input-mongodb(https://github.com/phutchins/logstash-input-mongodb) plugin to stream data from mongodb to ElasticSearch
a.opensource kind of
b.you get very less option for customization, it will dump entire collection, you can not write query or write aggregation query etc .
4.you can write you own program in python or java and connect with mongodb and index data in elastic search,then you can use cron to schedule it.
5.you can use node js Mongoosastic npm(https://www.npmjs.com/package/mongoosastic), the only overhead of this is it will commit change on mongo and ES both to keep it in sync.
Monstache seems a good option too as it support the latests versions of both elasticsearch and mongodb: https://github.com/rwynn/monstache

how can i use an array of values inside where clause in knex?

How can i use an array of value to compare inside the where clause like in the below code, 'frompersons' is an array of names coming in response from the first call and I want to get out their info from 'chatterusers' database. But how can i use this array inside the next where clause ?
return knex('frndrqst').where({ toperson: toperson })
.select('fromperson')
.then(frompersons => {
db.select('*').from('chatterusers')
.where( 'name', '=', frompersons )
.then(data => {
res.json(data);
})
.catch(err => res.json("Unable to load frndrqsts !!!"))
})
.catch(err => res.json("Unable to load frndrqsts !!!"))
//get the list of name from 'formperson' table
var subquery = knex.select('name').from('fromperson');
//get in all information from 'chatterusers table' that name is equal with name
return knex.select('*').from('chatterusers')
.whereIn('name', subquery)
output :
select * from chatterusers where name in (select name from fromperson)

Knex.js Migrations copy existing data to other table

Is it possible to, with a Knex.js migration, copy data from one table to another ?
The use case is as follows:
I have a table A, which I want to split into two new tables B and C. Ideally, I would loop over the rows in A to create the appropriate rows in B and C and fill them with the right information.
Can this be done inside a migration file? Aport from this question, I feel this way of doing migrations in Node.JS is quite complex (e.g. compared to ActiveRecord). Is there any better, more managed way to do such migrations? Or is this the industry standard ?
There's nothing special about the query builder object passed in to your up and down functions inside the migration file. You can use it like you would use any other instance of a query builder in your app, that is run any queries you want as part of the migration.
Here's an extremely simple example. Given you have a table called accounts with 4 fields, 1 of which you want to split off into a table by itself:
// Starting promise chain with Promise.resolve() for readability only
exports.up = function(knex, Promise) {
return Promise.resolve()
.then(() => knex.schema.createTable('table_b', t => {
t.string('col_a')
t.string('col_b')
}))
.then(() => knex.schema.createTable('table_c', t => {
t.string('col_c')
t.string('col_d')
}))
.then(() => knex('table_a').select('col_a', 'col_b'))
.then((rows) => knex('table_b').insert(rows))
.then(() => knex('table_a').select('col_c', 'col_d'))
.then((rows) => knex('table_c').insert(rows))
.then(() => knex.schema.dropTableIfExists('table_a'))
};
exports.down = function(knex, Promise) {
return Promise.resolve()
.then(() => knex.schema.createTable('table_a', t => {
t.string('col_a')
t.string('col_b')
t.string('col_c')
t.string('col_d')
}))
.then(() => knex('table_b').select('col_a', 'col_b'))
.then((rows) => knex('table_a').insert(rows))
.then(() => knex('table_c').select('col_c', 'col_d'))
.then((rows) => knex('table_a').insert(rows))
.then(() => knex.schema.dropTableIfExists('table_b'))
.then(() => knex.schema.dropTableIfExists('table_c'))
};
In this case, you could also just keep table_a and instead of creating third table, just drop two columns and rename the table. Be mindful, however, that splitting your table like this will get messy if it has relationships to other tables in the DB already.
My understanding is that migrations deal only with performing CRUD operations in tables.
knex allows you to call a function after the migrations are finished:
`knex.migrate.latest()
.then(function() {
return knex.seed.run();
})
.then(function() {
// migrations are finished
});`
So you can add your code in either a seed file or simply as a function as shown.
Note that this function is called only after migrations complete, which means your table A still has to be present (can't be deleted).
Here's the relevant documentation

Elasticsearch/Logstash duplicating output when use schedule

Im using Elasticsearch with Logstash.
I want to update indexes when database changes. So i decided to use LS schedule. But every 1 minute output appended by database table records.
Example: contract table has 2 rows.
First 1 minute total: 2, 1 minute after total output is : 4;
How can i solve this?
There is my config file. Command is bin/logstash -f contract.conf
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/resource"
jdbc_user => "postgres"
jdbc_validate_connection => true
jdbc_driver_library => "/var/www/html/iltodgeree/logstash/postgres.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * FROM contracts;"
schedule => "* * * * *"
codec => "json"
}
}
output {
elasticsearch {
index => "resource_contracts"
document_type => "metadata"
hosts => "localhost:9200"
}
}
You need to modify your output by specifying the document_id setting and use the ID field from your contracts table. That way, you'll never get duplicates.
output {
elasticsearch {
index => "resource_contracts"
document_type => "metadata"
document_id => "%{ID_FIELD}"
hosts => "localhost:9200"
}
}
Also if you have an update timestamp in your contracts table, you can modify the SQL statement in your input like below in order to only copy the records that changed recently:
statement => "SELECT * FROM contracts WHERE timestamp > :sql_last_value;"

Resources