Integrate Elasticsearch with PostgreSQL while using Sails.js with Waterline ORM - node.js

I am trying to integrate Elasticsearch with Sails.js and my database isn't MongoDB: I use PostgreSQL, so this post doesn't help.
I have installed Elasticsearch on my Ubuntu box and now it's running successfully. I also installed this package on my Sails project, but I cannot create indexes on my existing models.
How can I define indexes on my models, and how can I search using Elasticsearch inside my Models?
What are the hooks which I need to define it inside models?

Here you could find a pretty straightforward package (sails-elastic). It operates by configs directly from elasticsearch itself.
Elasticsearch docs and index creation in particular

There are lots of approach to solve this issue. The recommended way is to use logstash by elasticsearch which I have given in detail.
I would list most of the approaches that I know here:
Using Logstash
curl https://download.elastic.co/logstash/logstash/logstash-2.3.2.tar.gz > logstash.tar.gz
tar -xzf logstash.tar.gz
cd logstash-2.3.2
Install the jdbc input plugin:
bin/logstash-plugin install logstash-input-jdbc
Then download postgresql jdbc driver.
curl https://jdbc.postgresql.org/download/postgresql-9.4.1208.jre7.jar > postgresql-9.4.1208.jre7.jar
Now create a configuration file for logstash to use jdbc input as input.conf:
input {
jdbc {
jdbc_driver_library => "/Users/khurrambaig/Downloads/logstash-2.3.2/postgresql-9.4.1208.jre7.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/khurrambaig"
jdbc_user => "khurrambaig"
jdbc_password => ""
schedule => "* * * * *"
statement => 'SELECT * FROM customer WHERE "updatedAt" > :sql_last_value'
type => "customer"
}
jdbc {
jdbc_driver_library => "/Users/khurrambaig/Downloads/logstash-2.3.2/postgresql-9.4.1208.jre7.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/khurrambaig"
jdbc_user => "khurrambaig"
jdbc_password => ""
schedule => "* * * * *"
statement => 'SELECT * FROM employee WHERE "updatedAt" > :sql_last_value'
type => "employee"
}
# add more jdbc inputs to suit your needs
}
output {
elasticsearch {
index => "khurrambaig"
document_type => "%{type}" # <- use the type from each input
document_id => "%{id}" # <- To avoid duplicates
hosts => "localhost:9200"
}
}
Now run logstash using the above file:
bin/logstash -f input.conf
For every model that you want to insert as a document(table) type in a index(database, khurrambaig here), use appropriate SQL statement ( SELECT * FROM employee WHERE "updatedAt" > :sql_last_value here). Here I have use sql_last_value to put only updated data only. You can do scheduling also and many stuff in logstash. Here I am using every minute. For more details refer this.
To see the documents which has been inserted into index for a particular type:
curl -XGET 'http://localhost:9200/khrm/user/_search?pretty=true'
This will list all the documents under customer models for my case. Look into elastic search api. Use that. Or use nodejs official client.
Using jdbc input
https://github.com/jprante/elasticsearch-jdbc
You can read its readme. It's quite straightforward. But this doesn't provide scheduling and many of the things that are provided by logstash.
Using sails-elastic
You need to use multiple adapters as given in README.
But this isn't recommended because it will slow down your requests. For every creation, updation and deletion, you will be calling two dbs : elastic search and postgresql.
In logstash, indexing of documents is independent of requests. This approach is used by many including wikipedia. Also you remain independent of framework. Today you are using sails, tomorrow you might use something else but you don't need to change anything in case of logstash if you still use postgresql. (If you change db, even then many of the db's input are available and in case of change from any sql rdbms to another, you just need to change to jdbc driver)
There's zombodb also but it work for pre 2.0 elastic only currently (Support for > ES 2.0 coming also).

Related

Triggering dependent resources in a interation loop

I'm using Puppet to set up workstations and I want to modify the default (NTUSER.DAT) HKLM registry before the user logs on, which involves loading and unloading the hive. I have written some PowerShell scripts to facilitate the load/unload. Although I have three distinct actions, it appears that Puppet is trying to unload the hive before the registry module can make all the changes. I believe I need to add some dependencies using subscribe and refreshonly.
This question is very similar to this one, with the exception that my data is in Hiera, therefore I want to iterate over the data.
$temp_hive_name = $base_windows::temp_hive_name
# LOAD REGISTRY HIVE
exec { 'load_registry_hive' :
command => template('base_windows/Load-RegHive.ps1.erb'),
unless => template('base_windows/Test-HiveLoadState.ps1.erb'),
provider => powershell,
logoutput => true,
}
# MODIFY REGISTRY, ITERATING OVER HIERA DATA
$base_windows::registry.each | $key, $value | {
registry::value { "registry_${key}" :
key => "${value['key']}\\${temp_hive_name}\\${value['subkey']}",
type => $value['type'],
data => $value['data'],
value => $value['value'],
}
}
# UNLOAD REGISTRY HIVE
exec { 'unload_registry_hive' :
command => template('base_windows/Unload-RegHive.ps1.erb'),
onlyif => template('base_windows/Test-HiveLoadState.ps1.erb'),
provider => powershell,
logoutput => true,
}
This works fine when there are one or two Hiera entries.
I guess I could put the load / unload exec resources into an .each loop and add subscribe and refreshonly, however, it seems rather inefficient to do that for each item.
If anyone has any ideas, I'd be grateful if you could share?
T.I.A.
I believe I need to add some dependencies using subscribe and refreshonly.
I'm not so sure that you need to add dependencies, because without explicit dependencies, resources should be applied in the relative order in which they appear in the manifest. Additionally, refreshonly does not declare a dependency, and subscribe is probably not appropriate for this particular task. Furthermore, although refreshonly works in conjunction with dependencies, it's probably not appropriate for this task, either, because notify / subscribe is not right for it.
In a general sense, the key issues are these:
the hive must be loaded before you can attempt to sync any registry entries, so you cannot know whether any given registry resource is out of sync without loading the hive first;
if the hive is loaded then it must also be unloaded;
but the hive must not be unloaded before all the registry entries are synced.
You cannot make Exec['load_registry_hive'] refreshonly because there is no resource that would signal it. You can, however, check whether $base_windows::registry has any elements as a precondition for doing any of the work. If it does, then you definitely need to load the hive.
You can set up explicit dependencies, and I'm generally inclined to do that, as it protects against surprises when a resource is affected by dependency edges that are not apparent at the point of its declaration. So I would suggest this:
$temp_hive_name = $base_windows::temp_hive_name
if ! $base_windows::registry.empty() {
# LOAD REGISTRY HIVE
exec { 'load_registry_hive' :
command => template('base_windows/Load-RegHive.ps1.erb'),
unless => template('base_windows/Test-HiveLoadState.ps1.erb'),
provider => powershell,
logoutput => true,
}
# MODIFY REGISTRY, ITERATING OVER HIERA DATA
$base_windows::registry.each | $key, $value | {
registry::value { "registry_${key}" :
key => "${value['key']}\\${temp_hive_name}\\${value['subkey']}",
type => $value['type'],
data => $value['data'],
value => $value['value'],
require => Exec['load_registry_hive'],
before => Exec['unload_registry_hive'],
}
}
# UNLOAD REGISTRY HIVE
exec { 'unload_registry_hive' :
command => template('base_windows/Unload-RegHive.ps1.erb'),
onlyif => template('base_windows/Test-HiveLoadState.ps1.erb'),
provider => powershell,
logoutput => true,
}
}
Note that you will necessarily both load and unload the hive on each Puppet run, because you cannot determine whether any entries need to be updated without doing so.

logstash - add only first time value

Here's what I want, it's a bit the opposite of incremental data.
some data's are logs with a specific token, and I want to be able to keep (or to show in Elasticsearch) only the first submitted data, the oldiest information of each token.
I want to ignore any new log of the same token ?
How can I do that ? is it in logstash or elasticsearch ?
Thanks
Updates 2016-05-31
I think we can see that in different perspective. but globally what I want is the table like in the picture, but without the red lines, I want them to be ignored by logstash, or not display in ES queries.
I know it can be done, if I was able to add any flag in those lines I want to delete, but it's not possible, the only fact that tell us they can be removed is because we already have a key first-AAA that has been logged before.
At the logging process, we don't have this information.
You can achieve this using the elasticsearch filter. The filter would check in ES if the record already exists and if it is the case, we ask Logstash to just drop the line.
Note that I'm making the assumption that the Id field (AAA) is used as the document _id and is also present in the document as the Id field. Feel free to change whatever needs to, but this will work.
input {
...
}
filter {
elasticsearch {
hosts => ["localhost:9200"]
query => "_type:your_type AND _id:%{[Id]}"
fields => {"Id" => "found"}
}
if [found] {
drop {}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
...
}
}

Conditional creating fields depends on filtering results in logstash influxdb output

I'm using the logstash for collecting sar metrics from the server and store its in influxdb.
Metrics from different sources (CPU, Memory, Network) should be inserted to the different series in influxdb. Of course amount and names of fields in those series depends type of metric source.
This is my config file: https://github.com/evgygor/test/blob/master/logstash.conf
For each [type] of metrics I should configure separate influxdb output. In this example, I configured two types of metrics, but I'm planning to use it for SAR metrics, JMX metrics, csv from Jmeter metrics, that mean - I need configure the appropriate output for each of them (tens).
Questions:
How can I elaborate desired configuration?
I there any option to use conditions inside plugin. Example:
if [type]=="system.cpu" {
data_points => {
"time" => "%{time}"
"user" => "%{user}"
}
}
else {
data_points => {
"time" => "%{time}"
"kbtotalmemory" => "%{kbtotalmemory}"
"kbmemfree" => "%{kbmemfree}"
"kbmemused" => "%{kbmemused}"
}
}
Is there any flag to define to influxdb plugin to use by default fields names/data types from input?
Is there any flag/ability to define default datatype?
Is there any ability to set field name "time" reserved with datatype integer?
Thank a lot.
I cooked some nice solution.
This fork permits to create fields on the fly, accrding to fields names and datatypes that arrives to that output plugin.
I added 2 configuration paramters:
This settings revokes the needs to use data_points and coerce_values configuration # to create appropriate insert to influxedb. Should be used with fields_to_skip configuration # This setting sets data points (column) names as field name from arrived to plugin event, # value for data points config :use_event_fields_for_data_points, :validate => :boolean, :default => true
The array with keys to delete from future processing. # By the default event that arrived to the output plugin contains keys "#version", "#timestamp" # and can contains another fields like, for example, "command" that added by input plugin EXEC. # Of course we doesn't needs those fields to be processed and inserted to influxdb when configuration # use_event_fields_for_data_points is true. # We doesn't deletes the keys from event, we creates new Hash from event and after that, we deletes unwanted # keys.
config :fields_to_skip, :validate => :array, :default => []
This is my example config file: I'm retrieving different number of fields with differnt names from CPU, memory, disks, but I doesn't need defferent configuration per data type as in master branch. I'm creating relevant fields names and datatypes on filter stage and just skips the unwanted fields in outputv plugin.
https://github.com/evgygor/logstash-output-influxdb

logstash output for arangodb

did somebody alrady find an output package for logstash to arangodb? I see that there is one to elasticsearch which probably is quite similar, maybe also to mongodb. But unfortunately I up-to-now didn't find one for arangodb, and the public logstash documentation doesn't help me, as I'm not familiar with ruby.
I gave it a try and found the generic logstash http output plugin to be able to connect to ArangoDB. I wrote a blog article about using ArangoDB as a Logstash output with this plugin.
Compared with a dedicated ArangoDB plugin this has the advantage that it is already available and seems to be maintained by logstash as one of the standard plugins.
It's work fine^
bin/logstash -e 'input { stdin {codec => "json" } } output { http { http_method => "post" url => "http://127.0.0.1:8529/_db/rest/_api/document?collection=rest" format => "json" headers => [ "Authorization", "Basic cm9vdDpwYXNzd29yZA==" ] } }'

Logstash Dynamically assign template

I have read that it is possible to assign dynamic names to the indexes like this:
elasticsearch {
cluster => "logstash"
index => "logstash-%{clientid}-%{+YYYY.MM.dd}"
}
What I am wondering is if it is possible to assign the template dynamically as well:
elasticsearch {
cluster => "logstash"
template => "/etc/logstash/conf.d/%{clientid}-template.json"
}
Also where does the variable %{clientid} come from?
Thanks!
After some testing and feedback from other users, thanks Ben Lim, it seems this is not possible to do so far.
The closest thing would be to do something like this:
if [type] == "redis-input" {
elasticsearch {
cluster => "logstash"
index => "%{type}-logstash-%{+YYYY.MM.dd}"
template => "/etc/logstash/conf.d/elasticsearch-template.json"
template_name => "redis"
}
} else if [type] == "syslog" {
elasticsearch {
cluster => "logstash"
index => "%{type}-logstash-%{+YYYY.MM.dd}"
template => "/etc/logstash/conf.d/syslog-template.json"
template_name => "syslog"
}
}
Full disclosure: I am a Logstash developer at Elastic
You cannot dynamically assign a template because templates are uploaded only once, at Logstash initialization. Without the flow of traffic, deterministic variable completion does not happen. Since there is no traffic flow during initialization, there is nothing there which can "fill in the blank" for %{clientid}.
It is also important to remember that Elasticsearch index templates are only used when a new index is created, and so it is that templates are not uploaded every time a document reached the Elasticsearch output block in Logstash--can you imagine how much slower it would be if Logstash had to do that? If you intend to have multiple templates, they need to be uploaded to Elasticsearch before any data gets sent there. You can do this with a script of your own making using curl and Elasticsearch API calls. This also permits you to update templates without having to restart Logstash. You could run the script any time before index rollover, and when the new indices get created, they'll have the new template settings.
Logstash can send data to a dynamically configured index name, just as you have above. If there is no template present, Elasticsearch will create a best-guess mapping, rather than what you wanted. Templates can and ought to be completely independent of Logstash. This functionality was added for an improved out-of-the-box experience for brand new users. The default template is less than ideal for advanced use cases, and Logstash is not a good tool for template management if you have more than one index template.

Resources