How can I do multiple input streams in logstash? - logstash

My use case is:
I need a timestamp from database A
Then do a query on database B
Then output the results of that Database B query
Which means a logstash file that (theoretically) would look something like:
input {
jdbc {
get the timestamp
}
jdbc {
Do the SQL that gets lots of data with the timestamp above
}
}
output {
elasticsearch {
spew the data from that second jdbc query
}
}
That doesn't work, of course, but it gives the idea of the use case. How can I make solve this scenario?

Use a jdbc_streaming filter, which is designed for exactly this use-case.

Related

JMeter: Connect to PostGresSQL in JSR using groovy and then compare values from multiple tables in DB with API response

Sorry for the long post, but I really need some guidance here. I need to compare values from an API response with the values from multiple tables in the DB.
Currently, I am doing it as follows:
Use a JDBC Connect Configuration to connect to Postgres DB and then use the JDBC Sampler to execute queries. I use it three times to query 3 different tables. I store this data in variables (lets call them DBVariables). Please see this image for current Jmeter setup. https://i.stack.imgur.com/GZJyF.png
In JSR Assertion, I have written code that takes data from various DBVariables and compares it against the API response.
However, my issue is the API response can have an array of records and then nested arrays inside each (please see API response sample below). And these array elements can be sorted in any order. This is where I have issues.
I was wondering what would be the most efficient way of writing this JSR Assertion to validate all data elements returned by the API are the same as what is in the DB.
I am very new to groovy, but I think if I can query the DB inside the JSR assertion (instead of using the JDBC sampler), then the comparison can be done by storing API response in a map and then the DBResponse in another map and sorting both and comparing the items.
My questions are:
How can I connect to postgressql using groovy and then execute query statements in it? I have not done that before and was hoping if someone can provide a sample code.
How can I store API response and DB responses in Map and sort them and compare them in groovy?
The API response is of the following type:
{
"data":{
"response":{
"employeeList":[
{
"employeeNumber":"11102",
"addressList":[
{
"addrType":"Home",
"street_1":"123 Any street"
},
{
"addrType":"Alternate",
"street_1":"123 Any street"
}
],
"departmentList":[
{
"deptName":"IT"
},
{
"deptName":"Finance"
},
{
"deptName":"IT"
}
]
},
{
"employeeNumber":"11103",
"addressList":[
{
"addrType":"Home",
"street_1":"123 Any street"
},
{
"addrType":"Alternate",
"street_1":"123 Any street"
}
],
"departmentList":[
{
"deptName":"IT"
},
{
"deptName":"Finance"
},
{
"deptName":"IT"
}
]
}
]
}
}
}
Have you seen Working with a relational database chapter of Groovy documentation? Alternatively you can obtain a Connection instance from the JDBC Configuration Element like
def connection = org.apache.jmeter.protocol.jdbc.config.getConnection('your-pool-name')
With regards to "sort" There is DefaultGroovyMethods class which provides sort() function for any "sortable" entity. With regards to "compare" - we don't know how the object from the database looks like hence cannot provide a comprehensive solution.
Maybe an easier option would be converting the response from the JDBC Sampler to JSON using JsonBuilder and once you have 2 JSON structures use the library like JSONassert which doesn't care about order and depth
You haven't asked, but if you're "very new to groovy" maybe it worth extracting individual values from API using JSON Extractor, do the same for the database with the JDBC elements and compare individual JMeter Variables using Response Assertion?

Using Logstash Aggregate Filter plugin to process data which may or may not be sequenced

Hello all!
I am trying to use the Aggregate filter plugin of Logstash v7.7 to correlate and combine data from two different CSV file inputs which represent API data calls. The idea is to produce a record showing a combined picture. As you can expect the data may or may not arrive in the right sequence.
Here is as an example:
/data/incoming/source_1/*.csv
StartTime, AckTime, Operation, RefData1, RefData2, OpSpecificData1
231313232,44343545,Register,ref-data-1a,ref-data-2a,op-specific-data-1
979898999,75758383,Register,ref-data-1b,ref-data-2b,op-specific-data-2
354656466,98554321,Cancel,ref-data-1c,ref-data-2c,op-specific-data-2
/data/incoming/source_1/*.csv
FinishTime,Operation,RefData1, RefData2, FinishSpecificData
67657657575,Cancel,ref-data-1c,ref-data-2c,FinishSpecific-Data-1
68445590877,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
55443444313,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
I have a single pipeline that is receiving both these CSVs and I am able to process and write them as individual records to a single Index. However, the idea is to combine records from the two sources into one record each representing a superset. of Operation related information
Unfortunately, despite several attempts I have been unable to figure out how to achieve this via Aggregate filter plugin. My primary question is whether this is a suitable use of the specific plugin? And if so, any suggestions would be welcome!
At the moment, I have this
input {
file {
path => ['/data/incoming/source_1/*.csv']
tags => ["source1"]
}
file {
path => ['/data/incoming/source_2/*.csv']
tags => ["source2"]
}
# use the tags to do some source 1 and 2 related massaging, calculations, etc
aggregate {
task_id = "%{Operation}_%{RefData1}_%{RefData1}"
code => "
map['source_files'] ||= []
map['source_files'] << {'source_file', event.get('path') }
"
push_map_as_event_on_timeout => true
timeout => 600 #assuming this is the most far apart they will arrive
}
...
}
output {
elastic { ...}
}
And other such variations. However, I keep getting individual records being written to the Index and am unable to get one combined. Yet again, as you can see from the data set there's no guarantee of the sequencing of records - so I am wondering if the filter is the right tool for the job, to begin with? :-\
Or is it just me not being able to use it right! ;-)
In either case, any inputs/ comments/ suggestions welcome. Thanks!
PS: This message is being cross-posted over from Elastic forums. I am providing a link there just in case some answers pop up there too.
The answer is to use Elastic search in upsert mode. Please see the specifics here..
I recommend first that the information reaches you in order so that the filter can take it better, secondly, you could set the options in your pipeline.yml: pipeline.workers: 1 and pipeline.ordered: true, thus guaranteeing the order of processing.

How To Be Sure All Documents Written To Elasticsearch Integration Using Elasticsearch-Hadoop Connector In Spark Streaming

I am writing DStream to Elasticsearch using Elasticsearch-Hadoop connector. It's the link you can find the connector
https://www.elastic.co/guide/en/elasticsearch/hadoop/5.6/spark.html
I need to process the window, write all the documents to ES using "JavaEsSpark.saveToEs" method and want to be sure all the documents written and commit offsets to Kafka. Since JavaEsSpark.saveToEs insert documents as in batch mode, I cannot keep the track of my documents.
My basic code is below. Is there any opinion?
dstream.foreachRDD((items, time) -> {
JavaEsSpark.saveToEs(items,"myindex/mytype");
//wait until all the documents written
//do somehing else then return (actually the job is committing kafka offsets)
});
You can encapsulate your function in a Try (this is a Scala exemple) :
Try {
rdd.saveToEs(AppSettings.Elastic.Resource, configuration)
} match {
case Failure(f) =>
logger.error(s"SaveToEs failed: $f") //or whatever you want
case _ =>
}

logstash - add only first time value

Here's what I want, it's a bit the opposite of incremental data.
some data's are logs with a specific token, and I want to be able to keep (or to show in Elasticsearch) only the first submitted data, the oldiest information of each token.
I want to ignore any new log of the same token ?
How can I do that ? is it in logstash or elasticsearch ?
Thanks
Updates 2016-05-31
I think we can see that in different perspective. but globally what I want is the table like in the picture, but without the red lines, I want them to be ignored by logstash, or not display in ES queries.
I know it can be done, if I was able to add any flag in those lines I want to delete, but it's not possible, the only fact that tell us they can be removed is because we already have a key first-AAA that has been logged before.
At the logging process, we don't have this information.
You can achieve this using the elasticsearch filter. The filter would check in ES if the record already exists and if it is the case, we ask Logstash to just drop the line.
Note that I'm making the assumption that the Id field (AAA) is used as the document _id and is also present in the document as the Id field. Feel free to change whatever needs to, but this will work.
input {
...
}
filter {
elasticsearch {
hosts => ["localhost:9200"]
query => "_type:your_type AND _id:%{[Id]}"
fields => {"Id" => "found"}
}
if [found] {
drop {}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
...
}
}

Trouble migrating database tables with groovy sql

I want to select data from one table and insert into another in batches, using different sql connections. The two tables are set up exactly the same. At the moment I have:
destination.withBatch(1000) { stmt ->
source.eachRow(selectQuery) {
String insertString = """
INSERT INTO dest_table
VALUES (
${it[0]},
${it[1]});
"""
try {
stmt.addBatch(insertString)
}
catch (Exception e) {
println insertString
}
}
}
Something seems to happen to the data types in this process, because it gets very unhappy inserting a string like 'a:string' because of the colon.
I could do '${it[0]}' to enforce it is treated as a String, but this will cause problems when I come to other data types.
Furthermore, my error handling is definitely not working correctly. I want it to print out the inserts that it was unable to execute, and then carry on gracefully.
Thanks
It's likely that groovy sql is creating a prepared statement from your sql string, and that anything with a colon in it is parsed as a parameter placeholder.
So, I'd suggest going with the flow, and binding your data values separately, instead of putting them inline in the sql statement. This will also probably improve performance, as the prepared statements can be cached by the database.

Resources