Logstash - sequential loading of data from different sources - logstash

Please tell me, if I have an input section in logstash, where I read data from two different sources, then how can I make the data read in order: first, the data from source-1 is loaded, and only after they are loaded, the data is loaded from source-2
input {
jdbc {
#настройки для источника-1
}
jdbc {
#настройки для источника-2
}
}
now the data is read at the same time.
I tried to set this config for pipeline in pipelines.yml
pipeline.workers: 1
pipeline.ordered: true
but this does not help, the data is still read at the same time.

Related

How To Be Sure All Documents Written To Elasticsearch Integration Using Elasticsearch-Hadoop Connector In Spark Streaming

I am writing DStream to Elasticsearch using Elasticsearch-Hadoop connector. It's the link you can find the connector
https://www.elastic.co/guide/en/elasticsearch/hadoop/5.6/spark.html
I need to process the window, write all the documents to ES using "JavaEsSpark.saveToEs" method and want to be sure all the documents written and commit offsets to Kafka. Since JavaEsSpark.saveToEs insert documents as in batch mode, I cannot keep the track of my documents.
My basic code is below. Is there any opinion?
dstream.foreachRDD((items, time) -> {
JavaEsSpark.saveToEs(items,"myindex/mytype");
//wait until all the documents written
//do somehing else then return (actually the job is committing kafka offsets)
});
You can encapsulate your function in a Try (this is a Scala exemple) :
Try {
rdd.saveToEs(AppSettings.Elastic.Resource, configuration)
} match {
case Failure(f) =>
logger.error(s"SaveToEs failed: $f") //or whatever you want
case _ =>
}

structured streaming metrics performance?

After trying some methods for monitoring structured streaming performance, input/output metrics, I see that a reliable way is to attach streamingQueryListener to output the streamingQueryProgress to get the input/output number.
Besides the SparkUI,
Is there any better way to monitor structured streaming performance?
What's the best way to output the queryProgress into a file or Kafka?
What's the efficient way to compare performance (speed, input, output record) between the spark streaming and spark structured streaming?
One of the ways is to use ELK stack.
Spark application can sink the jmx to Logstash, which is able to aggregate the data, and send the data to ElasticSearch for indexing.
Kibana is able to display the data from ElasticSearch with the visualization capabilities.
1) You need to either include spark-sql-kafka dependency in build.sbt,
(at this moment, I am using Spark 2.2.0 with spark-sql-kafka 0.10)
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.2.0"
or includes --packages when doing spark-submit,
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
2) In order for spark application to output the jmx, all the lines related to jmx need to be uncommented in the file metrics.properties,
And, during the spark-submit, pointing the file directory like below,
--files=metrics.properties --conf spark.metrics.conf=metrics.properties
3) Install ElasticSearch, Logstash, Kibana.
If you are in Window, the way to start the ELK stack like below,
C:\elasticsearch> .\bin\elasticsearch.bat
C:\logstash> bin\logstash.bat -f .\jmx.conf
C:\kibana> .\bin\kibana.bat
In the jmx.conf, jmx path and the polling frequency need to be configured.
input {
jmx {
path => "C:/logstash/config/jmx/"
polling_frequency => 15
type => "jmx"
nb_thread => 2
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
In the folder of jmx path, a json file need to be created to list the object_names and attributes that you want the Logstash to retrieve.
(Logstash will read this json file based on the polling_frequency, so any later update in this json file when the Spark applications are running, Logstash will pick up, meaning that no need to restart the Logstash)
You can list the available object_names and attributes from jconsole after you submit the Spark applications,
The sample file is as follows,
{
"host" : "localhost",
"port" : 9000,
"alias" : "spark.jmx.sample1",
"queries" : [
{
"object_name" : "kafka.consumer:type=consumer-metrics,client-id=*",
"attributes" : ["incoming-byte-rate","outgoing-byte-rate"],
"object_alias" : "byteRate"
},
{
"object_name" : "metrics:name=local-1528139073457.driver.spark.streaming.e6c4b9da-a7d1-479f-b28f-ba2b9b1397d0.inputRate-total",
"attrivutes" : ["Value"],
"object_alias" : "somethingTeste1"
}
]}
4) Finally, you will access the Kibana via http://localhost:5601
And, set up the index pattern first. (You should see the date of data index)
Then, go to visualization page to create the metrics with your object_names and attributes that you list from the jmx json file.

Elasticsearch Logstash Kibana and Grok How do I break apart the message?

I created a filter to break apart our log files and am having the following issue. I'm not able to figure out how to save the parts of the "message" to their own field or tag or whatever you call it. I'm 3 days new to logstash and have had zero luck with finding someone here who knows it.
So for an example lets say this is your log line in a log file
2017-12-05 [user:edjm1971] msg:This is a message from the system.
And what you want to do is to get the value of the user and set that into some index mapping so you can search for all logs that were by that user. Also, you should see the information from the message in their own fields in Kibana.
My pipeline.conf file for logstash is like
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} [sid:%{USERNAME:sid} msg:%{DATA:message}"
}
add_tag => [ "foo_tag", "some_user_value_from_sid_above" ]
}
Now when I run the logger to create logs data gets over to ES and I can see the data in KIBANA but I don't see foo_tag at all with the sid value.
How exactly do I use this to create the new tag that gets stored into ES so I can see the data I want from the message?
Note: using regex tools it all appears to parse the log formats fine and the log for logstash does not spit out errors when processing.
Also for the logstash mapping it is using some auto defined mapping as the path value is nil.
I'm not clear on how to create a mapping for this either.
Guidance is greatly appreciated.

How to access available fields of #metadata in logstash

A sample logstash is running and getting input data from a filebeat running on another machine in the same network. I need to process some metadata of files forwarded by filebeat for example modified date of input file. I found that this information may be available in #metadata variable, and can access some fields like this:
%{[#metadata][type]}
%{[#metadata][beat]}
but I don't know how to access all kind of data stored in this field so that i'll be able to extract my own data.
You can add the following configuration to your logstash.conf file:
output {
stdout {
codec => rubydebug {
metadata => true
}
}
}
https://www.elastic.co/blog/logstash-metadata
But this field does not contain metadata of input file

Conditional creating fields depends on filtering results in logstash influxdb output

I'm using the logstash for collecting sar metrics from the server and store its in influxdb.
Metrics from different sources (CPU, Memory, Network) should be inserted to the different series in influxdb. Of course amount and names of fields in those series depends type of metric source.
This is my config file: https://github.com/evgygor/test/blob/master/logstash.conf
For each [type] of metrics I should configure separate influxdb output. In this example, I configured two types of metrics, but I'm planning to use it for SAR metrics, JMX metrics, csv from Jmeter metrics, that mean - I need configure the appropriate output for each of them (tens).
Questions:
How can I elaborate desired configuration?
I there any option to use conditions inside plugin. Example:
if [type]=="system.cpu" {
data_points => {
"time" => "%{time}"
"user" => "%{user}"
}
}
else {
data_points => {
"time" => "%{time}"
"kbtotalmemory" => "%{kbtotalmemory}"
"kbmemfree" => "%{kbmemfree}"
"kbmemused" => "%{kbmemused}"
}
}
Is there any flag to define to influxdb plugin to use by default fields names/data types from input?
Is there any flag/ability to define default datatype?
Is there any ability to set field name "time" reserved with datatype integer?
Thank a lot.
I cooked some nice solution.
This fork permits to create fields on the fly, accrding to fields names and datatypes that arrives to that output plugin.
I added 2 configuration paramters:
This settings revokes the needs to use data_points and coerce_values configuration # to create appropriate insert to influxedb. Should be used with fields_to_skip configuration # This setting sets data points (column) names as field name from arrived to plugin event, # value for data points config :use_event_fields_for_data_points, :validate => :boolean, :default => true
The array with keys to delete from future processing. # By the default event that arrived to the output plugin contains keys "#version", "#timestamp" # and can contains another fields like, for example, "command" that added by input plugin EXEC. # Of course we doesn't needs those fields to be processed and inserted to influxdb when configuration # use_event_fields_for_data_points is true. # We doesn't deletes the keys from event, we creates new Hash from event and after that, we deletes unwanted # keys.
config :fields_to_skip, :validate => :array, :default => []
This is my example config file: I'm retrieving different number of fields with differnt names from CPU, memory, disks, but I doesn't need defferent configuration per data type as in master branch. I'm creating relevant fields names and datatypes on filter stage and just skips the unwanted fields in outputv plugin.
https://github.com/evgygor/logstash-output-influxdb

Resources