After trying some methods for monitoring structured streaming performance, input/output metrics, I see that a reliable way is to attach streamingQueryListener to output the streamingQueryProgress to get the input/output number.
Besides the SparkUI,
Is there any better way to monitor structured streaming performance?
What's the best way to output the queryProgress into a file or Kafka?
What's the efficient way to compare performance (speed, input, output record) between the spark streaming and spark structured streaming?
One of the ways is to use ELK stack.
Spark application can sink the jmx to Logstash, which is able to aggregate the data, and send the data to ElasticSearch for indexing.
Kibana is able to display the data from ElasticSearch with the visualization capabilities.
1) You need to either include spark-sql-kafka dependency in build.sbt,
(at this moment, I am using Spark 2.2.0 with spark-sql-kafka 0.10)
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.2.0"
or includes --packages when doing spark-submit,
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
2) In order for spark application to output the jmx, all the lines related to jmx need to be uncommented in the file metrics.properties,
And, during the spark-submit, pointing the file directory like below,
--files=metrics.properties --conf spark.metrics.conf=metrics.properties
3) Install ElasticSearch, Logstash, Kibana.
If you are in Window, the way to start the ELK stack like below,
C:\elasticsearch> .\bin\elasticsearch.bat
C:\logstash> bin\logstash.bat -f .\jmx.conf
C:\kibana> .\bin\kibana.bat
In the jmx.conf, jmx path and the polling frequency need to be configured.
input {
jmx {
path => "C:/logstash/config/jmx/"
polling_frequency => 15
type => "jmx"
nb_thread => 2
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
In the folder of jmx path, a json file need to be created to list the object_names and attributes that you want the Logstash to retrieve.
(Logstash will read this json file based on the polling_frequency, so any later update in this json file when the Spark applications are running, Logstash will pick up, meaning that no need to restart the Logstash)
You can list the available object_names and attributes from jconsole after you submit the Spark applications,
The sample file is as follows,
{
"host" : "localhost",
"port" : 9000,
"alias" : "spark.jmx.sample1",
"queries" : [
{
"object_name" : "kafka.consumer:type=consumer-metrics,client-id=*",
"attributes" : ["incoming-byte-rate","outgoing-byte-rate"],
"object_alias" : "byteRate"
},
{
"object_name" : "metrics:name=local-1528139073457.driver.spark.streaming.e6c4b9da-a7d1-479f-b28f-ba2b9b1397d0.inputRate-total",
"attrivutes" : ["Value"],
"object_alias" : "somethingTeste1"
}
]}
4) Finally, you will access the Kibana via http://localhost:5601
And, set up the index pattern first. (You should see the date of data index)
Then, go to visualization page to create the metrics with your object_names and attributes that you list from the jmx json file.
Related
I am writing DStream to Elasticsearch using Elasticsearch-Hadoop connector. It's the link you can find the connector
https://www.elastic.co/guide/en/elasticsearch/hadoop/5.6/spark.html
I need to process the window, write all the documents to ES using "JavaEsSpark.saveToEs" method and want to be sure all the documents written and commit offsets to Kafka. Since JavaEsSpark.saveToEs insert documents as in batch mode, I cannot keep the track of my documents.
My basic code is below. Is there any opinion?
dstream.foreachRDD((items, time) -> {
JavaEsSpark.saveToEs(items,"myindex/mytype");
//wait until all the documents written
//do somehing else then return (actually the job is committing kafka offsets)
});
You can encapsulate your function in a Try (this is a Scala exemple) :
Try {
rdd.saveToEs(AppSettings.Elastic.Resource, configuration)
} match {
case Failure(f) =>
logger.error(s"SaveToEs failed: $f") //or whatever you want
case _ =>
}
I'm setting up my logserver. I'm forwarding logs using Fluentd to Kafka and then storing them in Cassandra for later use. For this I'm using kafka-cassandra sink connector. I have to store data chronologically for which I need to add timestamp to my messages in cassandra. How can this be done?
Datamountaineer connector uses kcql which i think doesn't support inserting timestamp to a log.
My connector configuration is as follows:
name=cassandra-sink
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=test_AF1
connect.cassandra.kcql=INSERT INTO test_event1 SELECT now() as id, message as msg FROM test_AF1 TIMESTAMP=sys_time()
connect.cassandra.port=9042
connect.cassandra.contact.points=localhost
connect.cassandra.key.space=demo
Kafka Connect's Single Message Transform can do this. Here's an example:
{
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"topics": "test_AF1",
…
"transforms": "addTS",
"transforms.addTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.addTS.timestamp.field": "op_ts"
}'
This adds a field to the message payload called op_ts with the timestamp of the Kafka message.
I don't know how this interacts with KCQL; you might want to check out the other two Cassandra sinks that I'm aware of:
https://www.confluent.io/hub/confluentinc/kafka-connect-cassandra
https://www.confluent.io/hub/datastax/kafka-connect-dse
We have around 50 servers from where we get log4j logs. these folder where log4j are writes we have been mounted to a machine where we have Logstash, which pushes these logs into Elasticsearch. It creates a index in Elasticsearch called logstash-2018.06.25 where it stores all log information here in this table. Now I have to delete the old logs, I have read that on internet that delete with query wouldn’t be a good way, rather we should delete it using CURATOR(Elasticsearch). I have read that curator can delete the whole index. How can I configure my logstash so that it creates index based on the date.
So it will create a index/table based on day wise.
So 25-Jun-2018 index would be created on 25-Jun-2018.
Similary 26-Jun-2018 index would be created on 26-Jun-2018.
This way I would be able to drop index on older file, using this approach I would be able to have faster performance of elastic search.
To do this how to configure my logstash so that i can acheive this.
In elasticsearch output plugin you can configure index name as follows:
output {
elasticsearch {
...
index => "logstash-%{+YYYY.MM.dd}"
...
}
}
I am new to solr. Please help me to get out of it. I want to save data from spark to solr, which i have done using SolrCloud mode through spark. but i want to define its schema file and solr config file for a particular core.
I do not want to use the DSE process. I have saved the data using following :
var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "sample")
in.write.format("solr").options(writeToSolrOpts).save
in which sample is collection name and zkhost is the port of zookeeper.
I want to setup the kafka connect from kafka topic to cassandra
The problem is simple: saying I have a demo topic in kafka with json data like
{"id":"1", "name":"Alex", "clicks":2}
I would like automatically to push it into the cassanra table with columns id, name, clicks.
I'm looking into kafka-connect-cassandra, but the only example I can find is to reading from cassandra and writing to another cassandra table via kafka in the middle.
My question is how can I make it read from kafka and not cassandra?
I'm looking for some connector open source with example for doing that.
The example you are referring to is showcasing both source and sink features of connector together. If your use-case is to push data from Kafka topic to a Cassandra table then all you need is a sink. Follow these steps
Create your own sink properties file. Use this as an example. Save it as my-sink.properties
Go to the home directory of installation and execute the command CLASSPATH=<<path-to-connector-jar>> ./bin/connect-standalone connect-standalone.properties my-sink.properties
If you are interested in example of more detailed steps, see here:https://github.com/yaravind/kafka-connect-jenkins#standalone-mode (Full disclosure: I maintain that connector for Jenkins.)
I had the same issue, and I followed what's on https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/. I'm using the DataMountaineer driver (http://docs.datamountaineer.com/en/latest/cassandra-sink.html), and setting it up in distributed mode.
Once you have it set, your configuration Json for the cassandra connector (uploaded via REST API for confluent-connect) should look something like:
{
"name": "cassandra.sink.yourConfigName",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"tasks.max": "1",
"topics": "<your topic>",
"connect.cassandra.sink.kcql": "INSERT INTO <your_table> SELECT * FROM <your_kafka_topic>;",
"connect.cassandra.contact.points": "<cassandra nodes>",
"connect.cassandra.port": "<cassandra port>",
"connect.cassandra.key.space": "<cassandra keyspace>",
"connect.cassandra.username": "cassandra",
"connect.cassandra.password": "cassandra"
}}