How to enable Kafka Producer Metrics in Spark? - apache-spark

We are using Kafka 0.10 with Spark 2.1 and I found our producer publish messages was always slow. I can only reach around 1k/s after give 8 cores to Spark executors while other post said they car reach millions/sec easily.
I tried to tune the linger.ms and batch.size to find out. However I found linger.ms=0 looks like optimal for me and the batch.size doesn't take much effect. And I was sending 160k events per iteration. Looks like I have to enable the Kafka Producer Metrics to know what exactly happen. But looks like it is not very easy to enable it in Spark Executor.
Could any one share me some lights?
My codes are like this:
private def publishMessagesAttempt(producer: KafkaProducer[String, String], topic: String, messages: Iterable[(String, String)], producerMaxDelay: Long,
individualMessageMaxDelay: Long, logger: (String, Boolean) => Unit = KafkaClusterUtils.DEFAULT_LOGGER): Iterable[(String, String)] = {
val futureMessages = messages.map(message => (message, producer.send(new ProducerRecord[String, String](topic, message._1, message._2))))
val messageSentTime = System.currentTimeMillis
val awaitedResults = futureMessages.map { case (message, future) =>
val waitFor = Math.max(producerMaxDelay - (System.currentTimeMillis - messageSentTime), individualMessageMaxDelay)
val failed = Try(future.get(waitFor, TimeUnit.MILLISECONDS)) match {
case Success(_) => false
case Failure(f) =>
logger(s"Error happened when publish to Kafka: ${f.getStackTraceString}", true)
true
}
(message, failed)
}
awaitedResults.filter(_._2).map(_._1)
}

I finally find the answer.
1. KafkaProducer has a metrics() function which can get the metrics of the producer. Just simply print it should be enough.
Some codes like this should work:
public class MetricsProducerReporter implements Runnable {
private final Producer<String, StockPrice> producer;
private final Logger logger =
LoggerFactory.getLogger(MetricsProducerReporter.class);
//Used to Filter just the metrics we want
private final Set<String> metricsNameFilter = Sets.set(
"record-queue-time-avg", "record-send-rate", "records-per-request-avg",
"request-size-max", "network-io-rate", "record-queue-time-avg",
"incoming-byte-rate", "batch-size-avg", "response-rate", "requests-in-flight"
);
public MetricsProducerReporter(
final Producer<String, StockPrice> producer) {
this.producer = producer;
}
#Override
public void run() {
while (true) {
final Map<MetricName, ? extends Metric> metrics
= producer.metrics();
displayMetrics(metrics);
try {
Thread.sleep(3_000);
} catch (InterruptedException e) {
logger.warn("metrics interrupted");
Thread.interrupted();
break;
}
}
}
My codes are slow was because the scala map doesn't have the parallel enabled by default. I will have to use messages.par.map() to achieve the parallelism.

Related

How can I parallel consumption kafka with spark streaming? I set concurrentJobs but something error [duplicate]

The doc of kafka give an approach about with following describes:
One Consumer Per Thread:A simple option is to give each thread its own consumer > instance.
My code:
public class KafkaConsumerRunner implements Runnable {
private final AtomicBoolean closed = new AtomicBoolean(false);
private final CloudKafkaConsumer consumer;
private final String topicName;
public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) {
this.consumer = consumer;
this.topicName = topicName;
}
#Override
public void run() {
try {
this.consumer.subscribe(topicName);
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
// Shutdown hook which can be called from a separate thread
public void shutdown() {
closed.set(true);
consumer.wakeup();
}
public static void main(String[] args) {
CloudKafkaConsumer kafkaConsumer = KafkaConsumerBuilder.builder()
.withBootstrapServers("172.31.1.159:9092")
.withGroupId("test")
.build();
ExecutorService executorService = Executors.newFixedThreadPool(5);
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log"));
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log.info"));
executorService.shutdown();
}
}
but it doesn't work and throws an exception:
java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Furthermore, I read the source of Flink (an open source platform for distributed stream and batch data processing). Flink using multi-thread consumer is similar to mine.
long pollTimeout = Long.parseLong(flinkKafkaConsumer.properties.getProperty(KEY_POLL_TIMEOUT, Long.toString(DEFAULT_POLL_TIMEOUT)));
pollLoop: while (running) {
ConsumerRecords<byte[], byte[]> records;
//noinspection SynchronizeOnNonFinalField
synchronized (flinkKafkaConsumer.consumer) {
try {
records = flinkKafkaConsumer.consumer.poll(pollTimeout);
} catch (WakeupException we) {
if (running) {
throw we;
}
// leave loop
continue;
}
}
flink code of mutli-thread
What's wrong?
Kafka consumer is not thread safe. As you pointed out in your question, the document stated that
A simple option is to give each thread its own consumer instance
But in your code, you have the same consumer instance wrapped by different KafkaConsumerRunner instances. Thus multiple threads are accessing the same consumer instance. The kafka documentation clearly stated
The Kafka consumer is NOT thread-safe. All network I/O happens in the
thread of the application making the call. It is the responsibility of
the user to ensure that multi-threaded access is properly
synchronized. Un-synchronized access will result in
ConcurrentModificationException.
That's exactly the exception you received.
It is throwing the exception on your call to subscribe. this.consumer.subscribe(topicName);
Move that block into a synchronized block like this:
#Override
public void run() {
try {
synchronized (consumer) {
this.consumer.subscribe(topicName);
}
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
Maybe is not your case, but if you are mergin processing of data of serveral topics, then you can read data from multiple topics with the same consumer. If not, then is preferable to create separate jobs consuming each topic.

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

Spark Streaming: Custom Receiver kryo registration with Google Pubsub

I'm using Spark 2.0.2 with Kryo serialization.
I'm attempting to implement a custom receiver for ingesting messages from Google PubSub into Spark Streaming:
class PubSubReceiver(project: String, topic: String, subscription: String)
extends Receiver[Array[Byte]](StorageLevel.MEMORY_AND_DISK_2) with Logging {
val projectFullName = ProjectName.create(project)
val topicName = TopicName.create(project, topic)
val subscriptionName = SubscriptionName.create(project, subscription)
val subscriber = Subscriber.defaultBuilder(subscriptionName, new receiver).build
def onStart() {
new Thread() {
override def run() {
subscriber.startAsync()
//ensure subscriber is running as well as spark receiver
while (subscriber.isRunning && !isStopped()) {
logger.info(s"${subscriber.getSubscriptionName} receiver running")
//sleep 10s
Thread.sleep(10000)
}
logger.info(s"${subscriber.getSubscriptionName} receiver stopping")
}
}.start()
}
def onStop(): Unit = {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
private class receiver extends MessageReceiver {
override def receiveMessage(message: PubsubMessage, consumer: AckReplyConsumer): Unit = {
store(ArrayBuffer(message.getData.toByteArray), message.getAttributesMap)
}
}
}
However when running a Spark job that utilizes this receiver, it seems that I have to serialized the job itself, which doesnt seem correct (the spark context would then be serialized).
object PubSubStreamingIngestionJob extends App {
//... setup
lazy val ssc = new StreamingContext(spark.sparkContext, batchInterval)
lazy val pubsubUnionStream =the stream
ssc.receiverStream(new PubSubReceiver(projectName, topicName, subscriptionName))
pubsubUnionStream.map( messageBytes => ...business logic... )
ssc.start()
ssc.awaitTermination()
}
The following error is thrown:
java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: com.c2fo.atlas.jobs.streaming.gcp.PubSubStreamingIngestionJob
Note: To register this class use: kryo.register(com.mycompany.package.PubSubStreamingIngestionJob.class);
Serialization trace:
classes (sun.misc.Launcher$AppClassLoader)
contextClassLoader (java.lang.Thread)
threads (java.lang.ThreadGroup)
parent (java.lang.ThreadGroup)
group (java.util.concurrent.Executors$DefaultThreadFactory)
val$backingThreadFactory (com.google.common.util.concurrent.ThreadFactoryBuilder$1)
threadFactory (java.util.concurrent.ScheduledThreadPoolExecutor)
e (java.util.concurrent.Executors$DelegatedScheduledExecutorService)
executor (com.google.cloud.pubsub.spi.v1.Subscriber)
subscriber (com.mycompany.package.PubSubReceiver)
array (scala.collection.mutable.WrappedArray$ofRef)
Is there a better way of implementing this?
The issue was the Subscriber instance needed to be thread local to prevent the entire closure from being serialized.
package org.apache.spark.streaming.gcp
import com.c2fo.atlas.util.LazyLogging
import com.google.cloud.pubsub.spi.v1._
import com.google.iam.v1.ProjectName
import com.google.pubsub.v1._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import scala.collection.mutable.ArrayBuffer
class PubSubReceiver(project: String, topic: String, subscription: String)
extends Receiver[PubsubMessage](StorageLevel.MEMORY_AND_DISK_2) with LazyLogging{
val projectFullName = ProjectName.create(project)
val topicName = TopicName.create(project, topic)
val subscriptionName = SubscriptionName.create(project, subscription)
def onStart() {
new Thread() {
**//crucial change below**
val subscriber = Subscriber.defaultBuilder(subscriptionName, new receiver).build
override def run() {
subscriber.startAsync()
//ensure subscriber is running as well as spark receiver
while (subscriber.isRunning && !isStopped()) {
logger.info(s"${subscriber.getSubscriptionName} receiver running")
//sleep 10s
Thread.sleep(10000)
}
logger.info(s"${subscriber.getSubscriptionName} receiver stopping")
}
}.start()
}
def onStop(): Unit = {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
class receiver extends MessageReceiver {
override def receiveMessage(message: PubsubMessage, consumer: AckReplyConsumer): Unit = {
store(ArrayBuffer(message), message.getAttributesMap)
}
}
}

Spark Streaming: Average of all the time

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key). For state tracking I use the StatCounter class, which holds all temperature values as doubles and re-calculates the average each stream via calling the StatCounter.mean method. Here my program:
EDITED MY WHOLE CODE: NOW USING StatCounter
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1));
streamingContext.checkpoint("hdfs://server:8020/spark-history/checkpointing");
JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
});
JavaPairDStream<String, Float> temperatureDStream = sensorDStream.mapToPair(new PairFunction<SensorData, String, Float>() {
public Tuple2<String, Float> call(SensorData sensorData) throws Exception {
return new Tuple2<String, Float>(sensorData.getIdSensor(), sensorData.getValTemp());
}
});
JavaPairDStream<String, StatCounter> statCounterDStream = temperatureDStream.updateStateByKey(new Function2<List<Float>, Optional<StatCounter>, Optional<StatCounter>>() {
public Optional<StatCounter> call(List<Float> newTemperatures, Optional<StatCounter> statsYet) throws Exception {
StatCounter stats = statsYet.or(new StatCounter());
for(float temp : newTemperatures) {
stats.merge(temp);
}
return Optional.of(stats);
}
});
JavaPairDStream<String, Double> avgTemperatureDStream = statCounterDStream.mapToPair(new PairFunction<Tuple2<String,StatCounter>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, StatCounter> statCounterTuple) throws Exception {
String key = statCounterTuple._1();
double avgValue = statCounterTuple._2().mean();
return new Tuple2<String, Double>(key, avgValue);
}
});
avgTemperatureDStream.print();
This seems to work fine. But now to the question:
I just found an example online which also shows how to calculate a average of all time here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html
They use AtmoicLongs etc. for storing the "stateful values" and update them in a forEachRDD method.
My question now is: What is the better solution for a stateful calculation of all time in Spark Streaming? Are there any advantages / disadvantages of using one or the other way? Thank you!

Measure duration of executing combineByKey function in Spark

I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)

Resources