Recently I was implementing a Kryo serializer for a Kafka Streams DSL application. Kryo is not thread-safe by default, and serialization methods were throwing exceptions most likely caused by unsynchronized access from multiple threads. Adding synchronization solved the problem, but raised some questions.
What is the threading model of Kafka Streams application with respect to different processing objects? Which objects are shared between threads and which are being used by a single thread only? Is it safe to have an unsynchronised local state (fields, not state stores) in these objects?
I'm especially interested in Processor / Transformer and Serializer / Deserializer objects.
I've seen this answer, but it's still not clear to me. Note I'm not trying to share any state between threads, but rather to avoid having such shared state.
Both DSL and PAPI require a supplier (i.e. factory) for Processors / Transformers, Serde interface is also a factory, so I assumed that a single instance is created per thread or per task. This assumption seems to be false, but at the same time it's very weird to accept a factory, create multiple instances and then access them from multiple threads at the same time.
My Serde implementation basically looks like this (a new Kryo instance is created per Serializer / Deserializer instance):
public class MySerde<T> implements Serde<T> {
#Override
public Serializer<T> serializer() {
final Kryo kryo = new Kryo();
return (topic, data) -> { /* use kryo instance */ };
}
#Override
public Deserializer<T> deserializer() {
final Kryo kryo = new Kryo();
return (topic, data) -> { /* use kryo instance */ };
}
}
Serdes were invoked for reading / writing repartition topic:
stream
.groupByKey(Grouped.with(/* set serdes here */))
.windowedBy(...)
.aggregate(...)
I've put up a small demo application to detect concurrent access to serdes and transformers: https://github.com/sukhinin/kafka-streams-threading.
Serializers and deserializers are accessed concurrently from multiple threads and thus must be thread-safe.
Transformers are accessed from a single thread at a time.
Still I would be grateful to hear from someone with more Kafka Streams experience.
Related
To my understanding, Spark works like this:
For standard variables, the Driver sends them together with the lambda (or better, closure) to the executors for each task using them.
For broadcast variables, the Driver sends them to the executors only once, the first time they are used.
Is there any advantage to use a broadcast variable instead of a standard variable when we know it is used only once, so there would be only one transfer even in case of a standard variable?
Example (Java):
public class SparkDriver {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Map<String,String> dictionary = new HashMap<>();
dictionary.put("J", "Java");
dictionary.put("S", "Spark");
SparkConf conf = new SparkConf()
.setAppName("Try BV")
.setMaster("local");
try (JavaSparkContext context = new JavaSparkContext(conf)) {
final Broadcast<Map<String,String>> dictionaryBroadcast = context.broadcast(dictionary);
context.textFile(inputPath)
.map(line -> { // just one transformation using BV
Map<String,String> d = dictionaryBroadcast.value();
String[] words = line.split(" ");
StringBuffer sb = new StringBuffer();
for (String w : words)
sb.append(d.get(w)).append(" ");
return sb.toString();
})
.saveAsTextFile(outputPath); // just one action!
}
}
}
There are several advantages regarding the use of broadcast variables, even if you use only once:
You avoid several problems of serialization. When you serialize an anonymous inner class that uses a field belonging to the external class this involves serializing its enclosing class. Even if spark and other framework has written a workaround to partially mitigate this problem, although sometimes the ClosureCleaner doesn't do the trick. You could avoid the NotSerializableExceptions by performing some tricks i.e.: copy a class member variable into a closure, transform the anonymous inner class into a class and put only the required fields in the constructor etc.
If you use the BroadcastVariable you don't even think about that, the method itself will serialize only the required variable. I suggest reading not-serializable-exception question and the first answer to deepen better the concept.
The serialization performance of the closure is, most of the time, worse than a specialized serialization method. As the official documentation of spark says: data-serialization
Kryo is significantly faster and more compact than Java serialization (often as much as 10x).
Searching on the Spark classes from the official spark repo I had seen that the closure is serialized through the variable SparkEnv.get.closureSerializer. The only assignment of that variable is the one present at line 306 of the SparkEnv class that use the standard and inefficient JavaSerializer.
In that case, if you serialize a big object you lose some performance due to the network bandwidth. This could be also an explanation of why the official doc claiming about to switch to BroadcastVariable for task larger than 20 KiB.
There is only one copy for each machine, in case of more executor on the same phisical machine there is an advantages.
> Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
The distribution algorithm is probably a lot more efficient. Using the immutability property of the BroadcastVariable is not difficult to think of distributing following a p2p algorithm instead of a centralized one. Imagine for example from the driver whenever you had finished with the first executor sending the BroadcastVariable to the second, but in parallel the initial executor send the data to the third and so on. Picture kindly provided by the bitTorrent Wikipedia page:
I had no deepen the implementation from spark but, as the documentation of the Broadcast variables says:
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Surely a more efficient algorithm than the trivial centralized-one can be designed using the immutability property of the BroadcastVariable.
Long story short: isn't the same thing use a closure or a broadcast variable. If the object that you are sending is big, use a broadcast variable.
Please refer to this excellent article: https://www.mikulskibartosz.name/broadcast-variables-and-broadcast-joins-in-apache-spark/ I could re-write it but it serves the purpose well and answers your question.
In summary:
A broadcast variable is an Apache Spark feature that lets us send a
read-only copy of a variable to every worker node in the Spark
cluster.
The broadcast variable is useful only when we want to:
Reuse the same variable across multiple stages of the Spark job
Speed up joins via a small table that is broadcast to all worker nodes, not all Executors.
I want to use hazelcast-jet-kafka in my app, because in my case the number of kafka partitions is limited. How I understand jet-kafka parallelism doesn't depend on kafka partitions, it would be nice to find explanations of how jet-kafka achieve independence of the number of kafka partitions.
But my question is how I can handle events in jet when my event handler could not be serializable.
For example, I've found a solution - use map sink and add local event listener to this map,
but for me, it seems like a crutch, because I don't need to store these events in map. It is possible to set map size to zero in such scheme?
Also, I see in docs new type of sink - observable, it seems what I want, but observable listener could not get only local entries and for me, it is not suitable.
Could you help find the right solution? Or hazelcast-jet-kafka is not a good choice in that case?
it would be nice to find explanations of how jet-kafka achieve independence of the number of kafka partitions.
One Jet thread can handle any number of partitions, so it's easy to achieve this independence. Jet just distributes all the partitions fairly among all the Kafka connector threads.
But my question is how I can handle events in jet when my event handler could not be serializable.
Hazelcast Jet doesn't require your event handler to be serializable. If you need a stateful handler, you have to supply a function that creates the state object. The function must be serializable, but the state doesn't have to be. If you just want a stateless mapping function, it must be serializable, but usually there's no problem with that.
If you are getting an error that says a function is non-serializable, this can be due to a common pitfall of capturing more state than you actually need in the lambda. You should show your code in that case.
When configuring spark to use Kafka as exposed here:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
The spark folks state the following:
"interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query."
Then i saw the following here https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L307
val otherUnsupportedConfigs = Seq(
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, // committing correctly requires new APIs in Source
ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG) // interceptors can modify payload, so not safe
otherUnsupportedConfigs.foreach { c =>
if (params.contains(s"kafka.$c")) {
throw new IllegalArgumentException(s"Kafka option '$c' is not supported")
}
}
Yes Not safe, but when one knows what he does, it can be used safely. Hence i wonder why blocking it by design and if there is a work-around ?
We have a front layer which just receives messages and writes to the Kafka topics for back-end processing. We send the messages at a very high rate; per day we process 1 billion messages. We have a thread pool which accepts the messages and writes to the Kafka producer instance. Here I have created only one producer (single instance) which is shared among multiple threads.
Recently, I have been observing that 90% of the threads are in blocked state. I found out that Kafka is sending the data sequentially. There was a synchronized block in the producer.send() method in the Kafka Java driver:
def send(messages: KeyedMessage[K,V]*) {
**lock synchronized {**
if (hasShutdown.get)
throw new ProducerClosedException
recordStats(messages)
sync match {
case true => eventHandler.handle(messages)
case false => asyncSend(messages)
}
}
}
The documentation says that we don't need to create multiple producer instances; one instance can be shared in a multi-threaded environment. But how can we do that? Or should we better create a pool of producer instances?
The reason why it is recommended to share the publisher client across threads is that it leads to better batching, as the messages are batched at partition level. Better batching leads to better compression (if enabled) and also better throughput. You can consider tuning parameters like buffer memory and linger.ms and batch size for optimizing the throughput.
One this is done, then you can consider adding multiple producers.
Also, consider increasing the number of partitions for the topic, if the incoming rate for the topic is quite high.
i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}
In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.