When configuring spark to use Kafka as exposed here:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
The spark folks state the following:
"interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query."
Then i saw the following here https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L307
val otherUnsupportedConfigs = Seq(
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, // committing correctly requires new APIs in Source
ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG) // interceptors can modify payload, so not safe
otherUnsupportedConfigs.foreach { c =>
if (params.contains(s"kafka.$c")) {
throw new IllegalArgumentException(s"Kafka option '$c' is not supported")
}
}
Yes Not safe, but when one knows what he does, it can be used safely. Hence i wonder why blocking it by design and if there is a work-around ?
Related
To my understanding, Spark works like this:
For standard variables, the Driver sends them together with the lambda (or better, closure) to the executors for each task using them.
For broadcast variables, the Driver sends them to the executors only once, the first time they are used.
Is there any advantage to use a broadcast variable instead of a standard variable when we know it is used only once, so there would be only one transfer even in case of a standard variable?
Example (Java):
public class SparkDriver {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Map<String,String> dictionary = new HashMap<>();
dictionary.put("J", "Java");
dictionary.put("S", "Spark");
SparkConf conf = new SparkConf()
.setAppName("Try BV")
.setMaster("local");
try (JavaSparkContext context = new JavaSparkContext(conf)) {
final Broadcast<Map<String,String>> dictionaryBroadcast = context.broadcast(dictionary);
context.textFile(inputPath)
.map(line -> { // just one transformation using BV
Map<String,String> d = dictionaryBroadcast.value();
String[] words = line.split(" ");
StringBuffer sb = new StringBuffer();
for (String w : words)
sb.append(d.get(w)).append(" ");
return sb.toString();
})
.saveAsTextFile(outputPath); // just one action!
}
}
}
There are several advantages regarding the use of broadcast variables, even if you use only once:
You avoid several problems of serialization. When you serialize an anonymous inner class that uses a field belonging to the external class this involves serializing its enclosing class. Even if spark and other framework has written a workaround to partially mitigate this problem, although sometimes the ClosureCleaner doesn't do the trick. You could avoid the NotSerializableExceptions by performing some tricks i.e.: copy a class member variable into a closure, transform the anonymous inner class into a class and put only the required fields in the constructor etc.
If you use the BroadcastVariable you don't even think about that, the method itself will serialize only the required variable. I suggest reading not-serializable-exception question and the first answer to deepen better the concept.
The serialization performance of the closure is, most of the time, worse than a specialized serialization method. As the official documentation of spark says: data-serialization
Kryo is significantly faster and more compact than Java serialization (often as much as 10x).
Searching on the Spark classes from the official spark repo I had seen that the closure is serialized through the variable SparkEnv.get.closureSerializer. The only assignment of that variable is the one present at line 306 of the SparkEnv class that use the standard and inefficient JavaSerializer.
In that case, if you serialize a big object you lose some performance due to the network bandwidth. This could be also an explanation of why the official doc claiming about to switch to BroadcastVariable for task larger than 20 KiB.
There is only one copy for each machine, in case of more executor on the same phisical machine there is an advantages.
> Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
The distribution algorithm is probably a lot more efficient. Using the immutability property of the BroadcastVariable is not difficult to think of distributing following a p2p algorithm instead of a centralized one. Imagine for example from the driver whenever you had finished with the first executor sending the BroadcastVariable to the second, but in parallel the initial executor send the data to the third and so on. Picture kindly provided by the bitTorrent Wikipedia page:
I had no deepen the implementation from spark but, as the documentation of the Broadcast variables says:
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Surely a more efficient algorithm than the trivial centralized-one can be designed using the immutability property of the BroadcastVariable.
Long story short: isn't the same thing use a closure or a broadcast variable. If the object that you are sending is big, use a broadcast variable.
Please refer to this excellent article: https://www.mikulskibartosz.name/broadcast-variables-and-broadcast-joins-in-apache-spark/ I could re-write it but it serves the purpose well and answers your question.
In summary:
A broadcast variable is an Apache Spark feature that lets us send a
read-only copy of a variable to every worker node in the Spark
cluster.
The broadcast variable is useful only when we want to:
Reuse the same variable across multiple stages of the Spark job
Speed up joins via a small table that is broadcast to all worker nodes, not all Executors.
this is a question about Spark error handling.
Specifically -- handling errors on writes to the target data storage.
The Situation
I'm writing into a non-transactional data storage that does not (in my case) support idempotent inserts — and want to implement error handling for write failures — to avoid inserting data multiple times.
So the scenario I'd like to handle is:
created dataframe / DAG
all executed, read data successfully, persisted within spark job (can be in memory)
writing to the target — but that throws an exception / fails midway / is unavailable
In this scenario, Spark would retry the write — without the ability to roll back (due to the nature of the custom target data store) — and thus potentially duplicate data.
The Question.
What is the proper approach in Spark to handle such cases ?
Option 1.
Is there a way to add an exception handler at task level ? For a specific task ?
Option 2.
Could set max retries to 1 so that the whole app would fail — and cleanup could be done externally — but would like to do better than that :)
Option 3.
As an alternative — we could add an extra column to the dataframe, one that would be computed at runtime and be unique across retries (so we could, again, clean it all up externally later). The question is then — what would be the way to compute a column literal at runtime of the Spark job (and not during DAG creation) ?
So...
Given the question -- what options are there ?
If it's any of the three proposed -- how can it be implemented ?
Would appreciate very much any help on this matter!
Thanks in advance...
I would implement the error handling at the driver level, but this comes at the cost of additional queries (which you would need anyways). Basically you need to
check if your (new) data contains duplicates
your target table/datastore already contains this data
val df : DataFrame = ???
df.cache() // if that fits into memory
val uniqueKeys = Seq("ID")
val noDuplicates = df.groupBy(uniqueKeys.head,uniqueKeys.tail:_*).count().where($"count">0).isEmpty
val notAlreadyExistsInTarget = spark.table(<targettable>).join(df,uniqueKeys,"leftsemi").isEmpty
if(noDuplicates && notAlreadyExistsInTarget) {
df.write // persist to your datastore
} else {
throw new Exception("df contains duplicates / already exists")
}
I have the following producer code:
var kafka = require('kafka-node');
var KafkaProducer = kafka.Producer;
var KeyedMessage = kafka.KeyedMessage;
var jsonRequest = JSON.stringify(request.object);
//I have to define client every time this endpoint is hit.
var client = new kafka.Client();
var producerKafka = new KafkaProducer(client);
var km = new KeyedMessage('key', 'message');
payloads = [
{ topic: 'collect-response', messages: jsonRequest, partition: 0 }
];
producerKafka.on('ready', function () {
producerKafka.send(payloads, function (err, data) {
console.log(data);
});
});
producerKafka.on('error', function (err) {})
Now, my task is to avoid de duplication of messages being written here.
This section of Kafka FAQ should be useful:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data
production
and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server. The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon
I am trying to Load huge amount of data from Spark to HBase. I am using saveAsNewAPIHadoopDataset method.
I am creating ImmutableWritable and Put and saving that is required as below.
dataframe.mapPartitions { rows =>
{
rows.map { eachRow =>
{
val rowKey = Seq(eachRow.getAs[String]("uniqueId"), eachRow.getAs[String]("authTime")).mkString(",")
val put = new Put(Bytes.toBytes(rowKey));
val fields = eachRow.schema.fields;
for (i <- 0 until fields.length) {
put.addColumn(userCF, Bytes.toBytes(fields(i).name), Bytes.toBytes(String.valueOf(eachRow.get(i))))
}
(new ImmutableBytesWritable(Bytes.toBytes(rowKey)), put)
}
}
}
}.saveAsNewAPIHadoopDataset(job.getConfiguration)
My data is 30GB worth and it is present in HDFS in 60 files.
When i submit the same job with 10 files at a time, every thing went fine.
But, when i submit every thing at once, it is giving this error. The error is really frustrating and i tried every thing within possibility. But really wondering what made it to run successfully when the data is of 5GB and what made it to result in error when it is 30GB.
Has any one faced this kind of issues.?
That's because ImmutableBytesWritable is not serializable. When there is a shuffle, apache spark tries to serialize it to send to another node. The same would happen if you'd try to take some or collect it on driver.
There only two approaches, actually.
Do not use it if you're shuffling. If you just need to put each record from disk into a database, then looks like shuffling is not required. Make sure it is. If you need to preprocess your data before they go to database, keep it in some other serializable format and convert to the required only, when you saving it.
Use another serializer. Apache spark comes with Kryo (make sure you're using spark 2.0.0 - Kryo has been updated there and it fixes few nasty concurrency bugs). In order to use it, you have to configure it. It's not hard, but require a bit of code.
I am looking to process elements on a queue (Kafka or Amazon Kinesis) and to have multiple operations to be performed on each element, for example:
Write that to HDFS cluster
Invoke a rest API
Trigger a notification on slack.
On each of these operations I am expecting an exactly-once semantic, is this achievable in Apache Spark and how?
You will need to manage unique keys manually: but given that approach it is possible when using
KafkaUtils.createDirectStream
From the Spark docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html :
Approach 2: Direct Approach (No Receivers)
each record is received by Spark Streaming
effectively exactly once despite failures.
And here is the idempotency requirement - so e.g. saving unique key per message in Postgres:
In order to achieve
exactly-once semantics for output of your results, your output
operation that saves the data to an external data store must be either
idempotent, or an atomic transaction that saves results and offsets
(see Semantics of output operations in the main programming guide for
further information).
Here is an idea of the kind of code you would need to manage the unique keys (from http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ ):
stream.foreachRDD { rdd =>
rdd.foreachPartition { iter =>
// make sure connection pool is set up on the executor before writing
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.foreach { case (key, msg) =>
DB.autoCommit { implicit session =>
// the unique key for idempotency is just the text of the message itself, for example purposes
sql"insert into idem_data(msg) values (${msg})".update.apply
}
}
}
}
A unique per-message ID would need to be managed.
Exactly once is a side effect of at least once processing semantic, when the operations are idempotent. In your case, if all 3 operations are idempotent, then you can get exactly once semantic. The other way to get exactly once semantic is to wrap all the 3 operations and Kafka offset storage in one transaction, which is not feasible.
https://pkghosh.wordpress.com/2016/05/18/exactly-once-stream-processing-semantics-not-exactly/