I have a case where Kafka producers sends the data twice a day. These producers read all the data from the database/files and send to Kafka. So these messages sent every day, which is duplicated. I need to deduplicate the message and write in some persistent storage using the Spark Streaming. What will the best way of removing the duplicate messages in this case?
The duplicate message sent is a json string with the timestamp field is only updated.
Note: I can't change Kafka Producer to send only the new data/message, it's already installed in the client machine and written by someone else.
For deduplication, you need to store somewhere information about what was already processed (for example unique ids of messages).
To store messages you can use:
spark checkpoints. Pros: out-of-the-box. Cons: if you update the source code of app, you need to clean checkpoints. As result, you will lose information. Solution can work, if the requirements for deduplication is not strict.
any database. For example, if you running on hadoop env, you can use Hbase. For every message you do 'get' (check that it wasn't sent before), and mark in DB sent when it is really send.
You can the change the topic configuration to compact mode. With compaction, a record with same key will be overwritten/updated in the Kafka log. There by you get only the latest value for a key from Kafka.
You can read more about compaction here.
You could try to use mapWithState. Check my answer.
A much simpler approach would be to solve this at kafka end. have a look at kafka's Log compaction feature. It will deduplicate the recors for you provided the records have same unique key.
https://kafka.apache.org/documentation/#compaction
You can use a Key-Value datastore where your key is going to be combination of fields excluding the timestamp field and value the actual json.
As you poll the records create the Key and value pair write to the datastore which either handles the UPSERT(Insert + Update) or check if the key exists in the datastore then drop the message
if(Datastore.get(key)){
// then drop
}else {
//write to the datastore
Datastore.put(key)
}
I suggest you to check HBase(Which handles UPSERTS) and Redis(In-Memory KV datastore used for lookups)
Have you looked into this:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
You can try using the dropDuplicates() method.
If you have more than one column that needs to be used to determine the duplicates, you can use the dropDuplicates(String[] colNames) to pass them.
Related
In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.
I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba
https://github.com/microsoft/FluidFramework/blob/release/0.30/server/routerlicious/packages/lambdas-driver/src/kafka-service/README.md#L81
source code of project
I found that there are two ways to manage Kafka Service,
DocumentLambda and KafkaRunner.
They are very similar, and I want to know more about the differences.
And the reason or the history of why it is like this
We use a fixed number of Kafka partitions. So, a partition is shared by multiple documents. The DocumentLambda is responsible for routing the messages inside a partition to corresponding lambda handler. It contains a HashMap where key is "tenantId/documentId". For every incoming message, it looks up those fields to determine the lambda associated with that message.
I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.
I am using Kafka 0.8.2 to receive data from AdExchange then I use Spark Streaming 1.4.1 to store data to MongoDB.
My problem is when I restart my Spark Streaming Job for instance like update new version, fix bug, add new features. It will continue read the latest offset of kafka at the time then I will lost data AdX push to kafka during restart the job.
I try something like auto.offset.reset -> smallest but it will receive from 0 -> last then data was huge and duplicate in db.
I also try to set specific group.id and consumer.id to Spark but it the same.
How to save the latest offset spark consumed to zookeeper or kafka then can read back from that to latest offset?
One of the constructors of createDirectStream function can get a map that will hold the partition id as the key and the offset from which you are starting to consume as the value.
Just look at api here: http://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html
The map that I was talking about usually called: fromOffsets
You can insert data to the map:
startOffsetsMap.put(TopicAndPartition(topicName,partitionId), startOffset)
And use it when you create the direct stream:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
streamingContext, kafkaParams, startOffsetsMap, messageHandler(_))
After each iteration you can get the processed offsets using:
rdd.asInstanceOf[HasOffsetRanges].offsetRanges
You would be able to use this data to construct the fromOffsets map in the next iteration.
You can see the full code and usage here: https://spark.apache.org/docs/latest/streaming-kafka-integration.html at the end of the page
To add to Michael Kopaniov's answer, if you really want to use ZK as the place you store and load your map of offsets from, you can.
However, because your results are not being output to ZK, you will not get reliable semantics unless your output operation is idempotent (which it sounds like it isn't).
If it's possible to store your results in the same document in mongo alongside the offsets in a single atomic action, that might be better for you.
For more detail, see https://www.youtube.com/watch?v=fXnNEq1v3VA
Here's some code you can use to store offsets in ZK http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/
And here's some code you can use to use the offset when you call KafkaUtils.createDirectStream:
http://geeks.aretotally.in/spark-streaming-direct-api-reusing-offset-from-zookeeper/
I haven't figured this out 100% yet, but your best bet is probably to set up JavaStreamingContext.checkpoint().
See https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing for an example.
According to some blog entries https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md there are some caveats but it almost feels like it involves certain fringe cases that are only alluded to and not actually explained.