spark streaming with results send to another topic using comitAsync - apache-spark

I am using the strategy provided here in spark streaming documentation for the committing to kafka itself. My flow is like so:
Topic A --> Spark Stream [ foreachRdd process -> send to topic b] commit offset to topic A
JavaInputDStream<ConsumerRecord<String, Request>> kafkaStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Request>Subscribe(inputTopics, kafkaParams)
);
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();
rdd.foreachPartition(
consumerRecords -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("$s %d %d $d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
consumerRecords.forEachRemaining(record -> doProcess(record));
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);
So let's say the RDD gets 10 events from topic A and while processing for each of them I send a new event to topic B. Now supposed that one of those responses fails. Now I don't want to commit that particular offset to topic A. Topic A and B have the same number of partitions N. So each RDD should be consuming from the same partition. What would be the best strategy to keep processing? How can I reset the stream to try to process those events from topic A until it succeeds? I know if I can't continue processing that partition without committing because that would automatically move the offset and the failed record would not be processed again.
I don't know how if it is possible for the stream/rdd to keep trying to process the same messages for that partition only, while the other partitions/rdd can keep working. If I throw an exception from that particular RDD what would happened to my job. Would it fail? Would I need to restart it manually? With regular consumers you could retry/recover but I am not sure what happens with Streaming.

This is what I came up with and it takes the input data and then sends a request using the output topic. The producer has to be created inside the foreach loop otherwise spark will try to serialize and send it to all the workers. Notice the response is send asynchronously. This means that I am using at least one semantics in this system.
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(
partition -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("%s %d %d %d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
// Print statements in this section are shown in the executor's stdout logs
KafkaProducer<String, MLMIOutput> producer = new KafkaProducer(producerConfig(o.partition()));
partition.forEachRemaining(record -> {
System.out.println("request: "+record.value());
Response data = new Response …
// As as debugging technique, users can write to DBFS to verify that records are being written out
// dbutils.fs.put("/tmp/test_kafka_output",data,true)
ProducerRecord<String, Response> message = new ProducerRecord(outputTopic, null, data);
Future<RecordMetadata> result = producer.send(message);
try {
RecordMetadata metadata = result.get();
System.out.println(String.format("offset='$d' partition='%d' topic='%s'timestamp='$d",
metadata.offset(),metadata.partition(),metadata.topic(),metadata.timestamp()));
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
});
producer.close();
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);

Related

Sending messages from a file to Kinesis using Spark

We've files containing millions of lines. I need to read each line from the file & send it to Kinesis. I am trying following code:
KinesisAsyncClient kinesisClient = KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.
builder().region(region));
rdd.map(line -> {
PutRecordRequest request = PutRecordRequest.builder()
.streamName("MyTestStream")
.data(SdkBytes.fromByteArray(line.getBytes()))
.build();
try {
kinesisClient.putRecord(request).get();
} catch (InterruptedException e) {
LOG.info("Interrupted, assuming shutdown.");
} catch (ExecutionException e) {
LOG.error("Exception while sending data to Kinesis. Will try again next cycle.", e);
}
return null;
}
);
I am getting this error message:
object not serializable (class: software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient
It seems KinesisAsyncClient is not the right object to use. Which other object can be used? NOTE: We don't want to use Spark (Structured) Streaming for this use case because the files will come only a few times in a day. Doesn't make sense to keep the Streaming app running.
Is there a better way to send messages to Kinesis via Spark. NOTE: We want to use Spark so that messages can be sent in Distributed fashion. Sequentially sending each message it taking too long.

Sending data in batches using spring integration?

I am trying to read a file from GCP based on a notification received as per the flow defined below:
File reader - Deserialises the data into collection and sends for routing.
I am de-searializing the data in collection of objects and sending it router for further processing. As i don't have the control over file size, i am thinking of some approach of batching the reader process.
Currently, the file-reader service activator returns the whole Collection of deserialised objects.
Issue:
In case i receive a file of larger size i.e. with 200k records, i want to send this in batches to the header value router rather than a
collection of 200k objects.
If i convert the file-reader into a splitter and add an aggregator
after that Notification -> file-reader -> aggregator -> router.
I would still need to return the collection of all the objects not the iterator.
I don't want to load all the record into a collection.
Updated approach:
public <S> Collection<S> readData(DataInfo dataInfo, Class<S> clazz) {
Resource gcpResource = context.getResource("classpath://data.json")
var tempDataSet = new HashSet<S>();
AtomicInteger pivot = new AtomicInteger();
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gcpResource.getInputStream()))) {
bufferedReader.lines().map((dataStr) -> {
try {
var data = deserializeData(dataStr, clazz);
return data;
} catch (JsonProcessingException ex) {
throw new CustomException("PARSER-1001", "Error occurred while parsing", ex);
}
}).forEach(data -> {
if (BATCH_SIZE == pivot.get()) {
//When the size in tempDataSet reached BATCH_SIZE send the data in routing channel and reset the pivot
var message = MessageBuilder.withPayload(tempDataSet.clone())
.setHeader(AppConstants.EVENT_HEADER_KEY, eventType)
.build();
routingChannel.send(message);
pivot.set(0);
tempDataSet.removeAll(tempDataSet);
} else {
pivot.addAndGet(1);
tempDataSet.add(data);
}
});
return tempDataSet;
} catch (Exception ex) {
throw new CustomException("PARSER-1002", "Error occurred while parsing", ex);
}
}
If the batch size in 100 and we received 1010 objects. The 11 batches would be created, 10 with 100 and last one with 10 objects in it.
In case i use a splitter and pass the stream to it, will it wait for the whole stream to finish and then send the collected collection or we can achieve something close to previous approach using it?
Not sure what is the question, but I would go with FileSplitter + Aggregator solution. The first one is exactly for streaming file reading use-case. The second one lets you to buffer incoming messages until they reach some condition, so it can emit a single message downstream. That message indeed could be with a collection as a payload.
Here is their docs for your consideration:
https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregator
https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-splitter

Kafka Producer Recovery Strategy when consuming a parquet file

I have a use case, where I have to read parquet files and publish the records towards a kafka topic.
I read the parquet files using :
spark.read.schema(//path to parquet file )
Then I sort this dataframe based on the timestamp (use case specific requirements to preserve order)
Finally, I do the following :
binaryFiles below is the dataframe containing the sorted records from the parquet file
binaryFiles.coalesce(1).foreachPartition(partition => {
val producer = new KafkaProducer[String,Array[Byte]](properties)
partition.foreach(file => {
try {
var producerRecord = new ProducerRecord[String,Array[Byte]](targetTopic,file.getAs[Integer](2),file.getAs[String](0),file.getAs[Array[Byte]](1))
var metadata = producer.send(producerRecord, new Callback {
override def onCompletion(recordMetadata:RecordMetadata , e:Exception):Unit = {
if (e != null){
println ("Error while producing" + e);
producer.close()
}
}
});
producer.flush()
}
catch{
case unknown :Throwable => println("Exception obtained with record. Key : " + unknown)
}
})
producer.close()
println("Closing the producer for this partition")
})
While writing the failover strategy for this scenario, one scenario that I am trying to cater to is what if the node that runs the kafka producer goes down.
Now when the kafka producer is restarted it will again read the parquet file from start and start publishing all the records again to the same topic.
How can we overcome this and implement some sort of checkpointing that spark stream provides.
PS: I cannot use spark structured streaming as that does not preserve the order of the messages

io.grpc.StatusRuntimeException: ABORTED when using Grpc client in Spark

I'm trying to run a spark (2.2) job to get some data from the server using GRPC (1.1.2) client calls. I get this error when I run this code through spark. Running the same job for a small set works fine. From what I researched, I understand that ABORTED message is because of some concurrency issues, so I'm guessing it is because the client is unable to create more than a certain number of stubs, but I'm not sure how to proceed. Also, I know for a fact that the GRPC server works well with large number of requests and I'm well below the number of requests it can handle. Any ideas?
Adding more information as requested:
My client CatalogGrpcClient has these methods to handle channels and the request:
private List<ManagedChannel> getChannels() {
return IntStream.range(0, numChannels).mapToObj(x ->
ManagedChannelBuilder.forAddress(channelHost, channelPort).usePlaintext(true).build()
).collect(Collectors.toList());
}
private ManagedChannel getChannel() {
return channels.get(ThreadLocalRandom.current().nextInt(channels.size()));
}
private ListingRequest populateRequest(ListingRequest.Builder req, String requestId) {
return req.setClientSendTs(System.currentTimeMillis())
.setRequestId(StringUtils.defaultIfBlank(req.getRequestId(), requestId))
.setSchemaVersion(StringUtils.defaultIfBlank(req.getSchemaVersion(), schema))
.build();
}
private List<ListingResponse> getGrpcListingWithRetry(ListingRequest.Builder request,
String requestIdStr,
int retryLimit,
int sleepBetweenRetry) {
int retryCount = 0;
while (retryCount < retryLimit) {
try {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(CatalogServiceGrpc.newBlockingStub(getChannel()).getListings(populateRequest(request, requestIdStr)), Spliterator.ORDERED), false).collect(Collectors.toList());
} catch (Exception e) {
System.out.println("Exception " + e.getCause().getMessage());
retryCount = retryCount + 1;
try {
Thread.sleep(sleepBetweenRetry);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
}
}
throw new StatusRuntimeException(Status.ABORTED);
}
I use the method getCatalogListingData in the method extract which is used to map to a case class in the spark job
def extract(itemIds: List[Long], validAspects: Broadcast[Array[String]]): List[ItemDetailModel] = {
var itemsDetails = List[ItemDetailModel]()
val client = new CatalogGrpcClient()
implicit val formats = DefaultFormats
val listings = client.getCatalogListingData(itemIds.map(x => x.asInstanceOf[java.lang.Long]).asJava).asScala
...
...
itemsDetails
}
Here's the spark code which calls extract. itemsMissingDetails is a dataframe with a column "item" which is a list of unique item ids. The zipWithIndex and the following map is so that I pass 50 item ids in each request to the GRPC svc.
itemsMissingDetails
.rdd
.zipWithIndex
.map(x => (x._2 / 50, List(x._1.getLong(0))))
.reduceByKey(_ ++ _)
.flatMap(items => extract(items._2, validAspects))
.toDF
.write
.format("csv")
.option("header",true)
.option("sep", "\t")
.option("escapeQuotes", false)
.save(path)
The ABORTED error is actually thrown by my client after a long time (~30 min to 1 hour). When I start this job, it gets the info I need from the GRPC svc for a few thousand items on every worker. After this, the job hangs up (on each worker) and after a really long wait (~30 min to 1 hour), it fails with the above exception or proceeds further. I haven't been able to consistently get StatusRuntimeException.

How to set the Spark streaming receiver frequency?

My requirement is to process the hourly data of a stock market.
i.e, get the data from source once per streaming interval and process it via DStream.
I have implemented a custom receiver to scrap/monitor the website by implementing onStart() and onStop() methods and its working.
Challenges encountered:
Receiver thread is fetching the data continuously i.e, multiples times per interval.
Unable to coordinate receiver and DStream execution time interval.
Options I tried:
Receiver Thread to sleep for few seconds (equal to streaming interval).
In this case data is not the latest data while processing.
class CustomReceiver(interval: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("Website Scrapper") {
override def run() { receive() }
}.start()
}
def onStop() {
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
println("Entering receive:" + new Date());
try {
while (!isStopped) {
val scriptsLTP = StockMarket.getLiveStockData()
for ((script, ltp) <- scriptsLTP) {
store(script + "," + ltp)
}
println("sent data")
System.out.println("going to sleep:" + new Date());
Thread.sleep(3600 * 1000);
System.out.println("awaken from sleep:" + new Date());
}
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
println("Exiting receive:" + new Date());
}
}
How to make the Spark Streaming receiver in sync with DStream processing?
This use case doesn't seem a good fit for Spark Streaming. The interval is long enough to consider this as a regular batch job instead. That way, we can make better use of the cluster resources.
I would rewrite it as a Spark Job by parallelizing the target tickers, using a mapPartitions to use the executors as distributed web scrappers and then process as intended.
Then schedule the Spark job to run each hour with cron or more advanced alternatives, such as Chronos at the exact times wanted.

Resources