We've files containing millions of lines. I need to read each line from the file & send it to Kinesis. I am trying following code:
KinesisAsyncClient kinesisClient = KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.
builder().region(region));
rdd.map(line -> {
PutRecordRequest request = PutRecordRequest.builder()
.streamName("MyTestStream")
.data(SdkBytes.fromByteArray(line.getBytes()))
.build();
try {
kinesisClient.putRecord(request).get();
} catch (InterruptedException e) {
LOG.info("Interrupted, assuming shutdown.");
} catch (ExecutionException e) {
LOG.error("Exception while sending data to Kinesis. Will try again next cycle.", e);
}
return null;
}
);
I am getting this error message:
object not serializable (class: software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient
It seems KinesisAsyncClient is not the right object to use. Which other object can be used? NOTE: We don't want to use Spark (Structured) Streaming for this use case because the files will come only a few times in a day. Doesn't make sense to keep the Streaming app running.
Is there a better way to send messages to Kinesis via Spark. NOTE: We want to use Spark so that messages can be sent in Distributed fashion. Sequentially sending each message it taking too long.
Related
I am trying to read a file from GCP based on a notification received as per the flow defined below:
File reader - Deserialises the data into collection and sends for routing.
I am de-searializing the data in collection of objects and sending it router for further processing. As i don't have the control over file size, i am thinking of some approach of batching the reader process.
Currently, the file-reader service activator returns the whole Collection of deserialised objects.
Issue:
In case i receive a file of larger size i.e. with 200k records, i want to send this in batches to the header value router rather than a
collection of 200k objects.
If i convert the file-reader into a splitter and add an aggregator
after that Notification -> file-reader -> aggregator -> router.
I would still need to return the collection of all the objects not the iterator.
I don't want to load all the record into a collection.
Updated approach:
public <S> Collection<S> readData(DataInfo dataInfo, Class<S> clazz) {
Resource gcpResource = context.getResource("classpath://data.json")
var tempDataSet = new HashSet<S>();
AtomicInteger pivot = new AtomicInteger();
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gcpResource.getInputStream()))) {
bufferedReader.lines().map((dataStr) -> {
try {
var data = deserializeData(dataStr, clazz);
return data;
} catch (JsonProcessingException ex) {
throw new CustomException("PARSER-1001", "Error occurred while parsing", ex);
}
}).forEach(data -> {
if (BATCH_SIZE == pivot.get()) {
//When the size in tempDataSet reached BATCH_SIZE send the data in routing channel and reset the pivot
var message = MessageBuilder.withPayload(tempDataSet.clone())
.setHeader(AppConstants.EVENT_HEADER_KEY, eventType)
.build();
routingChannel.send(message);
pivot.set(0);
tempDataSet.removeAll(tempDataSet);
} else {
pivot.addAndGet(1);
tempDataSet.add(data);
}
});
return tempDataSet;
} catch (Exception ex) {
throw new CustomException("PARSER-1002", "Error occurred while parsing", ex);
}
}
If the batch size in 100 and we received 1010 objects. The 11 batches would be created, 10 with 100 and last one with 10 objects in it.
In case i use a splitter and pass the stream to it, will it wait for the whole stream to finish and then send the collected collection or we can achieve something close to previous approach using it?
Not sure what is the question, but I would go with FileSplitter + Aggregator solution. The first one is exactly for streaming file reading use-case. The second one lets you to buffer incoming messages until they reach some condition, so it can emit a single message downstream. That message indeed could be with a collection as a payload.
Here is their docs for your consideration:
https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregator
https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-splitter
I am using Cassandra Batches to write to in Cassandra Nodes
The batch size is 20.
List<List<SimpleStatement>> lists = Lists.partition(simpleStatementList,20);
List<ListenableFuture<CompletionStage<AsyncResultSet>>> futures = new ArrayList ();
try{
lists.forEach(list -> {
BatchStatementBuilder batchStatementBuilder =
BatchStatement.builder(BatchType.LOGGED);
list.forEach(batchStatementBuilder::addStatement);
futures.add(executorService.submit(() ->
session.executeAsync(batchStatementBuilder.build())));
});
}catch{
LOG.error("Error")
}finally {
Futures.allAsList(futures).get(1, TimeUnit.HOURS);
futures.forEach(val -> {
try {
val.get().whenCompleteAsync(
(resultSet, error) -> {
if (error != null) {
LOG.info("Fail to write Cassandra");
}
});
} catch (Exception e) {
throw new RuntimeException(e);
}
});
executorService.shutdown();
}
I want to know which Batches are failed when I write using Async method.
Thanks in advance
Within your code, you can catch the exception returned from the asynchronous request and handle it whichever way you want. For example, log the contents of the lists object.
As a side note, CQL batches are not an optimisation. In fact, they are expensive and will perform worse than separate asynchronous statements since they put pressure on the coordinator.
Given that you are bulk-inserting into the same table, this is bad practice and is not recommended in Cassandra. CQL batches are for achieving atomicity when updating the same record in multiple tables. Cheers!
i have a spark streaming (2.1.1 with cloudera 5.12). with input kafka and output HDFS (in parquet format)
the problem is , i'm getting LeaseExpiredException randomly (not in all mini-batch)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/qoe_fixe/data_tv/tmp/cleanData/_temporary/0/_temporary/attempt_20180629132202_0215_m_000000_0/year=2018/month=6/day=29/hour=11/source=LYO2/part-00000-c6f21a40-4088-4d97-ae0c-24fa463550ab.snappy.parquet (inode 135532024): File does not exist. Holder DFSClient_attempt_20180629132202_0215_m_000000_0_-1048963677_900 does not have any open files.
i'm using the dataset API for writing to hdfs
if (!InputWithDatePartition.rdd.isEmpty() ) InputWithDatePartition.repartition(1).write.partitionBy("year", "month", "day","hour","source").mode("append").parquet(cleanPath)
my job fails after few hours because of this error
Two jobs write to the same directory share the same _temporary folder.
So when the first job finishes this code is executed (FileOutputCommitter class):
public void cleanupJob(JobContext context) throws IOException {
if (hasOutputPath()) {
Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
FileSystem fs = pendingJobAttemptsPath
.getFileSystem(context.getConfiguration());
// if job allow repeatable commit and pendingJobAttemptsPath could be
// deleted by previous AM, we should tolerate FileNotFoundException in
// this case.
try {
fs.delete(pendingJobAttemptsPath, true);
} catch (FileNotFoundException e) {
if (!isCommitJobRepeatable(context)) {
throw e;
}
}
} else {
LOG.warn("Output Path is null in cleanupJob()");
}
}
it deletes pendingJobAttemptsPath(_temporary) while the second job is still running
This may be helpful:
Multiple spark jobs appending parquet data to same base path with partitioning
I am using the strategy provided here in spark streaming documentation for the committing to kafka itself. My flow is like so:
Topic A --> Spark Stream [ foreachRdd process -> send to topic b] commit offset to topic A
JavaInputDStream<ConsumerRecord<String, Request>> kafkaStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Request>Subscribe(inputTopics, kafkaParams)
);
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();
rdd.foreachPartition(
consumerRecords -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("$s %d %d $d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
consumerRecords.forEachRemaining(record -> doProcess(record));
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);
So let's say the RDD gets 10 events from topic A and while processing for each of them I send a new event to topic B. Now supposed that one of those responses fails. Now I don't want to commit that particular offset to topic A. Topic A and B have the same number of partitions N. So each RDD should be consuming from the same partition. What would be the best strategy to keep processing? How can I reset the stream to try to process those events from topic A until it succeeds? I know if I can't continue processing that partition without committing because that would automatically move the offset and the failed record would not be processed again.
I don't know how if it is possible for the stream/rdd to keep trying to process the same messages for that partition only, while the other partitions/rdd can keep working. If I throw an exception from that particular RDD what would happened to my job. Would it fail? Would I need to restart it manually? With regular consumers you could retry/recover but I am not sure what happens with Streaming.
This is what I came up with and it takes the input data and then sends a request using the output topic. The producer has to be created inside the foreach loop otherwise spark will try to serialize and send it to all the workers. Notice the response is send asynchronously. This means that I am using at least one semantics in this system.
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(
partition -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("%s %d %d %d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
// Print statements in this section are shown in the executor's stdout logs
KafkaProducer<String, MLMIOutput> producer = new KafkaProducer(producerConfig(o.partition()));
partition.forEachRemaining(record -> {
System.out.println("request: "+record.value());
Response data = new Response …
// As as debugging technique, users can write to DBFS to verify that records are being written out
// dbutils.fs.put("/tmp/test_kafka_output",data,true)
ProducerRecord<String, Response> message = new ProducerRecord(outputTopic, null, data);
Future<RecordMetadata> result = producer.send(message);
try {
RecordMetadata metadata = result.get();
System.out.println(String.format("offset='$d' partition='%d' topic='%s'timestamp='$d",
metadata.offset(),metadata.partition(),metadata.topic(),metadata.timestamp()));
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
});
producer.close();
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);
I'm trying to run a spark (2.2) job to get some data from the server using GRPC (1.1.2) client calls. I get this error when I run this code through spark. Running the same job for a small set works fine. From what I researched, I understand that ABORTED message is because of some concurrency issues, so I'm guessing it is because the client is unable to create more than a certain number of stubs, but I'm not sure how to proceed. Also, I know for a fact that the GRPC server works well with large number of requests and I'm well below the number of requests it can handle. Any ideas?
Adding more information as requested:
My client CatalogGrpcClient has these methods to handle channels and the request:
private List<ManagedChannel> getChannels() {
return IntStream.range(0, numChannels).mapToObj(x ->
ManagedChannelBuilder.forAddress(channelHost, channelPort).usePlaintext(true).build()
).collect(Collectors.toList());
}
private ManagedChannel getChannel() {
return channels.get(ThreadLocalRandom.current().nextInt(channels.size()));
}
private ListingRequest populateRequest(ListingRequest.Builder req, String requestId) {
return req.setClientSendTs(System.currentTimeMillis())
.setRequestId(StringUtils.defaultIfBlank(req.getRequestId(), requestId))
.setSchemaVersion(StringUtils.defaultIfBlank(req.getSchemaVersion(), schema))
.build();
}
private List<ListingResponse> getGrpcListingWithRetry(ListingRequest.Builder request,
String requestIdStr,
int retryLimit,
int sleepBetweenRetry) {
int retryCount = 0;
while (retryCount < retryLimit) {
try {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(CatalogServiceGrpc.newBlockingStub(getChannel()).getListings(populateRequest(request, requestIdStr)), Spliterator.ORDERED), false).collect(Collectors.toList());
} catch (Exception e) {
System.out.println("Exception " + e.getCause().getMessage());
retryCount = retryCount + 1;
try {
Thread.sleep(sleepBetweenRetry);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
}
}
throw new StatusRuntimeException(Status.ABORTED);
}
I use the method getCatalogListingData in the method extract which is used to map to a case class in the spark job
def extract(itemIds: List[Long], validAspects: Broadcast[Array[String]]): List[ItemDetailModel] = {
var itemsDetails = List[ItemDetailModel]()
val client = new CatalogGrpcClient()
implicit val formats = DefaultFormats
val listings = client.getCatalogListingData(itemIds.map(x => x.asInstanceOf[java.lang.Long]).asJava).asScala
...
...
itemsDetails
}
Here's the spark code which calls extract. itemsMissingDetails is a dataframe with a column "item" which is a list of unique item ids. The zipWithIndex and the following map is so that I pass 50 item ids in each request to the GRPC svc.
itemsMissingDetails
.rdd
.zipWithIndex
.map(x => (x._2 / 50, List(x._1.getLong(0))))
.reduceByKey(_ ++ _)
.flatMap(items => extract(items._2, validAspects))
.toDF
.write
.format("csv")
.option("header",true)
.option("sep", "\t")
.option("escapeQuotes", false)
.save(path)
The ABORTED error is actually thrown by my client after a long time (~30 min to 1 hour). When I start this job, it gets the info I need from the GRPC svc for a few thousand items on every worker. After this, the job hangs up (on each worker) and after a really long wait (~30 min to 1 hour), it fails with the above exception or proceeds further. I haven't been able to consistently get StatusRuntimeException.