Spark Streaming using Azure Event Hubs

Spark Streaming using Azure Event Hubs - azure

We use spark-streaming-eventhubs to read data from the IoTHub in our Spark Streaming application. I am trying to understand whether the offset and lease management is completely handled by this library?
In this blog post it says:
...by design each EventHubsReceiver instance only handles one Event Hubs partition. Each receiver instance requires one CPU core to run, and you need to leave some CPU cores to process the received data. So if you set partition count to N in Event Hubs parameters, you need to make sure you assign 2xN CPU cores to the streaming application.
So does it mean that the library will automatically create one receiver per partition and also manage leases?
Will it automatically write checkpoints into the checkpoint location?
It also says we need 2xN CPU cores assigned to the streaming application. So one would need 8 CPU cores if there are 4 Partitions in the IoTHub, is it really correct? Would it then make sense to create applications that can handle multiple use cases and output to multiple locations instead of one streaming application per use case / location?
In the latter case, e.g. having 3 application which read from the same IoTHub (4 partitions) we would need 24 cores which is expensive...
Thanks!

Yes, if you are referring to this library - we support lease management and replaying from Offset completely with in the library (i.e.,the Library creates one receiver per partition and manages leases and checkpoint location per partition).
We Apologize about the CPU story - which is no longer applicable - in that article - to give some background - we had 2 EventHubs-to-SPARK adapter efforts that went on - after a while we recognized that and consolidated them. You are looking at an article which refers to the deprecated EventHubs effort - which is out-dated - we will get that fixed.

Related

Clustered app - only one server at a time reads from kafka, what am I missing?

I have a clustered application built around spring tooling, using kafka as the message layer for the fabric. At a high level, its architecture is a master process that parcels out work to slave processes running on separate hardware/vm's.
Master
|_______________
| | |
slave1 slave2 slave3
What I expect to happen is, if I throw 100 messages at Kafka, each of the slaves (three in this example) will pick up a proportionate number of messages and execute a proportionate amount of the work (about 1/3rd in this example).
What really happens is a slave picks up all of the messages and executes all of the work. It is indeterminate which slave will pick up the messages, but it is guaranteed one a slave starts picking up messages, the others will not until the slave has finished its work.
To me, it looks like the read from Kafka is pulling all of the messages from the queue, rather than one at a time. This leads me to believe I missed a configuration either on Kafka or in the Spring kafka.

I think you miss a conceptual understanding what is Apache Kafka and how it works.
There is no queues, first of all. Messages are settled in the topic. Everybody subscribed can get the same message. However there is a concept of consumer group. So, independently of the number of subscrbibers, only one of them will read a single message if the consumer group is the same.
There is another feature in Kafka called partitions. With that you can distribute your messages into different partitions or they will be assigned automatically: evenly by default. This partitions feature has another angle to use. When we have several subscribers for the same topic in the same consumer group, the partitions are distributed between them. So, you may reconsider your logic in favor of built-in features in Apache Kafka.
There is nothing to do from the Spring Kafka perspective, though. You only need properly configure your topic for reasonable number of partitions and provide the same consumer group for all your "slaves".

Apache Nifi multiple processor to same relationship

As shown in the image below, I have added 2 processors to same relationship of another processor, it distributes the flow files equally into both the tail processors. Is it the expected behavior? if yes, then on what basis the partition is done?

Sending the same relationship to multiple processors does not partition the flow files, it sends all of them to both relationships. You typically do this when you want to the send the same data to multiple destinations (i.e. HDFS and Kafka for example).
If you want to improve the concurrency of PutAzureBlobStorage, then you would have one instance of the processor and increase the concurrent tasks in the scheduling tab of the processor.

I you want to distribute the load across a NiFi cluster, there are different approaches:
Use Kafka to send messages (tasks) across the cluster.
Use Site-2-Site.
Push data using a processor who listen for incoming connections (HandleHttpRequest, ListenSyslog, andListenUDP) and a load balancer.
More information:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Spark: Writing to DynamoDB, limited write capacity

My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?
Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?
I already have auto-scaling enabled but don't want to increase it any further.
Please suggest other ways of handling this.
EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.

Bucket scheduler
The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.
DynamoDB retry
Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.

Message is consumed twice

There is a topic with 8 partitions in a Kafka cluster.
I implemented application to consume the topic with KafkaMessageDrivenChannelAdapter which concurrent is 8 and offsetManager is KafkaTopicOffsetManager.
When I start one application instance everything is right. But when I start two application instances, I find the meesge is consumed twice. Do you know why and how to solve it? I need change to highLevelConsumer?

You have to distribute the partitions across instances with that adapter.
We are working on upgrading to kafka 0.9 java clients which supports consumer groups.
The first milestone for the core project is available.
We need to work on releasing a milestone of spring-integration-kafka 2.0 that uses this new client.

Can a spring kafka consumer run on multiple machines for the same groip?

Kafka says that the offset is managed by consumers and there should be as many consumers as many partitions for the same group.
Spring integration says that the number of consumer streams in high level consumer is the number of partitions for the same group.
So, can the spring kafka consumer code run on multiple servers for the same group? If yes, how do the offsets know not to be in conflict between servers?

According to the kafka doc, if group (http://kafka.apache.org/documentation.html#introduction) was implemented, each message is consumed by exactly one consumer in the group. Each consumer can run on one machine. Two consumer can run on the same machine, also. In this case, each consumer can be one process.
One group can contain multiple consumers. Partitions can be distributed among all the consumers in one group by some algorithms. The number of consumers can be larger or less than the number of the partitions.
Offset can be managed by aid of zookeeper. but not all functions have been implemented in some clients until now.
As for your use case, in fact, kafka maybe "at-least-once delivery system". Kafka can be at-most-once delivery by disabling retries on the producer OR committing its offset before processing a batch of messages. It is very difficult to implement "exactly-once delivery system", which requires co-operation. But kafka provides offset. So it may be possible.For more details, please see http://kafka.apache.org/documentation.html#semantics, http://ben.kirw.in/2014/11/28/kafka-patterns/, https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o and so on.
Based on my personal experience, I spent lots of time to make sure that my kafka system to be exactly-once delivery system. but when the server is down, some messages can be consumed twice. But my testing was done on standalone kafka server, always kafka cluter is used in production. So, I think it may can be considered as exactly-once system.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string