Apache Bookkeeper delete log - apache-pulsar

In Apache bookkeeper how do we delete a log entry in a ledger? If the ledge is immutable and cannot delete entries how does pulsar delete expired messages from bookies?

BookKeeper doesn't provide interfaces to delete individual entries in a ledger. It only provides methods to delete ledgers. Once the ledgers are deleted, bookies will garbage-collect their entries to reclaim disk space in background.
A pulsar topic partition is comprised of multiple ledgers. At any given of time, pulsar broker is writing to one ledger for a partition. After the ledger reaches a given size or after a certain period time, pulsar broker will close the ledger that it is writing and open a new ledger to write. pulsar keeps the list of ledgers as part of the metadata of a topic partition. If all the messages in a ledger have been consumed or expired, the ledger will be deleted by pulsar broker.
Following links are useful to help understand this:
[1] how a segment-based architecture delivers better performance, scalability, and resilience
[2] pulsar faq

Related

How multiple Kafka Consumers in the same consumer group read messages from one partition in the topic?

I would like to know about how the consumers in the same consumer group read the messages from one topic which has only one partition.
For example, I have 3 consumers in one consumer group and that group is polling messages from Topic A which has partition A so if I have 1000 messages coming one by one in the Topic A how it would be delivered to 3 of the consumers.
Would it be like 3 messages will be delivered to 3 consumers parellely and once it's processed by each the another one would be delivered basically will they receive messages paraellely?
Would it be like any one consumer will fetch those messages as there is only one partition ?
Please also suggest me the best architecture approach for above scenario.
Thanks,
I want to process the multiple messages parallelly from one topic which has one partition to 4 consumers.
I am using the kafka structure with NodeJS microservices with kafkajs package.
In your scenario, only one consumer of that consumer group will read the data, most probably the first one you started. I'm not 100% sure as I never tried it out, but I assume the additional consumers will just idle without workload.
This question is essentially the same as yours.
If you want to achieve parallelity of consumers, you cannot avoid having multiple partitions, that's the main purpose of the whole partitioning concept.

Event Hub -- how to prevent duplicate handling when consumers scale out

When we have multiple consumers of Event Hub (or any messaging service, for that matter), how to make sure that no message is processed twice especially in a situation when consumer auto-scales out to multiple instances?
I know we could keep track of last message processed but then again, between the check if message was processed and actuall, processing it,other instance could process it already (race condition?.
so, how to solve that in a scalable way?
[UPDATE]
i am aware there is a recommendation to have at least as many partitions as there are consumers but what to do in case when a single consumer cannot process messages directed to it but needs to scale out to multiple instances?
Each processor takes a lease on a partition, see the docs
An event processor instance typically owns and processes events from one or more partitions. Ownership of partitions is evenly distributed among all the active event processor instances associated with an event hub and consumer group combination.
So scaling out doesn't result in duplicate message processing because a new processor cannot take a lease on a partition that is already being handled by another processor.
Then, regarding your comment:
i am aware there is a recommendation to have at least as many partitions as there are consumers
It is the other way around: it is recommended to have as many consumers as you have partitions. If you have more consumers than partitions the consumers will compete with each other to obtain a lock on a partition.
Now, regarding duplicate messages, since Event Hub guarantees at-least-once delivery there isn't much you can do to prevent this. There aren't that many scalable services that offer at-most-once deliveries, I know that Azure Service Bus Queues do offer this if you really need it.
The question may arise what can cause duplicate message processing. Well, when processing message the processor does some checkpointing: once in a while it will store its position within a partition event sequence (remember, a partition is bound to a single processor). Now when the processer instance crashes between two checkpoint events a new instance will resume processing messages from the position of the last checkpoint. That may very well lead to older messages being processed again.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group.
So, that means you need to make sure your processing logic is idempotent. How, that is up to you as I don't know your use case.
One option is to track each individual message to see whether it is already processed or not. If you do not have a unique ID to check on maybe you can generate a hash of the whole message and compare with that.

Where is topic-paritition-broker and segment-topic info saved in Pulsar?

Read the documentation: https://pulsar.apache.org/docs/en/concepts-architecture-overview/
I am not able to figure out:
Where does Pulsar store that which broker is assigned to a topic partition?
How does pulsar know which segment is of which topic-partition? Whenever a new segment is finally closed and persisted in bookie, how does Pulsar save that the segment belongs to which topic?
How do brokers figure out that they have to query which bookie for data for a topic when a read request comes?
Are all these stored in zookeeper?
Yes, all these metadata are stored in ZooKeeper.
Topic load balancing is done by a leader elected service and data stored in /loadbalance/bundle-data.
Topic to ledger/segment info are stored under the path /managed-ledgers

What happens when 2 servers disconnect from each other and then reconnect?

We have 2 servers with a peer, orderer and kafka each. They are connected in the same channel, both have a chaincode installed and instantiated and the policy is one organization or the other.
Imagine that the internet goes down and they disconnect:
Would both work individually?
Can the write new transactions to the ledger?
What would happen with the new submited blocks in the ledger when the internet goes up and running? How do this new blocks synchronize?
Thanks
EDIT1:
See image for clarification:
How would the network syncrhonize If during the disconnection both write to the ledger, how are those new generated blocks react? One gets invalidated? Or both are valid?
The peers once disconnected won't receive keep alive from the channel peers and will keep throwing the same if you have debug logging enabled.
The peer won't lose any config even though it got disconnected from network. The discovery service in fabric takes care of finding the peers configured in the channel. So, Once the connection resumes it will automatically re-synchronize with the peers with gossip messages.
The peers can then write and read from ledger as usual.
There are multiple things to consider here:
1) When you use a Kafka-based orderer, you will have to cluster the Kafka brokers if you expect them to be part of the same ordering service. Kafka is used to distribute the messages to the ordering nodes. If your Kafka brokers are not in a cluster, then you will have separate ordering services. Recall that Kafka also requires Zookeeper as well. Zookeeper has a 2f+1 fault tolerance model, so if you want to tolerate failure of a single node (failure includes communication issues), you will need at least 3 Zookeeper nodes and they should be deployed on separate hosts. For Kafka, you will want at least 2 brokers and would need to set the minimum ISRs (in sync replicas) to 2. Ideally you'd have 4 Kafka brokers.
2) In order for transactions to be processed, enough peers to satisfy the endorsement policy as well as the ordering service must be available / accessible. Peers which cannot connect to the ordering service will catch up once they can reestablish connectivity.

Mentioning orderer name while using kafka configuration

I am using orderer in kafka mode. Now while invoking chaincode, I need to supply orderer name. But then whats the use of kafka to select orderer if I need to supply the orderer name by my own.
I'll note that the client can initialize a channel in memory that has record of multiple orderers, and the SDK should provide the option of sending your transaction via a random orderer. While one organization's client may communicate with one orderer, another organization might prefer to have its client set up to use a different orderer (or group of orderers, and perhaps these are running on the organizations own servers).
Where kafka comes in is that it's a way to provide crash-fault tolerance to channels with high throughput and a set up of multiple orderers by helping keep track of transactions and thus allowing proper sequencing of blocks. Specifically, when the client sends the transaction to an orderer, the orderer then relays to a partition that the kafka cluster maintains, and then orderers then consume/read from this partition to package transactions into blocks (orderers are both producers and consumers in this set up). kafka keeps all the orderers in sync by maintaining a stream of transactions that's used by all of them.
The full technical solution is outlined in https://docs.google.com/document/d/19JihmW-8blTzN99lAubOfseLUZqdrB6sBR0HsRgCAnY/edit, the below image is from page 11.
From the readthedocs page (https://hyperledger-fabric.readthedocs.io/en/release-1.2/kafka.html):
Each channel maps to a separate single-partition topic in Kafka. When an OSN receives transactions via the Broadcast RPC, it checks to make sure that the broadcasting client has permissions to write on the channel, then relays (i.e. produces) those transactions to the appropriate partition in Kafka. This partition is also consumed by the OSN which groups the received transactions into blocks locally, persists them in its local ledger, and serves them to receiving clients via the Deliver RPC. For low-level details, refer to the document that describes how we came to this design — Figure 8 is a schematic representation of the process described above.

Resources