Where is topic-paritition-broker and segment-topic info saved in Pulsar? - apache-pulsar

Read the documentation: https://pulsar.apache.org/docs/en/concepts-architecture-overview/
I am not able to figure out:
Where does Pulsar store that which broker is assigned to a topic partition?
How does pulsar know which segment is of which topic-partition? Whenever a new segment is finally closed and persisted in bookie, how does Pulsar save that the segment belongs to which topic?
How do brokers figure out that they have to query which bookie for data for a topic when a read request comes?
Are all these stored in zookeeper?

Yes, all these metadata are stored in ZooKeeper.
Topic load balancing is done by a leader elected service and data stored in /loadbalance/bundle-data.
Topic to ledger/segment info are stored under the path /managed-ledgers

Related

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

Apache Bookkeeper delete log

In Apache bookkeeper how do we delete a log entry in a ledger? If the ledge is immutable and cannot delete entries how does pulsar delete expired messages from bookies?
BookKeeper doesn't provide interfaces to delete individual entries in a ledger. It only provides methods to delete ledgers. Once the ledgers are deleted, bookies will garbage-collect their entries to reclaim disk space in background.
A pulsar topic partition is comprised of multiple ledgers. At any given of time, pulsar broker is writing to one ledger for a partition. After the ledger reaches a given size or after a certain period time, pulsar broker will close the ledger that it is writing and open a new ledger to write. pulsar keeps the list of ledgers as part of the metadata of a topic partition. If all the messages in a ledger have been consumed or expired, the ledger will be deleted by pulsar broker.
Following links are useful to help understand this:
[1] how a segment-based architecture delivers better performance, scalability, and resilience
[2] pulsar faq

how to delete a consumer group C belonging to topic T when using the new consumer API

the newer versions of apache kafka store the consumer group data internally instead of using zookeeper
which means that the command-line utility kafka-consumer-groups is not useful since the related info is not stored in zookeeper now
could you please advise:
how to delete a consumer group C belonging to topic T when using the new consumer API
?
Note that deletion of a group is only available when the group metadata is stored in ZooKeeper. When using the new consumer API (where the broker handles coordination of partition handling and rebalance), the group is deleted when the last committed offset for that group expires.
./kafka-consumer-groups.sh --zookeeper --delete
--group
If none of the members of the consumer group commit offsets for 24 hours (with default settings) it will expire and be deleted automatically.
Alternatively you can reset the offsets or set them to any value you want using the new --reset-offsets option to bin/kafka-consumer-groups
See Kafka 0.11 how to reset offsets

Worker Queue option in Kafka

We are developing an application , which will receive time series sensor data as byte array from a set of devices via UDP. This data needs to be parsed and stored in a Cassandra Database...
We were using RabbitMQ as the message broker and using the Work Queues based consumers to parse the data and push it in to cassandra... Because of increasing traffic, we are concerned about RabbitMQ perfomance and are planning to move to Kafka... Our understanding is that the same can be implemented using consumer group in kafka .. is our understanding correct
With Apache Kafka, you can scale a topic relatively easier. In order to be able to process more data in same time you'll need:
Having multiple consumers in same consumer group, you'll be able to consume multiple messages in same time. You are limited to the number of partitions of a topic.
Increase the number of partitions for a topic, and increase the number of consumers.
Increase the number of brokers, if you still to process more data.
I will approach the scalability in the order described above, but Kafka can handle a lot. In a setup with 2 brokers, 4 partitions per topic and 2 consumers (each consumer use one thread per partition), consumer decode json to java object, enrich and store to Cassandra, it can handle 30k/s (data is batched in batch of 200 insert statements).

Aggregate separate Flume streams in Spark

I am researching the ability to do some "realtime" logprocessing in our setup and I have a question on how to proceed.
So the current setup (or as we intend to do it) is as follow:
Server A generates logfiles through Rsyslog to a folder per customer.
Server B generates logfiles through Rsyslog to a folder per customer.
Both server A and B generate up to 15 logfiles (1 per customer) in a folder per customer, the structure looks like this:
/var/log/CUSTOMER/logfile.log
On server C we have a Flume sink running that listens to Rsyslog tcp messages from server A and server B. Currently for testing we only have 1 flume sink for 1 customer, but I think we will need 1 flume sink per customer.
This Flume sink then forwards these loglines to a Spark application that should aggregate the results per customer.
Now my question is: how can I make sure that Spark (streaming) will aggregate the results per customer? So let's say each customer will have it's own Flume sink, so how can I make sure Spark aggregates each flume stream separately and doesn't mix 2 or more Flume streams together?
Or is Kafka more suitable for this kind of scenario?
Any insights would be appreciated.
You can use Kafka with customer id as partition key. So basic idea in Kafka is that a message can have both key and value. Now kafka guarantees that all the messages for same key go to same partition (Spark streaming understands concept of partitions in Kafka and lets you have have separate node handling every partition), If you want you can use flume's kafka sink to write messages to Kafka.

Resources