I am curious about how compression works in pulsar. from the public doc, it states "You can compress messages published by producers during transportation"
does it mean client compress the data and the data get decompressed when it arrives at broker so the decompressed data is persisted and consumer later? or it means the compression happens from end-to-end and the decompression happens at consumer side?
Compression and decompression is done on the client and is transparent to the broker. The message is stored compressed into the ledgers. The compression details (algo used, ...) are part of the message metadata.
The same principles apply to batching and encryption.
Related
I found that non-persistent messages are lost sometimes even though the my pulsar client is up and running.
Those non-persistent messages are lost when the throughput is high (more than 1000 messages within a very short period of time. I personally think that this is not high).
If I increase the parameter receiverQueueSize or change the message type to persistent message, the problem is gone.
I check the Pulsar source code (I am not sure this is the latest one)
https://github.com/apache/pulsar/blob/35f0e13fc3385b54e88ddd8e62e44146cf3b060d/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/nonpersistent/NonPersistentDispatcherMultipleConsumers.java#L185
and I think that Pulsar simply ignore those non-persistent messages if no consumer is available to handle the newly arrived non-persistent messages.
"No consumer" here means
no consumer subscribe the topic
OR all consumers are busy on processing messages received before
Is my understanding correct?
The Pulsar broker does not do any buffering of messages for the non-persistent topics, so if consumers are not connected or are connected but not keeping up with the producers, the messages are simply discarded.
This is done because any in-memory buffering would be anyway very limited and not sufficient to change any of the semantics.
Non-persistent topics are really designed for use cases where data loss is an acceptable situation (eg: sensors data which gets updates every 1sec and you just care about last value). For all the other cases, a persistent topic is the way to go.
I'm working on a NestJS project that receives data from SAP MII and then send it to EventHub. Unfortunately, EventHub supports a maximum of 1MB (https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas), and in my case, SAP MII sometimes returns 4MB+ and I still need to send it to EventHub.
I have a few ideas on my mind, but I'm not sure if there's a better approach to it or even if there's a way to change EventHub size limit.
I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer
I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.
I want to create a server which relays a http stream with about 1 min of latency.
In other words, I have a server A streaming audio at http://a.crappyserver.com:8000/stream.mp3. How can I create another stream at, say http://b.crappyserver.com:8080/stream.mp3, which has the same audio with about 1 min lag.
UPDATE: I can only use an arch linux server to do so
You could put the streaming audio into a circular queue which holds a minute of data. Just allocate storage and keep track of publish and consume index values.