Apache Pulsar Java client taking too much memory (OOM)

Apache Pulsar Java client taking too much memory (OOM) - apache-pulsar

I wrote a simple Apache Pulsar client with Spring boot - a pulsar-producer initialized as beans that will be used in the rest controller to publish incoming api messages to Pulsar, and a consumer that consumes message, prints some values in console & acknowledge.
As of now the application is very simple, but the moment this spring-boot app loads I see memory peak, at times getting OOM. Is there any specific configuration to be used when using Pulsar client with Spring-boot?
The code is mostly the one found the Pulsar doc.

I am answering this to doc this issue - do not use the loops to consume messages, instead adopt the MessageListener subscribed to consumer via
consumer.messageListener(new Myconsumer())
or
consumer.messageListener((consumer, msg)->{//do something})
Docs didnt mention this, but I found surfing the consumer api.

Related

Query a Kafka Topic with nodejs

I'm a bit puzzled. Is there really no NodeJS API to query Kafka topics (e.g. as with Kafka Streams and the Java API)? Am I missing something?
Just to get this straight. Only be notified of the latest event/record of a topic is not enough for me. I want to query and process the topics' records - and then maybe store an aggregate to another topic.
thanks for your thoughts if this is possible with nodejs and a library only.

Here what worked for me and most people use.
Limited solution
If you are stubborn and want to insist on a node.js library to wrap things up for you: https://nodefluent.github.io/kafka-streams/docs/
As of today they offer:
easy access streams
merge streams
split streams
Full blown solution
The easiest way (as from today - there are rumors Confluent is creating more libraries and also one for nodejs) one can query kafka is by the REST API. It is part of the ksqlDB and ksqlDB is part of the confluent platform which also ships with Zookeeper and a Kafka instance, which you probably already have. If you wonder how to install:
It spins up in 1 minutes with the docker yml file.
Run docker-compose up -d
See the ports and services running with docker ps
Start requesting the status from the REST API by simply sending a GET request to http://0.0.0.0:8088/. It will return service information.
{
"KsqlServerInfo": {
"version": "6.2.0",
"kafkaClusterId": "uOXfee3zQ76vCKBbREw1yg",
"ksqlServiceId": "default_",
"serverStatus": "RUNNING"
}
}
Hope the strips some of you from the initial research. And.... if we are lucky there will be a wrapper library soon.
Then create a stream out of your topic and voila. You are ready to query your topic (through the stream) with the REST API. Since the REST API offers HTTP2, one could also expect continuous updates on freshly arriving records in the stream. Apply Push Queries for this. Pull queries will cut the line after the result has been delivered.

How Spring Cloud Stream prevents the application’s instances from receiving duplicate messages?

Spring Cloud Stream is based on At least once method,This means that in some rare cases a duplicate message can arrive at an endpoint.
Does Spring Cloud Stream keep a buffer of already received messages?
The IdempotentReceiver in Enterprise Integration Patterns book suggests :
Design a receiver to be an Idempotent Receiver,one that can safely receive the same message multiple times.
Does Spring Cloud Stream control duplicate messages in consumers?
Update:
A paragraph from Spring Cloud Stream says :
4.5.1. Durability
Consistent with the opinionated application model of Spring Cloud Stream, consumer group subscriptions are durable. That is, a binder implementation ensures that group subscriptions are persistent and that, once at least one subscription for a group has been created, the group receives messages, even if they are sent while all applications in the group are stopped.
Anonymous subscriptions are non-durable by nature. For some binder implementations (such as RabbitMQ), it is possible to have non-durable group subscriptions.
In general, it is preferable to always specify a consumer group when binding an application to a given destination. When scaling up a Spring Cloud Stream application, you must specify a consumer group for each of its input bindings. Doing so prevents the application’s instances from receiving duplicate messages (unless that behavior is desired, which is unusual).

I think your assumption on the responsibility of the spring-cloud-stream framework are incorrect.
Spring-cloud-stream in a nutshell is a framework responsible for connecting and adapting producers/consumers provided by the developer to the message broker(s) exposed by the spring-cloud-stream binder (e.g., Kafka, Rabbit, Kinesis etc).
So connecting to a broker, receiving message from the broker, deserialising it, invoking user code, serialising message and sending it back to the broker is in the scope of framework responsibility. So you can look at it as purely infrastructure.
What you're describing is more of an application concern since the actual receiver is something that user would develop as part of the spring-cloud-stream development experience, hence responsibility for idempotence would reside with such user.
Also, on top of that most brokers already handle idempotency (in a way) by ensuring that a particular message has been delivered only once. That said, if someone sends identical message to such broker, it will have no idea that it is duplicate so the requirement for idempotency and/or deduplication is still valid, but as you can see it is not as straight forward given the amount of factor that are in play where your understanding of idempotence could be different from mine, hence our approaches could be different as well.
One last thing (partially to prove my last point): can safely receive the same message multiple times. - That is all it states, but what does safely really mean to you vs. me vs. some other person?

If you are concerned about a case where the application receives and processes message from the broker but crashes before it acknowledges the message, that can happen. Spring cloud stream app starters provides support for auto-configuration of a persistent message metadata store which backs Spring Integration's IdempotentReceiverInterceptor. An example of this is in the SFTP source app starter. By default, the sftp source uses an in-memory metadata store, so it would not survive a restart, but can be customized to use a persistent store.

JMS Problems Spring Batch With Partitioned Jobs On JBoss 5.2 EAP

We are using Spring Batch and partitioned job extensively with our project. Occasionally we see problems with partitioned jobs getting "hung" because of what apepars to be lost messages. The remote partitions all complete but the parent step stays in STARTED. Our configuration uses 1 connection factory for reading messages from the queues (inbound gateway) and a different clustered connection to send out the partition messages (outbound gateway). The reason for this is the JBoss messaging doesnt uniformly distribute messages around the cluster and the client connection factory provides that functionality.
Redhat came in and frankly threw mud at Spring and the configuration. The following are excerpts from their report
The Spring JMSTemplate code employs several anti-patterns, like creating a new connection, session, producer just to send a message, then closing the connection. Also, when receiving a message it can create a consumer each time,
receive the message, then close the consumer. This can results in poor performance under load. The use of anti-patterns not only results in poor performance, but can deplete operating system resources such as
threads and file handles, since some of the connection resources are released asynchronously. Moreover, with non-durable topic subscribers you can end up losing messages, since any messages received between the closing of
the last and opening of the next consumer will be lost. There is one place where it may be acceptable to use the Spring JMSTemplate is inside the application server using the JCA managed connection factory (normally at "java:/JmsXA") and that only works when you're sending messages.
The JCA managed connection factory caches connections so they will not actually be created each time. However using the JCA managed connection factory will not resolve the issue with consumers since they are not cached.
In summary, the Spring JMSTemplate is not safe to use apart from the very specific use case of using it inside the application server with the JCA managed connection factory (java:/JmsXA) and only in that case to send messages
(do not use it to consume messages).
Using it from a JMS client application outside the application server is never safe, and using it with a standard connection factory (e.g. "ConnectionFactory," "ClusteredConnectionFactory", "jms/RemoteConnectionFactory," etc.) is
never safe; also using it to receive messages is never safe. To safely receive messages using Spring, consider the use of MessageListenerContainers [7] with MessageDriven Pojos [8].
Finally, note that issues encountered are based on JMS anti-patterns and is thus not a problem specific to JBoss EAP. For example, see a similar discussion with regard to ActiveMQ [9].
Red Hat does not support using the Spring JMSTemplate with JBoss Messaging apart from the one acceptable use case for sending message via JCA managed connection factory.
RECOMMENDATIONS
● As to Spring JMS, as a rule, use JCA managed connection factories configured in JBoss EAP. Do not use the Spring configured connection factories. Use JNDI template to pull in the connection factories to Spring from JBoss. This will get rid of most of the Spring JMS problems.
● Use standard JMS instead of Spring JMS for the batch job. Spring is a non-standard (and probably sub-standard implementation of JMS). Standard JMS uses a pool of a few senders to send the message and close the session after the message is sent. On the listener side, standard JMS uses a pool of works listening to a distributed Queue or Topic. Each web server has JMS listener deployed as singleton and uses standard java observer to
notify any caller that is expecting a call back.
The JMS connection factories are configured in JBoss and loaded via JNDI.
Can you provide your feedback on their assessment?

To avoid the overhead of creating new connections/sessions per send, you need to wrap the provider's connection factory in a CachingConnectionFactory. It reuses the same connection for sends and caches sessions, producers, consumers.

Apache Camel - Browse Exchanges of a SEDA queue

I'm working on a small app which uses Apache Camel with JMX active.
Very simply put, I have a route using SEDA component - just 1 consumer - which in a nutshell creates its own thread and queues incoming Exchanges if the route is busy.
Basically I'd like to monitor/browse/visualize the Exchanges that are waiting in the SEDA queue. I've tried Hawtio and JConsole with JMX but it only provides the number of total and current inflight exchanges on that given route. It doesn't mention the number of Exchanges waiting to be processed.
I've also tried the Browse component which keeps track of all exchanges being passed to the browse endpoint, however it keeps all the exchanges, as opposed as just the "queued" ones.
I'm wondering if there is something out-of-the-box in Camel which allows me to do this or if I overlooked something in Hawtio or JConsole.
Thanks in advance.

You can see on the SedaEndpoint mbean how many messages are in the queue. You can find those in the endpoints tree in hawtio, or also in plain JMX as well.
#ManagedAttribute(description = "Current queue size")
public int getCurrentQueueSize() {
return queue.size();
}

Using Apache Kafka for log aggregation

I am learning Apache Kafka from their quickstart tutorial: http://kafka.apache.org/documentation.html#quickstart. Upto now, I have done the setup as follows. A producer node, where a web server is running at port 8888. A Kafka server(broker), Consumer and Zookeeper instance on another node. And I have tested the default console/file enabled producer and consumer with 3 partitions. The setup is perfect, and I am able to see the messages I sent in the order they created (with in each partition).
Now, I want to send the logs generated from the web server to Kafka Broker. These messages will be processed by consumer later. Currently I am using syslog-ng to capture server logs to a text file. I have come up with 3 rough ideas on how to implement producer to use kafka for log aggregation
Producer Implementations
First Kind:
Listen to tcp port of syslog-ng. Fetch each message and send to kafka server. Here we have two middle processes: Producer and syslog-ng
Second Kind: Using syslog-ng as Producer. Should find a way to send messages to Kafka server instead of writing to a file. Syslog-ng, the producer is the middle process.
Third Kind: Configuring the webserver itself as producer.
Am I correct in my thinking. In the last case we don't have any middle process. But I doubt its implementation will effect server performance. Can anyone let me know the best way of using Apache Kafka(if the above 3 are not good) and guide me through appropriate configuration of server?..
P.S.: I am using node.js for my web server
Thanks,
Sarath

Since you specify that you wish to send the logs generated to kafka broker, it indeed looks as if executing a process to listen and resend messages mainly creates another point of failure with no additional value (unless you need a specific syslog-ng capability).
Syslog-ng can send messages to external applications using:
http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-3.4-guides/en/syslog-ng-ose-v3.4-guide-admin/html/configuring-destinations-program.html. I don't know if there are other ways to do that.
For the third option, I am not sure if kafka can easily be integrated into Node.js as it requires a c++ producer and when I last looked for one, I was not able to find. However, an easy alternative could be to have kafka read the log file created by the server and send those logs (using the console producer provided with kafka). This is usually a good way, as it completely remove dependencies between kafka and the web server (embedding the producer in would require error handling, configuration, etc). It requires the use of tail --follow and it works for us very well. If you wish more details on that, I can include them as well. Still you would need to supervise kafka execution to make sure messages are not lost (and provide a recovery option to offline send messages that failed). But, the good thing about this method is that there are no dependency between the tools.
Hope it helps...
Eran

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string