Which is the default garbage collector in Storm and why?
Can someone please explain what happens to tuples in memory after they are acknowledged by a Bolt?
Since Apache Storm is a JVM based project, then when it comes to garbage collection, the garbage collection policy used by Storm JVM process will be used.
I might be wrong but it looks to me that you are mixing two things here, JVM GC and Storm Acknowledge process.
Here is how acknowledge in Apache Storm was created:
Apache Storm spouts keep the messages (events) in their output queues until they are being acknowledged. The acknowledgement occurs only after the successful processing of a message (event) by the topology. If an acknowledgement comes for a message (event) within a reasonable amount of time, then the spouts clear the message from their output queue. If an acknowledgement didn’t come within a predefined period (30 second default), then the spouts replay the message again through the topology.
Should be read: Guaranteeing Message Processing.
Related
We are developing an application that requires messages with the same key to be processed strictly in sequence. In addition, for performance/throughput reasons, we need to introduce parallel processing.
Parallelizing is easy - we can have a single thread receiving the messages, calculating a hash on the key, and use hash % number of workers to distribute the message to a particular blocking queue with a worker on the other side. This guarantees that messages with the same key are dispatched to the same worker, so ordering is guaranteed - as long as the receiver gets the messages in order.
The questions are:
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
If we increase them, are we still guaranteed ordering?
The Pulsar documentation is not clear...
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
It might, depending on various factors.
IoThreads: this is the thread pool used to manage the TCP connections with brokers. If you're producing/consuming across many topics, you'll most likely be interacting with multiple brokers and thus have multiple TCP connections opened. Increasing the ioThreads count might remove the "single thread bottleneck", though it would only be effective if such bottleneck is indeed present (most of the time it will not be the case...). You can check the CPU utilization in your consumer process, across all threads, to see if there's any thread approaching 100% (of a single CPU core).
ListenerThreads: this the thread pool size when you are using the message listener in the consumer. Typically this is the thread-pool used by application to process the messages (unless it hops to a different thread). It might make sense to increase the threads count here if the app processing is reaching the 1 CPU core limit.
If we increase them, are we still guaranteed ordering?
Yes.
IO threads: 1 TCP connection is always mapped to 1 IO thread
ListenerThreads: 1 Consumer is assigned to 1 listener thread
You may also want to look at using the new key-shared subscription type that was introduced in Pulsar 2.4. Per the documentation,
Messages are delivered in a distribution across consumers and message with same key or same ordering key are delivered to only one consumer.
I have a (Posix) server that acts as a proxy for many clients to another upstream server. Messages typically flow down from the upstream server, are then matched against, and pushed out to some subset of the clients interested in that traffic (maintaining the FIFO order from the upstream server). Currently, this proxy server is single threaded using an event loop (e.g. - select, epoll, etc.), but now I'd like to make it multithreaded so that the proxy can more fully utilize an entire machine and achieve much higher throughput.
My high level design is to have a pool of N worker pthreads (where N is some small multiple of the number of cores on the machine) who each run their own event loop. Each client connection will be assigned to a specific worker thread who would then be responsible for servicing all of that client's I/O + timeout needs for the duration of that client connection. I also intend to have a single dedicated thread who pulls in the messages in from the upstream server. Once a message is read in, its contents can be considered constant / unchanging, until it is no longer needed and reclaimed. The workers never alter the message contents -- they just pass them along to their clients as needed.
My first question is: should the matching of client interests preferably be done by the producer thread or the worker threads?
In the former approach, for each worker thread, the producer could check the interests (e.g. - group membership) of the worker's clients. If the message matched any clients, then it could push the message onto a dedicated queue for that worker. This approach requires some kind of synchronization between the producer and each worker about their client's rarely changing interests.
In the latter approach, the producer just pushes every message onto some kind of queue shared by all of the worker threads. Then each worker thread checks ALL of the messages for a match against their clients' interests and processes each message that matches. This is a twist on the usual SPMC problem where a consumer is usually assumed to unilaterally take an element for themselves, rather than all consumers needing to do some processing on every element. This approach distributes the matching work across multiple threads, which seems desirable, but I worry it may cause more contention between the threads depending on how we implement their synchronization.
In both approaches, when a message is no longer needed by any worker thread, it then needs to be reclaimed. So, some tracking needs to be done to know when no worker thread needs a message any longer.
My second question is: what is a good way of tracking whether a message is still needed by any of the worker threads?
A simple way to do this would be to assign to each message a count of how many worker threads still need to process the message when it is first produced. Then, when each worker is done processing a message it would decrement the count in a thread-safe manner and if/when the count went to zero we would know it could be reclaimed.
Another way to do this would be to assign 64b sequence numbers to the messages as they came in, then each thread could track and record the highest sequence number up through which they have processed somehow. Then we could reclaim all messages with sequence numbers less than or equal to the minimum processed sequence number across all of the worker threads in some manner.
The latter approach seems like it could more easily allow for a lazy reclamation process with less cross-thread synchronization necessary. That is, you could have a "clean-up" thread that only runs periodically who goes and computes the minimum across the worker threads, with much less inter-thread synchronization being necessary. For example, if we assume that reads and writes of a 64b integer are atomic and a worker's fully processed sequence number is always monotonically increasing, then the "clean-up" thread can just periodically read the workers' fully processed counts (maybe with some memory barrier) and compute the minimum.
Third question: what is the best way for workers to realize that they have new work to do in their queue(s)?
Each worker thread is going to be managing its own event loop of client file descriptors and timeouts. Is it best for each worker thread to just have their own pipe to which signal data can be written by the producer to poke them into action? Or should they just periodically check their queue(s) for new work? Are there better ways to do this?
Last question: what kind of data structure and synchronization should I use for the queue(s) between the producer and the consumer?
I'm aware of lock-free data structures but I don't have a good feel for whether they'd be preferable in my situation or if I should instead just go with a simple mutex for operations that affect the queue. Also, in the shared queue approach, I'm not entirely sure how a worker thread should track "where" it is in processing the queue.
Any insights would be greatly appreciated! Thanks!
Based on your problem description, matching of client interests needs to be done for each client for each message anyway, so the work in matching is the same whichever type of thread it occurs in. That suggests the matching should be done in the client threads to improve concurrency. Synchronization overhead should not be a major issue if the "producer" thread ensures the messages are flushed to main memory (technically, "synchronize memory with respect to other threads") before their availability is made known to the other threads, as the client threads can all read the information from main memory simultaneously without synchronizing with each other. The client threads will not be able to modify messages, but they should not need to.
Message reclamation is probably better done by tracking the current message number of each thread rather than by having a message specific counter, as a message specific counter presents a concurrency bottleneck.
I don't think you need formal queueing mechanisms. The "producer" thread can simply keep a volatile variable updated which contains the number of the most recent message that has been flushed to main memory, and the client threads can check the variable when they are free to do work, sleeping if no work is available. You could get more sophisticated on the thread management, but the additional efficiency improvement would likely be minor.
I don't think you need sophisticated data structures for this. You need volatile variables for the number of the latest message that is available for processing and for the number of the most recent message that have been processed by each client thread. You need to flush the messages themselves to main memory. You need some way of finding the messages in main memory from the message number, perhaps using a circular buffer of pointers, or of messages if the messages are all of the same length. You don't really need much else with respect to the data to be communicated between the threads.
I am currently working on putting together a NodeJS system that is responsible for receiving a large amount of events and the order of those events being processed is highly critical. It is also important that the application scales and can handle a Rabbit consumer falling over, therefore I have multiple consumers reading off a queue which is bound to a direct exchange with 'noAck' set to false and each queue having a prefetch count of 1.
This is ensuring that my messages are being processed in order however both consumers are processing events concurrently, where my desired outcome is:
Consumer A Consumer B
---------- -----------
process event 1
...
acknowledge
process event 2
...
acknowledge
process event 3
...
..and so on.
I realise that this reduces the efficiency of my nodes but the guarantee that events are fully processed in order is much more critical to me.
Any help would be greatly appreciated.
This is ensuring that my messages are being processed
No, whenever you have concurrent consumers, settings basic_qos=1 won't prevent other consumers from processing messages out of order.
If you need ordered processing you must have 1 consumer per queue. If you want parallelism, the best you could do is to partition the message stream into several queues, using something like the Consistent Hash Exchange. Of course messages will lose order at partitioning time, but then one consumer per queue will assure ordered processing on each queue.
Let me put my question as simple as below. Mine is a network router software built in erlang, but at a particular scenario I am observing very high memory growth as shown by VM.
I have one process which receives binary packet from some other process from socket.
This process, parses the binary packet and passes the binary packet to a gen_server (handle_cast is called)
The gen_server again stores some information in ETS table and send the packet to the peer server.
When the peer server responds back the entry from the ETS is deleted and the gen_server responds back to the first process
Also if the first process (which sent packet to gen_server) gets timedout after 5 seconds waiting for response from gen_server , it also deletes the ETS entry in the gen_server and exits.
Now I am observing high memory growth when lots of events gets timed out (due to unavailability of peer server) and from what i have researched its the "**binary**" and "**processes_used**" given by erlang:memory command thats using most of the memory.
but the same is not true when events are processed successfully.
The memory lost can be basically only in three places:
The state of your gen_server
look at you state, find out if there is some big or growing stuff there
Your processes mailboxes
see to it that there is some way to always drain unmatched messages (for gen_server handle_info callback) in normal receives a Any ->clause.
if the mailbox only fills up temporarily its probably because of the receiving process being too slow for the rate of messages produced. This is usually a problem for asynchronous communication. If its only temporary bursts that don't break anything this could be intended.
In this case you can either optimize the receiving process
or fix your protocol to use fewer messages
if you have multiple functions that receive some messages, make sure all receiving parts are being called regularly. Dont forget the Any -> clauses.
Be aware that while you are processing in a gen_servers callback no messages will be received, so if you need more time in a callback that would be necessary asyncronous messages might pile up (e.g. random message arrival + fixed processing time builds a unbounded growing queue, for details see Queueing theory
In your ETS table
maybe the information in the ETS is not be completely removed? Forgot to remove something in certain cases?
trigger GC by hand and see what happens with memory.
[garbage_collect(Pid) || Pid <- processes()]
Most likely you're leaving processes running that has references to binaries. If a process dies, all memory related to that process will be cleaned up (including any binaries that only belonged to that process).
If you still have leaking binaries, it means you have some long running process (server, singleton etc) that keeps references to binaries, either in its process state or by non-tail recursive functions. Make sure you clean up your state once process communication times out or they die. Also, check that you don't leave references to binaries on the heap by using non-tail recursive calls.
I've got a RabbitMQ queue that might, at times, hold a considerable amount of data to process.
As far as I understand, using channel.consume will try to force the messages into the Node program, even if it's reaching its RAM limit (and, eventually, crash).
What is the best way to ensure workers get only as many tasks to process as they are capable of handling?
I'm thinking about using a chain of (transform) streams together with channel.get (which gets just one message). If the first stream's buffer is full, we simply stop getting messages.
I believe what you want is to specify the consumer prefetch.
This indicates to RabbitMQ how many messages it should "push" to the consumer at once.
An example is provided here
channel.prefetch(1);
Would be the lowest value to provide, and should ensure the least memory consumption for your node program.
This is based on your description, if my understanding is correct, I'd also recommend renaming your question (parallel processing would relate more to multiple consumers on a single queue, not a single consumer getting all the messages)