Handling large amounts of arbitrarily scheduled tasks in node

Handling large amounts of arbitrarily scheduled tasks in node - node.js

Premise: I have a calendar-like system that allows the creation/deletion of 'events' at a scheduled time in the future. The end goal is to perform an action (send message/reminder) prior to & at the start of the event. I've done a bit of searching & have narrowed down to what seems to be my two most viable choices
Unix Cron Jobs
Bree
I'm not quite sure which will best suit my end goal though, and additionally, it feels like there must be some additional established ways to do things like this that I just don't have proper knowledge of, or that I'm entirely skipping over.
My questions:
If, theoretically, the system were to be handling an arbitrarily large amount of 'events', all for arbitrary times in the future, which of these options is more practical system-resource-wise? Is my concern in this regard even valid?
Is there any foreseeable problem with filling up a crontab with a large volume of jobs - or, in bree's case, scheduling a large amount of jobs?
Is there a better idea I've just completely missed so far?
This mainly stems from bree's use of node 'worker threads'. I'm very unfamiliar with this concept
and concerned that since a 'worker thread' is spawned per every job, I could very quickly tie up all of my available threads and grind... something, to a halt. This, however, sounds somewhat silly & possibly wrong(possibly indicative of my complete lack of knowledge here), & thus, my question.
Thanks, Stark.

For a calendar-like system, it seems you could query your database to find all events occuring in the next hour, then create a setTimeout() for each one of those. Then, an hour later, do the same thing again. Then, upon any server restart, do the same thing again. You don't really need to worry about events that aren't imminent. They can just sit in the database until shortly before their time. You will just need an efficient way to query the database to find events that are imminent and user a timer for them.
WorkerThreads are fairly heavy weight items in nodejs as they create a whole separate heap and a whole new instance of a V8 interpreter. You would definitely not want a separate WorkerThread for each event.
I should add that timers in nodejs are very lightweight items and it is not problem to have lots of them. They are just stored in a sorted linked list and only the insertion of a new timer takes a little bit more time (to do an insertion sort as it is added to the list) as the list gets longer. There is no continuous run-time overhead because there are lots of timers. The event loop, then just checks the first item in the linked list to see if it's time yet for the next timer to fire. If so, it removes it from the head of the list and calls its callback. If not, it goes about the rest of the event loop work items and will check the first item in the list again the next through the event loop.

Related

Conceptual approach of threads in Delphi

Over 2 years ago, Remy Lebeau gave me invaluable tips on threads in Delphi. His answers were very useful to me and I feel like I made great progress thanks to him. This post can be found here.
Today, I now face a "conceptual problem" about threads. This is not really about code, this is about the approach one should choose for a certain problem. I know we are not supposed to ask for personal opinions, I am merely asking if, on a technical point a view, one of these approach must be avoided or if they are both viable.
My application has a list of unique product numbers (named SKU) in a database. Querying an API with theses SKUS, I get back a JSON file containing details about these products. This JSON file is processed and results are displayed on screen, and saved in database. So, at one step, a download process is involved and it is executed in a worker thread.
I see two different approaches possible for this whole procedure :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. A Tstringlist is then built and, for each element of the list, a thread is launched, downloads the JSON, sends back the result to the main thread and terminates.
This can be pictured like this :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. Instead of sending SKU numbers one after another to the worker thread, the whole list is sent, and the worker thread iterates through the list, sending back results for displaying and saving to the main thread (via a synchronize event). So we only have one worker thread working the whole list before terminating.
This can be pictured like this :
I have coded these two different approaches and they both work... with each their downsides that I have experienced.
I am not a professional developer, this is a hobby and, before working my way further down a path or another for "polishing", I would like to know if, on a technical point of view and according to your knowledge and experience, one of the approaches I depicted should be avoided and why.
Thanks for your time
Mathias

Another thing to consider in this case is latency to your API that is producing the JSON. For example, if it takes 30 msec to go back and forth to the server, and 0.01 msec to create the JSON on the server, then querying a single JSON record per request, even if each request is in a different thread, does not make much sense. In that case, it would make sense to do fewer requests to the server, returning more data on each request, and partition the results up among different threads.
The other thing is that threads are not a solution to every problem. I would question why you need to break each sku into a single thread. how long is each individual thread running and how much processing is each thread doing? In general, creating lots of threads, for each thread to work for a fraction of a msec does not make sense. You want the threads to be alive for as long as possible, processing as much data as they can for the job. You don't want the computer to be using as much time creating/destroying threads as actually doing useful work.

Designing concurrency in a Python program

I'm designing a large-scale project, and I think I see a way I could drastically improve performance by taking advantage of multiple cores. However, I have zero experience with multiprocessing, and I'm a little concerned that my ideas might not be good ones.
Idea
The program is a video game that procedurally generates massive amounts of content. Since there's far too much to generate all at once, the program instead tries to generate what it needs as or slightly before it needs it, and expends a large amount of effort trying to predict what it will need in the near future and how near that future is. The entire program, therefore, is built around a task scheduler, which gets passed function objects with bits of metadata attached to help determine what order they should be processed in and calls them in that order.
Motivation
It seems to be like it ought to be easy to make these functions execute concurrently in their own processes. But looking at the documentation for the multiprocessing modules makes me reconsider- there doesn't seem to be any simple way to share large data structures between threads. I can't help but imagine this is intentional.
Questions
So I suppose the fundamental questions I need to know the answers to are thus:
Is there any practical way to allow multiple threads to access the same list/dict/etc... for both reading and writing at the same time? Can I just launch multiple instances of my star generator, give it access to the dict that holds all the stars, and have new objects appear to just pop into existence in the dict from the perspective of other threads (that is, I wouldn't have to explicitly grab the star from the process that made it; I'd just pull it out of the dict as if the main thread had put it there itself).
If not, is there any practical way to allow multiple threads to read the same data structure at the same time, but feed their resultant data back to a main thread to be rolled into that same data structure safely?
Would this design work even if I ensured that no two concurrent functions tried to access the same data structure at the same time, either for reading or for writing?
Can data structures be inherently shared between processes at all, or do I always explicitly have to send data from one process to another as I would with processes communicating over a TCP stream? I know there are objects that abstract away that sort of thing, but I'm asking if it can be done away with entirely; have the object each thread is looking at actually be the same block of memory.
How flexible are the objects that the modules provide to abstract away the communication between processes? Can I use them as a drop-in replacement for data structures used in existing code and not notice any differences? If I do such a thing, would it cause an unmanageable amount of overhead?
Sorry for my naivete, but I don't have a formal computer science education (at least, not yet) and I've never worked with concurrent systems before. Is the idea I'm trying to implement here even remotely practical, or would any solution that allows me to transparently execute arbitrary functions concurrently cause so much overhead that I'd be better off doing everything in one thread?
Example
For maximum clarity, here's an example of how I imagine the system would work:
The UI module has been instructed by the player to move the view over to a certain area of space. It informs the content management module of this, and asks it to make sure that all of the stars the player can currently click on are fully generated and ready to be clicked on.
The content management module checks and sees that a couple of the stars the UI is saying the player could potentially try to interact with have not, in fact, had the details that would show upon click generated yet. It produces a number of Task objects containing the methods of those stars that, when called, will generate the necessary data. It also adds some metadata to these task objects, assuming (possibly based on further information collected from the UI module) that it will be 0.1 seconds before the player tries to click anything, and that stars whose icons are closest to the cursor have the greatest chance of being clicked on and should therefore be requested for a time slightly sooner than the stars further from the cursor. It then adds these objects to the scheduler queue.
The scheduler quickly sorts its queue by how soon each task needs to be done, then pops the first task object off the queue, makes a new process from the function it contains, and then thinks no more about that process, instead just popping another task off the queue and stuffing it into a process too, then the next one, then the next one...
Meanwhile, the new process executes, stores the data it generates on the star object it is a method of, and terminates when it gets to the return statement.
The UI then registers that the player has indeed clicked on a star now, and looks up the data it needs to display on the star object whose representative sprite has been clicked. If the data is there, it displays it; if it isn't, the UI displays a message asking the player to wait and continues repeatedly trying to access the necessary attributes of the star object until it succeeds.

Even though your problem seems very complicated, there is a very easy solution. You can hide away all the complicated stuff of sharing you objects across processes using a proxy.
The basic idea is that you create some manager that manages all your objects that should be shared across processes. This manager then creates its own process where it waits that some other process instructs it to change the object. But enough said. It looks like this:
import multiprocessing as m
manager = m.Manager()
starsdict = manager.dict()
process = Process(target=yourfunction, args=(starsdict,))
process.run()
The object stored in starsdict is not the real dict. instead it sends all changes and requests, you do with it, to its manager. This is called a "proxy", it has almost exactly the same API as the object it mimics. These proxies are pickleable, so you can pass as arguments to functions in new processes (like shown above) or send them through queues.
You can read more about this in the documentation.
I don't know how proxies react if two processes are accessing them simultaneously. Since they're made for parallelism I guess they should be safe, even though I heard they're not. It would be best if you test this yourself or look for it in the documentation.

Threading the right choice?

Im new to threads, therefore im not sure if threads are the right way to approach this.
My program needs to perform a calculation a couple of times, same logik behind it, but with different parameters. The longer the calculation, the closer it will be to the perfect answer. The calculation duration cant be measured beforehanded (from a few seconds to a couple of minutes)
The user wants to have the results in an order (from calculation 1 to X) at certain times. He is satisfied with not the perfect solution as long as it he gets a result. Once he has a solution, he is not interested in the one before (example: he has a not perfect answer from calculation 1 and demands now answer from calculation 2; even if there is a better answer now for calculation 1, he is not interested in it)
Is threading the right way to do this?

Threading sounds like a good approach for this, as you can perform your long-running computation on a background thread while keeping your UI responsive.
In order to satisfy your requirement of having results in an order, you may need a way of stopping threads that are no longer needed. Either abort them (may be extreme), or just signal them to stop and/or return the current result.
Note you may want the threads to periodically check back in with the UI to report progress (% complete), check for any abort requests, etc. Although this depends entirely upon your application and is not necessarily required.

How to use node.js for a queue processing app

What are the best practices when using node.js for a queue processing application?

My main concern there would be that Node processes can handle thousands of items at once, but that a rogue unhandled error in any of them could bring down the whole process.
I'd be looking for a queue/driver combination that allowed a two-phase commit (wrong terminology I think?), i.e:
Get the next appropriate item from the queue (which then blocks that item from being consumed elsewhere)
Once each item is handed over to the downstream service/database/filesystem you can then tell the queue that the item has been processed
I'd also want repeatably unique identifiers so that you can reliably detect if an item comes down the pipe twice. In a theoretical system it might not happen, but in a practical environment the capability to deal with it will make your life easier.

check out http://learnboost.github.com/kue/ i have used it for a couple of pet projects and works quite good, you can look at their source and check what practices they have take care of

Returning LOTS of items from a MongoDB via Node.js

I'm returning A LOT (500k+) documents from a MongoDB collection in Node.js. It's not for display on a website, but rather for data some number crunching. If I grab ALL of those documents, the system freezes. Is there a better way to grab it all?
I'm thinking pagination might work?
Edit: This is already outside the main node.js server event loop, so "the system freezes" does not mean "incoming requests are not being processed"

After learning more about your situation, I have some ideas:
Do as much as you can in a Map/Reduce function in Mongo - perhaps if you throw less data at Node that might be the solution.
Perhaps this much data is eating all your memory on your system. Your "freeze" could be V8 stopping the system to do a garbage collection (see this SO question). You could Use V8 flag --trace-gc to log GCs & prove this hypothesis. (thanks to another SO answer about V8 and Garbage collection
Pagination, like you suggested may help. Perhaps even splitting up your data even further into worker queues (create one worker task with references to records 1-10, another with references to records 11-20, etc). Depending on your calculation
Perhaps pre-processing your data - ie: somehow returning much smaller data for each record. Or not using an ORM for this particular calculation, if you're using one now. Making sure each record has only the data you need in it means less data to transfer and less memory your app needs.

I would put your big fetch+process task on a worker queue, background process, or forking mechanism (there are a lot of different options here).
That way you do your calculations outside of your main event loop and keep that free to process other requests. While you should be doing your Mongo lookup in a callback, the calculations themselves may take up time, thus "freezing" node - you're not giving it a break to process other requests.

Since you don't need them all at the same time (that's what I've deduced from you asking about pagination), perhaps it's better to separate those 500k stuff into smaller chunks to be processed at the nextTick?
You could also use something like Kue to queue the chunks and process them later (thus not everything in the same time).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string