Threading the right choice? - multithreading

Im new to threads, therefore im not sure if threads are the right way to approach this.
My program needs to perform a calculation a couple of times, same logik behind it, but with different parameters. The longer the calculation, the closer it will be to the perfect answer. The calculation duration cant be measured beforehanded (from a few seconds to a couple of minutes)
The user wants to have the results in an order (from calculation 1 to X) at certain times. He is satisfied with not the perfect solution as long as it he gets a result. Once he has a solution, he is not interested in the one before (example: he has a not perfect answer from calculation 1 and demands now answer from calculation 2; even if there is a better answer now for calculation 1, he is not interested in it)
Is threading the right way to do this?

Threading sounds like a good approach for this, as you can perform your long-running computation on a background thread while keeping your UI responsive.
In order to satisfy your requirement of having results in an order, you may need a way of stopping threads that are no longer needed. Either abort them (may be extreme), or just signal them to stop and/or return the current result.
Note you may want the threads to periodically check back in with the UI to report progress (% complete), check for any abort requests, etc. Although this depends entirely upon your application and is not necessarily required.

Related

Handling large amounts of arbitrarily scheduled tasks in node

Premise: I have a calendar-like system that allows the creation/deletion of 'events' at a scheduled time in the future. The end goal is to perform an action (send message/reminder) prior to & at the start of the event. I've done a bit of searching & have narrowed down to what seems to be my two most viable choices
Unix Cron Jobs
Bree
I'm not quite sure which will best suit my end goal though, and additionally, it feels like there must be some additional established ways to do things like this that I just don't have proper knowledge of, or that I'm entirely skipping over.
My questions:
If, theoretically, the system were to be handling an arbitrarily large amount of 'events', all for arbitrary times in the future, which of these options is more practical system-resource-wise? Is my concern in this regard even valid?
Is there any foreseeable problem with filling up a crontab with a large volume of jobs - or, in bree's case, scheduling a large amount of jobs?
Is there a better idea I've just completely missed so far?
This mainly stems from bree's use of node 'worker threads'. I'm very unfamiliar with this concept
and concerned that since a 'worker thread' is spawned per every job, I could very quickly tie up all of my available threads and grind... something, to a halt. This, however, sounds somewhat silly & possibly wrong(possibly indicative of my complete lack of knowledge here), & thus, my question.
Thanks, Stark.
For a calendar-like system, it seems you could query your database to find all events occuring in the next hour, then create a setTimeout() for each one of those. Then, an hour later, do the same thing again. Then, upon any server restart, do the same thing again. You don't really need to worry about events that aren't imminent. They can just sit in the database until shortly before their time. You will just need an efficient way to query the database to find events that are imminent and user a timer for them.
WorkerThreads are fairly heavy weight items in nodejs as they create a whole separate heap and a whole new instance of a V8 interpreter. You would definitely not want a separate WorkerThread for each event.
I should add that timers in nodejs are very lightweight items and it is not problem to have lots of them. They are just stored in a sorted linked list and only the insertion of a new timer takes a little bit more time (to do an insertion sort as it is added to the list) as the list gets longer. There is no continuous run-time overhead because there are lots of timers. The event loop, then just checks the first item in the linked list to see if it's time yet for the next timer to fire. If so, it removes it from the head of the list and calls its callback. If not, it goes about the rest of the event loop work items and will check the first item in the list again the next through the event loop.

Conceptual approach of threads in Delphi

Over 2 years ago, Remy Lebeau gave me invaluable tips on threads in Delphi. His answers were very useful to me and I feel like I made great progress thanks to him. This post can be found here.
Today, I now face a "conceptual problem" about threads. This is not really about code, this is about the approach one should choose for a certain problem. I know we are not supposed to ask for personal opinions, I am merely asking if, on a technical point a view, one of these approach must be avoided or if they are both viable.
My application has a list of unique product numbers (named SKU) in a database. Querying an API with theses SKUS, I get back a JSON file containing details about these products. This JSON file is processed and results are displayed on screen, and saved in database. So, at one step, a download process is involved and it is executed in a worker thread.
I see two different approaches possible for this whole procedure :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. A Tstringlist is then built and, for each element of the list, a thread is launched, downloads the JSON, sends back the result to the main thread and terminates.
This can be pictured like this :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. Instead of sending SKU numbers one after another to the worker thread, the whole list is sent, and the worker thread iterates through the list, sending back results for displaying and saving to the main thread (via a synchronize event). So we only have one worker thread working the whole list before terminating.
This can be pictured like this :
I have coded these two different approaches and they both work... with each their downsides that I have experienced.
I am not a professional developer, this is a hobby and, before working my way further down a path or another for "polishing", I would like to know if, on a technical point of view and according to your knowledge and experience, one of the approaches I depicted should be avoided and why.
Thanks for your time
Mathias
Another thing to consider in this case is latency to your API that is producing the JSON. For example, if it takes 30 msec to go back and forth to the server, and 0.01 msec to create the JSON on the server, then querying a single JSON record per request, even if each request is in a different thread, does not make much sense. In that case, it would make sense to do fewer requests to the server, returning more data on each request, and partition the results up among different threads.
The other thing is that threads are not a solution to every problem. I would question why you need to break each sku into a single thread. how long is each individual thread running and how much processing is each thread doing? In general, creating lots of threads, for each thread to work for a fraction of a msec does not make sense. You want the threads to be alive for as long as possible, processing as much data as they can for the job. You don't want the computer to be using as much time creating/destroying threads as actually doing useful work.

Explain surprising number of timers and finding their origin

I have a CPU leak somewhere and am trying to find out the origin (the RAM increases a bit too but not as fast).
I have collected some data through dotnet-counters and found out that the number of timers in the given process keeps increasing. For instance, when 46 % of the CPU was used, it was reporting around 450 timers (see following data). Note that this count only increases, slowly but surely like my CPU usage. And also note that it's pretty much at a time where it was idle, not having customers online.
This leads to 2 questions:
1) Is it normal? Should I continue on this track? I find it odd to have this many timers.
2) In my own code, I never user the Timer class, I use the Stopwatch class once but that's it.
So what would be the best way to find the class/library that uses those timers?
1) There was indeed something fishy and it is definitely not normal.
I had a class spawning hundreds of threads and keeping them alive to listen to incoming calls.
Long story short, that many timers is not something normal.
2) The Timer was coming from somewhere deep in a library. I couldn't find an easy way to find them.
Best option is to use dotnet-trace collect. https://www.speedscope.app/ helped me a lot and it is possible to generate de traces for speedscope directly from dotnet-trace.

Estimating WCET of a task on Linux

I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.

Progress bar and multiple threads, decoupling GUI and logic - which design pattern would be the best?

I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.

Resources