libevent / epoll number of worker threads? - multithreading

I am following this example. Line#37 says that number of worker threads should be equal of number of cpu cores. Why is that so?
If there are 10k connections and my system has 8 cores, does that mean 8 worker threads will be processing 10k connections? Why shouldn't I increase this number?

Context Switching
For an OS to context switch between threads takes a little bit of time. Having a lot of threads, each one doing comparatively little work, means that the context switch time starts becoming a significant portion of the overall runtime of the application.
For example, it could take an OS about 10 microseconds to do a context switch; if the thread does only 15 microseconds worth of work before going back to sleep then 40% of the runtime is just context switching!
This is inefficient, and that sort of inefficiency really starts to show up when you're up-scaling as your hardware, power and cooling costs go through the roof. Having few threads means that the OS doesn't have to switch contexts anything like as much.
So in your case if your requirement is for the computer to handle 10,000 connections and you have 8 cores then the efficiency sweet spot will be 1250 connections per core.
More Clients Per Thread
In the case of a server handling client requests it comes down to how much work is involved in processing each client. If that is a small amount of work, then each thread needs to handle requests from a number of clients so that the application can handle a lot of clients without having a lot of threads.
In a network server this means getting familiar with the the select() or epoll() system call. When called these will both put the thread to sleep until one of the mentioned file descriptors becomes ready in some way. However if there's no other threads pestering the OS for runtime the OS won't necessarily need to perform a context switch; the thread can just sit there dozing until there's something to do (at least that's my understanding of what OSes do. Everyone, correct me if I'm wrong!). When some data turns up from a client it can resume a lot faster.
And this of course makes the thread's source code a lot more complicated. You can't do a blocking read of data from the clients for instance; being told by epoll() that a file descriptor has become ready for reading does not mean that all the data you're expecting to receive from the client can be read immediately. And if the thread stalls due to a bug that affects more than one client. But that's the price paid for attaining the highest possible efficiency.
And it's not necessarily the case that you would want just 8 threads to go with your 8 cores and 10,000 connections. If there's something that your thread has to do for each connection every time it handles a single connection then that's an overhead that would need to be minimised (by having more threads and fewer connections per thread). [The select() system call is like that, which is why epoll() got invented.] You have to balance that overhead up against the overhead of context switching.
10,000 file descriptors is a lot (too many?) for a single process in Linux, so you might have to have several processes instead of several threads. And then there's the small matter of whether the hardware is fundamentally able to support 10,000 within whatever response time / connection requirements your system has. If it doesn't then you're forced to distribute your application across two or more servers, and that can start getting really complicated!
Understanding exactly how many clients to handle per thread depends on what the processing is doing, whether there's harddisk activity involved, etc. So there's no one single answer; it's different for different applications, and also for the same application on different machines. Tuning the clients / thread to achieve peak efficiency is a really hard job. This is where profiling tools like dtrace on Solaris, ftrace on Linux, (especially when used like this, which I've used a lot on Linux on x86 hardware) etc. can help because they allow you to understand at a very fine scale precisely what runtime is involved in your thread handling a request from a client.
Outfits like Google are of course very keen on efficiency; they get through a lot of electricity. I gather that when Google choose a CPU, hard drive, memory, etc. to put into their famously home grown servers they measure performance in terms of "Searches per Watt". Obviously you have to be a pretty big outfit before you get that fastidious about things, but that's the way things go ultimately.
Other Efficiencies
Handling things like TCP network connections can take up a lot of CPU time in it's own right, and it can be difficult to understand whereabouts in a system all your CPU runtime has gone. For network connections things like TCP offload in the smarter NICs can have a real benefit, because that frees the CPU from the burden of doing things like the error correction calculations.
TCP offload mirrors what we do in the high speed large scale real time embedded signal processing world. The (weird) interconnects that we use require zero CPU time to run them. So all of the CPU time is dedicated to processing data, and specialised hardware looks after moving data around. That brings about some quite astonishing efficiencies, so one can build a system with more modest, lower cost, less power hungry CPUs.
Language can have a radical effect on efficiency too; Things like Ruby, PHP, Perl are all very well and good, but everyone who has used them initially but has then grown rapidly ended up going to something more efficient like Java/Scala, C++, etc.

Your question is even better than you think! :-P
If you do networking with libevent, it can do non-blocking I/O on sockets. One thread could do this (using one core), but that would under-utilize the CPU.
But if you do “heavy” file I/O, then there is no non-blocking interface to the kernel. (Many systems have nothing to do that at all, others have some half-baked stuff going on in that field, but non-portable. –Libevent doesn’t use that.) – If file I/O is bottle-necking your program/test, then a higher number of threads will make sense! If a hard-disk is used, and the i/o-scheduler is reordering requests to avoid disk-head-moves, etc. it will depend on how much requests the scheduler takes into account to do its job the best. 100 pending requests might work better then 8.
Why shouldn't you increase the thread number?
If non-blocking I/O is done: all cores are working with thread-count = core-count. More threads only means more thread-switching with no gain.
For blocking I/O: you should increase it!

Related

Java Threads and number of Cores

Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
If so why is this the case and what are the implications of using threads greater than the number of cpu cores ?
You will probably not get any definitive answer on the question of knowing, generally speaking, how many threads an app should have, in relation to the number of core(s) the underlying computer has.
One may also argue that, at the time of PaaS software design and/or elastic clusters, the notion of a fixed number of cores for any given process might be overrated.
Still, the first part of your question :
Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
This has a definitive answer, which is a "no" (once more : as a general rule). And the reason why, shortly, it that all created threads are not typically running (and maybe more importantly runable) at once, meaning there is an opportunity to optimize here.
As a support to this discussion, I'll oppose two ways of creating apps, you could call it "classical" versus "reactive", although this is not a generally acceptable division. Yet, let's use this as a support.
Classical application design
I label as classical applications that rely mostly on "blocking" calls and/or "thread per request" pattern. Consider the traditional way I/Os are done (socket communication like HTTP or Database connection, hard drive based file reading, ...) : your app thread calls some kind of read or write method, which usually triggers an OS level call, that blocks your app thread, fills some device buffer at the OS level (say, read from a disk). Once the buffer has received enough data, the OS signals your java app and thread to resume activity, and the read method returns with the data from the buffer.
The whole time the OS is working (usually just a tiny fraction of a second, but still some large amount of time compared to your typical GHz CPU speed), your Java thread is in state BLOCKED_WAITING, waiting for the OS to signal it can resume. This happens all the time. A code profiler tool, like JProfiler, or YourKit, can help you measure this time. If you do so, you'll notice that in many apps doing I/O, this is a significant part of the so-called "wall time" or "clock time" that is spent... waiting.
So we have one thread waiting, meaning it is not using any CPU time. It can be scheduled out, and the OS is free to give CPU time to anybody else.
Suppose this is a one core CPU, then NOW would be a good time to have another thread to feed the CPU. Meaning having two or more threads could be a good design to maximize CPU usage even on a single core CPU, and get the most out of your hardware.
Most "classical" web applications are typically subject to this type of CPU underuse if you follow the rule of "one thread per CPU core", because Socket communications (or more typically : the time spent waiting for a response to your SQL queries) will incur so much blocking.
If you raise the number of threads your app has, then even if one or two long running requests remain waiting, other faster requests will have runnable threads to run them, and you'll get better CPU usage, and better performance (number of concurrent requests). That is... untill something else reaches saturation (too many heavy requests on your DB, too many simultaneous hard drive reads/writes...)
Reactive app design
Recognizing this typical behavior of apps, and using different sets of OS features, some application frameworks now use non blocking patterns (even for I/O) to mitigate the above issues. Examples in the Java ecosystem are NIO based networking stacks like Netty, or actor pattern implementations like Akka.
In a typical "reactive" app, one usually abandons the "thread per request" pattern that we have in classical apps (meaning one thread is responsible for handling everything from start to finish of a given user request, and waiting when need be for external resources to become available), in favor of a vastly more modular, and non-blocking approach.
Threads are given more technical-grained bits of work to do, and each thread will hand-off work to one another and callbacks to hear back when work they depend upon is done. This "handing of" of units of work means each thread can quickly grab new units of work it is able to handle. Meaning one of two things : you achieve higher CPU usage with far fewer threads in your app (because each can grab work more efficiently, instead of just sitting "waiting") ; or you can instantiate many many more threads because they'll mostly be waiting (not saturating the CPUs), and the dynamical hand-off will still allow for good CPU usage.
Conclusion
Anyway, you don't design the number of threads solely based on the number of available cores. The nature of your implementation and work dictates the number of optimal threads to create.
On a classical app-design philosophy, the two numbers are more closely related than on a reactive one, but still, we have different profiles :
a very simple server app can accomodate many more threads than CPU cores, because it will allow for better throughput (the limit being, say, the output network bandwidth).
a SQL heavy app, should be scalled to the point where your app server will saturate the SQL backend. As your app server will be mostly waiting for your SQL server, then this is the limit
a mixed application consisting of some SQL heavy work, and some lightweight work, will need precision tuning, because you don't want the stuck threads (those blocked waiting for the DB) starving the light requests that would be served more rapidly
a compute intensive program (say, a cryptography service) will probably benefit from a number of threads close to the number CPU cores (if your algorithm is implemented in a classical way), because creating more threads than you are able to run is pointless. In an actor based implementation, creating more threads could actually be a win.

About multithreading download disadvantages

I have a question about multithreading download, as you know downloading using several threads improve performance of application, however there are some measures to respect: like the number of threads, the available bandwidth and some more, but I don't really understand, why the performance of application might be degraded by using many threads for example, or how can the bandwidth,quality of server affect the performance of multithreaded application? , what are the cases in which monothread download is faster than multithread?
Thanks for your replies.
I assume you're referring to download managers.
First, I'm sceptical of how much "performance" benefit a download manager really provides. But more importantly, any benefit they do provide is not due to multi-threading. The performance constraint of a download is the bandwidth of the connection. And this is why I'm sceptical of the benefits:
A 1 Mbps connection will download at 1 Mbps.
Splitting the file into 4 segments means you download each segment at 256 Kbps and 4 * 256 Kbps = 1 Mbps.
You may get some improvement if a server throttles each download segment.
You may get a small benefit if one of the segments gets timed out: the others downloading mean your connection doesn't sit idle during the time-out wait.
You might also speed up a download by 'drowning out' anything else trying to use the connection. (Not that I'd really call this a benefit though.)
The real benefit of a download manager is in automatically restarting downloads efficiently (i.e. not re-starting from scratch if possible).
So what is the point of multi-threading?
Let's first dispel a myth: Multi-threading does not speed anything up. If a routine requires X clock-cycles to run: it will take X clock-cycles; whether on 1 thread or many threads.
What multi-threading does do: it allows tasks to run concurrently (at the same time).
The ability to do different things at the same time means:
A slow task (combining various segments of a large download) can be done on a different thread without interfering with other threads that need to react quickly (such as the user interface).
Concurrent tasks can also use more available resources (multiple CPUs) more efficiently. Note (in answer to the last part of your question) if you only have one CPU then your threads are "time-sliced" by the operating system so it's not truly concurrent. But the time slices are very small, so previous benefit still applies.
When is single-threaded faster than multi-threaded?
Well, pretty much always in cases where CPU is not the bottle-neck. In the case of download: As mentioned before, the bottle-neck is the bandwidth between the two end-points of the connection. Many threads actually means you have to do more work (managing and coordinating the different threads).
The most efficient approach for download is 2 threads: one for the UI, and the other for the download so that any pauses/dealys don't stall the user interface.
However, more generally even when you have CPU intensive work that could theoretically benefit from multiple threads doing different work concurrently, it's very easy to make mistakes in implementation that actually slow down your application.
Ideally your multiple tasks should not share data. Because if they do, then you risk race-condition or concurrency bugs.
When they do have to share data, you need to synchronise the work in some way to avoid the above mentioned bugs. (There are many techniques to choose from depending on your needs and I won't go into detail here.)
However if your synchronisation is poorly planned you risk introducing a number of problems that can significantly slow down your application. These include:
Bottle-necking through a shared resource to make your multiple threads unable to run concurrently in any case.
High lock contention where task spend more time waiting than working.
Even deadlocking which can totally block some tasks.
First of all, what Multi-Threading download does, it creates multiple threads, which download the file from different starting positions, which tries to utilize maximum power of your internet connection. (This will download fast in multi-core processor case, which is explained below).
We might feel threads running parallel, but actually, they are supposed to run turn by turn. For example thread t1 runs for 0.25 sec, then thread t2 runs for 0.241 sec, and then t1,.., t2,, t1. So, they share CPU Bursts.
So why Multi-threads improve performance?
Answer: If you have multi-core processor, threads can be managed to run parallel using multiple processors, which improves performance. (This is how download accelerators like IDMan do the magic!).
Why Multi-threads degrade performance?
Answer: If you have single-core processor, all threads will run turn by turn, as a single process, sharing CPU Bursts. In this case, monothread download is faster than multithread, because if you use multi-threads, some time is wasted while switching from one thread to another. (Download accelerators like IDMan will not really download fast in this case despite multi-threaded download).
See this picture having dual core processor, and how threads are managed.
I hope this helps! :-)

Would handling each TCP connection in a separate thread improve latency?

I have an FTP server, implemented on top of QTcpServer and QTcpSocket.
I take advantage of the signals and slots mechanism to support multiple TCP connections simultaneously, even though I have a single thread. My code returns as soon as possible to the event loop, it doesn't block (no wait functions), and it doesn't use nested event loops anywhere. That way I already have cooperative multitasking, like Win3.1 applications had.
But a lot of other FTP servers are multithreaded. Now I'm wondering if using a separate thread for handling each TCP connection would improve performance, and especially latency.
On one hand, threads add to latency because you need to start a new thread for each new connection, but on the other, with my cooperative multitasking, other TCP connections have to wait until I've returned to the main loop before their readyRead()/bytesWritten() signals can be handled.
In your current system and ignoring file I/O time one processor is always doing something useful if there's something useful to be done, and waiting ready-to-go if there's nothing useful to be done. If this were a single processor (single core) system you would have maximized throughput. This is often a very good design -- particularly for an FTP server where you don't usually have a human waiting on a packet-by-packet basis.
You have also minimized average latency (for a single processor system.) What you do not have is consistent latency. Measuring your system's performance is likely to show a lot of jitter -- a lot of variation in the time it takes to handle a packet. Again because this is FTP and not real-time process control or human interaction, jitter may not be a problem.
Now, however consider that there is probably more than one processor available on your system and that it may be possible to overlap I/O time and processing time.
To take full advantage of a multi-processor(core) system you need some concurrency.
This normally translates to using multiple threads, but it may be possible to achieve concurrency via asynchronous (non-blocking) file reads and writes.
However, adding multiple threads to a program opens up a huge can-of-worms.
If you do decide to go the MT route, I'd suggest that you consider depending on a thread-aware I/O library. QT may provide that for you (I'm not sure.) If not, take a look at boost::asio (or ACE for an older, but still solid solution). You'll discover that using the MT capabilities of such a library involves a considerable investment in learning time; however as it turns out the time to add on multithreading "by-hand" and get it right is even worse.
So I'd say stay with your existing solution unless you are worried about unused Processor cycles and/or jitter in which case start learning QT's multithreading support or boost::asio.
Do you need to start a new thread for each new connection? Could you not just have a pool of threads that acts on requests as and when they arrive. This should reduce some of the latency. I have to say that in general a multi-threaded FTP server should be more responsive that a single-threaded one. Is it possible to have an event based FTP server?

Why is threading used for sockets?

Ever since I discovered sockets, I've been using the nonblocking variants, since I didn't want to bother with learning about threading. Since then I've gathered a lot more experience with threading, and I'm starting to ask myself.. Why would you ever use it for sockets?
A big premise of threading seems to be that they only make sense if they get to work on their own set of data. Once you have two threads working on the same set of data, you will have situations such as:
if(!hashmap.hasKey("bar"))
{
dostuff // <-- meanwhile another thread inserts "bar" into hashmap
hashmap[bar] = "foo"; // <-- our premise that the key didn't exist
// (likely to avoid overwriting something) is now invalid
}
Now imagine hashmap to map remote IPs to passwords. You can see where I'm going. I mean, sure, the likelihood of such thread-interaction going wrong is pretty small, but it's still existent, and to keep one's program secure, you have to account for every eventuality. This will significantly increase the effort going into design, as compared to simple, single-threaded workflow.
I can completely see how threading is great for working on separate sets of data, or for programs that are explicitly optimized to use threading. But for the "general" case, where the programmer is only concerned with shipping a working and secure program, I can not find any reason to use threading over polling.
But seeing as the "separate thread" approach is extremely widespread, maybe I'm overlooking something. Enlighten me! :)
There are two common reasons for using threads with sockets, one good and one not-so-good:
The good reason: Because your computer has more than one CPU core, and you want to make use of the additional cores. A single-threaded program can only use a single core, so with a heavy workload you'd have one core pinned at 100%, and the other cores sitting unused and going to waste.
The not-so-good reason: You want to use blocking I/O to simplify your program's logic -- in particular, you want to avoid dealing with partial reads and partial writes, and keep each socket's context/state on the stack of the thread it's associated with. But you also want to be able to handle multiple clients at once, without slow client A causing an I/O call to block and hold off the handling of fast client B.
The reason the second reason is not-so-good is that while having one thread per socket seems to simplify the program's design, in practice it usually complicates it. It introduces the possibility of race conditions and deadlocks, and makes it difficult to safely access shared data (as you mentioned). Worse, if you stick with blocking I/O, it becomes very difficult to shut the program down cleanly (or in any other way effect a thread's behavior from anywhere other than the thread's socket), because the thread is typically blocked in an I/O call (possibly indefinitely) with no reliable way to wake it up. (Signals don't work reliably in multithreaded programs, and going back to non-blocking I/O means you lose the simplified program structure you were hoping for)
In short, I agree with cib -- multithreaded servers can be problematic and therefore should generally be avoided unless you absolutely need to make use of multiple cores -- and even then it might be better to use multiple processes rather than multiple threads, for safety's sake.
The biggest advantage of threads is to prevent the accumulated lag time from processing requests. When polling you use a loop to service every socket with a state change. For a handful of clients, this is not very noticeable, however it could lead to significant delays when dealing with significantly large number of clients.
Assuming that each transaction requires some pre-processing and post processing (depending on the protocol this may be trivial amount of processing, or it could be relatively significant as is the case with BEEP or SOAP). The combined time to pre-process/post-process requests could lead to a backlog of pending requests.
For illustration purposes imagine that the pre-processing, processing, and post-processing stage of a request each consumes 1 microsecond so that the total request takes 3 microseconds to complete. In a single threaded environment the system would become overwhelmed if incoming requests exceed 334 requests per second (since it would take 1.002 seconds to service all requests received within a 1 second period of time) leading to a time deficit of 0.002 seconds each second. However if the system were using threads, then it would be theoretically possible to only require 0.336 seconds * (0.334 for shared data access + 0.001 pre-processing + 0.001 post processing) of processing time to complete all of the requests received in a 1 second time period.
Although theoretically possible to process all requests in 0.336 seconds, this would require each request to have it's own thread. More reasonably would be to multiple the combined pre/post processing time (0.668 seconds) by the number of requests and divide by the number of configured threads. For example, using the same 334 incoming requests and processing time, theoritically 2 threads would complete all requests in 0.668 seconds (0.668 / 2 + 0.334), 4 threads in 0.501 seconds, and 8 threads in 0.418 seconds.
If the highest request volume your daemon receives is relatively low, then a single threaded implementation with non-blocking I/O is sufficient, however if you expect occasionally bursts of high volume of requests then it is worth considering a multi-threaded model.
I've written more than a handful of UNIX daemons which have relatively low throughput and I've used a single-threaded for the simplicity. However, when I wrote a custom netflow receiver for an ISP, I used a threaded model for the daemon and it was able to handle peak times of Internet usage with minimal bumps in system load average.

Multithreading in .NET 4.0 and performance

I've been toying around with the Parallel library in .NET 4.0. Recently, I developed a custom ORM for some unusual read/write operations one of our large systems has to use. This allows me to decorate an object with attributes and have reflection figure out what columns it has to pull from the database, as well as what XML it has to output on writes.
Since I envision this wrapper to be reused in many projects, I'd like to squeeze as much speed out of it as possible. This library will mostly be used in .NET web applications. I'm testing the framework using a throwaway console application to poke at the classes I've created.
I've now learned a lesson of the overhead that multithreading comes with. Multithreading causes it to run slower. From reading around, it seems like it's intuitive to people who've been doing it for a long time, but it's actually counter-intuitive to me: how can running a method 30 times at the same time be slower than running it 30 times sequentially?
I don't think I'm causing problems by multiple threads having to fight over the same shared object (though I'm not good enough at it yet to tell for sure or not), so I assume the slowdown is coming from the overhead of spawning all those threads and the runtime keeping them all straight. So:
Though I'm doing it mainly as a learning exercise, is this pessimization? For trivial, non-IO tasks, is multithreading overkill? My main goal is speed, not responsiveness of the UI or anything.
Would running the same multithreading code in IIS cause it to speed up because of already-created threads in the thread pool, whereas right now I'm using a console app, which I assume would be single-threaded until I told it otherwise? I'm about to run some tests, but I figure there's some base knowledge I'm missing to know why it would be one way or the other. My console app is also running on my desktop with two cores, whereas a server for a web app would have more, so I might have to use that as a variable as well.
Thread's don't actually all run concurrently.
On a desktop machine I'm presuming you have a dual core CPU, (maybe a quad at most). This means only 2/4 threads can be running at the same time.
If you have spawned 30 threads, the OS is going to have to context switch between those 30 threads to keep them all running. Context switches are quite costly, so hence the slowdown.
As a basic suggestion, I'd aim for 1 thread per CPU if you are trying to optimise calculations. Any more than this and you're not really doing any extra work, you are just swapping threads in an out on the same CPU. Try to think of your computer as having a limited number of workers inside, you can't do more work concurrently than the number of workers you have available.
Some of the new features in the .net 4.0 parallel task library allow you to do things that account for scalability in the number of threads. For example you can create a bunch of tasks and the task parallel library will internally figure out how many CPUs you have available, and optimise the number of threads is creates/uses so as not to overload the CPUs, so you could create 30 tasks, but on a dual core machine the TP library would still only create 2 threads, and queue the . Obviously, this will scale very nicely when you get to run it on a bigger machine. Or you can use something like ThreadPool.QueueUserWorkItem(...) to queue up a bunch of tasks, and the pool will automatically manage how many threads is uses to perform those tasks.
Yes there is a lot of overhead to thread creation, but if you are using the .net thread pool, (or the parallel task library in 4.0) .net will be managing your thread creation, and you may actually find it creates less threads than the number of tasks you have created. It will internally swap your tasks around on the available threads. If you actually want to control explicit creation of actual threads you would need to use the Thread class.
[Some cpu's can do clever stuff with threads and can have multiple Threads running per CPU - see hyperthreading - but check out your task manager, I'd be very surprised if you have more than 4-8 virtual CPUs on today's desktops]
There are so many issues with this that it pays to understand what is happening under the covers. I would highly recommend the "Concurrent Programming on Windows" book by Joe Duffy and the "Java Concurrency in Practice" book. The latter talks about processor architecture at the level you need to understand it when writing multithreaded code. One issue you are going to hit that's going to hurt your code is caching, or more likely the lack of it.
As has been stated there is an overhead to scheduling and running threads, but you may find that there is a larger overhead when you share data across threads. That data may be flushed from the processor cache into main memory, and that will cause serious slow downs to your code.
This is the sort of low-level stuff that managed environments are supposed to protect us from, however, when writing highly parallel code, this is exactly the sort of issue you have to deal with.
A colleague of mine recorded a screencast about the performance issue with Parallel.For and Parallel.ForEach which may help:
http://rocksolidknowledge.com/ScreenCasts.mvc/Watch?video=ParallelLoops.wmv
You're speaking of an ORM, so I presume some amount of I/O is going on. If this is the case, the overhead of thread creation and context switching is going to be comparatively non-existent.
Most likely, you're experiencing I/O contention: it can be slower (particularly on rotational hard drives, but also on other storage devices) to read the same set of data if you read it out of order than if you read it in-order. So, if you're executing 30 database queries, it's possible they'll run faster sequentially than in parallel if they're all backed by the same I/O device and the queries aren't in cache. Running them in parallel may cause the system to have a bunch of I/O read requests almost simultaneously, which may cause the OS to read little bits of each in turn - causing your drive head to jump back and forth, wasting precious milliseconds.
But that's just a guess; it's not possible to really determine what's causing your slowdown without knowing more.
Although thread creation is "extremely expensive" when compared to say adding two numbers, it's not usually something you'll easily overdo. If your operations are extremely short (say, a millisecond or less), using a thread-pool rather than new threads will noticeably save time. Generally though, if your operations are that short, you should reconsider the granularity of parallelism anyhow; perhaps you're better off splitting the computation into bigger chunks: for instance, by having a fairly low number of worker tasks which handle entire batches of smaller work-items at a time rather than each item separately.

Resources