C# TPL Tasks - How many at one time - multithreading

I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.

This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.

Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.

I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.

I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)

Related

Managing the TPL Queue

I've got a service that runs scans of various servers. The networks in question can be huge (hundreds of thousands of network nodes).
The current version of the software is using a queueing/threading architecture designed by us which works but isn't as efficient as it could be (not least of which because jobs can spawn children which isn't handled well)
V2 is coming up and I'm considering using the TPL. It seems like it should be ideally suited.
I've seen this question, the answer to which implies there's no limit to the tasks TPL can handle. In my simple tests (Spin up 100,000 tasks and give them to TPL), TPL barfed fairly early on with an Out-Of-Memory exception (fair enough - especially on my dev box).
The Scans take a variable length of time but 5 mins/task is a good average.
As you can imagine, scans for huge networks can take a considerable length of time, even on beefy servers.
I've already got a framework in place which allows the scan jobs (stored in a Db) to be split between multiple scan servers, but the question is how exactly I should pass work to the TPL on a specific server.
Can I monitor the size of TPL's queue and (say) top it up if it falls below a couple of hundred entries? Is there a downside to doing this?
I also need to handle the situation where a scan needs to be paused. This is seems easier to do by not giving the work to TPL than by cancelling/resetting tasks which may already be partially processed.
All of the initial tasks can be run in any order. Children must be run after the parent has started executing but since the parent spawns them, this shouldn't ever be a problem. Children can be run in any order. Because of this, I'm currently envisioning that child tasks be written back to the Db not spawned directly into TPL. This would allow other servers to "work steal" if required.
Has anyone had any experience with using the TPL in this way? Are there any considerations I need to be aware of?
TPL is about starting small units of work and running them in parallel. It is not about monitoring, pausing, or throttling this work.
You should see TPL as a low-level tool to start "work" and to synchronize threads.
Key point: TPL tasks != logical tasks. Logical tasks are in your case scan-tasks ("scan an ip-range from x to y"). Such a task should not correspond to a physical task "System.Threading.Task" because the two are different concepts.
You need to schedule, orchestrate, monitor and pause the logical tasks yourself because TPL does not understand them and cannot be made to.
Now the more practical concerns:
TPL can certainly start 100k tasks without OOM. The OOM happened because your tasks' code exhausted memory.
Scanning networks sounds like a great case for asynchronous code because while you are scanning you are likely to wait on results while having a great degree of parallelism. You probably don't want to have 500 threads in your process all waiting for a network packet to arrive. Asynchronous tasks fit well with the TPL because every task you run becomes purely CPU-bound and small. That is the sweet spot for TPL.

Multithreading in .NET 4.0 and performance

I've been toying around with the Parallel library in .NET 4.0. Recently, I developed a custom ORM for some unusual read/write operations one of our large systems has to use. This allows me to decorate an object with attributes and have reflection figure out what columns it has to pull from the database, as well as what XML it has to output on writes.
Since I envision this wrapper to be reused in many projects, I'd like to squeeze as much speed out of it as possible. This library will mostly be used in .NET web applications. I'm testing the framework using a throwaway console application to poke at the classes I've created.
I've now learned a lesson of the overhead that multithreading comes with. Multithreading causes it to run slower. From reading around, it seems like it's intuitive to people who've been doing it for a long time, but it's actually counter-intuitive to me: how can running a method 30 times at the same time be slower than running it 30 times sequentially?
I don't think I'm causing problems by multiple threads having to fight over the same shared object (though I'm not good enough at it yet to tell for sure or not), so I assume the slowdown is coming from the overhead of spawning all those threads and the runtime keeping them all straight. So:
Though I'm doing it mainly as a learning exercise, is this pessimization? For trivial, non-IO tasks, is multithreading overkill? My main goal is speed, not responsiveness of the UI or anything.
Would running the same multithreading code in IIS cause it to speed up because of already-created threads in the thread pool, whereas right now I'm using a console app, which I assume would be single-threaded until I told it otherwise? I'm about to run some tests, but I figure there's some base knowledge I'm missing to know why it would be one way or the other. My console app is also running on my desktop with two cores, whereas a server for a web app would have more, so I might have to use that as a variable as well.
Thread's don't actually all run concurrently.
On a desktop machine I'm presuming you have a dual core CPU, (maybe a quad at most). This means only 2/4 threads can be running at the same time.
If you have spawned 30 threads, the OS is going to have to context switch between those 30 threads to keep them all running. Context switches are quite costly, so hence the slowdown.
As a basic suggestion, I'd aim for 1 thread per CPU if you are trying to optimise calculations. Any more than this and you're not really doing any extra work, you are just swapping threads in an out on the same CPU. Try to think of your computer as having a limited number of workers inside, you can't do more work concurrently than the number of workers you have available.
Some of the new features in the .net 4.0 parallel task library allow you to do things that account for scalability in the number of threads. For example you can create a bunch of tasks and the task parallel library will internally figure out how many CPUs you have available, and optimise the number of threads is creates/uses so as not to overload the CPUs, so you could create 30 tasks, but on a dual core machine the TP library would still only create 2 threads, and queue the . Obviously, this will scale very nicely when you get to run it on a bigger machine. Or you can use something like ThreadPool.QueueUserWorkItem(...) to queue up a bunch of tasks, and the pool will automatically manage how many threads is uses to perform those tasks.
Yes there is a lot of overhead to thread creation, but if you are using the .net thread pool, (or the parallel task library in 4.0) .net will be managing your thread creation, and you may actually find it creates less threads than the number of tasks you have created. It will internally swap your tasks around on the available threads. If you actually want to control explicit creation of actual threads you would need to use the Thread class.
[Some cpu's can do clever stuff with threads and can have multiple Threads running per CPU - see hyperthreading - but check out your task manager, I'd be very surprised if you have more than 4-8 virtual CPUs on today's desktops]
There are so many issues with this that it pays to understand what is happening under the covers. I would highly recommend the "Concurrent Programming on Windows" book by Joe Duffy and the "Java Concurrency in Practice" book. The latter talks about processor architecture at the level you need to understand it when writing multithreaded code. One issue you are going to hit that's going to hurt your code is caching, or more likely the lack of it.
As has been stated there is an overhead to scheduling and running threads, but you may find that there is a larger overhead when you share data across threads. That data may be flushed from the processor cache into main memory, and that will cause serious slow downs to your code.
This is the sort of low-level stuff that managed environments are supposed to protect us from, however, when writing highly parallel code, this is exactly the sort of issue you have to deal with.
A colleague of mine recorded a screencast about the performance issue with Parallel.For and Parallel.ForEach which may help:
http://rocksolidknowledge.com/ScreenCasts.mvc/Watch?video=ParallelLoops.wmv
You're speaking of an ORM, so I presume some amount of I/O is going on. If this is the case, the overhead of thread creation and context switching is going to be comparatively non-existent.
Most likely, you're experiencing I/O contention: it can be slower (particularly on rotational hard drives, but also on other storage devices) to read the same set of data if you read it out of order than if you read it in-order. So, if you're executing 30 database queries, it's possible they'll run faster sequentially than in parallel if they're all backed by the same I/O device and the queries aren't in cache. Running them in parallel may cause the system to have a bunch of I/O read requests almost simultaneously, which may cause the OS to read little bits of each in turn - causing your drive head to jump back and forth, wasting precious milliseconds.
But that's just a guess; it's not possible to really determine what's causing your slowdown without knowing more.
Although thread creation is "extremely expensive" when compared to say adding two numbers, it's not usually something you'll easily overdo. If your operations are extremely short (say, a millisecond or less), using a thread-pool rather than new threads will noticeably save time. Generally though, if your operations are that short, you should reconsider the granularity of parallelism anyhow; perhaps you're better off splitting the computation into bigger chunks: for instance, by having a fairly low number of worker tasks which handle entire batches of smaller work-items at a time rather than each item separately.

What kinds of applications need to be multi-threaded?

What are some concrete examples of applications that need to be multi-threaded, or don't need to be, but are much better that way?
Answers would be best if in the form of one application per post that way the most applicable will float to the top.
There is no hard and fast answer, but most of the time you will not see any advantage for systems where the workflow/calculation is sequential. If however the problem can be broken down into tasks that can be run in parallel (or the problem itself is massively parallel [as some mathematics or analytical problems are]), you can see large improvements.
If your target hardware is single processor/core, you're unlikely to see any improvement with multi-threaded solutions (as there is only one thread at a time run anyway!)
Writing multi-threaded code is often harder as you may have to invest time in creating thread management logic.
Some examples
Image processing can often be done in parallel (e.g. split the image into 4 and do the work in 1/4 of the time) but it depends upon the algorithm being run to see if that makes sense.
Rendering of animation (from 3DMax,etc.) is massively parallel as each frame can be rendered independently to others -- meaning that 10's or 100's of computers can be chained together to help out.
GUI programming often helps to have at least two threads when doing something slow, e.g. processing large number of files - this allows the interface to remain responsive whilst the worker does the hard work (in C# the BackgroundWorker is an example of this)
GUI's are an interesting area as the "responsiveness" of the interface can be maintained without multi-threading if the worker algorithm keeps the main GUI "alive" by giving it time, in Windows API terms (before .NET, etc) this could be achieved by a primitive loop and no need for threading:
MSG msg;
while(GetMessage(&msg, hwnd, 0, 0))
{
TranslateMessage(&msg);
DispatchMessage(&msg);
// do some stuff here and then release, the loop will come back
// almost immediately (unless the user has quit)
}
Servers are typically multi-threaded (web servers, radius servers, email servers, any server): you usually want to be able to handle multiple requests simultaneously. If you do not want to wait for a request to end before you start to handle a new request, then you mainly have two options:
Run a process with multiple threads
Run multiple processes
Launching a process is usually more resource-intensive than lauching a thread (or picking one in a thread-pool), so servers are usually multi-threaded. Moreover, threads can communicate directly since they share the same memory space.
The problem with multiple threads is that they are usually harder to code right than multiple processes.
There are really three classes of reasons that multithreading would be applied:
Execution Concurrency to improve compute performance: If you have a problem that can be broken down into pieces and you also have more than one execution unit (processor core) available then dispatching the pieces into separate threads is the path to being able to simultaneously use two or more cores at once.
Concurrency of CPU and IO Operations: This is similar in thinking to the first one but in this case the objective is to keep the CPU busy AND also IO operations (ie: disk I/O) moving in parallel rather than alternating between them.
Program Design and Responsiveness: Many types of programs can take advantage of threading as a program design benefit to make the program more responsive to the user. For example the program can be interacting via the GUI and also doing something in the background.
Concrete Examples:
Microsoft Word: Edit document while the background grammar and spell checker works to add all the green and red squiggle underlines.
Microsoft Excel: Automatic background recalculations after cell edits
Web Browser: Dispatch multiple threads to load each of the several HTML references in parallel during a single page load. Speeds page loads and maximizes TCP/IP data throughput.
These days, the answer should be Any application that can be.
The speed of execution for a single thread pretty much peaked years ago - processors have been getting faster by adding cores, not by increasing clock speeds. There have been some architectural improvements that make better use of the available clock cycles, but really, the future is taking advantage of threading.
There is a ton of research going on into finding ways of parallelizing activities that we traditionally wouldn't think of parallelizing. Even something as simple as finding a substring within a string can be parallelized.
Basically there are two reasons to multi-thread:
To be able to do processing tasks in parallel. This only applies if you have multiple cores/processors, otherwise on a single core/processor computer you will slow the task down compared to the version without threads.
I/O whether that be networked I/O or file I/O. Normally if you call a blocking I/O call, the process has to wait for the call to complete. Since the processor/memory are several orders of magnitude quicker than a disk drive (and a network is even slower) it means the processor will be waiting a long time. The computer will be working on other things but your application will not be making any progress. However if you have multiple threads, the computer will schedule your application and the other threads can execute. One common use is a GUI application. Then while the application is doing I/O the GUI thread can keep refreshing the screen without looking like the app is frozen or not responding. Even on a single processor putting I/O in a different thread will tend to speed up the application.
The single threaded alternative to 2 is to use asynchronous calls where they return immediately and you keep controlling your program. Then you have to see when the I/O completes and manage using it. It is often simpler just to use a thread to do the I/O using the synchronous calls as they tend to be easier.
The reason to use threads instead of separate processes is because threads should be able to share data easier than multiple processes. And sometimes switching between threads is less expensive than switching between processes.
As another note, for #1 Python threads won't work because in Python only one python instruction can be executed at a time (known as the GIL or Global Interpreter Lock). I use that as an example but you need to check around your language. In python if you want to do parallel calculations, you need to do separate processes.
Many GUI frameworks are multi-threaded. This allows you to have a more responsive interface. For example, you can click on a "Cancel" button at any time while a long calculation is running.
Note that there are other solutions for this (for example the program can pause the calculation every half-a-second to check whether you clicked on the Cancel button or not), but they do not offer the same level of responsiveness (the GUI might seem to freeze for a few seconds while a file is being read or a calculation being done).
All the answers so far are focusing on the fact that multi-threading or multi-processing are necessary to make the best use of modern hardware.
There is however also the fact that multithreading can make life much easier for the programmer. At work I program software to control manufacturing and testing equipment, where a single machine often consists of several positions that work in parallel. Using multiple threads for that kind of software is a natural fit, as the parallel threads model the physical reality quite well. The threads do mostly not need to exchange any data, so the need to synchronize threads is rare, and many of the reasons for multithreading being difficult do therefore not apply.
Edit:
This is not really about a performance improvement, as the (maybe 5, maybe 10) threads are all mostly sleeping. It is however a huge improvement for the program structure when the various parallel processes can be coded as sequences of actions that do not know of each other. I have very bad memories from the times of 16 bit Windows, when I would create a state machine for each machine position, make sure that nothing would take longer than a few milliseconds, and constantly pass the control to the next state machine. When there were hardware events that needed to be serviced on time, and also computations that took a while (like FFT), then things would get ugly real fast.
Not directly answering your question, I believe in the very near future, almost every application will need to be multithreaded. The CPU performance is not growing that fast these days, which is compensated for by the increasing number of cores. Thus, if we will want our applications to stay on the top performance-wise, we'll need to find ways to utilize all your computer's CPUs and keep them busy, which is quite a hard job.
This can be done via telling your programs what to do instead of telling them exactly how. Now, this is a topic I personally find very interesting recently. Some functional languages, like F#, are able to parallelize many tasks quite easily. Well, not THAT easily, but still without the necessary infrastructure needed in more procedural-style environments.
Please take this as additional information to think about, not an attempt to answer your question.
The kind of applications that need to be threaded are the ones where you want to do more than one thing at once. Other than that no application needs to be multi-threaded.
Applications with a large workload which can be easily made parallel. The difficulty of taking your application and doing that should not be underestimated. It is easy when your data you're manipulating is not dependent upon other data but v. hard to schedule the cross thread work when there is a dependency.
Some examples I've done which are good multithreaded candidates..
running scenarios (eg stock derivative pricing, statistics)
bulk updating data files (eg adding a value / entry to 10,000 records)
other mathematical processes
E.g., you want your programs to be multithreaded when you want to utilize multiple cores and/or CPUs, even when the programs don't necessarily do many things at the same time.
EDIT: using multiple processes is the same thing. Which technique to use depends on the platform and how you are going to do communications within your program, etc.
Although frivolous, games, in general are becomming more and more threaded every year. At work our game uses around 10 threads doing physics, AI, animation, redering, network and IO.
Just want to add that caution must be taken with treads if your sharing any resources as this can lead to some very strange behavior, and your code not working correctly or even the threads locking each other out.
mutex will help you there as you can use mutex locks for protected code regions, a example of protected code regions would be reading or writing to shared memory between threads.
just my 2 cents worth.
The main purpose of multithreading is to separate time domains. So the uses are everywhere where you want several things to happen in their own distinctly separate time domains.
HERE IS A PERFECT USE CASE
If you like affiliate marketing multi-threading is essential. Kick the entire process off via a multi-threaded application.
Download merchant files via FTP, unzipping the files, enumerating through each file performing cleanup like EOL terminators from Unix to PC CRLF then slam each into SQL Server via Bulk Inserts then when all threads are complete create the full text search indexes for a environmental instance to be live tomorrow and your done. All automated to kick off at say 11:00 pm.
BOOM! Fast as lightening. Heck you have so much time left you can even download merchant images locally for the products you download, save the images as webp and set the product urls to use local images.
Yep I did it. Wrote it in C#. Works like a charm. Purchase a AMD Ryzen Threadripper 64-core with 256gb memory and fast drives like nvme, get lunch come back and see it all done or just stay around and watch all cores peg to 95%+, listen to the pc's fans kick, warm up the room and the look outside as the neighbors lights flicker from the power drain as you get shit done.
Future would be to push processing to GPU's as well.
Ok well I am pushing it a little bit with the neighbors lights flickering but all else was absolutely true. :)

Does multithreading make sense for IO-bound operations?

When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.
Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.
That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).
I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.
It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.
In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.
No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.
I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.
I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.

Threads or asynch?

How do you make your application multithreaded ?
Do you use asynch functions ?
or do you spawn a new thread ?
I think that asynch functions are already spawning a thread so if your job is doing just some file reading, being lazy and just spawning your job on a thread would just "waste" ressources...
So is there some kind of design when using thread or asynch functions ?
If you are talking about .Net, then don't forget the ThreadPool. The thread pool is also what asynch functions often use. Spawning to much threads can actually hurt your performance. A thread pool is designed to spawn just enough threads to do the work the fastest. So do use a thread pool instead of spwaning your own threads, unless the thread pool doesn't meet your needs.
PS: And keep an eye out on the Parallel Extensions from Microsoft
Spawning threads is only going to waste resources if you start spawning tons of them, one or two extra threads isn't going to effect the platforms proformance, infact System currently has over 70 threads for me, and msn is using 32 (I really have no idea how a messenger can use that many threads, exspecialy when its minimised and not really doing anything...)
Useualy a good time to spawn a thread is when something will take a long time, but you need to keep doing something else.
eg say a calculation will take 30 seconds. The best thing to do is spawn a new thread for the calculation, so that you can continue to update the screen, and handle any user input because users will hate it if your app freezes untill its finished doing the calculation.
On the other hand, creating threads to do something that can be done almost instantly is nearly pointless, since the overhead of creating (or even just passing work to an existing thread using a thread pool) will be higher than just doing the job in the first place.
Sometimes you can break your app into a couple of seprate parts which run in their own threads. For example in games the updates/physics etc may be one thread, while grahpics are another, sound/music is a third, and networking is another. The problem here is you really have to think about how these parts will interact or else you may have worse proformance, bugs that happen seemingly "randomly", or it may even deadlock.
I'll second Fire Lancer's answer - creating your own threads is an excellent way to process big tasks or to handle a task that would otherwise be "blocking" to the rest of synchronous app, but you have to have a clear understanding of the problem that you must solve and develope in a way that clearly defines the task of a thread, and limits the scope of what it does.
For an example I recently worked on - a Java console app runs periodically to capture data by essentially screen-scraping urls, parsing the document with DOM, extracting data and storing it in a database.
As a single threaded application, it, as you would expect, took an age, averaging around 1 url a second for a 50kb page. Not too bad, but when you scale out to needing to processes thousands of urls in a batch, it's no good.
Profiling the app showed that most of the time the active thread was idle - it was waiting for I/O operations - opening of a socket to the remote URL, opening a connection to the database etc. It's this sort of situation that can easily be improved with multithreading. Rewriting to be multi-threaded and with just 5 threads instead of one, even on a single core cpu, gave an increase in throughput of over 20 times.
In this example, each "worker" thread was explicitly limited to what it did - open the remote a remote url, parse the data, store it in the db. All the "high level" processing - generating the list of urls to parse, working out which next, handling errors, all remained with the control of the main thread.
The use of threads makes you think more about the way your application needs threading and can in the long run make it easier to improve / control your performance.
Async methods are faster to use but they are a bit magic - a lot of things happen to make them possible - so it's probable that at some point you will need something that they can't give you. Then you can try and roll some custom threading code.
It all depends on your needs.
The answer is "it depends".
It depends on what you're trying to achieve. I'm going to assume that you're aiming for more performance.
The simplest solution is to find another way to improve your performance. Run a profiler. Look for hot spots. Reduce unnecessary IO.
The next solution is to break your program into multiple processes, each of which can run in their own address space. This is easiest because there is no chance of the individual processes messing each other up.
The next solution is to use threads. At this point you're opening a major can of worms, so start small, and only multi-thread the critical path of the code.
The next solution is to use asynch IO. Generally only recommended for people writing some of very heavily loaded server, and even then I would rather re-use one of the existing frameworks that abstract away the details e.g. the C++ framework ICE, or an EJB server under java.
Note that each of these solutions has multiple sub-solutions - there are different breeds of threads and different kinds of asynch IO, each with slightly different performance characteristics, but again, it's generally best to let the framework handle it for you.

Resources