How I profile multithreading problems?

How I profile multithreading problems? - multithreading

This is the first time I am trying to profile a multi-threaded program.
I suspect the problem is it waiting for something, but I have no clue what, the program never reaches 100% of CPU, GPU, RAM or I/O use.
Until recently, I've only worked on projects with single-threading, or where the threads were very simple (example: usually an extra thread just to ensure the UI is not locked while the program works, or once I made a game engine with a separate thread to handle .XM and .IT files music, so that the main thread could do everything, while the other thread in another core could take care of decoding those files).
This program has several threads, and they don't do parallel work on the same tasks, each thread has its own completely separate purpose (for example one thread is dedicated to handling all sound-related API calls to the OS).
I downloaded Microsoft performance tools, there is a blog by an ex-Valve employee that explains that they work to do this, but although I even managed to make some profiles and whatnot, I don't really understood what I am seeing, it is only a bunch of pretty graphs to me (except the CPU use graph, that I already knew from doing sample-based profiling on single-threaded apps), so, how I find why the program is waiting on something? Or how I find what is it waiting for? How I find what thread is blocking the others?

I look at is as an alternation between two things:
a) measuring overall time, for which all you need is some kind of timer, and
b) finding speedups, which does not mean measuring, in spite of what a lot of people have been told.
Each time you find a speedup, you time the results and do it again.
That's the alternation.
To find speedups, the method I and many people use is random pausing.
The idea is, you get the program running under a debugger and manually interrupt it, several times.
Each time, you examine the state of every thread, including the call stack.
It is very crude, and it is very effective.
The reason this works is that the only way the program can go faster is if it is doing an activity that you can remove, and if that saves a certain fraction of time, you are at least that likely to see it on every pause.
This works whether it is doing I/O, waiting for something, or computing.
It sees things that profilers do not expose, because they make summaries from which speedups can easily hide.

Performance Wizard in Visual Studio Performance and Diagnostics Hub has "Resource contention data" profiling regime which allows to analyze concurrency contention among threads, i.e. how the overall performance of a program is impacted by threads waiting on other threads. Please refer to this blog post for more details.
PerfView is an extremely powerful profiling tool which allows one to analyze the impact of service threads and tasks to the overall performance of the program. Here is the PerfView Tutorial available.

Related

Can code running in a background thread be faster than in the main VCL thread in Delphi?

If anybody has had a lot of experience timing code running on the main VCL thread vs a background thread, I'd like to get an opinion. I have some code that does some heavy string processing running in my Delphi 6 application on the main thread. Each time I run an operation, the time for each operation hovers around 50 ms on a single thread on my i5 Quad core. What makes me really suspicious is that the same code running on an old Pentium 4 that I have, shows the same time for the operation when usually I see code running about 4 times slower on the Pentium 4 than the Quad Core. I am beginning to wonder if the code might be consuming significantly less time than 50 ms but that there's something about the main VCL thread, perhaps Windows message handling or executing Windows API calls, that is creating an artificial "floor" for the operation. Note, an operation is triggered by an incoming request on a socket if that matters, but the time measurement does not take place until the data is fully received.
Before I undertake the work of moving all the code on to a background thread for testing, I am wondering if anyone has any general knowledge in this area? What have your experiences been with code running on and off the main VCL thread? Note, the timing measurements are being done when there is absolutely no user triggered activity going on during the tests.
I'm also wondering if raising the priority of the thread to just below real-time would do any good. I've never seen much improvement in my run times when experimenting with those flags.
-- roschler

Given all threads have the same priority, as they normally do, there can't be a difference, for the following reasons. If you're seeing a difference, re-evaluate the code (make sure you run the same thing in both VCL and background threads) and make sure you time it properly:
The compiler generates the exact same code, it doesn't care if the code is going to run in the main thread or in a background thread. In fact you can put the whole code in a procedure and call that from both your worker thread's Execute() and from the main VCL thread.
For the CPU all cores, and all threads, are equal. Unless it's actually a Hyper Threading CPU, where not all cores are real, but then see the next bullet.
Even if not all CPU cores are equal, your thread will very unlikely run on the same core, the operating system is free to move it around at will (and does actually schedule your thread to run on different cores at different times).
Messaging overhead doesn't matter for the main VCL thread, because unless you're calling Application.ProcessMessages() manually, the message pump is simply stopped while your procedure does it's work. The message pump is passive, your thread needs to request messages from the queue, but since the thread is busy doing your work, it's not requesting any messages so no overhead there.
There's just one place where threads are not equal, and this can change the perceived speed of execution: It's the operating system that schedules threads to execution units (cores), and for the operating system threads have different priorities. You can tell the OS a certain thread needs to be treated differently using the SetThreadPriority() API (which is used by the TThread.Priority property).

Without simple source code to reproduce the issue, and how you are timing your threads, it will be difficult to understand what occurs in your software.
Sounds definitively like either:
An Architecture issue - how are your threads defined?
A measurement issue - how are you timing your threads?
A typical scaling issue of both the memory manager and the RTL string-related implementation.
About the latest point, consider this:
The current memory manager (FastMM4) is not scaling well on multi-core CPU; try with a per-thread memory manager, like our experimental SynScaleMM - note e.g. that the Free Pascal Compiler team has written a new scaling MM from scratch recently, to avoid such issue;
Try changing the string process implementation to avoid memory allocation (use static buffers), and string reference-counting (every string reference counting access produces a LOCK DEC/INC which do not scale so well on multi-code CPU - use per-thread char-level process, using e.g. PChar on static buffers instead of string).
I'm sure that without string operations, you'll find that all threads are equivalent.
In short: neither the current Delphi MM, neither the current string implementation scales well on multi-core CPU. You just found out a known issue of the current RTL. Read this SO question.

When your code has control of the VCL thread, for instance if it is in one method and doesn't call out to any VCL controls or call Application.ProcessMessages, then the run time will not be affected just because it's in the main VCL thread.
There is no overhead, since you "own" the whole processing power of the thread when you are in your own code.
I would suggest that you use a profiling tool to find where the actual bottleneck is.

Performance can't be assessed statically. For that you need to get AQTime, or some other performance profiler for Delphi. I use AQtime, and I love it, but I'm aware it's considered expensive.
Your code will not magically get faster just because you moved it to a background thread. If anything, your all-inclusive-time until you see results in your UI might get a little slower, if you have to send a lot of data from the background thread to the foreground thread via some synchronization mechanisms.
If however you could execute parts of your algorithm in parallel, that is, split your work so that you have 2 or more worker threads processing your data, and you have a quad core processor, then your total time to do a fixed load of work, could decrease. That doesn't mean the code would run any faster, but depending on a lot of factors, you might achieve a slight benefit from multithreading, up to the number of cores in your computer. It's never ever going to be a 2x performance boost, to use two threads instead of one, but you might get 20%-40% better performance, in your more-than-one-threaded parallel solutions, depending on how scalable your heap is under multithreaded loads, and how IO/memory/cache bound your workload is.
As for raising thread priorities, generally all you will do there is upset the delicate balance of your Windows system's performance. By raising the priorities you will achieve (sometimes) a nominal, but unrepeatable and non-guaranteeable increase in performance. Depending on the other things you do in your code, and your data sources, playing with priorities of threads can introduce subtle problems. See Dining Philosophers problem for more.
Your best bet for optimizing the speed of string operations is to first test it and find out exactly where it is using most of its time. Is it heap operations? Memory Copy and move operations? Without a profiler, even with advice from other people, you will still be comitting a cardinal sin of programming; premature optimization. Be results oriented. Be science based. Measure. Understand. Then decide.
Having said that, I've seen a lot of horrible code in my time, and there is one killer thing that people do that totally kills their threaded app performance; Using TThread.Synchronize too much.
Here's a pathological (Extreme) case, that sadly, occurs in the wild fairly frequently:
procedure TMyThread.Execute;
begin
while not Terminated do
Synchronize(DoWork);
end;
The problem here is that 100% of the work is really done in the foreground, other than the "if terminated" check, which executes in the thread context. To make the above code even worse, add a non-interruptible sleep.
For fast background thread code, use Synchronize sparingly or not at all, and make sure the code it calls is simple and executes quickly, or better yet, use TThread.Queue or PostMessage if you could really live with queueing main thread activity.

Are you asking if a background thread would be faster? If your background thread would run the same code as the main thread and there's nothing else going on in the main thread, you don't stand to gain anything with a background thread. Threads should be used to split and distribute processing loads that would otherwise contend with one another and/or block one another when running in the main thread. Since you seem to be dealing with a case where your main thread is otherwise idle, simply spawning a thread to run slow code will not help.
Threads aren't magic, they can't speed up slow code or eliminate processing bottlenecks in a particular segment not related to contention on the main thread. Make sure your code isn't doing something you don't know about and that your timing methodology is correct.
My first hunch would be that your interaction with the socket is affecting your timing in a way you haven't detected... (I know you said you're sure that's not involved - but maybe check again...)

Pseudo real time threading

So I have built a small application that has a physics engine and a display. The display is attached to a controller which handles the physics engine(well, actually a view model that handles the controller, but details).
Currently the controller is a delegate that gets activated by a begin-invoke and deactivated by a cancellation token, and then reaped by an endinvoke. Inside the lambda brushes PropertyChanged(hooked into INotifyPropertyChanged) which keeps the UI up to date.
From what I understand the BeginInvoke method activates a task rather than another thread(which on my computers does activate another thread, but this isn't a guarantee from the reading I have done,it's up to the thread pool how it wants to get the task completed), which is fine from all the testing I have done. The lambda doesn't complete until a CancellationToken is killed. It has a sleep and an update(so it is sort of simulating a real-time physics engine...it's crude, but I don't need real precision on the real time, just enough to get a feel)
The question I have is, will this work on other computers, or should I switch over to explicit threads that I start and cancel? The scenario I am thinking of is on a 1 core processor, is it possible the second task will get massively less processor time and thereby make my acceptably inaccurate model into something unacceptably inaccurate(i.e. waiting for milliseconds before switching rather than microseconds?). Or is their some better way of doing this that I haven't come up with?

In my experience, using the threadpool in the way you described will pretty much guarantee reasonably optimal performance on most computers, without you having to go to the trouble to figure out how to divvy up the threads.
A thread is not the same thing as a core; you will still get multiple threads on a single-core machine, and those threads will each take part of the processing load. You won't get the "deadlock" condition you describe, unless you do something unusual with the threads, like give one of them real-time priority.
That said, microseconds is not a lot of time for context switching between threads, so YMMV. You'll have to try it, and see how well it works; there may be some tweaking required.

Can a multi-threaded program ever be deterministic?

Normally it is said that multi threaded programs are non-deterministic, meaning that if it crashes it will be next to impossible to recreate the error that caused the condition. One doesn't ever really know what thread is going to run next, and when it will be preempted again.
Of course this has to do with the OS thread scheduling algorithm and the fact that one doesn't know what thread is going to be run next, and how long it will effectively run.
Program execution order also plays a role as well, etc...
But what if you had the algorithm used for thread scheduling and what if you could know when what thread is running, could a multi threaded program then become "deterministic", as in, you'll be able to reproduce a crash?

Knowing the algorithm will not actually allow you to predict what will happen when. All kinds of delays that happen in the execution of a program or thread are dependent on environmental conditions such as: available memory, swapping, incoming interrupts, other busy tasks, etc.
If you were to map your multi-threaded program to a sequential execution, and your threads in themselves behave deterministically, then your whole program could be deterministic and 'concurrency' issues could be made reproducible. Of course, at that point they would not be concurrency issues any more.
If you would like to learn more, http://en.wikipedia.org/wiki/Process_calculus is very interesting reading.

My opinion is: technically no (but mathematically yes). You can write deterministic threading algorithm, but it will be extremely hard to predict state of the application after some sensible amount of time that you can treat it is non-deterministic.

There are some tools (in development) that will try to create race-conditions in a somewhat predictable manner but this is about forward-looking testing, not about reconstructing a 'bug in the wild'.
CHESS is an example.

It would be possible to run a program on a virtual multi-threaded machine where the allocation of virtual cycles to each thread was done via some entirely deterministic process, possibly using a pseudo-random generator (which could be seeded with a constant before each program run). Another, possibly more interesting, possibility would be to have a virtual machine which would alternate between running threads in 'splatter' mode (where almost any variable they touch would have its value become 'unknown' to other threads) and 'cleanup' mode (where results of operations with known operands would be visible and known to other threads). I would expect the situation would probably be somewhat analogous to hardware simulation: if the output of every gate is regarded as "unknown" between its minimum and maximum propagation times, but the simulation works anyway, that's a good indication the design is robust, but there are many useful designs which could not be constructed to work in such simulations (the states would be essentially guaranteed to evolve into a valid combination, though one could not guarantee which one). Still, it might be an interesting avenue of exploration, since large parts of many programs could be written to work correctly even in a 'splatter mode' VM.

I don't think it is practicable. To enforce a specific thread interleaving we require to place locks on shared variables, forcing the threads to access them in a specific order. This would cause severe performance degradation.
Replaying concurrency bugs is usually handled by record&replay systems. Since the recording of such large amounts of information also degrades performance, the most recent systems do partial logging and later complete the thread interleavings using SMT solving. I believe that the most recent advance in this type of systems is Symbiosis (published in this year's PLDI conference). Tou can find open source implementations in this URL:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html

This is actually a valid requirement in many systems today which want to execute tasks parallelly but also want some determinism from time to time.
For example, a mobile company would want to process subscription events of multiple users parallelly but would want to execute events of a single user one at a time.
One solution is to of course write everything to get executed on a single thread. Another solution is deterministic threading. I have written a simple library in Java that can be used to achieve the behavior I have described in the above example. Take a look at this- https://github.com/mukulbansal93/deterministic-threading.
Now, having said that, the actual allocation of CPU to a thread or process is in the hands of the OS. So, it is possible that the threads get the CPU cycles in a different order every time you run the same program. So, you cannot achieve the determinism in the order the threads are allocated CPU cycles. However, by delegating tasks effectively amongst threads such that sequential tasks are assigned to a single thread, you can achieve determinism in overall task execution.
Also, to answer your question about the simulation of a crash. All modern CPU scheduling algorithms are free from starvation. So, each and every thread is bound to get guaranteed CPU cycles. Now, it is possible that your crash was a result of the execution of a certain sequence of threads on a single CPU. There is no way to rerun that same execution order or rather the same CPU cycle allocation order. However, the combination of modern CPU scheduling algorithms being starvation-free and Murphy's law will help you simulate the error if you run your code enough times.
PS, the definition of enough times is quite vague and depends on a lot of factors like execution cycles need by the entire program, number of threads, etc. Mathematically speaking, a crude way to calculate the probability of simulating the same error caused by the same execution sequence is on a single processor is-
1/Number of ways to execute all atomic operations of all defined threads
For instance, a program with 2 threads with 2 atomic instructions each can be allocated CPU cycles in 4 different ways on a single processor. So probability would be 1/4.

Lots of crashes in multithreaded programs have nothing to do with the multithreading itself (or the associated resource contention).

Normally it is said that multi threaded programs are non-deterministic, meaning that if it crashes it will be next to impossible to recreate the error that caused the condition.
I disagree with this entirely, sure multi-threaded programs are non-deterministic, but then so are single-threaded ones, considering user input, message pumps, mouse/keyboard handling, and many other factors. A multi-threaded program usually makes it more difficult to reproduce the error, but definitely not impossible. For whatever reasons, program execution is not completely random, there is some sort of repeatability (but not predictability), I can usually reproduce multi-threaded bugs rather quickly in my apps, but then I have lots of verbose logging in my apps, for the end users' actions.
As an aside, if you are getting crashes, can't you also get crash logs, with call stack info? That will greatly aid in the debugging process.

Multithreading in .NET 4.0 and performance

I've been toying around with the Parallel library in .NET 4.0. Recently, I developed a custom ORM for some unusual read/write operations one of our large systems has to use. This allows me to decorate an object with attributes and have reflection figure out what columns it has to pull from the database, as well as what XML it has to output on writes.
Since I envision this wrapper to be reused in many projects, I'd like to squeeze as much speed out of it as possible. This library will mostly be used in .NET web applications. I'm testing the framework using a throwaway console application to poke at the classes I've created.
I've now learned a lesson of the overhead that multithreading comes with. Multithreading causes it to run slower. From reading around, it seems like it's intuitive to people who've been doing it for a long time, but it's actually counter-intuitive to me: how can running a method 30 times at the same time be slower than running it 30 times sequentially?
I don't think I'm causing problems by multiple threads having to fight over the same shared object (though I'm not good enough at it yet to tell for sure or not), so I assume the slowdown is coming from the overhead of spawning all those threads and the runtime keeping them all straight. So:
Though I'm doing it mainly as a learning exercise, is this pessimization? For trivial, non-IO tasks, is multithreading overkill? My main goal is speed, not responsiveness of the UI or anything.
Would running the same multithreading code in IIS cause it to speed up because of already-created threads in the thread pool, whereas right now I'm using a console app, which I assume would be single-threaded until I told it otherwise? I'm about to run some tests, but I figure there's some base knowledge I'm missing to know why it would be one way or the other. My console app is also running on my desktop with two cores, whereas a server for a web app would have more, so I might have to use that as a variable as well.

Thread's don't actually all run concurrently.
On a desktop machine I'm presuming you have a dual core CPU, (maybe a quad at most). This means only 2/4 threads can be running at the same time.
If you have spawned 30 threads, the OS is going to have to context switch between those 30 threads to keep them all running. Context switches are quite costly, so hence the slowdown.
As a basic suggestion, I'd aim for 1 thread per CPU if you are trying to optimise calculations. Any more than this and you're not really doing any extra work, you are just swapping threads in an out on the same CPU. Try to think of your computer as having a limited number of workers inside, you can't do more work concurrently than the number of workers you have available.
Some of the new features in the .net 4.0 parallel task library allow you to do things that account for scalability in the number of threads. For example you can create a bunch of tasks and the task parallel library will internally figure out how many CPUs you have available, and optimise the number of threads is creates/uses so as not to overload the CPUs, so you could create 30 tasks, but on a dual core machine the TP library would still only create 2 threads, and queue the . Obviously, this will scale very nicely when you get to run it on a bigger machine. Or you can use something like ThreadPool.QueueUserWorkItem(...) to queue up a bunch of tasks, and the pool will automatically manage how many threads is uses to perform those tasks.
Yes there is a lot of overhead to thread creation, but if you are using the .net thread pool, (or the parallel task library in 4.0) .net will be managing your thread creation, and you may actually find it creates less threads than the number of tasks you have created. It will internally swap your tasks around on the available threads. If you actually want to control explicit creation of actual threads you would need to use the Thread class.
[Some cpu's can do clever stuff with threads and can have multiple Threads running per CPU - see hyperthreading - but check out your task manager, I'd be very surprised if you have more than 4-8 virtual CPUs on today's desktops]

There are so many issues with this that it pays to understand what is happening under the covers. I would highly recommend the "Concurrent Programming on Windows" book by Joe Duffy and the "Java Concurrency in Practice" book. The latter talks about processor architecture at the level you need to understand it when writing multithreaded code. One issue you are going to hit that's going to hurt your code is caching, or more likely the lack of it.
As has been stated there is an overhead to scheduling and running threads, but you may find that there is a larger overhead when you share data across threads. That data may be flushed from the processor cache into main memory, and that will cause serious slow downs to your code.
This is the sort of low-level stuff that managed environments are supposed to protect us from, however, when writing highly parallel code, this is exactly the sort of issue you have to deal with.
A colleague of mine recorded a screencast about the performance issue with Parallel.For and Parallel.ForEach which may help:
http://rocksolidknowledge.com/ScreenCasts.mvc/Watch?video=ParallelLoops.wmv

You're speaking of an ORM, so I presume some amount of I/O is going on. If this is the case, the overhead of thread creation and context switching is going to be comparatively non-existent.
Most likely, you're experiencing I/O contention: it can be slower (particularly on rotational hard drives, but also on other storage devices) to read the same set of data if you read it out of order than if you read it in-order. So, if you're executing 30 database queries, it's possible they'll run faster sequentially than in parallel if they're all backed by the same I/O device and the queries aren't in cache. Running them in parallel may cause the system to have a bunch of I/O read requests almost simultaneously, which may cause the OS to read little bits of each in turn - causing your drive head to jump back and forth, wasting precious milliseconds.
But that's just a guess; it's not possible to really determine what's causing your slowdown without knowing more.
Although thread creation is "extremely expensive" when compared to say adding two numbers, it's not usually something you'll easily overdo. If your operations are extremely short (say, a millisecond or less), using a thread-pool rather than new threads will noticeably save time. Generally though, if your operations are that short, you should reconsider the granularity of parallelism anyhow; perhaps you're better off splitting the computation into bigger chunks: for instance, by having a fairly low number of worker tasks which handle entire batches of smaller work-items at a time rather than each item separately.

What kinds of applications need to be multi-threaded?

What are some concrete examples of applications that need to be multi-threaded, or don't need to be, but are much better that way?
Answers would be best if in the form of one application per post that way the most applicable will float to the top.

There is no hard and fast answer, but most of the time you will not see any advantage for systems where the workflow/calculation is sequential. If however the problem can be broken down into tasks that can be run in parallel (or the problem itself is massively parallel [as some mathematics or analytical problems are]), you can see large improvements.
If your target hardware is single processor/core, you're unlikely to see any improvement with multi-threaded solutions (as there is only one thread at a time run anyway!)
Writing multi-threaded code is often harder as you may have to invest time in creating thread management logic.
Some examples
Image processing can often be done in parallel (e.g. split the image into 4 and do the work in 1/4 of the time) but it depends upon the algorithm being run to see if that makes sense.
Rendering of animation (from 3DMax,etc.) is massively parallel as each frame can be rendered independently to others -- meaning that 10's or 100's of computers can be chained together to help out.
GUI programming often helps to have at least two threads when doing something slow, e.g. processing large number of files - this allows the interface to remain responsive whilst the worker does the hard work (in C# the BackgroundWorker is an example of this)
GUI's are an interesting area as the "responsiveness" of the interface can be maintained without multi-threading if the worker algorithm keeps the main GUI "alive" by giving it time, in Windows API terms (before .NET, etc) this could be achieved by a primitive loop and no need for threading:
MSG msg;
while(GetMessage(&msg, hwnd, 0, 0))
{
TranslateMessage(&msg);
DispatchMessage(&msg);
// do some stuff here and then release, the loop will come back
// almost immediately (unless the user has quit)
}

Servers are typically multi-threaded (web servers, radius servers, email servers, any server): you usually want to be able to handle multiple requests simultaneously. If you do not want to wait for a request to end before you start to handle a new request, then you mainly have two options:
Run a process with multiple threads
Run multiple processes
Launching a process is usually more resource-intensive than lauching a thread (or picking one in a thread-pool), so servers are usually multi-threaded. Moreover, threads can communicate directly since they share the same memory space.
The problem with multiple threads is that they are usually harder to code right than multiple processes.

There are really three classes of reasons that multithreading would be applied:
Execution Concurrency to improve compute performance: If you have a problem that can be broken down into pieces and you also have more than one execution unit (processor core) available then dispatching the pieces into separate threads is the path to being able to simultaneously use two or more cores at once.
Concurrency of CPU and IO Operations: This is similar in thinking to the first one but in this case the objective is to keep the CPU busy AND also IO operations (ie: disk I/O) moving in parallel rather than alternating between them.
Program Design and Responsiveness: Many types of programs can take advantage of threading as a program design benefit to make the program more responsive to the user. For example the program can be interacting via the GUI and also doing something in the background.
Concrete Examples:
Microsoft Word: Edit document while the background grammar and spell checker works to add all the green and red squiggle underlines.
Microsoft Excel: Automatic background recalculations after cell edits
Web Browser: Dispatch multiple threads to load each of the several HTML references in parallel during a single page load. Speeds page loads and maximizes TCP/IP data throughput.

These days, the answer should be Any application that can be.
The speed of execution for a single thread pretty much peaked years ago - processors have been getting faster by adding cores, not by increasing clock speeds. There have been some architectural improvements that make better use of the available clock cycles, but really, the future is taking advantage of threading.
There is a ton of research going on into finding ways of parallelizing activities that we traditionally wouldn't think of parallelizing. Even something as simple as finding a substring within a string can be parallelized.

Basically there are two reasons to multi-thread:
To be able to do processing tasks in parallel. This only applies if you have multiple cores/processors, otherwise on a single core/processor computer you will slow the task down compared to the version without threads.
I/O whether that be networked I/O or file I/O. Normally if you call a blocking I/O call, the process has to wait for the call to complete. Since the processor/memory are several orders of magnitude quicker than a disk drive (and a network is even slower) it means the processor will be waiting a long time. The computer will be working on other things but your application will not be making any progress. However if you have multiple threads, the computer will schedule your application and the other threads can execute. One common use is a GUI application. Then while the application is doing I/O the GUI thread can keep refreshing the screen without looking like the app is frozen or not responding. Even on a single processor putting I/O in a different thread will tend to speed up the application.
The single threaded alternative to 2 is to use asynchronous calls where they return immediately and you keep controlling your program. Then you have to see when the I/O completes and manage using it. It is often simpler just to use a thread to do the I/O using the synchronous calls as they tend to be easier.
The reason to use threads instead of separate processes is because threads should be able to share data easier than multiple processes. And sometimes switching between threads is less expensive than switching between processes.
As another note, for #1 Python threads won't work because in Python only one python instruction can be executed at a time (known as the GIL or Global Interpreter Lock). I use that as an example but you need to check around your language. In python if you want to do parallel calculations, you need to do separate processes.

Many GUI frameworks are multi-threaded. This allows you to have a more responsive interface. For example, you can click on a "Cancel" button at any time while a long calculation is running.
Note that there are other solutions for this (for example the program can pause the calculation every half-a-second to check whether you clicked on the Cancel button or not), but they do not offer the same level of responsiveness (the GUI might seem to freeze for a few seconds while a file is being read or a calculation being done).

All the answers so far are focusing on the fact that multi-threading or multi-processing are necessary to make the best use of modern hardware.
There is however also the fact that multithreading can make life much easier for the programmer. At work I program software to control manufacturing and testing equipment, where a single machine often consists of several positions that work in parallel. Using multiple threads for that kind of software is a natural fit, as the parallel threads model the physical reality quite well. The threads do mostly not need to exchange any data, so the need to synchronize threads is rare, and many of the reasons for multithreading being difficult do therefore not apply.
Edit:
This is not really about a performance improvement, as the (maybe 5, maybe 10) threads are all mostly sleeping. It is however a huge improvement for the program structure when the various parallel processes can be coded as sequences of actions that do not know of each other. I have very bad memories from the times of 16 bit Windows, when I would create a state machine for each machine position, make sure that nothing would take longer than a few milliseconds, and constantly pass the control to the next state machine. When there were hardware events that needed to be serviced on time, and also computations that took a while (like FFT), then things would get ugly real fast.

Not directly answering your question, I believe in the very near future, almost every application will need to be multithreaded. The CPU performance is not growing that fast these days, which is compensated for by the increasing number of cores. Thus, if we will want our applications to stay on the top performance-wise, we'll need to find ways to utilize all your computer's CPUs and keep them busy, which is quite a hard job.
This can be done via telling your programs what to do instead of telling them exactly how. Now, this is a topic I personally find very interesting recently. Some functional languages, like F#, are able to parallelize many tasks quite easily. Well, not THAT easily, but still without the necessary infrastructure needed in more procedural-style environments.
Please take this as additional information to think about, not an attempt to answer your question.

The kind of applications that need to be threaded are the ones where you want to do more than one thing at once. Other than that no application needs to be multi-threaded.

Applications with a large workload which can be easily made parallel. The difficulty of taking your application and doing that should not be underestimated. It is easy when your data you're manipulating is not dependent upon other data but v. hard to schedule the cross thread work when there is a dependency.
Some examples I've done which are good multithreaded candidates..
running scenarios (eg stock derivative pricing, statistics)
bulk updating data files (eg adding a value / entry to 10,000 records)
other mathematical processes

E.g., you want your programs to be multithreaded when you want to utilize multiple cores and/or CPUs, even when the programs don't necessarily do many things at the same time.
EDIT: using multiple processes is the same thing. Which technique to use depends on the platform and how you are going to do communications within your program, etc.

Although frivolous, games, in general are becomming more and more threaded every year. At work our game uses around 10 threads doing physics, AI, animation, redering, network and IO.

Just want to add that caution must be taken with treads if your sharing any resources as this can lead to some very strange behavior, and your code not working correctly or even the threads locking each other out.
mutex will help you there as you can use mutex locks for protected code regions, a example of protected code regions would be reading or writing to shared memory between threads.
just my 2 cents worth.

The main purpose of multithreading is to separate time domains. So the uses are everywhere where you want several things to happen in their own distinctly separate time domains.

HERE IS A PERFECT USE CASE
If you like affiliate marketing multi-threading is essential. Kick the entire process off via a multi-threaded application.
Download merchant files via FTP, unzipping the files, enumerating through each file performing cleanup like EOL terminators from Unix to PC CRLF then slam each into SQL Server via Bulk Inserts then when all threads are complete create the full text search indexes for a environmental instance to be live tomorrow and your done. All automated to kick off at say 11:00 pm.
BOOM! Fast as lightening. Heck you have so much time left you can even download merchant images locally for the products you download, save the images as webp and set the product urls to use local images.
Yep I did it. Wrote it in C#. Works like a charm. Purchase a AMD Ryzen Threadripper 64-core with 256gb memory and fast drives like nvme, get lunch come back and see it all done or just stay around and watch all cores peg to 95%+, listen to the pc's fans kick, warm up the room and the look outside as the neighbors lights flicker from the power drain as you get shit done.
Future would be to push processing to GPU's as well.
Ok well I am pushing it a little bit with the neighbors lights flickering but all else was absolutely true. :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string