Memory addressing in assembly / multitasking

Memory addressing in assembly / multitasking - multithreading

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?

That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.

A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.

I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Related

Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.

Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
So no, putting your junk in the kernel just means you have increased the attack surface for no valid reason. Try to think in terms of exposing minimum risk, it leads to better software and better sleep.

Context switch between kernel threads vs user threads

Copy pasted from this link:
Thread switching does not require Kernel mode privileges.
User level threads are fast to create and manage.
Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.
I never came across these points while reading standard operating systems reference books. Though these points sound logical, I wanted to know how they reflect in Linux. To be precise :
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
Can someone link me to Linux source code line (say on github) handling context switch.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?

Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Let's imagine a thread needs to read data from a file, but the file isn't cached in memory and disk drives are slow so the thread has to wait; and for simplicity let's also assume that the kernel is monolithic.
For kernel threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes, sets the thread to "blocked waiting for IO" and switches to a different thread (that may belong to a completely different process, depending on global thread priorities). The kernel returns to the user-space of whatever thread it switch to.
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do the for (currently blocked) thread and unblocks that thread. At this point the kernel might decide to switch to the "now unblocked" thread; and the kernel returns to the user-space of the "now unblocked" thread.
For user threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes but can't take care of that because some fool decided to make everything worse by doing thread switching in user space, so the kernel returns to user-space with "IO request has been queued" status.
after the pointless extra overhead of switching back to user-space; the user-space scheduler does the thread switch that the kernel could have done. At this point the user-space scheduler will either tell kernel it has nothing to do and you'll have more pointless extra overhead switching back to kernel; or user-space scheduler will do a thread switch to another thread in the same process (which may be the wrong thread because a thread in a different process is higher priority).
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do for the (currently blocked) thread; but the kernel isn't able to do the thread switch to unblock the thread because some fool decided to make everything worse by doing thread switching in user space. Now we've got a problem - how does kernel inform the user-space scheduler that the IO has finished? To solve this (without any "user-space scheduler running zero threads constantly polls kernel" insanity) you have to have some kind of "kernel puts notification of IO completion on some kind of queue and (if the process was idle) wakes the process up" which (on its own) will be more expensive than just doing the thread switch in the kernel. Of course if the process wasn't idle then code in user-space is going to have to poll its notification queue to find out if/when the "notification of IO completion" arrives, and that's going to increase latency and overhead. In any case, after lots of stupid pointless and avoidable overhead; the user-space scheduler can do the thread switch.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
The actual low-level context switch code typically begins with something like:
save whichever registers are "caller preserved" according to the calling conventions on the stack
save the current stack top in some kind of "thread info structure" belonging to the old thread
load a new stack top from some kind of "thread info structure" belonging to the new thread
pop whichever registers are "caller preserved" according to the calling conventions
return
However:
usually (for modern CPUs) there's a relatively large amount of "SIMD register state" (e.g. for 80x86 with support for AVX-512 I think it's over 4 KiB of of stuff). CPU manufacturers often have mechanisms to avoid saving parts of that state if it wasn't changed, and to (optionally) postpone the loading of (pieces of) that state until its actually used (and avoid it completely if its not actually used). All of that requires kernel.
if it's a task switch and not just used for thread switches you might need some kind of "if virtual address space needs to change { change virtual address space }" on top of that
normally you want to keep track of statistics, like how much CPU time a thread has used. This requires some kind of "thread_info.time_used += now() - time_at_last_thread_switch;"; which gets difficulty/ugly when "process switching" is separated from "thread switching".
normally there's other state (e.g. pointer to thread local storage, special registers for performance monitoring and/or debugging, ...) that may need to be saved/loaded during thread switches. Often this state is not directly accessible in user code.
normally you also want to set a timer to expire when the thread has used too much time; either because you're doing some kind of "time multiplexing" (e.g. round-robin scheduler) or because its a cooperating scheduler where you need to have some kind of "terminate this task after 5 seconds of not responding in case it goes into an infinite loop forever" safe-guard.
this is just the low level task/thread switching in isolation. There is almost always higher level code to select a task to switch to, handle "thread used too much CPU time", etc.
Can someone link me to Linux source code line (say on github) handling context switch
Someone probably can't. It's not one line; it's many lines of assembly for each different architecture, plus extra higher-level code (for timers, support routines, the "select a task to switch to" code, for exception handlers to support "lazy SIMD state load", ...); which probably all adds up to something like 10 thousand lines of code spread across 50 files.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Yes; often you're already in kernel code when you find out that a thread switch is needed.
Rarely/sometimes (mostly only due to communication between threads belonging to the same process - e.g. 2 or more threads in the same process trying to acquire the same mutex/semaphore at the same time; or threads sending data to each other and waiting for data from each other to arrive) kernel isn't involved; and in some cases (which are almost always massive design failures - e.g. extreme lock contention problems, failure to use "worker thread pools" to limit the number of threads needed, etc) it's possible for this to be the dominant cause of thread switches, and therefore possible that doing thread switches in user space can be beneficial (e.g. as a work-around for the massive design failures).

Don't limit yourself to Linux or even UNIX, they are neither the first nor last word on systems or programming models. The synchronous execution model dates back to the early days of computing, and are not particularly well suited to larger scale concurrent and reactive programming.
Golang, for example, employs a great many lightweight user threads -- goroutines -- and multiplexes them on a smaller set of heavyweight kernel threads to produce a more compelling concurrency paradigm. Some other programming systems take similar approaches.

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.

A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.

There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.

I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

Thread context switch Vs. process context switch

Could any one tell me what is exactly done in both situations? What is the main cost each of them?

The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch.
Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor "remembers" in its cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor's Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.

Process context switching involves switching the memory address space. This includes memory addresses, mappings, page tables, and kernel resources—a relatively expensive operation. On some architectures, it even means flushing various processor caches that aren't sharable across address spaces. For example, x86 has to flush the TLB and some ARM processors have to flush the entirety of the L1 cache!
Thread switching is context switching from one thread to another in the same process (switching from thread to thread across processes is just process switching).Switching processor state (such as the program counter and register contents) is generally very efficient.

First of all, operating system brings outgoing thread in a kernel mode if it is not already there, because thread switch can be performed only between threads, that runs in kernel mode. Then the scheduler is invoked to make a decision about thread to which will be performed switching. After decision is made, kernel saves part of the thread context that is located in CPU (CPU registers) into the dedicated place in memory (frequently on the top of the kernel stack of outgoing thread). Then the kernel performs switch from kernel stack of outgoing thread on to kernel stack of the incoming thread. After that, kernel loads previously stored context of incoming thread from memory into CPU registers. And finally returns control back into user mode, but in user mode of the new thread.
In the case when OS has determined that incoming thread runs in another process, kernel performs one additional step: sets new active virtual address space.
The main cost in both scenarios is related to a cache pollution. In most cases, the working set used by the outgoing thread will differ significantly from working set which is used by the incoming thread. As a result, the incoming thread will start its life with avalanche of cache misses, thus flushing old and useless data from the caches and loading the new data from memory. The same is true for TLB (Translation Look Aside Buffer, which is on the CPU). In the case of reset of virtual address space (threads run in different processes) the penalty is even worse, because reset of virtual address space leads to the flushing of the entire TLB, even if new thread actually needs to load only few new entries. As a result, the new thread will start its time quantum with lots TLB misses and frequent page walking. Direct cost of threads switch is also not negligible (from ~250 and up to ~1500-2000 cycles) and depends on the CPU complexity, states of both threads and sets of registers which they actually use.
P.S.: Good post about context switch overhead: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

process switching: it is a transition between two memory resident of process in a multiprogramming environment;
context switching: it is a changing context from an executing program to an interrupt service routine (ISR).

In Thread Context Switching, the virtual memory space remains the same while it is not in the case of Process Context Switch. Also, Process Context Switch is costlier than Thread Context Switch.

I think main difference is when calling switch_mm() which handles memory descriptors of old and new task. In the case of threads, the virtual memory address space is unchanged (threads share virtual memory), so very little has to be done, and therefore less costly.

Though thread context switching needs to change the execution context (registers, stack pointers, program counters), they don't need to change address space as processes context switches do. There's an additional cost when you switch address space, more memory access (paging, segmentation, etc) and you have to flush TLB when entering or exiting a new process...

In short, the thread context switch does not assign a brand new set of memory and pid, it uses the same as the parent since it is running within the same process. A process one spawns a new process and thus assigns new mem and pid.
There is a loooooot more to it. They have written books on it.
As for cost, a process context switch >>>> thread as you have to reset all of the stack counters etc.

Assuming that The CPU the OS runs has got Some High Latency Devices Attached,
It makes sense to run another thread Of the Process's Address Space, while the high latency device responds back.
But, if the High Latency Device is responding faster than the time to need do set up of table + translation of Virtual To Physical memories for a NEW Process, then it is questionable if a switch is essential at all.
Also, HOT cache(data needed for running the process/thread is reachable in less time) is better choice.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string