hi I have searched through the various posts and answers regarding a perpetually blocked finalizer thread. Several seem useful, but I am needing additional clarity.
In my application API calls are coming from a Web Service to an NT Service. I am working in a test environment and can reproduce this "effect" over and over. In my test I will follow this general procedure:
Run a load test in Visual Studio (10 users) to make a pattern of API calls. (My time for the load test is set to 99 hours, so that I can run it continuously and simply abort the test when I want to stop the load, and check on the conditions of the process.)
Start Perfmon and monitor key statistics, private bytes, virtual bytes, threads, handles, #time in GC, etc....
Stop the load test.
Attach Windbg to the process. Check to see if something is going wrong. (FinalizeQueue, etc.)
If nothing is wrong, detach and restart the load test.
I'll follow this general procedure and alter the test to check various API call to see what those API's do to the memory profile of the application.
As I was going through this, I noticed that occasionally when I stop the load test, I will get a "thread was being aborted" exception. This will also rarely be seen in the logs from a customer environment. That seems to happen when an API call is made, and then the client is disconnected for whatever reasons.
However, after getting this exception, I am noticing that the finalizer of my process is hung. (The NT class service, not w3wp.)
Regardless of how long the process is left unburdened and ununsed, this is the top of the output from !finalizequeue:
0:002> !finalizequeue
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
generation 0 has 15 finalizable objects (0c4b3db8->0c4b3df4)
generation 1 has 9 finalizable objects (0c4b3d94->0c4b3db8)
generation 2 has 771 finalizable objects (0c4b3188->0c4b3d94)
Ready for finalization 178 objects (0c4b3df4->0c4b40bc)
I can add to the objects "Ready For Finalization" by making API calls, but the Finalizer thread never seems to move on and empty the list.
The finalizer thread shown above is thread 002. Here is the call-stack:
0:002> !clrstack
OS Thread Id: 0x29bc (2)
Child SP IP Call Site
03eaf790 77b4f29c [DebuggerU2MCatchHandlerFrame: 03eaf790]
0:002> kb 2000
# ChildEBP RetAddr Args to Child
00 03eae910 773a2080 00000001 03eaead8 00000001 ntdll!NtWaitForMultipleObjects+0xc
01 03eaeaa4 77047845 00000001 03eaead8 00000000 KERNELBASE!WaitForMultipleObjectsEx+0xf0
02 03eaeaf0 770475f5 05b72fb8 00000000 ffffffff combase!MTAThreadWaitForCall+0xd5 [d:\rs1\onecore\com\combase\dcomrem\channelb.cxx # 7290]
03 03eaeb24 77018457 03eaee5c 042c15b8 05b72fb8 combase!MTAThreadDispatchCrossApartmentCall+0xb5 [d:\rs1\onecore\com\combase\dcomrem\chancont.cxx # 227]
04 (Inline) -------- -------- -------- -------- combase!CSyncClientCall::SwitchAptAndDispatchCall+0x38a [d:\rs1\onecore\com\combase\dcomrem\channelb.cxx # 6050]
05 03eaec40 76fbe16b 03eaee5c 03eaee34 03eaee5c combase!CSyncClientCall::SendReceive2+0x457 [d:\rs1\onecore\com\combase\dcomrem\channelb.cxx # 5764]
06 (Inline) -------- -------- -------- -------- combase!SyncClientCallRetryContext::SendReceiveWithRetry+0x29 [d:\rs1\onecore\com\combase\dcomrem\callctrl.cxx # 1734]
07 (Inline) -------- -------- -------- -------- combase!CSyncClientCall::SendReceiveInRetryContext+0x29 [d:\rs1\onecore\com\combase\dcomrem\callctrl.cxx # 632]
08 03eaec9c 77017daa 05b72fb8 03eaee5c 03eaee34 combase!DefaultSendReceive+0x8b [d:\rs1\onecore\com\combase\dcomrem\callctrl.cxx # 590]
09 03eaee10 76f72fa5 03eaee5c 03eaee34 03eaf3d0 combase!CSyncClientCall::SendReceive+0x68a [d:\rs1\onecore\com\combase\dcomrem\ctxchnl.cxx # 767]
0a (Inline) -------- -------- -------- -------- combase!CClientChannel::SendReceive+0x7c [d:\rs1\onecore\com\combase\dcomrem\ctxchnl.cxx # 702]
0b 03eaee3c 76e15eea 05b59e04 03eaef48 17147876 combase!NdrExtpProxySendReceive+0xd5 [d:\rs1\onecore\com\combase\ndr\ndrole\proxy.cxx # 1965]
0c 03eaf358 76f73b30 76f5bd70 76f84096 03eaf390 rpcrt4!NdrClientCall2+0x53a
0d 03eaf378 7706313f 03eaf390 00000008 03eaf420 combase!ObjectStublessClient+0x70 [d:\rs1\onecore\com\combase\ndr\ndrole\i386\stblsclt.cxx # 217]
0e 03eaf388 77026d85 05b59e04 03eaf3d0 011d0940 combase!ObjectStubless+0xf [d:\rs1\onecore\com\combase\ndr\ndrole\i386\stubless.asm # 171]
0f 03eaf420 77026f30 011d0930 73b1a9e0 03eaf4e4 combase!CObjectContext::InternalContextCallback+0x255 [d:\rs1\onecore\com\combase\dcomrem\context.cxx # 4401]
10 03eaf474 73b1a88b 011d0940 73b1a9e0 03eaf4e4 combase!CObjectContext::ContextCallback+0xc0 [d:\rs1\onecore\com\combase\dcomrem\context.cxx # 4305]
11 03eaf574 73b1a962 73b689d0 03eaf60c b42b9326 clr!CtxEntry::EnterContext+0x252
12 03eaf5ac 73b1a9a3 73b689d0 03eaf60c 00000000 clr!RCW::EnterContext+0x3a
13 03eaf5d0 73b1eed3 03eaf60c b42b936a 740ecf60 clr!RCWCleanupList::ReleaseRCWListInCorrectCtx+0xbc
14 03eaf62c 73b6118f b42b90f6 03eaf790 00000000 clr!RCWCleanupList::CleanupAllWrappers+0x119
15 03eaf67c 73b61568 03eaf790 73b60f00 00000001 clr!SyncBlockCache::CleanupSyncBlocks+0xd0
16 03eaf68c 73b60fa9 b42b9012 03eaf790 73b60f00 clr!Thread::DoExtraWorkForFinalizer+0x75
17 03eaf6bc 73a7b4c9 03eaf7dc 011a6248 03eaf7dc clr!FinalizerThread::FinalizerThreadWorker+0xba
18 03eaf6d0 73a7b533 b42b91fe 03eaf7dc 00000000 clr!ManagedThreadBase_DispatchInner+0x71
19 03eaf774 73a7b600 b42b915a 73b7a760 73b60f00 clr!ManagedThreadBase_DispatchMiddle+0x7e
1a 03eaf7d0 73b7a758 00000001 00000000 011b3120 clr!ManagedThreadBase_DispatchOuter+0x5b
1b 03eaf7f8 73b7a81f b42b9ebe 73b7a760 00000000 clr!ManagedThreadBase::FinalizerBase+0x33
1c 03eaf834 73af15a1 00000000 00000000 00000000 clr!FinalizerThread::FinalizerThreadStart+0xd4
1d 03eaf8d0 753c62c4 011ae320 753c62a0 c455bdb0 clr!Thread::intermediateThreadProc+0x55
1e 03eaf8e4 77b41f69 011ae320 9d323ee5 00000000 kernel32!BaseThreadInitThunk+0x24
1f 03eaf92c 77b41f34 ffffffff 77b6361e 00000000 ntdll!__RtlUserThreadStart+0x2f
20 03eaf93c 00000000 73af1550 011ae320 00000000 ntdll!_RtlUserThreadStart+0x1b
By repeating this several times, I have correlated getting the "thread was being aborted" exception, and getting a hung finalizer with this call-stack each time. Can anyone provide a clarification of what the finalizer is waiting on, and anything else related that might help identify a solution for this?
Thanks? Feel free to ask question if you need more information.
Edited Additions follow:
My supervisor sent me the code for System.ComponentModel.Component after reading the analysis by Hans, and I need to try and get clarification on a certain point. (Even if the code that I was sent was the wrong version somehow.)
Here is the finalizer for System.ComponentModel.Component:
~Component() {
Dispose(false);
}
Here is Dispose(bool):
protected virtual void Dispose(bool disposing) {
if (disposing) {
lock(this) {
if (site != null && site.Container != null) {
site.Container.Remove(this);
}
if (events != null) {
EventHandler handler = (EventHandler)events[EventDisposed];
if (handler != null) handler(this, EventArgs.Empty);
}
}
}
}
Hopefully I am not making a dumb blunder here, but if the finalizer thread only calls Dispose(false), then it cannot possibly block on anything. If that is true, then am I barking up the wrong tree? Should I be looking for some other class?
How can I use the dump file that I have to determine the actual object type that the finalizer is hanging on?
2nd Edit:
I ran my load test, and monitored finalization survivors until I started seeing it rise upward. I then stopped the load test. The result was that finalization survivors were "stuck" at 88 even if I caused a GC a few times, and never progressed downward. I resetiis and restarted SQL Server, but that statistic remained stuck at 88.
I took a dump of the process, and captured the finalizequeue output, and while I was puzzling about that, I noticed that perfmon suddenly registered a drop in finalization survivors about 15 minutes after I took the dump.
I took another dump, and recaptured the finalizequeue output and compared it to the previous. Most of it was the same with the following differences:
B A Diff
4 0 -4 System.Transactions.SafeIUnknown
6 0 -6 OurCompany.Xpedite.LogOutResponseContent
20 0 -20 System.Net.Sockets.OverlappedCache
8 2 -6 System.Security.SafeBSTRHandle
21 3 -18 System.Net.SafeNativeOverlapped
6 4 -2 Microsoft.Win32.SafeHandles.SafeFileHandle
8 9 1 System.Threading.ThreadPoolWorkQueueThreadLocals
19 1 -18 System.Net.Sockets.OverlappedAsyncResult
7 2 -5 System.Data.SqlClient.SqlDataAdapter
7 2 -5 System.Data.DataSet
13 7 -6 OurCompany.Services.Messaging.Message
79 13 -66 Microsoft.Win32.SafeHandles.SafeAccessTokenHandle
6 4 -2 System.IO.FileStream
24 3 -21 System.Data.SqlClient.SqlConnection
40 22 -18 Microsoft.Win32.SafeHandles.SafeWaitHandle
17 3 -14 System.Data.SqlClient.SqlCommand
14 4 -10 System.Data.DataColumn
7 2 -5 System.Data.DataTable
21 20 -1 System.Threading.Thread
73 68 -5 System.Threading.ReaderWriterLock
9/16/2022,Additional Information leading to a cause and possible solution:
By examining the call-stack of the finalizer thread, frame 3's variables have an entry called the pOXIDEntry, which contains the thread number of the target thread that the finalizer is waiting on. To illustrate:
In the above screenshot you can see the finalizer is thread 2 (OSID 382c). In the vars passed to MTAThreadDispatchCrossApartmentCall, it is targeting thread 5 (OSID 2e30). You will also note that this target thread is marked (mysteriously) as being an STA, rather than an MTA as all of the other threadpool workers are.
When a COM object that is created on an STA thread is cleaned up, marshalling is done to the target thread as mentioned here: https://learn.microsoft.com/en-us/troubleshoot/windows/win32/descriptions-workings-ole-threading-models
The problem in this case is that the code has erroniously set Thread.CurrentThread.ApartmentState = ApartmentState.STA for some unknown reason (this is under investigation).
When that line of code is executed, the thread is indeed marked as being an STA, but it does not automatically get a message pump to process messages that are sent to it. As such it will never receive a message sent to it, and hence never reply.
The issue of the blocked finalizer does not always happen, because usually, all COM objects are explicitly disposed on the API thread that created it. So in virtually all of the normal processing cases, the Finalizer thread is not involved in the cleanup of these COM objects.
The one rare exception is when a Thread.Abort occurs on one of these threads, which can be imposed on the thread by the code is the WSE2 client library. Depending on exactly when this happens, a COM object can be left for the Finalizer to clean up. It is only in this case that the Finalizer will see this object, and see that it was created on an STA thread and attempt to marshal the call to clean it up. When that happens, the worker thread that was erroniously marked as STA will never reply, and hence the finalizer will be blocked at that point.
This issue is now resolved.
Ultimately, the source of this issue is that calls were being made in one of our Win32 client libraries to CoInitialize using the STA model. This potential issue was discovered prior to 2006 by Microsoft, and deeply detailed in a blog post by Microsoft Principal Developer Tess Ferrandez at this address: https://www.tessferrandez.com/blog/2006/03/26/net-memory-leak-unblock-my-finalizer.html
As detailed in the post, COM can flip a ThreadPool worker thread over to the STA model, but threadpool threads were not designed to support this thread model.
The application worked most of the time because in all cases the COM objects were forcefully disposed on a API thread that created them and the finalizer was not involved in the cleanup. The rare exception was when the WSE2 client library enforced a Thread.Abort on one of our API threads and that chanced to intrude on the execution of the thread at just such a moment that made a forceful dispose impossible, and in those cases the finalizer was involved.
The fix for this issue was to search throughout our client library for calls to CoInitialize(nil), and convert these calls, and the underlying code to use the multithreaded model. Essentially change these to CoInitializeEx(nil, COINIT_MULTITHREADED);
After initializing to the MTA version of COM, there was no longer any need for the finalizer to do any marshalling in these cases of Thread.Abort, and the issue was resolved.
Related
EDIT:
Not sure if that's a problem or not, but here is something I've noticed about:
The rx_skbuffers are allocated in two places: once when the driver is been initialized, it calls __netdev_alloc_skb_ip_align with GPF_KERNEL, and the second time if the rx_skbuff is already freed it calls netdev_alloc_skb_ip_align (which is internally uses GPF_ATOMIC).
Shouldn't these skb allocations be called with GPF_DMA ?
==========================================================================
I'm having issues with data corruption in an Ethernet driver (for ST MAC 10/100/1000) I'm working with. The driver runs on Allwinner A20 (ARM Cortex-A7).
Some details:
The driver holds a ring of Rx sk_buffers (allocated with __netdev_alloc_skb_ip_align).
The data (rx_skbuff[i]->data) of each of the sk_buffers is mapped to the DMA using dma_map_single.
The mapping has been succeed (verified with dma_mapping_error).
The problem:
After a while (minutes, hours, days... very random), kernel panics due to data corruption.
Debugging (EDITED):
After digging a bit more, I found out that sometimes, after a while, one of the sk_buffer structure is been corrupted, and this may lead the program to do things it should not and thus cause the kernel to panic.
After some more digging, I've found out that the corruption occurs after skb_copy_to_linear_data (which is just the same as memcpy). Keep in mind that this corruption doesn't occur after each call of skb_copy_to_linear_data, but when the corruption does occurs, it is always after the call to skb_copy_to_linear_data.
When the corruption occurs, it doesn't happens on the rx_q->rx_skbuff of the current entry (rx_q->rx_skbuff[entry]). For example, if we perform the skb_copy_to_linear_data on rx_q->rx_skbuff[X], the corrupted sk_buff structure will be rx_q->rx_skbuff[Y] (where X is not equal Y).
It seems that the physical address of the skb->data (that has been allocated right before the skb_copy_to_linear_data call), has the same physical address of rx_q->rx_skbuff[Y]->end. First thing I thought is that maybe the driver doesn't know rx_q->rx_skbuff[Y] has been released, but when this collision occurs I see the rx_q->rx_skbuff[Y]->users is 1.
How could it be, Any ideas ?
Not sure if that's a problem or not, but here is something I've noticed about:
The rx_skbuffers are allocated in two places: once when the driver is been initialized, it calls __netdev_alloc_skb_ip_align with GPF_KERNEL, and the second time if the rx_skbuff is already freed it calls netdev_alloc_skb_ip_align (which is internally uses GPF_ATOMIC). Shouldn't these skb allocations be called with GPF_DMA ?
Code:
Here is part of the code where the corruption occurs.
The full code of the driver is from linux kernel mainline 4.19, and it can be found here.
I've paste here only the part between lines 3451-3474.
Does anyone find here a wrong behavior regarning the use of the DMA-API ?
skb = netdev_alloc_skb_ip_align(priv->dev,
frame_len);
if (unlikely(!skb)) {
if (net_ratelimit())
dev_warn(priv->device,
"packet dropped\n");
priv->dev->stats.rx_dropped++;
continue;
}
dma_sync_single_for_cpu(priv->device,
rx_q->rx_skbuff_dma
[entry], frame_len,
DMA_FROM_DEVICE);
// Here I check if data has been corrupted (the answer is ALWAYS NO).
debug_check_data_corruption();
skb_copy_to_linear_data(skb,
rx_q->
rx_skbuff[entry]->data,
frame_len);
// Here I check again if data has been corrupted (the answer is SOMETIMES YES).
debug_check_data_corruption();
skb_put(skb, frame_len);
dma_sync_single_for_device(priv->device,
rx_q->rx_skbuff_dma
[entry], frame_len,
DMA_FROM_DEVICE);
Some last notes:
I tried to run the kernel with CONFIG_DMA_API_DEBUG enabled. It's is not always triggered, but when I catch the corruption by my self (with my debug fucntion), sometimes I can see that /sys/kernel/debug/dma-api/num_errors has been increased, and sometimes I also get this log: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000006879f902] [size=61 bytes]
I've also enabled the CONFIG_DEBUG_KMEMLEAK and right after I catches the data corruption event, I get this log: kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak), but I still don't understand what clues it throws, although it seem to be taken from the same part of code I've pasted here (__netdev_alloc_skb is been called from __netdev_alloc_skb_ip_align). This what /sys/kernel/debug/kmemleak displays:
unreferenced object 0xe9ea52c0 (size 192):
comm "softirq", pid 0, jiffies 6171209 (age 32709.360s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 40 4d 2d ea ............#M-.
00 00 00 00 00 00 00 00 d4 83 7c b3 7a 87 7c b3 ..........|.z.|.
backtrace:
[<045ac811>] __netdev_alloc_skb+0x9f/0xdc
[<4f2b009a>] stmmac_napi_poll+0x89b/0xfc4
[<1dd85c70>] net_rx_action+0xd3/0x28c
[<1c60fabb>] __do_softirq+0xd5/0x27c
[<9e007b1d>] irq_exit+0x8f/0xc0
[<beb36a07>] __handle_domain_irq+0x49/0x84
[<67c17c88>] gic_handle_irq+0x39/0x68
[<e8f5dc30>] __irq_svc+0x65/0x94
[<075bc7c7>] down_read+0x8/0x3c
[<075bc7c7>] down_read+0x8/0x3c
[<790c6556>] get_user_pages_unlocked+0x49/0x13c
[<544d56e3>] get_futex_key+0x77/0x2e0
[<1fd5d0e9>] futex_wait_setup+0x3f/0x144
[<8bc86dff>] futex_wait+0xa1/0x198
[<b362fbc0>] do_futex+0xd3/0x9a8
[<46f336be>] sys_futex+0xcd/0x138
I'm looking for ways to learn which syscalls or which subsystems a process or thread spends time waiting in, i.e. blocked and not scheduled to run on a CPU.
Specifically if I have some unknown process, or a process where all we know is "it's slow" I'd like to be able to learn things like:
"it spends 80% of its time in sys_write() on fd 13 which is /some/file"
"it's spending a lot of time waiting to read() from a network socket"
"it's sleeping in epoll_wait() for activity on fds [4,5,6] which are [file /boo], [socket 10.1.1.:42], [notifyfd blah]"
In other words when my program is not running on the CPU what is it doing?
This is astonishingly hard to answer with perf because it does not appear to have any way to record the duration of a syscall from sys_enter to sys_exit or otherwise keep track of how long an event is. Presumably due to its sampling nature.
I'm aware of some experimental work with eBPF for Linux 4.6 and above that may help, with Brendan Gregg's off-cpu work. But in the sad world of operations and support a 4.6 kernel is a rare unicorn to be treasured.
What are the real world options?
Do ftrace, systemtap etc offer any insights here?
You can use strace. First, you might want to get a high-level summary of the costs of each type of system call. You can obtain this summary by running strace -c. For example, one possible output is the following:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
90.07 0.000263 26 10 getdents
3.42 0.000010 0 1572 read
3.42 0.000010 0 762 1 stat
3.08 0.000009 0 1574 6 open
0.00 0.000000 0 11 write
0.00 0.000000 0 1569 close
0.00 0.000000 0 48 fstat
The % time value is with respect to overall kernel time, not overall execution time (kernel+user). This summary tells you what the most expensive system calls are. However, if you need to determine which specific instances of system calls are most expensive and what arguments are passed to them, you can run strace -i -T. The -i option shows the instruction addresses of the instructions that performed the system call and the -T option the time spent in the system call. An output might look like this:
[00007f97f1b37367] open("myfile", O_RDONLY|O_CLOEXEC) = 3 <0.000020>
[00007f97f1b372f4] fstat(3, {st_mode=S_IFREG|0644, st_size=159776, ...}) = 0 <0.000018>
[00007f97f1b374ba] mmap(NULL, 159776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f97f1d19000 <0.000019>
[00007f97f1b37467] close(3) = 0 <0.000018>
The first column shows instruction addresses, the second column shows systems calls with their arguments, the third column shows the returned value, and the last column shows the time spent in that system call. This list is ordered by the dynamic occurrence of the system call. You can filter this output using either grep or the -e option. The instruction addresses can help you locate where in the source code these system calls are made. For example, if a long sequence of system calls have the same address, then there is a good chance that you have a loop somewhere in the code that contains the system call. If your executable binary is not PIE, the dynamic addresses are the same as the static addresses shown by objdump. But even with PIE, the relative order of the dynamic addresses is the same. I don't know if there is an easy way to map these system calls to source code lines.
If you want to find out things like "it spends 80% of its time in sys_write() on fd 13 which is /some/file" then you need to write a script that first extracts the return values of all open calls and the corresponding file name arguments and then sum up the times of all sys_write calls whose fd argument is equal to some value.
I received the following messages in /var/log/message:
Sep 6 04:23:30 localhost kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Sep 6 04:23:30 localhost kernel: mptbase: ioc0: SMART data received, ASC/ASCQ = 5dh/00h
Sep 6 04:26:01 localhost kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Sep 6 04:26:01 localhost kernel: mptbase: ioc0: SMART data received, ASC/ASCQ = 5dh/05h
This message is repeated frequently, and it has been going on for two weeks. But the server seems ok, I haven't noticed any service failure.
What does this messages mean?
So 5Dh is for "Informational Exceptions". The 05h is "Access Times exceeding limits." which doesn't sound too worrisome unless performance is your main concern. The 00h seems to be a timeout, so I'm guessing the drive has been in use for a while. If you want to be really proactive, go ahead and replace the drive.
5Dh reference
from the bowels of googles cache
From: "Elliott, Robert (Hou)" <Robert.Elliott#COMPAQ.com>
To: "'t10#symbios.com'" <t10#aztec.co.lsil.com>
Subject: ASC/ASCQ 5Dh and SMART disk drives
Date: Tue, 6 Jul 1999 10:07:37 -0500
Extracted-To: T10_Reflector
* From the T10 Reflector (t10#symbios.com), posted by:
* "Elliott, Robert (Hou)" <Robert.Elliott#COMPAQ.com>
*
ASC code 5Dh is used for Informational Exceptions. Disk drives following
the "SMART" (non)standard use ASCQs from 10-7Fh to report detailed failure
prediction information. However, SPC-2 Table C.1 only defines those ASCQs
for RBC devices. RBC Table 18 defines the meaning of each code in that
region.
How should we make these codes legal for SBC devices? Ralph doesn't want
to just add SBC to the list of standards that use those codes, since it
doesn't define their meaning. A reader wouldn't know to refer to RBC for
the definitions. The codes are too disk-specific for SPC-2 itself.
If an SBC-2 project is started, it could certainly go there. Gene noted
that the table could be added to the ISO version of SBC, since that is
still open.
Background:
SPC-2 revision 11 lists these ASC/ASCQ assignments in its annex
(table C.1):
5D 00 Failure Prediction Threshold Exceeded (all devices)
5D 01 Media Failure Prediction Threshold Exceeded (MMC-2, RBC)
5D 02 Logical Unit Failure Prediction Threshold Exceeded (MMC-2)
5D FF Failure Prediction Threshold Exceeded (False) (all devices)
5D nn Detailed Failure Prediction Information (nn=10h-7Fh)(RBC)
RBC defines the ASCQs in this manner:
Value Meaning
upper nibble:
0 General Hard Drive Failure
1h Hardware impending failure
2h Controller impending failure
3h Data Channel impending failure
4h Servo impending failure
5h Spindle impending failure
6h Firmware impending failure
7h Reserved
8h-Fh Vendor-specific in SPC-2
lower nibble:
0 General Hard Drive Failure
1h Drive Error threshold exceeding limits.
2h Data Error Rate exceeding limits.
3h Seek Error Rate exceeding limits.
4h LBA reassignment exceeding limits.
5h Access Times exceeding limits.
6h Start Unit Times exceeding limits.
7h Channel parametrics indicate impending failure
8h Controller detected impending failure.
9h Throughput performance
Ah Seek time performance
Bh Spin-up retry count
Ch Drive calibration retry count
Dh-Eh Reserved.
I have a .dll that is statically linked to the MFC. During normal use of the .dll, it will create a worker thread using AfxBeginThread. Inside the function for that thread I create two arrays:
CByteArray ReadBuffer;
ReadBuffer.SetSize(92);
CByteArray PacketBuffer;
PacketBuffer.SetSize(46);
Those buffers will change size (typically to much larger) during the program execution. The problem is, when the program exits, the function for the thread seems to terminate without ever getting a chance to free the memory allocated by those arrays. I have an ExitInstance() in the .dll overloaded to do other clean up, but the time it is reached I already get the messages:
The thread 'Win32 Thread' (0x2208) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1e34) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1ff8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1fc0) has exited with code 0 (0x0).
These show the thread is being terminated in the middle of execution and doesn't seem to give it time to call any destructors or do any cleanup.
I tried to create my own CWinThread object with an overloaded ExitInstance() function, but again those threads exit before that function is called.
After the .dlls close function is called and the memory cleaned up, I get this:
Detected memory leaks!
Dumping objects ->
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\array_b.cpp(110) : {258} normal block at 0x023686E8, 2048 bytes long.
Data: < > C6 F9 1D C0 E2 0C 00 00 00 00 AA 00 C0 11 E0 11
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\array_b.cpp(110) : {201} normal block at 0x02366F00, 64 bytes long.
Data: <b # > 62 F9 1D C0 81 09 00 00 00 00 7F 00 40 11 E0 10
Object dump complete.
Showing leaks caused by the ReadBuffer and PacketBuffer from above. I can't find a way to properly close the threads & clean up their memory before the program exits. I have a way to gracefully terminate the threads that I could use, but I can't find a spot to execute it before the program terminates.
I am not sure that this is even a big issue, since the program is terminating, but always thought .dlls should clean up all of their own memory just to be safe.
I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?
Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.
The speed depends on the TLS implementation.
Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.
For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.
Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.
If the scheduler does not support any of these methods, the compiler/library has to do the following:
get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.
Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.
The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).
Unfortunately it's not uncommon to see the slow TLS implementation in practice.
Thread locals in D are really fast. Here are my tests.
64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64
// this loop takes 0m0.630s
void main(){
int a; // register allocated
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:
Disassembling with objdump -d:
- this is local variable in %ecx register (loop counter in %eax):
8: 31 c9 xor %ecx,%ecx
a: b8 00 ca 9a 3b mov $0x3b9aca00,%eax
f: 83 c1 09 add $0x9,%ecx
12: ff c8 dec %eax
14: 85 c0 test %eax,%eax
16: 75 f7 jne f <_Dmain+0xf>
- this is thread local, %fs register is used for indirection, %edx is loop counter:
6: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
b: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
12: 00 00
14: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # 1b <_Dmain+0x1b>
1b: 83 04 08 09 addl $0x9,(%rax,%rcx,1)
1f: ff ca dec %edx
21: 85 d2 test %edx,%edx
23: 75 e6 jne b <_Dmain+0xb>
Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.
One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.
To see what kind of code is generated for tls, compile and obj2asm this code:
__thread int x;
int foo() { return x; }
TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.
I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.
I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.
Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.
On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.
I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.
If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.
In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.
Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.
usage:
for type*:
tl_ptr<type>
for const type*:
tl_ptr<const type>
for type* const:
const tl_ptr<type>
const type* const:
const tl_ptr<const type>
template<typename T>
class tl_ptr {
protected:
DWORD index;
public:
tl_ptr(void) : index(TlsAlloc()){
assert(index != TLS_OUT_OF_INDEXES);
set(NULL);
}
void set(T* ptr){
TlsSetValue(index,(LPVOID) ptr);
}
T* get(void)const {
return (T*) TlsGetValue(index);
}
tl_ptr& operator=(T* ptr){
set(ptr);
return *this;
}
tl_ptr& operator=(const tl_ptr& other){
set(other.get());
return *this;
}
T& operator*(void)const{
return *get();
}
T* operator->(void)const{
return get();
}
~tl_ptr(){
TlsFree(index);
}
};
We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.
I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).
The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.
All in all I think the Win32 TLS support is poorly crafted really.