Is there any lock-free malloc under BSD/BSD-like license? - malloc

Like NBmalloc: http://www.cse.chalmers.se/research/group/dcs/nbmalloc.html
Or Maged M. Michael's malloc: http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
Thanks!!

Jemalloc is BSD/BSD-like licensed. http://www.canonware.com/jemalloc/
There is also SuperMalloc which is licensed as dual GPL3/MIT https://github.com/kuszmaul/SuperMalloc
Can you say what is the meaning of "lock-free"?

Related

2 Questions about Risc-V-Privileged-Spec-v1.7

Page 16, Table 3.1:
Base field in mcpuid: RV32I RV32E RV64I RV128I
What is "RV32E"?
Is there a "E" extension?
ECALL (page 30) says nothing about the behavior of the pc.
While mepc (page 28) and mbadaddr (page 29) claim that "mepc will point to the beginning of the instruction". I think ECALL should set the mepc to the end of the causing instruction so that a ERET would go to the next instruction. Is that right?
As answered by CliffordVienna, RV32E ("embedded") is a new base ISA which uses 16 registers and makes some of the counter registers optional.
I would not recommend implementing a RV32E core, as it is probably an unnecessary over-optimization in core size that limits your ability to use a large body of RV*I code. But if performance is not needed, and you really need the core to be a tad smaller, and the core is not connected to a memory hierarchy that would dominate the area/power anyways, and you were willing to deal with the tool-chain headaches... then maybe an RV32E core is appropriate.
ECALL is treated like an exception, and will redirect the PC to the appropriate trap handler based on the current privilege level. MEPC will be set to the current PC of the ecall instruction.
You can verify this behavior by analyzing the Berkeley RV64G Rocket processor (https://github.com/ucb-bar/rocket/blob/master/src/main/scala/csr.scala), or by looking at the Spike ISA simulator (starting here: https://github.com/riscv/riscv-isa-sim/blob/master/riscv/insns/scall.h). Careful: as of 2015 Jun 27 the code is still in flux regarding the Privileged Spec.
If we look at how Spike handles eret ("sret": https://github.com/riscv/riscv-isa-sim/blob/master/riscv/insns/sret.h) for example, we have to be a bit careful. The PC is set to "mepc", but it's the trap handler's job to advance the PC by 4. We can see that done, for example, by the proxy kernel in some of the handler functions here (https://github.com/riscv/riscv-pk/blob/master/pk/handlers.c).
A draft of the RV32E (embedded) spec can be found here (via isa-dev mailing list):
https://lists.riscv.org/lists/arc/isa-dev/2015-06/msg00022/rv32e.pdf
It's RV32I with 16 instead of 32 registers and without the counter instructions.

Self-hosted WCF service maxing cpu load

I am looking into an issue at work with a WindowsService that is taking 100% CPU on a machine with 16 CPU's.
The service is hosting a self-hosted .NET WCF service.
I have received a crash dump which I have loaded up in windbg, in order to look for clues.
So what I have tried:
!threads :
ThreadCount: 646
UnstartedThread: 0
BackgroundThread: 643
PendingThread: 0
DeadThread: 2
Hosted Runtime: no
642 of the threads were Threadpool workers as following:
8 29 2a34 000000002068b510 3029220 Preemptive 0000000000000000:0000000000000000 0000000000563f50 0 MTA (Threadpool Worker)
~29s -> !CLRStack
000000003c66eb70 00000000770512fa [GCFrame: 000000003c66eb70]
000000003c66ec40 00000000770512fa [GCFrame: 000000003c66ec40]
000000003c66ec78 00000000770512fa [HelperMethodFrame: 000000003c66ec78] System.Threading.Monitor.Enter(System.Object)
000000003c66ed70 000007fef7af1c9c System.Threading.TimerQueueTimer.Fire()
000000003c66ede0 000007fef7a6c2f3 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
000000003c66ee30 000007fef7a6c92a System.Threading.ThreadPoolWorkQueue.Dispatch()
000000003c66f388 000007fef8d57d33 [DebuggerU2MCatchHandlerFrame: 000000003c66f388]
~29s -> K
000000003c66e858 000007fefd7010dc ntdll!NtWaitForSingleObject+0xa
000000003c66e860 000007fef8d049bf KERNELBASE!WaitForSingleObjectEx+0x79
000000003c66e900 000007fef8d04977 clr!CLREventBase::WaitEx+0x16c
000000003c66e940 000007fef8d048f8 clr!CLREventBase::WaitEx+0x103
000000003c66e9a0 000007fef8e9c5de clr!CLREventBase::WaitEx+0x70
000000003c66ea30 000007fef8dc5a34 clr!WKS::GCHeap::WaitUntilGCComplete+0x2b
000000003c66ea60 000007fef8d0c4f4 clr!Thread::RareDisablePreemptiveGC+0x176
000000003c66eaf0 000007fef8dd1f3d clr!GCCoop::GCCoop+0x3d
000000003c66eb20 000007fef8e898cf clr!AwareLock::Contention+0x137
000000003c66ebe0 000007fef7af1c9c clr!JITutil_MonContention+0xaf
000000003c66ed70 000007fef7a6c2f3 mscorlib_ni+0x521c9c
000000003c66ede0 000007fef7a6c92a mscorlib_ni+0x49c2f3
000000003c66ee30 000007fef8d57d33 mscorlib_ni+0x49c92a
000000003c66eef0 000007fef8d556e6 clr!CallDescrWorkerInternal+0x83
000000003c66ef30 000007fef8d557af clr!CallDescrWorkerWithHandler+0x4a
000000003c66ef70 000007fef8eda2c9 clr!MethodDescCallSite::CallTargetWorker+0x2e6
000000003c66f120 000007fef8ee51b0 clr!QueueUserWorkItemManagedCallback+0x2a
000000003c66f200 000007fef8ee513e clr!DebuggerU2MCatchHandlerFrame::DebuggerU2MCatchHandlerFrame+0xa0
000000003c66f240 000007fef8ee50b5 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x38e
000000003c66f340 000007fef8ee51eb clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x2bd
000000003c66f3d0 000007fef8eda224 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x23b
000000003c66f430 000007fef8ee6baf clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xb4
000000003c66f5c0 000007fef8ee6ab3 clr!ThreadpoolMgr::ExecuteWorkRequest+0x4c
000000003c66f5f0 000007fef8eda8a6 clr!ThreadpoolMgr::WorkerThreadStart+0xf3
000000003c66f6b0 0000000076c9652d clr!Thread::intermediateThreadProc+0x7d
000000003c66f7f0 000000007702c541 kernel32!BaseThreadInitThunk+0xd
000000003c66f820 0000000000000000 ntdll!RtlUserThreadStart+0x1d
Im having a hard time interpreting the stacktraces since they dont hit any of my applicationcode.
Are they all just idle threadworkers, waiting for work?
Threads with WaitForSingleObject are not critical, since they are waiting and not consuming CPU time. But be aware that your dump is only a snapshot and you might have had bad luck when taking the snapshot.
For a performance analysis with WinDbg you'd need several dumps during high CPU and compare them. If they all have similar stack traces, that's fine and you can conclude something. If they are all very different, it's almost useless.
The command !runaway seems more interesting here, since it lists CPU times consumed per thread, so you can identify the one(s) which are on high CPU. Again: having two snapshots that you can compare is helpful, because the main thread may still have consumed more CPU time in total than some short-living 100% threads.
If you can't use a performance profiler, SysInternals Procdump can generate a series of dumps (-n) for you on high CPU (-c). Use -s to set the time between dumps. For .NET, don't forget -ma for full memory.
Other than that, 646 threads sounds a lot to me. The OS itself could be quite busy scheduling them.
Sounds like the issue could be related to GC. Since this is a self-hosted service, it will use the Workstation GC by default, unless you enable the server GC manually:
http://msdn.microsoft.com/en-us/library/ms229357(v=vs.110).aspx
Have you tried that and see if it makes any difference?
Perfview from Microsoft may be helpful. From the link:
http://blogs.msdn.com/b/dotnet/archive/2012/10/09/improving-your-app-s-performance-with-perfview.aspx
"Late last year, Vance Morrison, who is currently an architect on the .NET Framework Performance team, released PerfView, which is a new performance tool for .NET developers. PerfView helps you discover and investigate performance hotspots in .NET Framework apps, and enables you to deliver consistently high-performance apps to your customers.
Using PerfView, you can perform complex CPU performance analyses to solve hard-to-detect performance problems. PerfView's revolutionary grouping and folding features are what makes it possible to grasp and solve these difficult problems."
use WPRUI.exe to capture a trace and analyze the CPU usage with WPA.exe.
Microsoft explained how to analyze the created trace in the following video:
Defrag Tools: #42 - WPT - CPU Analysis
http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-42-WPT-CPU-Analysis
Collect ETW with Perfview and follow the big % numbers.
try run in windbg ~*e!clstack => call stacks of all threads look for repeatable code.

C Why the casting of malloc and an explanation?

Ok I asked this the other day. But the answers I recieved when asked made me realize I was not fit to question it yet until I did some hard research.
So here I am yet again to retry this....
In examples of malloc I seen something as such...
#include <stdio.h>
int main()
{
int ptr_doe;
ptr_doe = (int *) malloc(sizeof(int));
}
I read that it was not neccesary for the:
(this *) malloc(sizeof(int));
And that only
(int *) malloc(sizeof(\\this));
is neccesary. Is the casting before calling the malloc function ever neccesary?
And how do we know how much memory we need to allocate and what the hell is this?
malloc(10 * sizeof(int));
is it multiplying 4 bytes by 10? and when is it neccesary to use malloc? How does it work internally? Thanks for any help guys
1.
Is the casting before calling the malloc function ever neccesary?
If you use a currect compiler, No. See answers to Do I cast the result of malloc?
2.
how do we know how much memory we need to allocate
It depends on your needs.
3.
malloc(10 * sizeof(int));
means to allocate memory big enough to store 10 int value.
4.
is it multiplying 4 bytes by 10?
If sizeof(int) equals 4, yes.
5.
when is it neccesary to use malloc?
If only at runtime you can know how much memory your program needs.
6.
How does it work internally?
Briefly, malloc() will ask operating system for enough memory, record some information it needs, and return you a pointer to the start of that memory segment.

Using perf to monitor raw event counters

I am trying to measure certain hardware events on a (Intel Xeon) machine with multiple (physical) processors. Specifically, I wish to know how many requests are issued for reading 'offcore' data.
I found the OFFCORE_REQUESTS hardware event in Intels documentation and it gives the event descriptor 0xB0 and for data demands, the additional mask 0x01.
Would it then be correct to tell perf to record the event 0xB1 (i.e. 0xB0 | 0x01) and to call it as:
perf record -e r0B1 ./mytestapp someargs
Or is this incorrect?
Because perf report shows no output for events entered like this.
The perf documentation is rather sparse in this area, apart from a tutorial entry which does not say which event it was (though this one works for me), or how it was encoded...
Any help is greatly appreciated.
Ok, so I guess I figured it out.
For the the Intel machine I use, the format is as follows:
<umask><eventselector> where both are hexadecimal values. The leading zeros of the umask can be dropped, but not for the event selector.
So for the event 0xB0 with the mask 0x01 I can call:
perf record -e r1B0 ./mytestapp someargs
I could not manage to find the exact parsing of it in the perf kernel code (any kernel hacker here?), but I found these sources:
A description of the use of perf with raw events in the c't magazine 13/03 (subscription required), which describes some raw events with their description from the Intel Architecture Software Developers Manuel (Vol 3b)
A patch on the kernel mailing list, discussing the proper way to document it. It specified that the pattern above was "... was x86 specific and imcomplete at that"
(Updated) The man page of newer versions shows an example on Intel machines: man perf-list
Update:
As pointed out in the comments (thank you!), the libpfm translator can be used to obtain the proper event descriptor. The website linked in the comments (Bojan Nikolic: How to monitor the full range of CPU performance events), discovered by user 'osgx' explains it in further detail.
It seems you can use as well:
perf record -e cpu/event=0xB1,umask=0x1/u ./mytestapp someargs
I don't know where this syntax is documented.
You can probably use the other arguments (edge, inv, cmask) as well.
There are several libraries which can be helpful to work with raw PMU events.
perf's own wiki https://perf.wiki.kernel.org/index.php/Tutorial#Events recommends perf list --help man page for info about raw events encoding. And modern perf versions will list raw events as part of perf list output ("... if linked against the libpfm4 library, provides some short description of the events."). perf list --details will also print raw ids and masks of events.
Bojan Nikolic has "How to monitor the full range of CPU performance events" blog article about libpfm4 (perfmon2) lib usage to encode raw events for perf with help of showevtinfo and check_events tools, which are provided with the same library.
There is also perf python wrapper ocperf which accepts intel's event names. It is written by Andi Kleen (Intel Open Source Technology Center) as part of pmu-tools set of utilities (LWN post from 2013, event lists by intel at https://download.01.org/perfmon/). There is a demo of ocperf (2011) http://halobates.de/modern-pmus-yokohama.pdf:
ocperf
•Perf wrapper to support Intel specific events
•Allows symbolic events and some additional events
ocperf record -a −e offcore_response.any_data.remote_dram_0 sleep 10
PAPI library also has tool to explore raw events with some descriptions - papi_native_avail.

Initial Sequence Number generation in Linux TCP stack

What is the procedure for generating the Initial Sequence Numbers (ISN)
in LINUX tcp/ip protocol. I know the procedure for ISN generation in
the LINUX kernels 2.4 to 2.6 that are described in pages 7 & 8 of
Embedding Covert Channels into TCP/IP. I have searched for similar
procedures in later kernels, but to my dismay I couldn't find any. I understand that much details may not be available for obvious reasons related to security. As I'm verifying the possibilities of implementing a similar steganography scheme(as described in the link) in the later Linux kernels, I am badly in need of some information. Any help is appreciated.
Read my answer here:
Most efficient way to manipulate ISN numbers in TCP headers
This algorithm is used on the latest kernel TCP Stack (3.5).
EDIT : See image below to see all concerned kernels versions
EDIT2 : See the sources of the kernel for older versions of the function secure_tcp_sequence_number:
Kernel 2.4.22
Kernel 2.6.30
Kernel 2.6.39
Kernel 3.0
Kernel 3.1 : (MD5 has replaced half-MD4)
After more careful reading of RFC6528 and RFC1948 I came to a conclusion that the algorithm for generating Initial Sequence Numbers(ISNs) as specified in RFC1948 :
ISN = M + F(LocalIP, LocalPort, RemoteIP, RemotePort, Secretkey)
has not changed. Instead, the algorithm which was proposed by Bellovin S.M in RFC1948 is formally specified and taken into the Standards Track(as per RFC2119) in RFC6528 which was written together by Bellovin S.M and Gont. F.. As the now obsolete RFC1948 can't be used in any documentations, RFC6528 has replaced it.
But as pointed out in the answer to my original question MD5 has replaced half-MD4 as the Hash Function from kernels 3.1. This is completely justified by RFC6528 as it did give the flexibility to change F() which is the Pseudo-Random function.
(Please have a look at the links for more details).
Update to previous answers so that to represent current state (2019 and Linux kernel 5.1.3) because algorithm to generate ISNs in the kernel was changed again to be more secure.
secure_tcp_seq now uses SipHash (add–rotate–xor) functions to generate ISN based on initial secret key, source and destination IP address and port. I suppose this change was related to hash collision vulnerability resulted in hash flooding attacks.

Resources