slower performance of kernel level memcpy in 4.4.x kernel on powerpc architecture - linux

I have a board with PowerPC processor and some DSP cores.
I was running 2.6.34 version linux kernel based Windriver release on it. Recently I built 4.4.106 version kernel based OS for my board and booted it up. With this I have problem in booting up my DSP cores.
On debugging, I figured out that memcpy() ( Kernel space memcpy , not the user space libc's memcpy ) in 4.4.106 kernel taking slightly more time than that in 2.6.34 .
Further I replaced the memcpy functionality provided in arch/powerpc/lib/memcpy_64.S with older one in 2.6.x ( which is a bad idea , I know) but still I did not see any performance improvement.
Can anyone let me know what all other factors that could be influencing the performance of memcpy in kernel space.
I could think of preemption factor , which I am testing right now. I disabled all configurations related to kernel stats. Is there any thing else I need to look into to get the needed performance improvement.

Related

How does kernel know how many cores there are

I am wondering how is Linux kernel made aware of all available cores on the system? For the purposes of scheduler I'd assume kernel has to know how many cores there are, who provides kernel info about all the cores on the system?
Who provides kernel info about all the cores on the system?
It depends on which system.
For 80x86 PCs, the firmware constructs table/s (ACPI tables now) which provide a list of CPUs, and the kernel parses those tables.
For small embedded systems (with no firmware), the number of CPUs might be compile-time constant or provided by the boot loader somehow (e.g. "flattened device tree").

Difference between user-space driver and kernel driver [duplicate]

This question already has answers here:
Userspace vs kernel space driver
(2 answers)
Closed 5 years ago.
I have been reading "Linux Device Drivers" by Jonathan Corbet. I have some questions that I want to know:
What are the main differences between a user-space driver and a kernel driver?
What are the limitations of both of them?
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?
What are the main differences between a user-space driver and a kernel driver?
User space drivers run in user space. Kernel drivers run in kernel space.
What are the limitations of both of them?
The kernel driver can do anything the kernel can, so you could say it has no limitations. But kernel drivers are much harder to "prove correct" and debug. It's all-to-easy to introduce race conditions, or use a kernel function in the wrong context or with the wrong locking. Things will appear to work for a while, but cause problems (including crashing the whole system) down the road. Drivers must also be wary when reading all user input (both from the device and from userspace) because invalid data can sometimes cause crashes.
A user-space driver usually needs a small shim in the kernel to do it's bidding. Usually, that 'shim' provides a simpler API. For example, the FUSE layer lets people write file systems in any language. They can be mounted, read/written, then unmounted. The shim must also protect the kernel against all invalid input.
User-space drivers have lots of limitations. For example, the kernel reserves some memory for use during emergencies, but that is not available for users-space. During memory pressure, the kernel will kill random user-space programs, but never kill kernel threads. User-space programs may be swapped out, which could lead to your device being unavailable for several seconds. (Kernel code can not be swapped out.) Running code in user-space requires several context switches. These waste a "lot" of CPU time. If your device is a 300 baud modem, nobody will notice. But if it's a gigabit Ethernet card, and every packet has to go to your userspace driver before it gets to the real user, the system will have major bottlenecks.
User space programs are also "harder" to use because you have to install that user-space software, which often has many library dependencies. Kernel modules "just work".
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?
The question is "Does this complexity really need to be in the kernel?"
I used to work for a company that made USB dongles that talked a particular protocol. We could have written a full kernel driver, but instead just wrote our program on top of libUSB.
The advantages: The program was portable between Linux, Mac, Win. No worrying about our code vs the GPL.
The disadvantages: If the device needed to data to the PC and get a response quickly, there is no guarantee that would happen. For example, if we needed a real-time control loop on the PC, it would be harder to have bounded response times. (Maybe not entirely impossible on Linux.)
If there is a way to do it in userspace, I would try that first. Only if there are significant performance bottlenecks, or significant complexity in keeping it in userspace would you move it. Even then, consider the "shim" approach, and/or the "emulator" approach (where your kernel module makes your device look like a serial port or a block device.)
On the other hand, if there are already several kernel modules similar to what you want, then start there.

Control over memory virtualization on Linux and Windows

This is semi-theoretic question.
Can I specify the virtualization mode for memory (pure segmentation/segmentation+paging/just paging) while compiling for Windows (e.g., MSVS12 C++) and for Linux (e.g. g++)?
I have read all MSVS linker+compiler options, and found no point of control in there.
For g++ the manual is quite too complex for such question.
The source of this question is this - link
I know from theory and practice that these should either be possible or restricted by OS policy at some level cause core i7 supports all three modes I mentioned above.
Practical background:
The piece of code that created lots of data is here, function Init - and it exhausted my memory if I wanted to have over 2-3G primes on heap.
Intel x86 CPUs always use some form segmentation that can't be turned off. In 64-bit mode code segmentation is limited, but it's still there. Paging is required for both Windows and Linux to work on Intel CPUs (though Linux doesn't use paging on certain other CPU architectures). Paging is also required to enable 64-bit mode on Intel CPUs.
So in other words on Windows and Linux the OS always uses segmentation and paging, and so do any applications run on them, though this is largely transparent. It's not possible to "compiled+linked for 'segmentation without paging'" as you said in the answer you linked. Maybe the book you referenced is referring to ancient 16-bit versions of Windows (3.1 or earlier) which could be run in a mode that supported 80286 CPUs which didn't have paging. Though even then that normally didn't make any difference in how you compiled and linked your applications.
What you are describing is not a function of a compiler, or even a linker.
When you run your program, you get the memory model that is already running on the system. Your compiled code does not care abut the underlying memory mode.
However, your program itself can change the memory model IF it starts running in an unprotected processor mode.

What <4GB workloads would have worse performance in the Linux x32 ABI than x64?

There is a relatively new Linux ABI referred to as x32, where the x86-64 processor runs in 32-bit mode, so pointers are still only 32-bits, but the 64-bit architecture specific registers are still used. So you're still limited to 4GB max memory use as in normal 32-bit, but your pointers use up less cache space than they do in 64-bit, you can do 64-bit arithmetic efficiently, and you get access to more registers (16) than you would in vanilla 32-bit (8).
Assuming you have a workload that fits nicely within 4GB, is there any way the performance of x32 could be worse than on x86-64?
It seems to me that if you don't need the extra memory space nothing is lost -- you should always get the same perf (when you already fit in cache) or better (when the pointer space savings lets you fit more in cache). But it wouldn't surprise me if there are paging/TLB/etc. details that I don't know about.
Certainly if you have a multithreaded program, the fact that data structures are smaller on x32 might cause cache line fighting between threads -- different objects might get allocated on the same cache line in x32 mode and different cache lines in x86_64 mode. If two threads modify those objects independently the cache ping-ponging could severely slow down the x32 code. Of course, this kind of cache effect could happen regardless of pointer size, but if the code has been tuned assuming 64-bit pointers, going to 32-bit pointers could de-tune things.
In X32 the processor is actually executing in "long mode", the same mode as for x86_64. That is, addresses as seen by the processor when doing addressing are still 64 bits, however the X32 ABI makes sure that all addresses are small enough to fit into 32 bits. As a result of this, in some case there is some slight overhead when pointers have to be zero extended from 32 bits to 64.
Also, needing x86/x86-64/x32 libraries in RAM, which I suppose is what one will end up with in practice (unless you're talking about some embedded or other tightly controlled system rather than a general purpose computer), may eat up some of the benefit of X32.

What is the Maximum Java Heap Space for SuSE Linux

This question is related to Java Refuses To Start - Could Not Resrve Enough Space for Object Heap and should be easy enough to figure out. However; my searches haven't yielded anything useful.
Essentially we have 2 32 bit OS's (RedHat & SuSE) on different machines with the same hardware. Both use the same JVM both executing the same command line. RedHat works perfectly fine but SuSE reports there isn't enough Memory.
We just need to know if this is a limitation of the version of SuSE we're using or if it's something else.
'cat /proc/version' gives us:
Linux version 2.6.5-7.244-bigsmp (geeko#buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Mon Dec 12 18:32:25 UTC 2005
'uname -a' gives us the following on BOTH types of machines:
UTC 2005 i686 i686 i386 GNU/Linux
The JVM memory limit is related the largest free contiguous block available, not the amount of free memory. The limit varies from about 1.4 GB to a bit over 2.0 GB, and depends on where your operating system puts various things in memory. I don't know the particulars of where Redhat or Suse load stuff into memory, but it could be that suse is mapping some library to an address in the middle of RAM, where Redhat might map it at the end (speculating).
And remember that your actual memory usage in java is more than what you specify for Xmx. The other memory settings also affect the size of your heap (like permgen). So it could also be that the perm space on Suse has a larget default than on Redhat.
Also, depending on the memory allocation profile of your application, you might get away with a smaller heap size and different garbage collecting options. There are some details here (http://java.sun.com/performance/reference/whitepapers/tuning.html) and other places. For example, if you allocate a lot of small, temporary blocks, you'll want different GC settings than if you have a lot of bit, long-lived objects.
Regarding the linked question, why not just use Redhat? That might be a simplistic solution, but I guarantee it's going to fix your problem faster than deeply delving into the arcane world of java tuning and OS memory management :P
Firstly, you are crazy to be running a 32-bit OS when you have this much address space pressure. Migrate to a 64-bit JVM on 64-bit Linux. How much time have you wasted already trying to diagnose this problem which you must have suspected from the outset would go away with the larger address space of a 64-bit system ?
Secondly, it's well known that out of all the Linux vendors Red Hat has the most kernel engineers on staff and makes some serious tweaks for the kernels in their RHEL products. These often include patches for large workloads like yours (well, it's a large workload for a 32-bit system, it's nothing special on 64-bit). So there's some chance the reason ultimately is that RHEL has other customers doing the same crazy stuff as you and you're benefiting from work they did to support those customers.
Finally though, since I suspect you're going to insist on trying to find a way to do this on 32-bit SuSE I will point out that Linux offers a variety of address space trade-offs on 32-bit x86, and it's possible (but not certain) that your SuSE systems just have a different trade-off selected. If you can bring up the configuration of the running kernels (often in /boot/config....) then you can compare settings like HIGHMEM.
The conventional option until a few years ago was 2:2 split, that is userspace is limited to 2GiB of address space, an easy solution to program and it has decent efficiency but in this scenario obviously you can't have your requested heap since it would leave no space for the program text, stack etc. More recently the trend has been for 3:1 (similar to the Windows /3GB switch) which expands userspace address space at the cost of cramming the OS kernel itself into less space which potentially causes its own problems. This might work, but it would be very cramped so I also wouldn't be surprised if it didn't work for your jobs. Finally newer Linux kernels also offer an option where you get 4GiB 32-bit userspace, which might be enough to make your jobs run reliably, at a significant performance cost since then obviously userspace and kernel addresses can't co-exist.
To try this you'd need a new kernel. You may be able to just install one provided by SuSE (see if they offer others to choose from, e.g. a "PAE" option) or you may have to compile your own, in which case it probably invalidates your support contract.
But really, you should just go with option 1, switch to a 64-bit JVM and put your feet up.

Resources