Did any of the Motorola 68k processors have performance counters?

Did any of the Motorola 68k processors have performance counters? - 68000

Did any of the Motorola 68k processors have performance counters or anything similar that could be used for cycle-level code timing?

Nothing like this up to at least 68060 CPUs.
From the other side, manual cycle counting for 68000 (the first CPU in the family) is relatively easy.

For manual cycle-counting (which is pretty much your only option here...), many cycle-count tables exist. For example, the famous YACHT.TXT file, a copy of which can be found at http://nemesis.hacking-cult.org/MegaDrive/Documentation/Yacht.txt (and many other places)

Related

Moving threads across CPUs with clock_gettime(CLOCK_MONOTONIC)

I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?

OP:
I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
I'm not sure what you mean by "erratic" or "unstable." If you mean drift, or variance between cores, these concerns are probably based on computers that shipped 12-15 years ago (XP and Win2000-based OS). From Microsoft:
QPC is available on Windows XP and Windows 2000 and works well on most systems. However, some hardware systems' BIOS didn't indicate the hardware CPU characteristics correctly (a non-invariant TSC), and some multi-core or multi-processor systems used processors with TSCs that couldn't be synchronized across cores. Systems with flawed firmware that run these versions of Windows might not provide the same QPC reading on different cores if they used the TSC as the basis for QPC.
That has pretty much become a non-issue for most current hardware (see link, below).
OP:
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Well, any high-frequency clock has to come from somewhere. On a Windows box (since you asked about QPC), where is that value going to come from? Adding an additional layer to any call to QueryPerformanceCounter essentially guarantees less precision, as there will be more CPU instructions in the mix between "now" and "now as reported back to you by the OS" (and, although extremely unlikely, there would be a small increase in the possibility of preemption, contributing further loss of precision).
This applies equally to any Intel-based Linux/BSD/whatever boxes, as they have to run with the same hardware characteristics. In the Intel-architecture world, the highest frequency you'll be able to get is going to be a value based on RDTSC and the operating system's best effort to keep any TSC values as close as possible across cores or processors (at least in the absence of specialized hardware).
This is why any benchmarking should be an average performance expectation based on a large number of data points.
Microsoft actually has a pretty good document outlining the implementation characteristics of high-frequency timers, including HPET and ACPI power management times, and touches briefly on multi-clock and virtualization: http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx.
OP:
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?
For most of the world, yes. I work on server-based code where microseconds count (high frequency market data), but most people who think a few hundred microseconds are going to make or break their program are kidding themselves. Do you know how long it takes to serve a web page? The CPU itself gets bored with all that waiting, even with thousands of users.

Outliers during Performance Evaluation

I am trying to do some performance measurements using Intels RDTSC, and it is quite
odd the variations I get during different testruns. In most cases my benchmark in C
needs 3000000 Mio cycles, however, exactly the same execution can in some cases take
5000000, almost double as much. I tried to have no intense workloads running in parallel
so that I get good performance estimations. Anyone an idea where this huge timing
variations can come from? I know that interrupts and stuff can happening, but I did not expect
such huge variations in timing!
PS.: I am running it on a Pentium processor with Linux running on it.
Thanks for feedback,
John

I think the answer is in:
I tried to have no intense workloads
running in parallel
You have insufficient control over this in a modern OS.

According to this Wikipedia article, the RDTSC (time stamp counter) cannot be used reliably for benchmarking on multi-core systems. There is no promise that all cores have the same value in the time stamp register.
On Linux, it is better to use the POSIX clock_gettime function.

You have to take the cache of most modern processors into account. Maybe another process evicts your program's cache content in the case where you measured the long running time.
As Henk pointed out, lots of stuff happen in a modern OS that you can't control that much.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)

Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.

Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Is there any advantage for developing on a 64 bit OS?

I'm not sure I understand it properly: does a 64 bit OS run/compile code faster than a 32 bit OS on the same system?
We're using 64 bit OSs where I am and it seems to only cause compatibility issues with legacy and proprietary software. (We're running Ubuntu 9.04 Jaunty amd64)

I will restrict this answer to x86-32 (IA-32) vs x86-64 (AMD64), as I believe that's the question you're actually asking.
At the processor level, there are a few advantages. First and most obvious is the expansion of the per-process virtual memory to a much wider range of 48 bits. (64 is allowed in the architecture but not required, if memory serves.) That enables applications to use a lot more of the system's memory available to them, as well as opening up a lot of space for things like memory mapped files that operate on virtual memory that isn't linked to real memory. It also opens up a lot of space for the OS in question to work, as it doesn't have to share your 4 GB limit for its data. In short, applications and the OS can make better use of your machine's resources.
Additionally, the AMD64 architecture addresses one of the biggest problems of IA-32, which is the utter lack of registers. In fact it doubles the available registers, which is a huge win for some types of code. (Actually it's a win for almost ANY code, but some applications suffer from the increased memory cost of 64 bits and it evens out.)
On the Windows side, MS has taken it as an opportunity to break a whole bunch of historical compability problems. It's not a clean break from the old world, but it's a start. I don't believe Linux suffers from the same problems to begin with, and I don't have much perspective to offer on their 64 bit advantages.

As a general rule, developing--or using--a 64-bit operating system, in any context, will be slower than the same 32-bit operating system. Because all pointers are suddenly twice as large, you are far more likely to blow the cache, and can fit less data in RAM. That slows down your application considerably. You normally would only use 64-bit systems when your applications need to address more than 2 to 3 GB of data simultaneously--something very common in scientific computing and some database situations, but otherwise extremely rare. This is why Apple does not advocate unconditionally compiling PowerPC applications in 64-bit mode, for example: the cost due to cache-misses and lack of memory are high enough that going 64-bit only makes sense when you truly can take advantage of the 64-bit space.
But x86 v. AMD64, which is what you're really asking about (since you're discussing Ubuntu), is a very special beast. AMD64 not only extends all pointers to 64-bit; it fixes many, many deficiencies in the x86 architecture, doubling the number of GPRs, simplifying the instructions to be more friendly to modern CPU designs, and more. Because of this, on AMD64 platforms only, you will frequently see a substantial performance boost by going to 64-bit.
There is one other area where, in software development, it makes sense to go to 64-bit: you need to run lots of VMs. Running a couple of VMs can easily blow you past the 3 GB memory barrier of the operating system, making using them very painful. (It will work due to a technology called PAE, or Paged Addressing Extensions, that Intel invented to bridge the gap between 32-bit systems and 64-bit systems, but the result is slow, painful to work with as a developer, and not very well supported on Windows.) Going to a 64-bit OS can provide tremendous benefits.

(As the commentators note, this answer is somewhat generic, some of these points do not apply to intel/amd chips.)
The answer is: it varies, for a few reasons:
With larger-width instructions, you're going to get more expressiveness (either a greater variety of instructions or a greater capacity to encode data into those instructions directly), which can mean a reduced number of instructions flowing through the machine, which is generally a win: so ++64bit here.
But sometimes larger instructions might take more cycles to decode and execute, because they may be more complex. So a possible --64bit here.
Also, you need to transfer these instructions to and from the CPU: 64 bit instructions are twice as big as 32 bit instructions, which means more traffic to and from memory and the caches. CPUs are structured to ameliorate a lot of this cost, but it is a slight --64bit here.
More registers are usually available in wider instruction sets, which causes less data traffic to and from the stack and or memory. So ++64bit here.
And as everyone's no doubt going to mention, you have the ability to address more memory.
(Nearly forgot this one) the native "long" or "int" size may go up, depending on architecture, meaning data structures based on these get larger. Larger = more memory to move around, which means more possible waiting on data moving: --64bit if you're not careful.
Depending on your architecture, a lot of other concerns may apply too. You can rest assured that the processor and compiler vendors are working their butts off to reduce the "--"s above and increase the "++"s.

I have this 5GByte database that needs converting. On a 64-bit system, I just put all data in collections. In the 32-bit system, I had to think about the order in which to load and convert. The problem is not run-time, it is engineering time. Switching to 64 bit saves weeks of development time.
The compatability issues: that's no bug, that's a feature. It shows you who has written clean software.

There are also some security advantages to using 64-bit operating systems. There have been some buffer overflow exploits that circumvent address space layout randomization by brute force. On a 64-bit OS, there are simply too many addresses for this kind of attack to be successful.

It will speed up compilation if your compile process is memory-bound and you use your 64bit OS to increase the amount of memory usable by your system.

I expect it to be slightly slower, I had that experience with FC10. I don't have real reasons, but it is definitely not the sizeof(pointer) issue. (*)
My own hunch is that it simply is a matter of less optimized drivers or tweaked chipsets.
Also NTFS-3g was funny under 64-bit, while it worked under 32-bit (same distro, same kernel same partition, it just "hung" in some circumstances)
(*) most compiling is disk bound, not CPU bound. Moreover there are other improvements in the x86_64 architecture that cancel out that fact (better PIC, more regs, SSE2 default on, 686 cmov default on) . Unless your app does nothing than randomly moving small blocks around.

When should I write a Linux kernel module?

Some people want to move code from user space to kernel space in Linux for some reason. A lot of times the reason seems to be that the code should have particularly high priority or simply "kernel space is faster".
This seems strange to me. When should I consider writing a kernel module? Are there a set of criterias?
How can I motivate keeping code in user space that (I believe) belong there?

Rule of thumb: try your absolute best to keep your code in user-space. If you don't think you can, spend as much time researching alternatives to kernel code as you would writing the code (ie: a long time), and then try again to implement it in user-space. If you still can't, research more to ensure you're making the right choice, then very cautiously move into the kernel. As others have said, there are very few circumstances that dictate writing kernel modules and debugging kernel code can be quite hellish, so steer clear at all costs.
As far as concrete conditions you should check for when considering writing kernel-mode code, here are a few: Does it need access to extremely low-level resources, such as interrupts? Is your code defining a new interface/driver for hardware that cannot be built on top of currently exported functionality? Does your code require access to data structures or primitives that are not exported out of kernel space? Are you writing something that will be primarily used by other kernel subsystems, such as a scheduler or VM system (even here it isn't entirely necessary that the subsystem be kernel-mode: Mach has strong support for user-mode virtual memory pagers, so it can definitely be done)?

There are very limited reasons to put stuff into the kernel. If you're writing device drivers it's ok. Any standard application: never.
The drawbacks are huge. Debugging gets harder, errors become more frequent and hard to find. You might compromise security and stability. You might have to adapt to kernel changes more frequently. It becomes impossible to port to other UNIX OSs.
The closest I've ever come to the kernel was a custom filesystem (with mysql in the background) and even for that we used FUSE (where the U stands for userspace).

I'm not sure the question is the right way around. There should be a good reason to move things to kernel space. If there aren't any reasons, don't do it.
For one thing, debugging is made harder, and the effect of bugs is far worse (crash/panic instead of simple coredump).

Basically, I agree with rpj. Code has to be in user-space, unless it's REALLY necessary.
But, to emphasize your question, which condition?
Some people claims that driver has to be in the kernel, which is not true. Some drivers are not timing sensitive, in fact lots of drivers are like that.
For example, the framer, RTC timer, i2c devices, etc. Those drivers can be easily moved to user space. There are even some file-systems that are written in user-space.
You should move to kernel space where the overhead, eg. user-kernel swap, becomes unacceptable for your code to work properly.
But there are lots of way to deal with this. For example, the /dev/mem provides a good way to access your physical memory, just like you do it from the kernel space.
When people talk about going to RTOS, I'm usually skeptical.
These days, the processor is so powerful, that most of the time, the real-time aspect becomes negligible.
But even, let's say, you're dealing with SONET, and you need to do a protection switching within 50ms (actually even less, since the 50ms constrains applies to the whole ring), you still can do the switching very fast, IF your hardware supports it.
Lots of framer these days can give you a hardware support that reduces the amount of writes that you need to do. Your job is basically responds to the interrupt as quickly as possible. And Linux is not bad at all. The interrupt latency I got was less 1ms, even if I have tons of other interrupts running (eg. IDE, ethernet, etc.).
And if that's still not enough, then maybe your hardware design is wrong. Some things are better left on the hardware. And when I said hardware, I mean ASIC, FPGA, Network Processor, or other advanced logic.

Code running in the kernel accesses memory, peripherals, system functions in ways that are different from userspace code and thus has the ability to be more efficient. Not to mention the reduced security restrictions for kernel code. However, all this usually comes at a cost, such as increasing the possibility of opening the kernel up to security threats, locking up the OS, complicating the debugging, and so forth.

If your people want really high priority, determinism, low latency etc, the right way to go is to use some real-time version of Linux (or other OS).
Also look at the preemptible kernel options etc. Exactly what you should do depends on the requirements, but to put the code in kernel modules is not likely the right solution, unless you are interfacing some hardware directly.

Another reason to not move code into kernel space is that when you use it in production or commercial situations, you will have to publish that code due to the GPL agreement. A situation that many software companies don't want to come into. :)

As general rule. Think on what you want to know and if that is something you would see in an operating system development book or class then it has a good chance to belong into the kernel. If not, keep it out of the kernel. If you have a very good reason to break that rule, be sure, you will have enough knowledge to know it by yourself or you will be working with someone that has that knowledge.
Yes, might sound harsh, but this exactly what I meant, if you don't know, then be almost sure the answer is no, don't do it in the kernel. Moving your development to kernel space opens a giant can of worms that you must be sure to be able to handle.

If you just need lower latency, higher throughput, etc., it is probably cheaper to buy a faster computer than to develop kernel code.
Kernel modules may be faster (due to less context switches, less system call overhead, and less interruptions), and certainly do run at very high priority. If you want to export a small amount of fairly simple code into kernel space, this might be OK. That is, if a small piece of code is found to be crucial to performance, and is the sort of code that would benefit from being placed in kernel mode, then it may be justified to place it there.
But moving large parts of your program into kernel space should be avoided unless all other options are completely exhausted. Aside from the difficulty of doing so, the performance benefit is not likely to be very large.

If you're asking such a question, then you shouldn't go to the kernel layer. Basically just wondering means you don't need to. The time of the context switch is so negligible that it doesn't matter anyway these days.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string