Why do register renaming, when we can increase the number of registers in the architecture? - rename

In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?

Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.

Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
this Q&A about how superscalar execution works.

Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

How do modern cpus handle crosspage unaligned access?

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.
What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?
Here is an example:
I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.
Physical memory:
---|---------------|---------------|-->
|... 4b| | |4b ...|-->
I need 8 bytes that are split at the page boundaries.
How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?
I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?
Should I care about such things if I'm a software programmer and why?
I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.
I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":
An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.
So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses1, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.
Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.
I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:
[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.
so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).
Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.
[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.
So the architecture doesnt really matter other than x86 has traditionally not directly told you not to where mips and arm traditionally generate a data abort rather than trying to just make it work.
where it doesnt matter is that all processors have a fixed number of pins a fixed size (maximum) data bus a fixed size (max) address bus, "modern processors" tend to have data busses more than 8 bits wide but the units on addresses is still an 8 bit byte, so the opportunity for unaligned exists. Anything larger than one byte in a particular transfer has the opportunity of being unaligned if the architecture allows.
Transfers are typically in some units of bytes and/or bus widths. On an ARM amba/axi bus for example the length field is in units of bus widths, 32 or 64 bits, 4 or 8 bytes. And no it is not going to be in units of 4Kbytes....
(yes this is elementary I assume you understand all of this).
Whether it is 16 bits or 128 bits, the penalty for unaligned comes from the additional bus cycles which these days is an extra bus clock per. So for an ARM 16 bit unaligned transfer (which arm will support on its newer cores without faulting) that means you need to read 128 bits instead of 64, 64 bits to get 16 is not a penalty as 64 is the smallest size for a bus transfer. Each transfer whether it is a single width of the data bus or multiple has multiple clock cycles associated with it, lets say there are 6 clock cycles to do an aligned 16 bit read, then ideally it is 7 cycles to do an unaligned 16 bit. Seems small but it does add up.
caches help alot because the dram side of the cache will be setup to use multiples of the bus width and will always do aligned accesses for cache fetches and evictions. not-cached accesses will follow the same pain except the dram side is not handfuls of clocks but dozens to hundreds of clocks of overhead.
For random access a single 16 bit read that not only spans a bus width boundary but also happens to cross a cache line boundary will not just incur the one additional clock on the processor side but worst case it can incur an addition cache line fetch which is dozens to hundreds of additional clock cycles. if you were walking through an array of things that happen to not be aligned (structures/unions may be an example depending on the compiler and code) that additional cache line fetch would have happened anyway, if the array of things is a little over on one or both ends then you might still incur one or two more cache line fetches that you would have avoided had the array been aligned.
That is really the key to this on reads is before or after an aligned area you might have to incur a transfer for each one for each side you spill into.
Writes are both good and bad. random reads are slower because the transaction has to stall until the answer comes back. For a random write the memory controller has all the information it needs it has the address, data, byte mask, transfer type, etc. So it is fire and forget the processor has done its job and can call the transaction complete from its perspective and move on. Naturally gang too much of these up or do a read on something just written and then the processor stalls due to the completion of a prior write in addition to the current transaction.
An unaligned 16 bit write for example does not only incur the additional read cycle but assuming a 32 or 64 bit wide bus that would be one byte per location so you have to do a read-modify-write on whatever that closest memory is (cache or dram). so depending on how the processor and then memory controller implements it it can be two individual read-modify-write transactions (unlikely since that incurs twice the overhead), or the double width read, modify both parts, and a double width read. incurring two additional clocks over and above the overhead, the overhead is doubled as well. If it had been an aligned bus width write then no read-modify-write is required, you save the read. Now if this read-modify-write is in the cache then that is pretty fast but still noticeable up to a few clocks depending on what is queued up and you have to wait on.
I am also most familiar with ARM. Arm traditionally would punish an unaligned access with an abort, you could turn that off, and you would instead get a rotation of the bus rather than it spilling over which would make for some nice freebie endian swaps. the more modern arm cores will tolerate and implement an unaligned transfer. Understand for example a store multiple of say 4 or more registers against a non-64-bit-aligned address is not considered an unaligned access even though it is a 128 bit write to an address that is neither 64 nor 128 bit aligned. What the processor does in that case is brakes it into 3 writes, an aligned 32 bit write, an aligned 64 bit write and an aligned 32 bit write. the memory controller does not have to deal with the unaligned stuff. That is for legal things like store multiple. the core I am familiar with wont do a write length of more than 2 anyway, an 8 register store multiple, is not a single length of 4 write it is 2 separate length of two writes. But a load multiple of 8 registers, so long it is aligned on a 64 bit address is a single length of 4 transaction. I am pretty sure that since there is no masking on the bus side for a read, everything is in units of bus width, there is no reason to break say a 4 register load multiple on an address that is not 64 bits aligned into 3 transactions, simply do a length of 3 read. When the processor reads a single byte you cant tell that from the bus all you see is a 64 bit read AFAIK. The processor strips the byte lane out. If the processor/bus does care be it arm, x86, mips, etc, then sure you will hopefully see separate transfers.
Does everyone do this? no older processors (not thinking of an arm nor x86) would put more burden on the memory controller. I dont know what modern x86 and mips and such do.
Your malloc example. First off you are not going to see single bus transfers of 4Kbytes, that 4k will be broken up into digestible bits anyway. first off it has to do one to many bus cycles against the memory management unit to find the physical address and other properties anyway (those answers can get cached to make them faster, but sometimes they have to go all the way out to slow dram) so for that example the only transfer that matters is an aligned transfer that splits the 4k boundary, say a 16 bit transfer, for the mmu system to work at all the only way for that to be supported is that has to be turned into two separate 8 bit transfers that happen in those physical address spaces, and yes that literally doubles everything the mmu lookup cycles the cache/dram bus cycles, etc. Other than that boundary there is nothing special about your 8k being split. the bulk of your cycles will be within one of the two 4k pages, so it looks like any other random access, with of course repetitive/sequential accesses gaining the benefit of caching.
The short answer is that no matter what platform you are on either 1) the platform will abort an unaligned transfer, or 2) somewhere in the path there is an additional one or more (dozens/hundreds) as a result of the unaligned access compared to an aligned access.
It doesn't matter whether the physical pages are adjacent or not. Modern CPUs use caches. Data is transferred to/from DRAM a full cache-line at a time. Thus, DRAM will never see a multi-byte read or write that crosses a 64B boundary, let alone a page boundary.
Stores that cross a page boundary are still slow (on modern x86). I assume the hardware handles the page-split case by detecting it at some later pipeline stage, and triggering a re-do that does two TLB checks. IDK if Intel designs insert extra uops into the pipeline to handle it, or what. (i.e. impact on latency, throughput of page-splits, throughput of all memory accesses, throughput of other (e.g. non-memory) uops).
Normally there's no penalty at all for unaligned accesses within a cache-line (since about Nehalem), and a small penalty for cache-line splits that aren't page-splits. An even split is apparently cheaper than others. (e.g. a 16B load that takes 8B from one cache line and 8B from another).
Anyway, DRAM will never see an unaligned access directly. AFAIK, no sane modern design has only write-through caches, so DRAM only sees writes when a cache-line is flushed, at which point the fact that one unaligned access dirtied two cache lines is not available. Caches don't even record which bytes are dirty; they just burst-write the whole 64B to the next level down (or last-level to DRAM) when needed.
There are probably some CPU designs that don't work this way, but Intel and AMD's designs are also this way.
Caveat: loads/stores to uncachable memory regions might produce smaller stores, but probably still only within a single cache-line. (On x86, this prob. applies to MOVNT non-temporal stores that use write-combining store buffers but otherwise bypass the cache).
Uncacheable unaligned stores that cross a page boundary are probably still split into separate stores (because each part needs a separate TLB translation).
Caveat 2: I didn't fact-check this. I'm certain about the whole-cache-line aligned access to DRAM for "normal" loads/stores to "normal" memory regions, though.

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
*
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?
If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.
#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).
The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.
If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.
It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.

Determining Opcode Cycle Count for a CPU

I was wondering where would one go about getting CPU opcode cycle counts for various machines. An example of what I'm talking about can be seen at this link:
https://web.archive.org/web/20150217051448/http://www.obelisk.demon.co.uk/6502/reference.html
If you examine the MAME source code, especially under src\emu\cpu, you'll see that most of the CPU models keep a track of the cycle count in a similar way. My question is where does one go about getting this information, or reverse engineering it if its not available? I've never seen any 'official' ASM programmer's guide contain cycle count info. My initial guess is that a small program is thrown into the real hardware's bootrom, and if it contains an opcode equivalent to RDTSC, something like this is done:
RDTSC
//opcode of choosing
RDTSC
But what would you do if such support wasn't available? I know for older hardware the MAME team has no access to anything but the roms, and scattered documentation.
Up through about the Pentium, cycle counts were easy to find for Intel and AMD processors (and most competitors). Starting with the Pentium Pro and AMD K5, however, the CPU went to a dynamic execution model, in which instructions can be executed out of order. In this case, the time taken to execute an instruction depends heavily upon the data it uses, and whether (for example) it depends on data from a previous instruction (in which case, it has to wait for that instruction to complete before it can execute).
There are also constraints on things like how many instructions can be decoded per cycle (e.g. at least one, plus two more as long as they're "simple") and how many can be retired per cycle (usually around three or four).
As a result, on a modern CPU it's almost meaningless to talk about the cycles for a given instruction in isolation. Meaningful results require a stream of instructions, so you look not only at that instruction, but what comes before and after it. An instruction that's a serious bottleneck in one instruction stream might be essentially free in another stream (e.g. if you have one multiplication mixed in with a lot of adds, the multiplication might be almost free -- but if it's surrounded by a lot of other multiplications, it might be relatively expensive).
The accepted RDTSC count should have a serializing instruction to ensure that all previous instructions have retired before getting the count. This adds overhead to the count, but you can simply "count" zero instructions and subtract that value from the measured instructions.
Some pdf manuals that cover this very well.
http://www.agner.org/optimize/#manuals

Do you expect that future CPU generations are not cache coherent?

I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.
I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.
We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?
I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!
Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin
You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.
The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.

Resources