How does register size affect processor performance?

How does register size affect processor performance? - 64-bit

I've been flying around the internet today trying to wrap my head around this topic. So here's what I understood so far. So the bigger the register size the bigger the instructions a processor can handle?
Quote:
The size of the registers, which is sometimes called the word size, indicates the amount of data with which the computer can work at any given time.
Question 1:
How would this be explained in terms of dealing with RAM? Why would a 32-bit processor be less adept or slower at processing information in this case?
Also, the term addressing. So while a 64-bit processor can "address" 2^64 different locations in RAM, a 32-bit processor can only deal with 2^32.
Question 2:
What does addressing mean? And why would the ability to address more locations be more helpful?
Question 3:
How are these 2 points, 1)Number of addressable locations and 2)Instruction size, related?
I hope my questions aren't confusing. It would be nice if references and examples to RAM as well as comparisons between 32 and 64-bits would be given in the explanations.

As chux already stated, there can be a lot of different bus width's in a computer system. That said, I assume you're talking about usual PC architectures here. Now, to your questions:
Performance difference between 32 and 64 bit systems
The hardware usually is able to operate on bigger numbers than a 32 bit system, so it can e.g. sum two 64 bits numbers in one operation, while a 32 bit system would need at least two (plus some operations to combine the results). This means a software that does lots of operations in big numbers will probably be faster on a 64 bit system, but a software that don't need big numbers will not be faster
A 64 bit processor usually fetch bigger blocks of data from memory than a 32 bit one. If the data bus is 64 bits instead of 32, it'll fetch double the bytes than the 32 bit system
This is actually a negative point in 64 bit system: since you have more addressable memory, you also need more memory for pointers, so 64 bit applications will also use a little more memory than the same aplication compiled for a 32 bit system.
Memory addressing
The memory address is a number that uniquely identifies a position in memory, where data is stored. With a 32 bit number, you can adress 2^32 positions, which is roughly 4 GB. This is why 32 bit PC's cannot use more than 4 GB of memory (they actually can, with some restrictions. See PAE). Using 64 bit numbers means the computer can now address 2^64 positions, which means it could, in principle, use up to 16 exbibytes of memory. In practice, other limits prevent a PC from having all that memory.
Addressable locations vs Memory Size
Since lots of instructions should reference a memory position, this means that some of them will have to be bigger, so they have room for memory adresses.
Bigger instructions usually means bigger software code, but this is not a problem in most cases, because the difference isn't that big, and because most of software size usually is composed of data, rather than code.
Disclaimer: Not all I said is valid for every software/architecture. There are a lot of detais that may have more impact in performace and memory usage than the points I wrote here.

The bit width of a processor's registers, it addressing range and the processor internal/external bus width between the processor and RAM are independent.
A 32-bit processor commonly can handle 32-bit addresses, but it may only handle 24 or maybe 64. Many possibilities have occurred.
Addressing would the the maximum range from 0 to N-1 of unique addresses that could be generated. If there is truly N locations of memory is another matter.
The width of the bus between CPU and RAM dramatically affects performance. This width, independent of CPU reg size and RAM size, throttle throughput.
Addressing range and resister size tend to correlate. Units with wider registers usually have wider address range. There is no rule that forces these 2 to be the same.
Suggest reviewing CPU architectures and micro controller and the theoretical Turing Machine

Related

How do modern cpus handle crosspage unaligned access?

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.
What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?
Here is an example:
I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.
Physical memory:
---|---------------|---------------|-->
|... 4b| | |4b ...|-->
I need 8 bytes that are split at the page boundaries.
How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?
I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?
Should I care about such things if I'm a software programmer and why?
I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.

I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":
An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.
So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses1, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.
Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.
I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:
[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.
so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).
Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.
[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.

So the architecture doesnt really matter other than x86 has traditionally not directly told you not to where mips and arm traditionally generate a data abort rather than trying to just make it work.
where it doesnt matter is that all processors have a fixed number of pins a fixed size (maximum) data bus a fixed size (max) address bus, "modern processors" tend to have data busses more than 8 bits wide but the units on addresses is still an 8 bit byte, so the opportunity for unaligned exists. Anything larger than one byte in a particular transfer has the opportunity of being unaligned if the architecture allows.
Transfers are typically in some units of bytes and/or bus widths. On an ARM amba/axi bus for example the length field is in units of bus widths, 32 or 64 bits, 4 or 8 bytes. And no it is not going to be in units of 4Kbytes....
(yes this is elementary I assume you understand all of this).
Whether it is 16 bits or 128 bits, the penalty for unaligned comes from the additional bus cycles which these days is an extra bus clock per. So for an ARM 16 bit unaligned transfer (which arm will support on its newer cores without faulting) that means you need to read 128 bits instead of 64, 64 bits to get 16 is not a penalty as 64 is the smallest size for a bus transfer. Each transfer whether it is a single width of the data bus or multiple has multiple clock cycles associated with it, lets say there are 6 clock cycles to do an aligned 16 bit read, then ideally it is 7 cycles to do an unaligned 16 bit. Seems small but it does add up.
caches help alot because the dram side of the cache will be setup to use multiples of the bus width and will always do aligned accesses for cache fetches and evictions. not-cached accesses will follow the same pain except the dram side is not handfuls of clocks but dozens to hundreds of clocks of overhead.
For random access a single 16 bit read that not only spans a bus width boundary but also happens to cross a cache line boundary will not just incur the one additional clock on the processor side but worst case it can incur an addition cache line fetch which is dozens to hundreds of additional clock cycles. if you were walking through an array of things that happen to not be aligned (structures/unions may be an example depending on the compiler and code) that additional cache line fetch would have happened anyway, if the array of things is a little over on one or both ends then you might still incur one or two more cache line fetches that you would have avoided had the array been aligned.
That is really the key to this on reads is before or after an aligned area you might have to incur a transfer for each one for each side you spill into.
Writes are both good and bad. random reads are slower because the transaction has to stall until the answer comes back. For a random write the memory controller has all the information it needs it has the address, data, byte mask, transfer type, etc. So it is fire and forget the processor has done its job and can call the transaction complete from its perspective and move on. Naturally gang too much of these up or do a read on something just written and then the processor stalls due to the completion of a prior write in addition to the current transaction.
An unaligned 16 bit write for example does not only incur the additional read cycle but assuming a 32 or 64 bit wide bus that would be one byte per location so you have to do a read-modify-write on whatever that closest memory is (cache or dram). so depending on how the processor and then memory controller implements it it can be two individual read-modify-write transactions (unlikely since that incurs twice the overhead), or the double width read, modify both parts, and a double width read. incurring two additional clocks over and above the overhead, the overhead is doubled as well. If it had been an aligned bus width write then no read-modify-write is required, you save the read. Now if this read-modify-write is in the cache then that is pretty fast but still noticeable up to a few clocks depending on what is queued up and you have to wait on.
I am also most familiar with ARM. Arm traditionally would punish an unaligned access with an abort, you could turn that off, and you would instead get a rotation of the bus rather than it spilling over which would make for some nice freebie endian swaps. the more modern arm cores will tolerate and implement an unaligned transfer. Understand for example a store multiple of say 4 or more registers against a non-64-bit-aligned address is not considered an unaligned access even though it is a 128 bit write to an address that is neither 64 nor 128 bit aligned. What the processor does in that case is brakes it into 3 writes, an aligned 32 bit write, an aligned 64 bit write and an aligned 32 bit write. the memory controller does not have to deal with the unaligned stuff. That is for legal things like store multiple. the core I am familiar with wont do a write length of more than 2 anyway, an 8 register store multiple, is not a single length of 4 write it is 2 separate length of two writes. But a load multiple of 8 registers, so long it is aligned on a 64 bit address is a single length of 4 transaction. I am pretty sure that since there is no masking on the bus side for a read, everything is in units of bus width, there is no reason to break say a 4 register load multiple on an address that is not 64 bits aligned into 3 transactions, simply do a length of 3 read. When the processor reads a single byte you cant tell that from the bus all you see is a 64 bit read AFAIK. The processor strips the byte lane out. If the processor/bus does care be it arm, x86, mips, etc, then sure you will hopefully see separate transfers.
Does everyone do this? no older processors (not thinking of an arm nor x86) would put more burden on the memory controller. I dont know what modern x86 and mips and such do.
Your malloc example. First off you are not going to see single bus transfers of 4Kbytes, that 4k will be broken up into digestible bits anyway. first off it has to do one to many bus cycles against the memory management unit to find the physical address and other properties anyway (those answers can get cached to make them faster, but sometimes they have to go all the way out to slow dram) so for that example the only transfer that matters is an aligned transfer that splits the 4k boundary, say a 16 bit transfer, for the mmu system to work at all the only way for that to be supported is that has to be turned into two separate 8 bit transfers that happen in those physical address spaces, and yes that literally doubles everything the mmu lookup cycles the cache/dram bus cycles, etc. Other than that boundary there is nothing special about your 8k being split. the bulk of your cycles will be within one of the two 4k pages, so it looks like any other random access, with of course repetitive/sequential accesses gaining the benefit of caching.
The short answer is that no matter what platform you are on either 1) the platform will abort an unaligned transfer, or 2) somewhere in the path there is an additional one or more (dozens/hundreds) as a result of the unaligned access compared to an aligned access.

It doesn't matter whether the physical pages are adjacent or not. Modern CPUs use caches. Data is transferred to/from DRAM a full cache-line at a time. Thus, DRAM will never see a multi-byte read or write that crosses a 64B boundary, let alone a page boundary.
Stores that cross a page boundary are still slow (on modern x86). I assume the hardware handles the page-split case by detecting it at some later pipeline stage, and triggering a re-do that does two TLB checks. IDK if Intel designs insert extra uops into the pipeline to handle it, or what. (i.e. impact on latency, throughput of page-splits, throughput of all memory accesses, throughput of other (e.g. non-memory) uops).
Normally there's no penalty at all for unaligned accesses within a cache-line (since about Nehalem), and a small penalty for cache-line splits that aren't page-splits. An even split is apparently cheaper than others. (e.g. a 16B load that takes 8B from one cache line and 8B from another).
Anyway, DRAM will never see an unaligned access directly. AFAIK, no sane modern design has only write-through caches, so DRAM only sees writes when a cache-line is flushed, at which point the fact that one unaligned access dirtied two cache lines is not available. Caches don't even record which bytes are dirty; they just burst-write the whole 64B to the next level down (or last-level to DRAM) when needed.
There are probably some CPU designs that don't work this way, but Intel and AMD's designs are also this way.
Caveat: loads/stores to uncachable memory regions might produce smaller stores, but probably still only within a single cache-line. (On x86, this prob. applies to MOVNT non-temporal stores that use write-combining store buffers but otherwise bypass the cache).
Uncacheable unaligned stores that cross a page boundary are probably still split into separate stores (because each part needs a separate TLB translation).
Caveat 2: I didn't fact-check this. I'm certain about the whole-cache-line aligned access to DRAM for "normal" loads/stores to "normal" memory regions, though.

What is the difference between a 32-bit and 64-bit processor?

I have been trying to read up on 32-bit and 64-bit processors (http://en.wikipedia.org/wiki/32-bit_processing). My understanding is that a 32-bit processor (like x86) has registers 32-bits wide. I'm not sure what that means. So it has special "memory spaces" that can store integer values up to 2^32?
I don't want to sound stupid, but I have no idea about processors. I'm assuming 64-bits is, in general, better than 32-bits. Although my computer now (one year old, Win 7, Intel Atom) has a 32-bit processor.

All calculations take place in the registers. When you're adding (or subtracting, or whatever) variables together in your code, they get loaded from memory into the registers (if they're not already there, but while you can declare an infinite number of variables, the number of registers is limited). So, having larger registers allows you to perform "larger" calculations in the same time. Not that this size-difference matters so much in practice when it comes to regular programs (since at least I rarely manipulate values larger than 2^32), but that is how it works.
Also, certain registers are used as pointers into your memory space and hence limits the maximum amount of memory that you can reference. A 32-bit processor can only reference 2^32 bytes (which is about 4 GB of data). A 64-bit processor can manage a whole lot more obviously.
There are other consequences as well, but these are the two that comes to mind.

First 32-bit and 64-bit are called architectures.
These architectures means that how much data a microprocessor will process within one instruction cycle i.e. fetch-decode-execute
In one second there might be thousands to billions of instruction cycles depending upon a processor design.
32-bit means that a microprocessor can execute 4 bytes of data in one instruction cycle while 64-bit means that a microprocessor can execute 8 bytes of data in one instruction cycle.
Since microprocessor needs to talk to other parts of computer to get and send data i.e. memory, data bus and video controller etc. so they must also support 64-bit data transfer theoretically. However, for practical reasons such as compatibility and cost, the other parts might still talk to microprocessor in 32 bits. This happened in original IBM PC where its microprocessor 8088 was capable of 16-bit execution while it talked to other parts of computer in 8 bits for the reason of cost and compatibility with existing parts.
Imagine that on a 32 bit computer you need to write 'a' as 'A' i.e. in CAPSLOCK, so the operation only requires 2 bytes while computer will read 4 bytes of data resulting in overhead. This overhead increases in 64 bit computer to 6 bytes. So, 64 bit computers not necessarily be fast all the times.
Remember 64 bit windows could be run on a microprocessor only if it supports 64-bit execution.

Processor calls data from Memory i.e. RAM by giving its address to MAR (Memory Address Register). Selector electronics then finds that address in the memory bank and retrieves the data and puts it in MDR (Memory Data Register) This data is recorded in one of the Registers in the Processor for further processing. Thats why size of Data Bus determines the size of Registers in Processor. Now, if my processor has 32 bit register, it can call data of 4 bytes size only, at a time. And if the data size exceeds 32 bits, then it would required two cycles of fetching to have the data in it. This slows down the speed of 32 bit Machine compared to 64 bit, which would complete the operation in ONE fetch cycle only. So, obviosly for the smaller data, it makes no difference if my processors are clocked at the same speed.
Again, with 64 bit processor and 64 bit OS, my instructions will be of 64 bit size always... which unnecessarily uses up more memory space.

32bit processors can address a memory bank with 32 bit address with. So you can have 2^32 memory cells and therefore a limited amount of addressable memory (~ 4GB). Even when you add another memory bank to your machine it can not be addressed. 64bit machines therefore can address up to 2^64 memory cells.

This answer is probably 9 years too late, but I feel that the above answers don't adequately answer the question.
The definition of 32-bit and 64-bit are not well defined or regulated by any standards body. They are merely intuitive concepts. The 32-bit or 64-bit CPU generally refers to the native word size of the CPU's instruction set architecture (ISA). So what is an ISA and what is a word size?
ISA and word size
ISA is the machine instructions / assembly mnemonics used by the CPU. They are the lowest level of a software which directly tell what the hardware to do. Example:
ADD r2,r1,r3 # add instruction in ARM architecture to do r2 = r1 + r3
# r1, r2, r3 refer to values stored in register r1, r2, r3
# using ARM since Intel isn't the best when learning about ISA
The old definition of word size would be the number of bits the CPU can compute in one instruction cycle. In modern context the word size is the default size of the registers or size of the registers the basic instruction acts upon (I know I kept a lot of ambiguity in this definition, but it's an intuitive concept across multiple architectures which don't completely match with each other). Example:
ADD16 r2,r1,r3 # perform addition half-word wise (assuming 32 bit word size)
ADD r2,r1,r3 # default add instruction works in terms of the word size
Example bit-ness of a Pentium Pro CPU with PAE
First, various word sizes in general purpose instrucion:
Arithmetic, logical instructions: 32 bit (Note that this violates old concept of word size since multiply and divide takes more than one cycle)
Branch, jump instructions: 32 bit for indirect addressing, 16-bit for immediate (Again Intel isn't a great example because of CISC ISA and there is enough complexity here)
Move, load, store: 32 bit for indirect, 16 bit for immediate (These instructions may take several cycles, so old definition of word size does not hold)
Second, bus and memory access sizes in hardware architecture:
Logical address size before virtual address translation: 32 bit
Virtual address size: 64-bit
Physical address size post translation: 36 bit (system bus address bus)
System bus data bus size: 256 bit
So from all the above sizes, most people intuitively called this a 32-bit CPU (despite no clear consensus on ALU word size and address bit size).
Interesting point to note here is that in olden days (70s and 80s) there were CPU architectures whose ALU word size was very different from it's memory access size. Also note that we haven't even dealt with the quirks in non-general purpose instructions.
Note on Intel x86_64
Contrary to popular belief, x86_64 is not a 64-bit architecture in the truest sense of the word. It is a 32 bit architecture which supports extension instructions which can do 64 bit operations. It also supports a 64-bit logical address size. Intel themselves call this ISA IA32e (IA32 extended, with IA32 being their 32-bit ISA).
References
ARM instruction examples
Intel addressing modes

From here:
The main difference between 32-bit processors and 64-bit processors is
the speed they operate. 64-bit processors can come in dual core, quad
core, and six core versions for home computing (with eight core
versions coming soon). Multiple cores allow for increase processing
power and faster computer operation. Software programs that require
many calculations to function operate faster on the multi-core 64-bit
processors, for the most part. It is important to note that 64-bit
computers can still use 32-bit based software programs, even when the
Windows operating system is a 64-bit version.
Another big difference between 32-bit processors and 64-bit processors
is the maximum amount of memory (RAM) that is supported. 32-bit
computers support a maximum of 3-4GB of memory, whereas a 64-bit
computer can support memory amounts over 4 GB. This is important for
software programs that are used for graphical design, engineering
design or video editing, where many calculations are performed to
render images, drawings, and video footage.
One thing to note is that 3D graphic programs and games do not benefit
much, if at all, from switching to a 64-bit computer, unless the
program is a 64-bit program. A 32-bit processor is adequate for any
program written for a 32-bit processor. In the case of computer games,
you'll get a lot more performance by upgrading the video card instead
of getting a 64-bit processor.
In the end, 64-bit processors are becoming more and more commonplace
in home computers. Most manufacturers build computers with 64-bit
processors due to cheaper prices and because more users are now using
64-bit operating systems and programs. Computer parts retailers are
offering fewer and fewer 32-bit processors and soon may not offer any
at all.

32-bit and 64-bit are basically the registers size, register the fastest type of memory and is closest to the CPU. A 64-bit processor can store more data for addressing and transmission than a 32-bit register but there are other factors also on the basis of the speed of the processor is measured such as the number of cores, cache memory, architecture etc.
Reference: Difference between 32-bit processor and 64-bit processor

From what is the meaning of 32 bit or 64 bit
process??
by kenshin123 :
The virtual addresses of a process are the mappings of an address
table that correspond to real physical memory on the system. For
reasons of efficiency and security, the kernel creates an abstraction
for a process that gives it the illusion of having its own address
space. This abstraction is called a virtual address space. It's just a
table of pointers to physical memory.
So a 32-bit process is given about 2^32 or 4GB of address space. What
this means under the hood is that the process is given a 32-bit page
table. In addition, this page table has a 32-bit VAS that maps to 4GB
of memory on the system.
So yes, a 64-bit process has a 64-bit VAS. Does that make sense?

there are 8 bits in a byte so if its 32 bit you are processing 4 bytes of data at whatever ghz or mhz your cpu is clocked at per second. so if there is a 64 bit cpu and 32 bit cpu clocked at the same speed the 64 bit cpu would be faster

32 bit processors are processing 32 bits of data based on Ghz of Processor in per second and 64 bit processors are processing 64bit of data per second on what speed your PC has. as well the 34 bit processors works with 4GB of RAM .

What are 16, 32 and 64-bit architectures?

What do 16-bit, 32-bit and 64-bit architectures mean in case of Microprocessors and/or Operating Systems?
In case of Microprocessors, does it mean maximum size of General Purpose Registers or size of Integer or number of Address-lines or number of Data Bus lines or what?
What do we mean by saying "DOS is a 16-bit OS", "Windows in a 32-bit OS", etc...?

My original answer is below, if you want to understand the comments.
New Answer
As you say, there are a variety of measures. Luckily for many CPUs a lot of the measures are the same, so there is no confusion. Let's look at some data (Sorry for image upload, I couldn't see a good way to do a table in markdown).
As you can see, many columns are good candidates. However, I would argue that the size of the general purpose registers (green) is the most commonly understood answer.
When a processor is very varied in size for different registers, it will often be described in more detail, eg the Motorola 68k being described as a 16/32bit chip.
Others have argued it is the instruction bus width (yellow) which also matches in the table. However, in today's world of pipelining I would argue this is a much less relevant measure for most applications than the size of the general purpose registers.
Original answer
Different people can mean different things, because as you say there are several measures. So for example someone talking about memory addressing might mean something different to someone talking about integer arithmetic. However, I'll try and define what i think is the common understanding.
My take is that for a CPU it means "The size of the typical register used for standard operations" or "the size of the data bus" (the two are normally equivalent).
I justify this with the following logic. The Z80 has an 8bit accumulator and an 8 bit databus, while having 16bit memory addressing registers (IX, IY, SP, PC), and a 16bit memory address bus. And the Z80 is called an 8bit microprocessor. This means people must normally mean the main integer arithmetic size, or databus size, not the memory addressing size.
It is not the size of instructions, as the Z80 (again) had 1,2 and 3 byte instructions, though of course the multi-byte were read in multiple reads. In the other direction, the 8086 is a 16bit microprocessor and can read 8 or 16bit instructions. So I would have to disagree with the answers that say it is instruction size.
For Operating systems, I would define it as "the code is compiled to run on a CPU of that size", so a 32bit OS has code compiled to run on a 32 bit CPU (as per the definition above).

How many bits a CPU "is", means what it's instruction word length is.
On a 32 bit CPU, the word length of such instruction is 32 bit, meaning that this is the width what a CPU can handle as instructions or data, often resulting in a bus line with that width.
For a similar reason, registers have the size of the CPU's word length, but you often have larger registers for different purposes.
Take the PDP-8 computer as an example. This was a 12 bit computer. Each instruction was 12 bit long. To handle data of the same width, the accumulator was also 12 bit.
But what makes the 12-bit computer a 12 bit machine, was its instruction word length. It had twelve switches on the front panel with which it could be programmed, instruction by instruction.
This is a good example to break out of the 8/16/32 bit focus.
The bit count is also typically the size of the address bus. It therefore usually tells the maximum addressable memory.
There's a good explanation of this at Wikipedia:
In computer architecture, 32-bit integers, memory addresses, or other data units are those that are at most 32 bits (4 octets) wide. Also, 32-bit CPU and ALU architectures are those that are based on registers, address buses, or data buses of that size. 32-bit is also a term given to a generation of computers in which 32-bit processors were the norm.
Now let's talk about OS.
With OS-es, this is way less bound to the actual "bitty-ness" of the CPU, it usually reflects how opcodes are assembled (for which word length of the CPU) and how registers are adressed (you can't load a 32 bit value in a 16 bit register) and how memory is adressed. Think of it as the completed, compiled program. It is stored as binary instructions and has therefore to fit into the CPUs word length. Task-wise, it has to be able to address the whole memory, otherwise it couldn't do proper memory management.
But what come's down to it, is whether a program is 32 or 64 bit (an OS is essentially a program here) it how its binary instructions are stored and how registers and memory are addressed. All in all, this applies to all kinds of programs, not just OS-es. That's why you have programs compiled for 32 bit or 64 bit.

The difference comes down to the bit width of an instruction set passed to a general purpose register for operating on. 16 bits can operate on 2 bytes, 64 on 8 bytes of instruction at a time. You can often increase throughput of a processor by executing more dense instructions per clock cycle.

The definitions are marketing terms more than precise technical terms.
In fuzzy technical term they are more related to architecturally visible widths than any real implementation register or bus width. For instance the 68008 was classed as a 32-bit CPU, but had 16-bit registers in the silicon and only an 8-bit data bus and 20-odd address bits.

http://en.wikipedia.org/wiki/64-bit#64-bit_data_models the data models mean bitness for the language.
The "OS is x-bit" phrase usually means that the OS was written for x-bit cpu mode, that is, 64-bit Windows uses long mode on x86-64, where registers are 64 bits and address space is 64-bits large and there are other distinct differences from 32-bits mode, where typically registers are 32-bits wide and address space is 32-bits large. On x86 a major difference between 32 and 64 bits modes is presence of segmentation in 32-bits for historical compatibility.
Usually the OS is written with CPU bitness in mind, x86-64 being a notable example of decades of backwards compatibility - you can have everything from 16-bit real-mode programs through 32-bits protected-mode programs to 64-bits long-mode programs.
Plus there are different ways to virtualise, so your program may run as if in 32-bits mode, but in reality it is executed by a non-x86 core at all.

When we talk about 2^n bit architectures in computer science then we are basically talking about memory registers, address buses size or data buses size. The basic concept behind term of 2^n bit architecture is to signify that this here 2^n bit of data can be made use to address/transport the data of size 2^n by processes.

As far as I know, technically, it's the width of the integer pathways. I've heard of 16bit chips that have 32bit addressing. However, in reality, it is the address width. sizeof(void*) is 16bit on a 16bit chip, 32bit on a 32bit, and 64bit on a 64bit.
This leads to problems because C and C++ allow conversions between void* and integral types, and it's safe if the integral type is large enough (the same size as the pointer). This lead to all sorts of unsafe stuff in terms of
void* p = something;
int i = (int)p;
Which will horrifically crash and burn on 64bit code (works on 32bit) because void* is now twice as big as int.
In most languages, you have to work hard to care about the width of the system you're working on.

Processor, OS : 32bit, 64 bit

I am new to programming and come from a non-CS background (no formal degree). I mostly program winforms using C#.
I am confused about 32 bit and 64 bit.... I mean, have heard about 32 bit OS, 32 bit processor and based on which a program can have maximum memory. How it affects the speed of a program. There are lot more questions which keep coming to mind.
I tried to go through some Computer Organization and Architecture books. But, either I am too dumb to understand what is written in there or the writers assume that the reader has some CS background.
Can someone explain me these things in a plain simple English or point me to something which does that.
EDIT: I have read things like In 32-bit mode, they can access up to 4GB memory; in 64-bit mode, they can access much much more....I want to know WHY to all such things.
BOUNTY: Answers below are really good....esp one by Martin. But, I am looking at a thorough explanation, but in plain simple English.

Let's try to answer this question by looking at people versus computers; hopefully this will shed some light on things for you:
Things to Keep In Mind
As amazing as they are, computers are very, very dumb.
Memory
People have memory (with the exception, arguably, of husbands and politicians.) People store information in their memory for later use.
With a question (e.g, "What is your phone number?") a person is able to retrieve information to give an answer (e.g., "867-5309")
All modern computers have memory, and store information in their memory for later use.
Because computers are dumb, they can only be asked a very specific question to retrieve information: "What is the value at X in your memory?"
In the question above, X is known as an address, which can also be called a pointer.
So here we have a fundamental difference between people and computers: To recall information from memory, computers need to be given an address, whereas people do not. (Well in a sense one could say "your phone number" is an address because it gives different information than "your birthday", but that's another conversation.)
Numbers
People use the decimal number system. That means for every digit in a decimal number, the digit can be one of 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. People have ten options per digit.
All modern computers use the binary number system. That means for every digit in a binary number, the digit can only be either 1 or 0. Computers have two options per digit.
In computer jargon, a single binary digit is called a bit, short for binary digit.
Addresses
Every address in a computer is a binary number.
Every address in a computer has a maximum number of digits (or bits) that it can have. This is mostly because the computer's hardware is inflexible (also known as fixed) and needs to know ahead of time that an address will only be so long.
Terms like "32-bit" and "64-bit" are talking about the longest address for which a computer can store and retrieve information. In English "32-bit" in this sense means "This computer expects instructions about its memory to have addresses no more than 32 binary digits long."
As you can imagine, the more bits a computer can handle the longer the address it can look up and therefore the more memory it can manage at one time.
32-bit v. 64-bit Addressing
For an inflexible (fixed) number of digits (e.g. 2 decimal digits) the possible numbers you can represent is called the range (e.g. 00 to 99, or 100 unique numbers). Adding an additional decimal digit multiplies the range by 10 (e.g. 3 decimal digits -> 000 to 999, or 1000 unique numbers).
This applies to computers, too, but because they are binary machines instead of decimal machines, adding an additional binary digit (bit) only increases the range by a factor of 2.
Addressing Ranges:
1-bit addressing lets you talk about 2 unique addresses (0 and 1).
2-bit addressing lets you talk about 4 unique addresses (00, 01, 10, and 11).
3-bit addressing lets you talk about 8 unique addresses (000, 001, 010, 011, 100, 101, 110, and 111).
and after a long while... 32-bit addressing lets you talk about 4,294,967,296 unique addresses.
and after an even longer while... 64-bit addressing lets you talk about 18,446,744,073,709,551,616 unique addresses. That's a LOT of memory!
Implications
What all this means is that a 64-bit computer can store and retrieve much more information than a 32-bit computer. For most users this really doesn't mean a whole lot because things like browsing the web, checking email and playing Solitaire all work comfortably within the confines of 32-bit addressing. Where the 64-bit benefit will really shine is in areas where you have a lot of data the computer will have to churn through. Digital signal processing, gigapixel photography and advanced 3D gaming are all areas where their massive amounts of data processing would see a big boost in a 64-bit environment.

It really all comes down to wires.
In digital circuits, only 0's and 1's (usually low voltage and high voltage) can be transmitted from one element (CPU) to another element (memory chip). If I have only 1 wire, I can only send either a 1 or a 0 over the wire per clock cycle. This means I can only address 2 bytes (assuming byte addressing, and that entire addresses are transmitted in just 1 cycle for speed!).
If I have 2 wires, I can address 4 bytes. Because I can send: (0, 0), (0, 1), (1, 0), or (1, 1) over the two wires. So basically it's 2 to the power of # of wires.
So if I have 32 wires, I can address 4 GB, and if I have 64 wires, I can address a lot more.
There are other tricks that engineers can do to address a larger address space than the wires allow for. E.g. splitting up the address into two parts and sending one half in the first cycle and the second half on the next cycle. But that means that your memory interface will be half as fast.
Edited my comments into here (unedited) ;) And making it a wiki if anyone has anything interesting to add as well.
Like other comments have mentioned, 2^32 (2 to the power of 32) = 4294967296, which is 4 GB. And 2^64 is 18,446,744,073,709,551,616. To dig in further (and you probably read this in Hennesey & Patterson) processors contains registers that it uses as "scratch space" for storing the results of its computations. A CPU only knows how to do simple arithmetic and knows how to move data around. Naturally, the size of these registers are the same width in bits as the "#-bits" of architecture it is, so a 32-bit CPU's registers will be 32-bits wide, and 64-bit CPU's registers will be 64-bits wide.
There will be exceptions to this when it comes to floating point (to handle double precision) or other SIMD instructions (single-instruction, multiple data commands). The CPU loads and saves the data to and from the main memory (the RAM). Since the CPU also uses these registers to compute memory addresses (physical and virtual), the amount of memory that it can address is also the same as the width of its registers. There are some CPUs that handles address computation with special extended registers, but those I would call "after thoughts" added after engineers realize they needed it.
At the moment 64-bits is quite a lot for addressing real physical memory. Most 64-bit CPUs will omit quite a few wires when it comes to wiring up the CPU to the memory due to practicality. It won't make sense to use up precious motherboard real estate to run wires that will always have 0's. Not to mention in order to have the max amount of RAM with today's DIMM density would require 4 billion dimm slots :)
Other than the increased amount of memory, 64-bit processors offer faster computation for integer numbers larger than 2^32. Previously programmers (or compilers, which is also programmed by programmers ;) would have to simulate having a 64-bit register by taking up two 32-bit registers and handling any overflow situations. But on 64-bit CPUs it would be handled by the CPU itself.
The drawback is that a 64-bit CPU (with everything equal) would consume more power than a 32-bit CPU just due to (roughly) twice the amount of circuitry needed. However, in reality you will never get equal comparison because newer CPUs will be manufactured in newer silicon processes that have less power leakage, allow you to cram more circuit in the same die size, etc. But 64-bit architectures would consume twice as much memory. What was once considered "ugly" of x86's variable instruction length is actually an advantage now compared to architectures that uses a fixed instruction size.

Many modern processors can run in two modes: 32-bit mode, and 64-bit mode. In 32-bit mode, they can access up to 4GB memory; in 64-bit mode, they can access much much more. Older processors only support 32-bit mode.
Operating systems chose to use the processors in one of these modes: at installation time, a choice is made whether to operate the processor in 32-bit mode or in 64-bit mode. Even though the processor can continue to operate in 64-bit mode, switching from 32-bit to 64-bit would require a reinstallation of the system. Older systems only support 32-bit mode.
Applications can also be written in (or compiled for) 32-bit or 64-bit mode. Compatibility here is more tricky, as the processor, when run in 64-bit mode, can still support 32-bit applications as an emulation feature. So on a 64-bit operating system, you can run either 32-bit applications or 64-bit applications. On a 32-bit operating system, you can run only 32-bit applications.
Again, chosing the size is primarily a matter of amount of main memory you want to access. 32-bit applications are often restricted to 2GB on many systems, since the system needs some address space for itself.
From a performance (speed) point of view, there is no significant difference. 64-bit applications may be bit slower because they use 64-bit pointers, so they need more memory accesses for a given operation. At the same time, they may also be a bit faster, since they can perform 64-bit integer operations as one instruction, whereas 32-bit processors need to emulate them with multiple instructions. However, those 64-bit integer operations are fairly uncommon.
One also may wonder what the cost is of running a 32-bit application on a 64-bit processor: on AMD64 and Intel64 processors, this emulation mode is mostly in hardware, so there is no real performance loss over running the 32-bit application natively. This is significantly different on Itanium, where 32-bit (x86) applications are emulated very poorly.

Let me tell you the story of Binville, a small town in the middle of nowhere. Binville had one road leading to it. Every person either coming to or leaving Binville had to drive on this road. But as you approached the town, there was a fork. You could either go left or go right.
In fact, every road had a fork in it, except for the roads leading up to the homes themselves. Those roads simply ended at the house. None of the roads had names; they didn't need names thanks to an ingenious addressing scheme created by the Binville Planning Commission. Here's a map of Binville, showing the roads and the houses:
------- [] 00
/
------
/ \
/ ------- [] 01
-----
\ ------- [] 10
\ /
------
\
------- [] 11
As you can see, each house has a two-digit address. That address alone is enough to a) uniquely identify each house (there are no repeats) and b) tell you how to get there. It's easy to get around town, you see. Each fork is labeled with a zero or one, which the Planning Commission calls the Binville Intersection Tracer, or bit for short. As you approach the first fork, look at the first bit of the address. If it's a zero, go left; if it's a one, go right. Then look at the second digit when you get to the second fork, going left or right as appropriate.
Let's say you want visit your friend who lives in Binville. She says she lives in house 10. When you get to Binville's first fork, go right (1). Then at the second fork, go left (0). You're there!
Binville existed like this for several years but word started to get around about its idyllic setting, great park system, and generous health care. (After all, if you don't have to spend money on street signs, you can use it on better things.) But there was a problem. With only two bits, the addressing scheme was limited to four houses!
So the Planning Commission put their heads together and came up with a plan: they would add a bit to each address, thereby doubling the number of houses. To implement the plan, they would build a new fork at the edge of town and everyone would get new addresses. Here's the new map, showing the new fork leading into town and the new part of Binville:
------- [] 000
/
------
/ \
/ ------- [] 001
----- Old Binville
/ \ ------- [] 010
/ \ /
/ ------
/ \
/ ------- [] 011
--
\ ------- 100
\ /
\ ------
\ / \
\ / ------- [] 101
----- New Binville (some homes not built yet)
\ ------- 110
\ /
------
\
------- 111
Did you notice that everyone in the original part of Binville simply added a zero to the front of their address? The new bit represents the new intersection that was built. When the number of bits is increased by one, the number of addresses doubles. The citizens always knew the maximum size of their town: all they had to do was compute the value of two raised to the power of the number of bits. With three bits, they could have 23 = 8 houses.
A few years went by and Binville was once again filled to capacity. More people wanted to move in, so another bit was added (along with the requisite intersection), doubling the size of the town to sixteen houses. Then another bit, and another, and another... Binville's addresses were soon at sixteen bits, able to accommodate up to 216 (16,384) houses, but it wasn't enough. The people kept coming and coming!
So the Planning Commission decided to solve the problem once and for all: they would jump all the way to thirty-two bits. With sufficient addresses for over four billion homes (232), surely that would be enough!
And it was... for about twenty-five years, when Binville was no longer a small town in the middle of nowhere. It was now a major metropolis. In fact, it was getting to be as big as a whole nation with billions of residents. But the parks were still nice and everyone had great health care, so the population kept growing.
Faced with the ever-increasing population, the Planning Commission once again put their heads together and proposed another expansion of the city. This time they would use 64 bits. Do you know how many homes could fit within the Binville city limits now? That's right: 18,446,744,073,709,551,616. That number is so big, we could populate about two billion Earths and give everyone their own address.
Using 64 bits wasn't a panacea for all their addressing problems. The addresses take twice as much space to write as the old 32-bit addresses did. Worse, some citizens hadn't yet updated their addresses to use the new 64-bit format, so they were forced into a walled-off section of the city reserved specifically for those still using 32-bit addresses. But that was OK: the people using 32 bits had access to more than enough of the city to suit their needs. They didn't feel the need to change just yet.
Will 64 bits be enough? Who knows at this time, but citizens of Binville are waiting for the announcement of 128-bit addresses...

Martin's answer is mostly correct and detailed.
I thought I would just mention that all the memory limits are per-application virtual memory limits, not limits for the actual physical memory in the computer. In fact it's possible to work with more than 4Gb of memory in single application even in 32-bit systems, it just requires more work, since it can't all be accessible using pointers at one time. link text
Another thing that was not mentioned is that the difference between traditional x86 processor and x86-64 is not only in the pointer size, but also in the instruction set. While the pointers are larger and consume more memory (8 bytes instead of 4) it is compensated by larger register set (15 general purpose registers instead of 8, iirc), so the performance can actually be better for code that does computational work.

Martin's answer is excellent. Just to add some additional points... since you mention .NET, you should note that the CLI/JIT has some differences between x86 and x64, with different optimisations (tail-call, for example), and some subtle different behaviour of advanced things like volatile. This can all have an impact on your code.
Additionally, not all code works on x64. Anything that uses DirectX or certain COM features may struggle. Not really a performance feature, but important to know.
(I removed "DirectX" - I might be talking rubbish there... but simply: you need to check that anything you depend upon is stable on your target platform)

Think of a generic computers memory as a massive bingo card with billions of squares. To address any individual square on the board there is a scheme to label each row and column B-5, I-12, O-52..etc.
If there are enough squares on the card eventually you will run out of letters so you will need to start reusing more letters and writing larger numbers to continue to be able to uniquely address each square.
Before you know it the announcer is spouting annoyingly huge numbers and letter combinations to let you know which square to mark on your 10 billion square card.
BAZC500000, IAAA12000000, OAAAAAA523111221
The bit count of the computer specifies its limit of the complexity of the letters and numbers to address any specific square.
32-bits means if the card is any bigger than 2^32 squares the computer does not have enough wires and transisters to allow it to uniquely physically address any specific square required to read a value or write a new value to the specified memory location.
64-bit computers can individually address a massive 2^64 squares.. but to do so each square needs a lot more letters and numbers to make sure each square has its own unique address. This is why 64-bit computers need more memory.
Other common examples of addressing limits are local telephone numbers. They are ususally 7-digits 111-2222 or reformatted as a number 1,112,222 .. what happens when there are more than 9,999,999 people who want their own telelphone numbers? You add area codes and country codes and your phone number goes from 7 digits to 10 to 11 taking up more space.
If you are familiar with the impending IPv4 shortage its the same problem.. IPv4 addresses are 32-bits meaning there are only 2^32 (~4 billion) unique IP addresses possible and there are many more people than that alive today.
There is overhead in all schemes I mentioned (computers, phone numbers, IPv4 addresses) where certain portions are reserved for organizational purposes so the usable space is much less.
The performance promise for the 64-bit world is that instead of sending 4 bytes at a time (ABCD) a 64-bit computer can send 8 bytes at a time (ABCDEFGH) so the alphabet is transfered between different areas of memory up to twice as fast as a 32-bit computer. There is also benefit for some applications that just run faster when they have more memory they can use.
In the real world 64-bit desktop processors by intel et al are not really true 64-bit processors and still are limited to 32-bits for several types of operations so in the real world the performance between 32-bit and 64-bit applications is marginal. 64-bit mode gives you more hardware registers to work with which does improve performance but adressing more memory on a "fake" 64-bit processor can also hurt performance in some areas so its ususally a wash. In the future we will be seeing more performance improvements when desktop processors become fully 64-bit.

I don't think I've seen much of the word 'register' in the previous answers. A digital computer is a bunch of registers, with logic for arithmetic and memory to store data and programs.
But first ... digital computers use a binary representation of numbers because the binary digits ('bits') 0 and 1 are easily represented by the two states (on/off) of a switch. Early computers used electromechanical switches; modern computers use transistors because they're smaller and faster. Much smaller, and much faster.
Inside the CPU, the switches are grouped together in registers of a finite length, and operations are typically performed on entire registers: For example, add this register to that, and so on. As you would expect, a 32-bit CPU has registers 32 bits long. I'm simplifying here, but bear with me.
It makes sense to organise the computer memory as a series of 'locations', each holding the same number of bits as a CPU register: for example, load this register from that memory location. Actually, if we think of memory as bytes, that's just a convenient fraction of a register and we migh load a register from a series of memory locations (1, 2, 4, 8).
As transistors get smaller, additional logic for more complex arithmetic can be implemented in the limited space of a computer chip. CPU real estate is always at a premium.
But with improvements in chip fabrication, more transistors can be reliably made on only slightly larger chips. Registers can be longer and the paths between them can be wider.
When the registers which hold the addresses of memory locations are longer, they address larger memories and data can be manipulated in larger chunks. In combination with the more complex arithmetic logic, things get done faster.
And isn't that what we're all after?

to explain WHY 32 bit mode can only access 4GB of RAM:
Maximum accessible memory space = 2n bytes where n is the word length of the architecture. So in a 32 bit architecture, maximum accessible memory space is 232 = 4294967296 = 4GB of RAM.
64 bit architecture would be able to access 264 = LOTS of memory.
Just noticed Tchens comments going over this. Anyways, without a CS background, yes computer organization and architecture books are going to be difficult to understand at best.

The processor uses base-2 to store numbers. Base 2 was probably chosen because it's the "simplest" of all bases: for example the base-2 multiplication table has only 4 cells while base "10" multiplication table has a 100 cells.
Before 2003, common PC processors were only "32-bit-capable".
That means that the processor's native numerical operations were for 32-bit numbers.
You can still do numerical operations for larger numbers, but those would have to be performed by programs executed by the processor, and not be the "primitive actions" (commands in machine-language) supported by the processor like those for 32-bit-integers (at the time)
32 bits were chosen because CPU engineers are fond of powers of 2, and 16-bits weren't enough
Why weren't 16 bits enough? With 16 bits you can represent integers in the range of 0-65535
65535 = 1111111111111111 in binary (= 20+21+22...+215 = 216-1)
65535 is not enough because for example, a Hospital management software needs to be able to count more than 65535 patients
Usually people consider the size of the computer's memory when discussing how big its integers should be. 65535 is definitely not enough. Computers have way more RAM than that, and it doesn't matter if you count in "Bytes" or bits
32 bits was considered enough for a while. In 2003 AMD Introduced the first 64-bit-capable "x86" processor. Intel soon followed.
Actually 16 bit was considered enough a long while ago.
It is common practice for lots of hardware and software to be backward-compatible. In this case it means the 64-bit-capable CPUs can also run every software the 32-bit-capable CPUs can.
Backward compatibility is strived for as a business strategy. More users will want to upgrade to the better processor if it can also do everything the previous one could.
In CPUs backward compatibility means that the new actions the CPU supports are added to the previous machine language. For example the previous machine language may had some specification like "all opcodes starting in 1111 are reserved for future use"
In theory this kind of CPU backward compatibility wouldn't had been necessary as all software could have just been recompiled to the new and not compatible machine-language. However that's not the case because of corporate strategies and political or economical systems. In a Utopic "open source" world, backward compatibility of machine languages would probably not be a concern.
The backward compatibility of x86-64 (the common 64-bit CPUs' machine language) comes in the form of a "compatibility mode". This means that any program wishing to make use of the new cpu capabilities needs to notify the CPU (through the OS) that it should run in "64-bit mode". And then it could use to great new CPU 64-bit capabilities.
Therefore, for a program to use the CPU's 64-bit capabilities: The CPU, the OS, and the program, all have to "support 64-bits".
64-bits is enough to give every person in the world several unique numbers. It's probably big enough for most current computing endeavors. It's probably unlikely that future CPUs will shift further to 128 bits. But if they do, that's definitely enough for everything I can imagine, and therefore a 256-bits transition won't be necessary.
I hope this helps.

It's worth noting that certain applications(e.g. multimedia encoding/decoding and rendering) gain significant(2x) performance boost when written to fully utilize 64-bit.
See 32-bit vs. 64-bit benchmarks for Ubuntu and Windows Vista

For non CS person. 64bit will work better for calculations (all kinds of) it will be good also it will allow you to have more RAM.
Also if you have limited RAM (in VPS for example or small-RAM dedicated server) - choose 32 bit, services there will eat less RAM.

This is a very simple explanation, given that everything above is quite detailed.
32-bit refers to the registers. Registers are places to store data, and all programs operate by manipulating these things. Assembly operates directly on them (and hence why people are often excited to program in assembly).
32-bit means the basic set of registers can hold 32-bits ofinformation. 64-bit means, unsurprisingly, 64 bits of info.
Why can this make programs faster? Because you can do larger operations faster. It will only make certain types of programs faster, by the way. Games, typically, can take great advantage of optimising per processor, because of their math-heavy operations (and hence register use).
But amusingly, as tchen mentioned, their are many other 'things' that let you perform larger operations anyway. SSE, SSE2, etc, will have 64-bit registers and 128-bit registers, even on a '32 bit' system.
The increased ability to address memory speaks directly to the increase in basic register size, based on (I imagine) Windows' specific memory-addressing system.
Hope that helps a little. other posters are much more accurate than me, I am just trying to explain very simply (it helps that I know very little :)

I have a wonderful answer for this question, but it doesn't fit all within in this answer block.... The simple answer is that for your program to get a byte out of memory, it needs an address. In 32-bit CPUs, the memory address of each byte is stored in a 32-bit (unsigned) integer, which as a maximum value of 4 GB. When you use a 64 bit processor, the memory address is a 64 bit integer, which gives you about 1.84467441 × 10^19 possible memory addresses. This really should suffice if you are new to programming. You should really be focusing more on learning how to program, than about the internal workings of your processor, and why you can't access more than 4 GB of RAM on your 32 bit CPU.

Simple answer to explain addressable memory range with 32 bit processors is:
Lets assume You have only 3 digit numbers allowed to construct so maximum number u can go upto is 999. Range of numbers is (0 - 999). You have just 1000 numbers to use.
But if u are allowed to have 6 digit numbers then the maximum number you can construct is 999999. Now range is (0 - 999999). So now u have 1 million numbers with you to use.
Similarly more bits you are allowed to have in a processor, larger set of addresses(numbers in previous example) you can construct and eventually use to store data etc..
Anything simpler than this would interesting to read!
-AD.

What are the advantages of a 64-bit processor?

Obviously, a 64-bit processor has a 64-bit address space, so you have more than 4 GB of RAM at your disposal. Does compiling the same program as 64-bit and running on a 64-bit CPU have any other advantages that might actually benefit programs that aren't enormous memory hogs?
I'm asking about CPUs in general, and Intel-compatible CPUs in particular.

There's a great article on Wikipedia about the differences and benefits of 64bit Intel/AMD cpus over their 32 bit versions. It should have all the information you need.
Some on the key differences are:
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks

The main advantage of a 64-bit CPU is the ability to have 64-bit pointer types that allow virtual address ranges greater than 4GB in size. On a 32-bit CPU, the pointer size is (typically) 32 bits wide, allowing a pointer to refer to one of 2^32 (4,294,967,296) discrete addresses. This allows a program to make a data structure in memory up to 4GB in size and resolve any data item in it by simply de-referencing a pointer. Reality is slightly more complex than this, but for the purposes of this discussion it's a good enough view.
A 64-bit CPU has 64-bit pointer types that can refer to any address with a space with 2^64 (18,446,744,073,709,551,616) discrete addresses, or 16 Exabytes. A process on a CPU like this can (theoretically) construct and logically address any part of a data structure up to 16 Exabytes in size by simply de-referencing a pointer (looking up data at an address held in the pointer).
This allows a process on a 64-bit CPU to work with a larger set of data (constrained by physical memory) than a process on a 32 bit CPU could. From the point of view of most users of 64-bit systems, the principal advantage is the ability for applications to work with larger data sets in memory.
Aside from that, you may get a native 64-bit integer type. A 64 bit integer makes arithmetic or logical operations using 64 bit types such as C's long long faster than one implemented as two 32-bit operations. Floating point arithmetic is unlikely to be significantly affected, as FPU's on most modern 32-bit CPU's natively support 64-bit double floating point types.
Any other performance advantages or enhanced feature sets are a function of specific chip implementations, rather than something inherent to a system having a 64 bit ALU.

This article may be helpful:
http://www.softwaretipsandtricks.com/windowsxp/articles/581/1/The-difference-between-64-and-32-bit-processors
This one is a bit off-topic, but might help if you plan to use Ubuntu:
http://ubuntuforums.org/showthread.php?t=368607
And this pdf below contains a detailed technical specification:
http://www.plmworld.org/access/tech_showcase/pdf/Advantage%20of%2064bit%20WS%20for%20NX.pdf

Slight correction. On 32-bit Windows, the limit is about 3GB of RAM. I believe the remaining 1GB of address space is reserved for hardware. You can still install 4GB, but only 3 will be accessable.
Personally I think anyone who hasn't happily lived with 16K on an 8-bit OS in a former life should be careful about casting aspersions against some of today's software starting to become porcine. The truth is that as our resources become more plentiful, so do our expectations. The day is not long off when 3GB will start to seem ridiculously small. Until that day, stick with your 32-bit OS and be happy.

About 1-3% of speed increase due to instruction level parallelism for 32-bit calculations.

Just wanted to add a little bit of information on the pros and cons of 64-bit CPUs. https://blogs.msdn.microsoft.com/joshwil/2006/07/18/should-i-choose-to-take-advantage-of-64-bit/

The main difference between 32-bit processors and 64-bit processors is the speed they operate. 64-bit processors can come in dual core, quad core, and six core versions for home computing (with eight core versions coming soon). Multiple cores allow for increase processing power and faster computer operation. Software programs that require many calculations to function operate faster on the multi-core 64-bit processors, for the most part. It is important to note that 64-bit computers can still use 32-bit based software programs, even when the Windows operating system is a 64-bit version.
Another big difference between 32-bit processors and 64-bit processors is the maximum amount of memory (RAM) that is supported. 32-bit computers support a maximum of 3-4GB of memory, whereas a 64-bit computer can support memory amounts over 4 GB. This is important for software programs that are used for graphical design, engineering design or video editing, where many calculations are performed to render images, drawings, and video footage.
One thing to note is that 3D graphic programs and games do not benefit much, if at all, from switching to a 64-bit computer, unless the program is a 64-bit program. A 32-bit processor is adequate for any program written for a 32-bit processor. In the case of computer games, you'll get a lot more performance by upgrading the video card instead of getting a 64-bit processor.
In the end, 64-bit processors are becoming more and more commonplace in home computers. Most manufacturers build computers with 64-bit processors due to cheaper prices and because more users are now using 64-bit operating systems and programs. Computer parts retailers are offering fewer and fewer 32-bit processors and soon may not offer any at all.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string