Thursday, November 27, 2008

Central Processing Unit

"CPU" redirects here. For other uses, see CPU (disambiguation).

A Central Processing Unit (CPU) is a machine that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage. The term itself and its initialism have been in use in the computer industry at least since the early 1960s (Weik 1961). The form, design and implementation of CPUs have changed dramatically since the earliest examples, but their fundamental operation has remained much the same.

Early CPUs were custom-designed as a part of a larger, sometimes one-of-a-kind, computer. However, this costly method of designing custom CPUs for a particular application has largely given way to the development of mass-produced processors that are suited for one or many purposes. This standardization trend generally began in the era of discrete transistor mainframes and minicomputers and has rapidly accelerated with the popularization of the integrated circuit (IC). The IC has allowed increasingly complex CPUs to be designed and manufactured to tolerances on the order of nanometers. Both the miniaturization and standardization of CPUs have increased the presence of these digital devices in modern life far beyond the limited application of dedicated computing machines. Modern microprocessors appear in everything from automobiles to cell phones to children's toys.

History of CPUs

Main article: History of general purpose CPUs

EDVAC, one of the first electronic stored program computers.

Prior to the advent of machines that resemble today's CPUs, computers such as the ENIAC had to be physically rewired in order to perform different tasks. These machines are often referred to as "fixed-program computers," since they had to be physically reconfigured in order to run a different program. Since the term "CPU" is generally defined as a software (computer program) execution device, the earliest devices that could rightly be called CPUs came with the advent of the stored-program computer.

The idea of a stored-program computer was already present during ENIAC's design, but was initially omitted so the machine could be finished sooner. On June 30, 1945, before ENIAC was even completed, mathematician John von Neumann distributed the paper entitled "First Draft of a Report on the EDVAC." It outlined the design of a stored-program computer that would eventually be completed in August 1949 (von Neumann 1945). EDVAC was designed to perform a certain number of instructions (or operations) of various types. These instructions could be combined to create useful programs for the EDVAC to run. Significantly, the programs written for EDVAC were stored in high-speed computer memory rather than specified by the physical wiring of the computer. This overcame a severe limitation of ENIAC, which was the large amount of time and effort it took to reconfigure the computer to perform a new task. With von Neumann's design, the program, or software, that EDVAC ran could be changed simply by changing the contents of the computer's memory.

While von Neumann is most often credited with the design of the stored-program computer because of his design of EDVAC, others before him such as Konrad Zuse had suggested similar ideas. Additionally, the so-called Harvard architecture of the Harvard Mark I, which was completed before EDVAC, also utilized a stored-program design using punched paper tape rather than electronic memory. The key difference between the von Neumann and Harvard architectures is that the latter separates the storage and treatment of CPU instructions and data, while the former uses the same memory space for both. Most modern CPUs are primarily von Neumann in design, but elements of the Harvard architecture are commonly seen as well.

Being digital devices, all CPUs deal with discrete states and therefore require some kind of switching elements to differentiate between and change these states. Prior to commercial acceptance of the transistor, electrical relays and vacuum tubes (thermionic valves) were commonly used as switching elements. Although these had distinct speed advantages over earlier, purely mechanical designs, they were unreliable for various reasons. For example, building direct current sequential logic circuits out of relays requires additional hardware to cope with the problem of contact bounce. While vacuum tubes do not suffer from contact bounce, they must heat up before becoming fully operational and eventually stop functioning altogether. Usually, when a tube failed, the CPU would have to be diagnosed to locate the failing component so it could be replaced. Therefore, early electronic (vacuum tube based) computers were generally faster but less reliable than electromechanical (relay based) computers.

Tube computers like EDVAC tended to average eight hours between failures, whereas relay computers like the (slower, but earlier) Harvard Mark I failed very rarely (Weik 1961:238). In the end, tube based CPUs became dominant because the significant speed advantages afforded generally outweighed the reliability problems. Most of these early synchronous CPUs ran at low clock rates compared to modern microelectronic designs (see below for a discussion of clock rate). Clock signal frequencies ranging from 100 kHz to 4 MHz were very common at this time, limited largely by the speed of the switching devices they were built with.

Discrete transistor and IC CPUs
CPU, core memory, and external bus interface of a DEC PDP-8/I. made of medium-scale integrated circuits

The design complexity of CPUs increased as various technologies facilitated building smaller and more reliable electronic devices. The first such improvement came with the advent of the transistor. Transistorized CPUs during the 1950s and 1960s no longer had to be built out of bulky, unreliable, and fragile switching elements like vacuum tubes and electrical relays. With this improvement more complex and reliable CPUs were built onto one or several printed circuit boards containing discrete (individual) components.

During this period, a method of manufacturing many transistors in a compact space gained popularity. The integrated circuit (IC) allowed a large number of transistors to be manufactured on a single semiconductor-based die, or "chip." At first only very basic non-specialized digital circuits such as NOR gates were miniaturized into ICs. CPUs based upon these "building block" ICs are generally referred to as "small-scale integration" (SSI) devices. SSI ICs, such as the ones used in the Apollo guidance computer, usually contained transistor counts numbering in multiples of ten. To build an entire CPU out of SSI ICs required thousands of individual chips, but still consumed much less space and power than earlier discrete transistor designs. As microelectronic technology advanced, an increasing number of transistors were placed on ICs, thus decreasing the quantity of individual ICs needed for a complete CPU. MSI and LSI (medium- and large-scale integration) ICs increased transistor counts to hundreds, and then thousands.

In 1964 IBM introduced its System/360 computer architecture which was used in a series of computers that could run the same programs with different speed and performance. This was significant at a time when most electronic computers were incompatible with one another, even those made by the same manufacturer. To facilitate this improvement, IBM utilized the concept of a microprogram (often called "microcode"), which still sees widespread usage in modern CPUs (Amdahl et al. 1964). The System/360 architecture was so popular that it dominated the mainframe computer market for the decades and left a legacy that is still continued by similar modern computers like the IBM zSeries. In the same year (1964), Digital Equipment Corporation (DEC) introduced another influential computer aimed at the scientific and research markets, the PDP-8. DEC would later introduce the extremely popular PDP-11 line that originally was built with SSI ICs but was eventually implemented with LSI components once these became practical. In stark contrast with its SSI and MSI predecessors, the first LSI implementation of the PDP-11 contained a CPU composed of only four LSI integrated circuits (Digital Equipment Corporation 1975).

Transistor-based computers had several distinct advantages over their predecessors. Aside from facilitating increased reliability and lower power consumption, transistors also allowed CPUs to operate at much higher speeds because of the short switching time of a transistor in comparison to a tube or relay. Thanks to both the increased reliability as well as the dramatically increased speed of the switching elements (which were almost exclusively transistors by this time), CPU clock rates in the tens of megahertz were obtained during this period. Additionally, while discrete transistor and IC CPUs were in heavy usage, new high-performance designs like SIMD (Single Instruction Multiple Data) vector processors began to appear. These early experimental designs later gave rise to the era of specialized supercomputers like those made by Cray Inc.

Microprocessors

Main article: Microprocessor

The integrated circuit from an Intel 8742, an 8-bit microcontroller that includes a CPU running at 12 MHz, 128 bytes of RAM, 2048 bytes of EPROM, and I/O in the same chip.
Intel 80486DX2 microprocessor in a ceramic PGA package.

The introduction of the microprocessor in the 1970s significantly affected the design and implementation of CPUs. Since the introduction of the first microprocessor (the Intel 4004) in 1970 and the first widely used microprocessor (the Intel 8080) in 1974, this class of CPUs has almost completely overtaken all other central processing unit implementation methods. Mainframe and minicomputer manufacturers of the time launched proprietary IC development programs to upgrade their older computer architectures, and eventually produced instruction set compatible microprocessors that were backward-compatible with their older hardware and software. Combined with the advent and eventual vast success of the now ubiquitous personal computer, the term "CPU" is now applied almost exclusively to microprocessors.

Previous generations of CPUs were implemented as discrete components and numerous small integrated circuits (ICs) on one or more circuit boards. Microprocessors, on the other hand, are CPUs manufactured on a very small number of ICs; usually just one. The overall smaller CPU size as a result of being implemented on a single die means faster switching time because of physical factors like decreased gate parasitic capacitance. This has allowed synchronous microprocessors to have clock rates ranging from tens of megahertz to several gigahertz. Additionally, as the ability to construct exceedingly small transistors on an IC has increased, the complexity and number of transistors in a single CPU has increased dramatically. This widely observed trend is described by Moore's law, which has proven to be a fairly accurate predictor of the growth of CPU (and other IC) complexity to date.

While the complexity, size, construction, and general form of CPUs have changed drastically over the past sixty years, it is notable that the basic design and function has not changed much at all. Almost all common CPUs today can be very accurately described as von Neumann stored-program machines. As the aforementioned Moore's law continues to hold true, concerns have arisen about the limits of integrated circuit transistor technology. Extreme miniaturization of electronic gates is causing the effects of phenomena like electromigration and subthreshold leakage to become much more significant. These newer concerns are among the many factors causing researchers to investigate new methods of computing such as the quantum computer, as well as to expand the usage of parallelism and other methods that extend the usefulness of the classical von Neumann model.

CPU operation

The fundamental operation of most CPUs, regardless of the physical form they take, is to execute a sequence of stored instructions called a program.The program is represented by a series of numbers that are kept in some kind of computer memory. There are four steps that nearly all von Neumann CPUs use in their operation: fetch, decode, execute, and writeback.

The first step, fetch, involves retrieving an instruction (which is represented by a number or sequence of numbers) from program memory. The location in program memory is determined by a program counter (PC), which stores a number that identifies the current position in the program. In other words, the program counter keeps track of the CPU's place in the current program. After an instruction is fetched, the PC is incremented by the length of the instruction word in terms of memory units. Often the instruction to be fetched must be retrieved from relatively slow memory, causing the CPU to stall while waiting for the instruction to be returned. This issue is largely addressed in modern processors by caches and pipeline architectures (see below).

The instruction that the CPU fetches from memory is used to determine what the CPU is to do. In the decode step, the instruction is broken up into parts that have significance to other portions of the CPU. The way in which the numerical instruction value is interpreted is defined by the CPU's instruction set architecture(ISA). Often, one group of numbers in the instruction, called the opcode, indicates which operation to perform. The remaining parts of the number usually provide information required for that instruction, such as operands for an addition operation. Such operands may be given as a constant value (called an immediate value), or as a place to locate a value: a register or a memory address, as determined by some addressing mode. In older designs the portions of the CPU responsible for instruction decoding were unchangeable hardware devices. However, in more abstract and complicated CPUs and ISAs, a microprogram is often used to assist in translating instructions into various configuration signals for the CPU. This microprogram is sometimes rewritable so that it can be modified to change the way the CPU decodes instructions even after it has been manufactured.

After the fetch and decode steps, the execute step is performed. During this step, various portions of the CPU are connected so they can perform the desired operation. If, for instance, an addition operation was requested, an arithmetic logic unit (ALU) will be connected to a set of inputs and a set of outputs. The inputs provide the numbers to be added, and the outputs will contain the final sum. The ALU contains the circuitry to perform simple arithmetic and logical operations on the inputs (like addition and bitwise operations). If the addition operation produces a result too large for the CPU to handle, an arithmetic overflow flag in a flags register may also be set .

The final step, writeback, simply "writes back" the results of the execute step to some form of memory. Very often the results are written to some internal CPU register for quick access by subsequent instructions. In other cases results may be written to slower, but cheaper and larger, main memory. Some types of instructions manipulate the program counter rather than directly produce result data. These are generally called "jumps" and facilitate behavior like loops, conditional program execution (through the use of a conditional jump), and functions in programs. Many instructions will also change the state of digits in a "flags" register. These flags can be used to influence how a program behaves, since they often indicate the outcome of various operations. For example, one type of "compare" instruction considers two values and sets a number in the flags register according to which one is greater. This flag could then be used by a later jump instruction to determine program flow.

After the execution of the instruction and writeback of the resulting data, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the program counter. If the completed instruction was a jump, the program counter will be modified to contain the address of the instruction that was jumped to, and program execution continues normally. In more complex CPUs than the one described here, multiple instructions can be fetched, decoded, and executed simultaneously. This section describes what is generally referred to as the "Classic RISC pipeline," which in fact is quite common among the simple CPUs used in many electronic devices (often called microcontroller). It largely ignores the important role of CPU cache, and therefore the access stage of the pipeline.

Design and implementation

Main article: CPU design

Prerequisites
Computer architecture
Digital circuits

Integer range

The way a CPU represents numbers is a design choice that affects the most basic ways in which the device functions. Some early digital computers used an electrical model of the common decimal (base ten) numeral system to represent numbers internally. A few other computers have used more exotic numeral systems like ternary (base three). Nearly all modern CPUs represent numbers in binary form, with each digit being represented by some two-valued physical quantity such as a "high" or "low" voltage.
MOS 6502 microprocessor in a dual in-line package, an extremely popular 8-bit design.

Related to number representation is the size and precision of numbers that a CPU can represent. In the case of a binary CPU, a bit refers to one significant place in the numbers a CPU deals with. The number of bits (or numeral places) a CPU uses to represent numbers is often called "word size", "bit width", "data path width", or "integer precision" when dealing with strictly integer numbers (as opposed to floating point). This number differs between architectures, and often within different parts of the very same CPU. For example, an 8-bit CPU deals with a range of numbers that can be represented by eight binary digits (each digit having two possible values), that is, 28 or 256 discrete numbers. In effect, integer size sets a hardware limit on the range of integers the software run by the CPU can utilize.

Integer range can also affect the number of locations in memory the CPU can address (locate). For example, if a binary CPU uses 32 bits to represent a memory address, and each memory address represents one octet (8 bits), the maximum quantity of memory that CPU can address is 232 octets, or 4 GiB. This is a very simple view of CPU address space, and many designs use more complex addressing methods like paging in order to locate more memory than their integer range would allow with a flat address space.

Higher levels of integer range require more structures to deal with the additional digits, and therefore more complexity, size, power usage, and general expense. It is not at all uncommon, therefore, to see 4- or 8-bit microcontrollers used in modern applications, even though CPUs with much higher range (such as 16, 32, 64, even 128-bit) are available. The simpler microcontrollers are usually cheaper, use less power, and therefore dissipate less heat, all of which can be major design considerations for electronic devices. However, in higher-end applications, the benefits afforded by the extra range (most often the additional address space) are more significant and often affect design choices. To gain some of the advantages afforded by both lower and higher bit lengths, many CPUs are designed with different bit widths for different portions of the device. For example, the IBM System/370 used a CPU that was primarily 32 bit, but it used 128-bit precision inside its floating point units to facilitate greater accuracy and range in floating point numbers (Amdahl et al. 1964). Many later CPU designs use similar mixed bit width, especially when the processor is meant for general-purpose usage where a reasonable balance of integer and floating point capability is required.


Clock rate

Main article: Clock rate

Most CPUs, and indeed most sequential logic devices, are synchronous in nature. That is, they are designed and operate on assumptions about a synchronization signal. This signal, known as a clock signal, usually takes the form of a periodic square wave. By calculating the maximum time that electrical signals can move in various branches of a CPU's many circuits, the designers can select an appropriate period for the clock signal.

This period must be longer than the amount of time it takes for a signal to move, or propagate, in the worst-case scenario. In setting the clock period to a value well above the worst-case propagation delay, it is possible to design the entire CPU and the way it moves data around the "edges" of the rising and falling clock signal. This has the advantage of simplifying the CPU significantly, both from a design perspective and a component-count perspective. However, it also carries the disadvantage that the entire CPU must wait on its slowest elements, even though some portions of it are much faster. This limitation has largely been compensated for by various methods of increasing CPU parallelism (see below).

However architectural improvements alone do not solve all of the drawbacks of globally synchronous CPUs. For example, a clock signal is subject to the delays of any other electrical signal. Higher clock rates in increasingly complex CPUs make it more difficult to keep the clock signal in phase (synchronized) throughout the entire unit. This has led many modern CPUs to require multiple identical clock signals to be provided in order to avoid delaying a single signal significantly enough to cause the CPU to malfunction. Another major issue as clock rates increase dramatically is the amount of heat that is dissipated by the CPU. The constantly changing clock causes many components to switch regardless of whether they are being used at that time. In general, a component that is switching uses more energy than an element in a static state. Therefore, as clock rate increases, so does heat dissipation, causing the CPU to require more effective cooling solutions.

One method of dealing with the switching of unneeded components is called clock gating, which involves turning off the clock signal to unneeded components (effectively disabling them). However, this is often regarded as difficult to implement and therefore does not see common usage outside of very low-power designs. Another method of addressing some of the problems with a global clock signal is the removal of the clock signal altogether. While removing the global clock signal makes the design process considerably more complex in many ways, asynchronous (or clockless) designs carry marked advantages in power consumption and heat dissipation in comparison with similar synchronous designs. While somewhat uncommon, entire asynchronous CPUs have been built without utilizing a global clock signal. Two notable examples of this are the ARM compliant AMULET and the MIPS R3000 compatible MiniMIPS. Rather than totally removing the clock signal, some CPU designs allow certain portions of the device to be asynchronous, such as using asynchronous ALUs in conjunction with superscalar pipelining to achieve some arithmetic performance gains. While it is not altogether clear whether totally asynchronous designs can perform at a comparable or better level than their synchronous counterparts, it is evident that they do at least excel in simpler math operations. This, combined with their excellent power consumption and heat dissipation properties, makes them very suitable for embedded computers (Garside et al. 1999).

Parallelism

Main article: Parallel computing

Model of a subscalar CPU. Notice that it takes fifteen cycles to complete three instructions.

The description of the basic operation of a CPU offered in the previous section describes the simplest form that a CPU can take. This type of CPU, usually referred to as subscalar, operates on and executes one instruction on one or two pieces of data at a time.

This process gives rise to an inherent inefficiency in subscalar CPUs. Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. Even adding a second execution unit (see below) does not improve performance much; rather than one pathway being hung up, now two pathways are hung up and the number of unused transistors is increased. This design, wherein the CPU's execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (one instruction per clock). However, the performance is nearly always subscalar (less than one instruction per cycle).

Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behave less linearly and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can execute simultaneously. Each methodology differs both in the ways in which they are implemented, as well as the relative effectiveness they afford in increasing the CPU's performance for an application.

Instruction level parallelism

Main articles: Instruction pipelining and Superscalar

Basic five-stage pipeline. In the best case scenario, this pipeline can sustain a completion rate of one instruction per cycle.

One of the simplest methods used to accomplish increased parallelism is to begin the first steps of instruction fetching and decoding before the prior instruction finishes executing. This is the simplest form of a technique known as instruction pipelining, and is utilized in almost all modern general-purpose CPUs. Pipelining allows more than one instruction to be executed at any given time by breaking down the execution pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired.

Pipelining does, however, introduce the possibility for a situation where the result of the previous operation is needed to complete the next operation; a condition often termed data dependency conflict. To cope with this, additional care must be taken to check for these sorts of conditions and delay a portion of the instruction pipeline if this occurs. Naturally, accomplishing this requires additional circuitry, so pipelined processors are more complex than subscalar ones (though not very significantly so). A pipelined processor can become very nearly scalar, inhibited only by pipeline stalls (an instruction spending more than one clock cycle in a stage).
Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.

Further improvement upon the idea of instruction pipelining led to the development of a method that decreases the idle time of CPU components even further. Designs that are said to be superscalar include a long instruction pipeline and multiple identical execution units. [Huynh 2003] In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel (simultaneously). If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given cycle.

Most of the difficulty in the design of a super scalar CPU architecture lies in creating an effective dispatcher. The dispatcher needs to be able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to keep as many execution units busy as possible. This requires that the instruction pipeline is filled as often as possible and gives rise to the need in superscalar architectures for significant amounts of CPU cache. It also makes hazard-avoiding techniques like branch prediction, speculative execution, and out-of-order execution crucial to maintaining high levels of performance. By attempting to predict which branch (or path) a conditional instruction will take, the CPU can minimize the number of times that the entire pipeline must wait until a conditional instruction is completed. Speculative execution often provides modest performance increases by executing portions of code that may or may not be needed after a conditional operation completes. Out-of-order execution somewhat rearranges the order in which instructions are executed to reduce delays due to data dependencies.

In the case where a portion of the CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The original Intel Pentium (P5) had two superscalar ALUs which could accept one instruction per clock each, but its FPU could not accept one instruction per clock. Thus the P5 was integer superscalar but not floating point superscalar. Intel's successor to the Pentium architecture, P6, added superscalar capabilities to its floating point features, and therefore afforded a significant increase in floating point instruction performance.

Both simple pipelining and superscalar design increase a CPU's ILP by allowing a single processor to complete execution of instructions at rates surpassing one instruction per cycle (IPC). Most modern CPU designs are at least somewhat superscalar, and nearly all general purpose CPUs designed in the last decade are superscalar. In later years some of the emphasis in designing high-ILP computers has been moved out of the CPU's hardware and into its software interface, or ISA. The strategy of the very long instruction word (VLIW) causes some ILP to become implied directly by the software, reducing the amount of work the CPU must perform to boost ILP and thereby reducing the design's complexity.

Thread level parallelism

Another strategy of achieving performance is to execute multiple programs or threads in parallel. This area of research is known as parallel computing. In Flynn's taxonomy, this strategy is known as Multiple Instructions-Multiple Data or MIMD.

One technology used for this purpose was multiprocessing (MP). The initial flavor of this technology is known as symmetric multiprocessing (SMP), where a small number of CPUs share a coherent view of their memory system. In this scheme, each CPU has additional hardware to maintain a constantly up-to-date view of memory. By avoiding stale views of memory, the CPUs can cooperate on the same program and programs can migrate from one CPU to another. To increase the number of cooperating CPUs beyond a handful, schemes such as non-uniform memory access (NUMA) and directory-based coherence protocols were introduced in the 1990s. SMP systems are limited to a small number of CPUs while NUMA systems have been built with thousands of processors. Initially, multiprocessing was built using multiple discrete CPUs and boards to implement the interconnect between the processors. When the processors and their interconnect are all implemented on a single silicon chip, the technology is known as a multi-core microprocessor.

It was later recognized that finer-grain parallelism existed with a single program. A single program might have several threads (or functions) that could be executed separately or in parallel. Some of earliest examples of this technology implemented input/output processing such as direct memory access as a separate thread from the computation thread. A more general approach to this technology was introduced in the 1970s when systems were designed to run multiple computation threads in parallel. This technology is known as multi-threading (MT). This approach is considered more cost-effective than multiprocessing, as only a small number of components within a CPU is replicated in order to support MT as opposed to the entire CPU in the case of MP. In MT, the execution units and the memory system including the caches are shared among multiple threads. The downside of MT is that the hardware support for multithreading is more visible to software than that of MP and thus supervisor software like operating systems have to undergo larger changes to support MT. One type of MT that was implemented is known as block multithreading, where one thread is executed until it is stalled waiting for data to return from external memory. In this scheme, the CPU would then quickly switch to another thread which is ready to run, the switch often done in one CPU clock cycle. Another type of MT is known as simultaneous multithreading, where instructions of multiple threads are executed in parallel within one CPU clock cycle.

For several decades from the 1970s to early 2000s, the focus in designing high performance general purpose CPUs was largely on achieving high ILP through technologies such as pipelining, caches, superscalar execution, Out-of-order execution, etc. This trend culminated in large, power-hungry CPUs such as the Intel Pentium 4. By the early 2000s, CPU designers were thwarted from achieving higher performance from ILP techniques due to the growing disparity between CPU operating frequencies and main memory operating frequencies as well as escalating CPU power dissipation owing to more esoteric ILP techniques.

CPU designers then borrowed ideas from commercial computing markets such as transaction processing, where the aggregate performance of multiple programs, also known as throughput computing, was more important than the performance of a single thread or program.

This reversal of emphasis is evidenced by the proliferation of dual and multiple core CMP (chip-level multiprocessing) designs and notably, Intel's newer designs resembling its less superscalar P6 architecture. Late designs in several processor families exhibit CMP, including the x86-64 Opteron and Athlon 64 X2, the SPARC UltraSPARC T1, IBM POWER4 and POWER5, as well as several video game console CPUs like the Xbox 360's triple-core PowerPC design, and the PS3's 8-core Cell microprocessor.

Data parallelism

Main articles: Vector processor and SIMD

A less common but increasingly important paradigm of CPUs (and indeed, computing in general) deals with data parallelism. The processors discussed earlier are all referred to as some type of scalar device. As the name implies, vector processors deal with multiple pieces of data in the context of one instruction. This contrasts with scalar processors, which deal with one piece of data for every instruction. Using Flynn's taxonomy, these two schemes of dealing with data are generally referred to as SISD (single instruction, single data) and SIMD (single instruction, multiple data), respectively. The great utility in creating CPUs that deal with vectors of data lies in optimizing tasks that tend to require the same operation (for example, a sum or a dot product) to be performed on a large set of data. Some classic examples of these types of tasks are multimedia applications (images, video, and sound), as well as many types of scientific and engineering tasks. Whereas a scalar CPU must complete the entire process of fetching, decoding, and executing each instruction and value in a set of data, a vector CPU can perform a single operation on a comparatively large set of data with one instruction. Of course, this is only possible when the application tends to require many steps which apply one operation to a large set of data.

Most early vector CPUs, such as the Cray-1, were associated almost exclusively with scientific research and cryptography applications. However, as multimedia has largely shifted to digital media, the need for some form of SIMD in general-purpose CPUs has become significant. Shortly after floating point execution units started to become commonplace to include in general-purpose processors, specifications for and implementations of SIMD execution units also began to appear for general-purpose CPUs. Some of these early SIMD specifications like Intel's MMX were integer-only. This proved to be a significant impediment for some software developers, since many of the applications that benefit from SIMD primarily deal with floating point numbers. Progressively, these early designs were refined and remade into some of the common, modern SIMD specifications, which are usually associated with one ISA. Some notable modern examples are Intel's SSE and the PowerPC-related AltiVec (also known as VMX)

Burton Smith

Burton J. Smith is one of the world's leading computer architects. From 1988 until 1999, Smith was the chief scientist and member of the board of directors of Tera Computer Company. Smith received the 1991 Eckert-Mauchly Award and the 2003 Cray award for computer architecture.

In December, 2005, Smith was hired by Microsoft as a Technical Fellow.

Burroughs large systems descriptors

Descriptors are an architectural feature of Burroughs large systems, including the current (as of 2006) Unisys Clearpath/MCP systems. Apart from being stack- and tag-based, a notable architectural feature of these systems is that it is descriptor-based. Descriptors are the means of having data that does not reside on the stack as for arrays and objects. Descriptors are also used for string data as in compilers and commercial applications.

Details

Descriptors describe data blocks. Each descriptor contains a 20-bit address field referencing the data block. Each block has a length which is stored in the descriptor, also 20 bits. The size of the data is also given, being 4-, 6-, 8- or 48-bit data in a three bit field.

The first computer with this architecture was the B5000. in that implementation, the meaning of the various status bits was:

* Bit 47 — The presence bit (P-Bit)
* Bit 46 — The copy bit
* Bit 45 — The indexed bit
* Bit 44 — The paged bit
* Bit 43 — The read only bit

In later implementations these status bits evolved to keep up with growing memory sizes and gained insights.

Bit 47 is probably the most interesting bit in the system – it is the way the architecture implements virtual memory. Virtual memory was originally developed for the Atlas project at the University of Manchester in the late 1950s. Keen to see this used in commercial applications, they invited engineers from several computer companies to a seminar, including those from Burroughs and IBM. The Burroughs engineers saw the significance of virtual memory and put it into the B5000. The IBM engineers weren't interested and IBM did not "invent" virtual memory for another ten years.

When a descriptor is referenced, the hardware checks bit 47. If it is 1, the data is present in memory at the location indicated in the address field. If bit 47 is 0, the data block is not present and an interrupt (p-bit interrupt) is raised and MCP code entered to make the block present. In this case, if the address field is 0, the data block has not been allocated (init p-bit) and the MCP searches for a free block the size of which is given in the length field.

Usage in compilers

In ALGOL, the bounds of an array were completely dynamic, could be taken from values computed at run time, which was unlike Pascal where the size of arrays was fixed at compile time. This was the main weakness of Pascal as defined in its standard, but which was removed in many commercial implementations of Pascal, notably the Burroughs implementations (both the University of Tasmania version by Arthur Sale and Roy Freak, and the Burroughs Slice implementation by Matt Miller et al.)

Note that in a program in the Burroughs environment, an array is not allocated when it is declared, but only when it is touched for the first time – thus arrays can be declared and the overhead of allocating them avoided if they are not used.

Also note that low-level memory allocation system calls such as the malloc class of calls of C and Unix are not needed – arrays are automatically allocated as used. This saves the programmer the great burden of filling programs with the error-prone activity of memory management, which is crucial in mainframe applications.

When porting programs in lower-level languages such as C, the C memory structure is dealt with by doing its own memory allocation within a large allocated B5000 block – thus the security of the rest of the B5000 system cannot be compromised by errant C programs. In fact, many buffer overruns in apparently otherwise running C programs have been caught when ported to the B5000 architecture. C, like Pascal, was also implemented using the Slice compiler system (using a common code generator and optimizer for all languages). The C compiler, run-time system, POSIX interfaces, as well as a port of many Unix tools was done by Steve Bartels. An Eiffel compiler was also developed using Slice.

For object-oriented programs which require more dynamic creation of objects than the B5000 architecture, objects are best allocated within a single B5000 block. Such object allocation is higher level than C's malloc and is best implemented with a modern efficient garbage collector.

The last p-bit scenario is when bit 47 is 0, indicating that the data is not in memory, but the address is non-zero, indicating that the data has been allocated and in this case the address represents a disk address in the virtual memory area on disk. In this case a p-bit interrupt is raised and it is noted as an 'other' p-bit.

Thus the B5000 had a virtual memory system integrated into the hardware – a virtual memory system that has to this day been unsurpassed, since all other systems must build virtual memory on top of lower-level hardware. ALGOL and the B5000 also represented a significant advance on the low-level, error-prone, and programmer intensive 'malloc' mechanisms of later systems.

Integration in memory architecture

The address field in the B5000 was only 20 bits, which meant that only 1 Meg words (6MB) of memory could be addressed by descriptors. This was a significant restriction of the architecture. To overcome this, two solutions were implemented:

1. Swapper – this solution actually implemented another layer on top of memory management, moving large clusters of related data in and out of memory at once.

2. ASN – this solution allowed physically more memory to be configured in a system, divided into separately addressable chunks. This architecture became known as ASN (Address Space Number) memory. Memory was logically divided into two areas, allocating low memory addresses to a Global address space for the operating system and support software and high memory addresses to several parallel Local address spaces for individual programs. Address spaces were numbered, zero indicating Global, 1..n indicating the local address spaces. Programs sharing data were automatically placed in the same address space.

No program code modifications were necessary for these features to be utilized. Both solutions could even be combined, but eventually the MCP memory requirements and program data sharing requirements outgrew the maximum size of the address spaces itself.

With the advent of the A Series in the early 1980s, the meaning of this field was changed to contain the address of a master descriptor, which meant that 1 Meg data blocks could be allocated, but that the machine memory could be greatly expanded to gigabytes or perhaps terabytes. This architecture was named ASD (Advanced Segment Descriptors) memory. This required a new common microcode specification, referred to as Beta. The main visionary behind ASD memory is Bill McClintock. Later the 3-bit memory tag was increased to a 4-bit specification, allowing the segment descriptor to grow from 20 to 23 bits in size, allowing even more memory to be addressed simultaneously. This microcode specification became known as level Gamma.

Memory management

Another significant advantage was realized for virtual memory. In the B5000 design, if a data block were rolled out, all descriptors referencing that block needed to be found in order to update the presence bit and address. With the master descriptor, only the presence bit in the master descriptor needs changing. Also the MCP can move blocks around in memory for compaction and only needs to change the address in the master descriptor.

A difference between the B5000 and most other systems is that other systems mainly used paged virtual memory, that is pages are swapped out is fixed-sized chunks regardless of the structure of the information in them. B5000 virtual memory works with varying-size segments as described by the descriptors.

When the memory is filled to a certain capacity, an OS process called the 'Working Set Sheriff' is invoked to either compact memory or start moving segments out of memory. It chooses code segments first, since these cannot change and can be reloaded from the original in the code file, so do not need writing out, and then data segments which are written out to the virtual memory file.

P-bit interrupts are also useful to measure system performance. For first-time allocations, 'init p-bits' indicate a potential performance problem in a program, for example if a procedure allocating an array is continually called. Reloading blocks from virtual memory on disk can significantly degrade system performance and is not the fault of any specific task. This is why many of today's computers may gain increased system performance by adding memory. On B5000 machines, 'other p-bits' indicate a system problem, which can be solved either by better balancing the computing load across the day, or by adding more memory.

Thus the Burroughs large systems architecture helps optimization of both individual tasks and the system as a whole.

Buffer overflow protection

The last and maybe most important point to note about descriptors is how they affect the complementary notions of system security and program correctness. One of the best tools a hacker has to compromise operating systems of today is the buffer overflow. This is particularly easily done in C and with a little more effort in assemblers. C, in particular, uses the most primitive and error-prone way to mark the end of strings, using a null byte as an end-of-string sentinel in the data stream itself.

Pointers are implemented on the B5000 by indexed descriptors. During indexing operations, pointers are checked at each increment to make sure that neither the source not the destination blocks are out of bound. During a scan or replace operation, the mechanisms used to read or copy large blocks of memory, both source and destination are checked at each word increment for a valid memory tag. Each memory segment is bounded by tag 3 words, which would make such an operation fail. Each memory segment containing integrity sensitive data, such as program code, is stored in tag 3 words, making an uncontrolled read – let alone modification – impossible. Thus a significant source of program errors can be detected early before software goes into production, and a more significant class of attacks on system security is not possible.

Bridging model

In computer science, a bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine; in other words, it is intended to provide a common level of understanding between hardware and software engineers.

A successful bridging model is one which can be efficiently implemented in reality and efficiently targeted by programmers; in particular, it should be possible for a compiler to produce good code from a typical high-level language. The term was introduced by Leslie Valiant's 1990 paper A Bridging Model for Parallel Computation, which argued that the strength of the von Neumann model was largely responsible for the success of computing as a whole. The paper goes on to develop the bulk-synchronous parallel model as an analogous model for parallel computing.

Application-specific instruction-set processor

An application-specific instruction-set processor (ASIP) is a component used in System-on-a-Chip design. The instruction set of an ASIP is tailored to benefit a specific application. This specialization of the core provides a tradeoff between the flexibility of a general purpose CPU and the performance of an ASIC.

Some ASIPs have a configurable instruction set. Usually, these cores are divided into two parts: static logic which defines a minimum ISA and configurable logic which can be used to design new instructions. The configurable logic can be programmed either in the field in a similar fashion to an FPGA or during the chip synthesis.

Anti machine

In computer science, anti machine refers to the basic machine paradigm for reconfigurable computing that is the counterpart of the von Neumann machine. The difference between an anti machine and a von Neumann machine is that the anti machine is data-stream-driven and is therefore sequenced by data counters. The von Neumann machine, in contrast, is is instruction-stream-driven and is therefore controlled by a program counter. Another key difference is that the anti machine usually has multiple data counters, whereas the the von Neumann machine can have only a single program counter. The data counters are located within auto-sequencing memory blocks, which are programmed from flowware sources. An anti machine does not have a central processing unit, but rather one or several data path unit(s), also known as DPU(s).

Auto-Sequence Memory

The Auto-sequencing memory (ASM) is an essential part of the anti machine paradigm. It is part of the instruction sequencer and is co-located with the datapath. Traditionally this block, including the datapath unit (for example, an ALU) and the instruction sequencer, is called the CPU. The ASM is a key component of the paradigm shift from instruction-stream-based computing to the Reconfigurable Computing paradigm.

ASM use is a fundamental issue because in homogenous Reconfigurable Computing Systems, there is no instruction fetch at run time, since in a reconfigurable array (rDPU or a FPGA), the datapath units (DPUs) are connected to form a pipe network, where execution is transport-triggered - for example, upon arrival of data items coming along with the data streams driven through the array. For this reason instruction sequencers are not needed here. But a machine paradigm needs a sequencing mechanism - data counters are used.

An Auto-sequencing memory block (ASM block) is a RAM memory unit that includes a data address generator with the data counter (a data pointer) used as a data address register for implementation of a data stream. The direct memory access (DMA) unit is an example of such an address generator for an ASM. Another example is the generic address generator (GAG), a generalization of the DMA. In Reconfigurable Computing systems ASMs play an important role for massive speed-up by minimizing or avoiding memory cycle overhead for complex address computations.

Generic Address Generator

The Generic Address Generator (GAG) is a generalization of the direct memory access (DMA) method for the transfer of blocks of data or of data streams between memory and processing resource without the need to individually address each data item by a CPU instruction. The GAG is a reconfigurable address generator. The GAG is also a highly efficient implementation of the data sequencer for auto-sequencing memory (ASM) blocks. At run time after having been configured for a particular addressing pattern, the GAG does not need any memory cycles (except for fetching or storing the data item), even for highly complex address computations. Depending on the application, using a GAG instead of the addressing features of a classical CPU can yield speed-up factors of one order of magnitude or more. Without needing any memory cycles for address computation, the GAG methodology supports, for instance with a 2-dimensional address space, a wide variety of generic address sequences like video scans, slanted, sheared, triangular, or rotated video scans, shuffle of butterfly addressing patterns, spiral- or zigzag-shaped address sequences and much more, and, can be used to optimize storage schemes for image processing and massively parallel computing.

The GAG is also an important ingredient of the anti machine methodology, Note, that using Auto-sequencing memory is a fundamental issue distinguishing the anti machine paradigm of reconfigurable Computing from the von Neumann machine paradigm.

Data Counters

Data counters in an anti machine are used instead of a program counter by the reconfigurable computing systems. A computing machine paradigm needs a sequencing mechanism. The instruction-stream-driven von Neumann architecture paradigm uses a program counter for sequencing the instructions, according to software programming sources. Because the Anti machine is, however, data-stream-driven, it uses data counters which are programmed from Flowware sources. According to the Anti machine model the data counters are parts of address generators like DMA or GAG units, located in auto-sequencing memory blocks for data. Instead of a CPU, the anti machine uses DPUs.

Addressing mode

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how machine language instructions in that architecture identify the operand (or operands) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

In computer programming, addressing modes are primarily of interest to compiler writers and to those who write code directly in assembly language.

Caveats

Note that there is no generally accepted way of naming the various addressing modes. In particular, different authors and computer manufacturers may give different names to the same addressing mode, or the same names to different addressing modes. Furthermore, an addressing mode which, in one given architecture, is treated as a single addressing mode may represent functionality that, in another architecture, is covered by two or more addressing modes. For example, some complex instruction set computer (CISC) computer architectures, such as the Digital Equipment Corporation (DEC) VAX, treat registers and literal/immediate constants as just another addressing mode. Others, such as the IBM System/390 and most reduced instruction set computer (RISC) designs, encode this information within the instruction. Thus, the latter machines have three distinct instruction codes for copying one register to another, copying a literal constant into a register, and copying the contents of a memory location into a register, while the VAX has only a single "MOV" instruction.

The term "addressing mode" is itself subject to different interpretations: either "memory address calculation mode" or "operand accessing mode". Under the first interpretation instructions that do not read from memory or write to memory (such as "add literal to register") are considered not to have an "addressing mode". The second interpretation allows for machines such as VAX which use operand mode bits to allow for a literal operand. Only the first interpretation applies to instructions such as "load effective address".

The addressing modes listed below are divided into code addressing and data addressing. Most computer architectures maintain this distinction, but there are, or have been, some architectures which allow (almost) all addressing modes to be used in any context.

The instructions shown below are purely representative in order to illustrate the addressing modes, and do not necessarily reflect the mnemonics used by any particular computer.

How many addressing modes?

Different computer architectures vary greatly as to the number of addressing modes they provide. At the cost of a few extra instructions, and perhaps an extra register, it is normally possible to use the simpler addressing modes instead of the more complicated modes. It has proven[citation needed] much easier to design pipelined CPUs if the only addressing modes available are simple ones.

Most RISC machines have only about five simple addressing modes, while CISC machines such as the DEC VAX supermini have over a dozen addressing modes, some of which are quite complicated. The IBM System/360 mainframe had only three addressing modes; a few more have been added for the System/390.

When there are only a few addressing modes, the particular addressing mode required is usually encoded within the instruction code (e.g. IBM System/390, most RISC). But when there are lots of addressing modes, a specific field is often set aside in the instruction to specify the addressing mode. The DEC VAX allowed multiple memory operands for almost all instructions, and so reserved the first few bits of each operand specifier to indicate the addressing mode for that particular operand.

Even on a computer with many addressing modes, measurements of actual programs[citation needed] indicate that the simple addressing modes listed below account for some 90% or more of all addressing modes used. Since most such measurements are based on code generated from high-level languages by compilers, this reflects to some extent the limitations of the compilers being used.

Useful side effect

Some processors, such as Intel x86 and the IBM/390, have a Load effective address instruction. This performs a calculation of the effective operand address, but instead of acting on that memory location, it loads the address that would have been accessed into a register. This can be useful when passing the address of an array element to a subroutine. It may also be a slightly sneaky way of doing more calculation than normal in one instruction; for example, using such an instruction with the addressing mode "base+index+offset" allows one to add two registers and a constant together in one instruction.

Simple addressing modes for code

Absolute

+----+------------------------------+
| jump| address |
+----+------------------------------+

(Effective address = address)

The effective address for an absolute instruction address is the address parameter itself with no modifications.

PC-relative

+------+-----+-----+----------------+
|jumpEQ| reg1| reg2| offset | jump relative if reg1=reg2
+------+-----+-----+----------------+

(Effective address = next instruction address + offset, offset may be negative)

The effective address for a PC-relative instruction address is the offset parameter added to the address of the next instruction. This offset is usually signed to allow reference to code both before and after the instruction.

This is particularly useful in connection with conditional jumps, because typical jumps are to nearby instructions (in a high-level language most if or while statements are reasonably short). Measurements of actual programs suggest that an 8 or 10 bit offset is large enough for some 90% of conditional jumps[citation needed].

Another advantage of program-relative addressing is that the code may be position-independent, i.e. it can be loaded anywhere in memory without the need to adjust any addresses.

Register indirect

+-------+-----+
|jumpVia| reg |
+-------+-----+

(Effective address = contents of register 'reg')

The effective address for a Register indirect instruction is the address in the specified register. For example, (A7) to access the content of address register A7.

The effect is to transfer control to the instruction whose address is in the specified register.

Many RISC machines have a subroutine call instruction that places the return address in an address register -- the register indirect addressing mode is used to return from that subroutine call.

skip

+------+-----+-----+
|skipEQ| reg1| reg2| skip the following instruction if reg1=reg2
+------+-----+-----+

(Effective address = next instruction address + 1)

Skip addressing may be considered a special kind of PC-relative addressing mode with a fixed "+1" offset.

Unlike all other conditional branches, a "skip" instruction never needs to flush the instruction pipeline.

The "condition code" used in the ARM architecture can be considered a kind of skip addressing mode.

Simple addressing modes for data

Register

+------+-----+-----+-----+
| mul | reg1| reg2| reg3| reg1 := reg2 * reg3;
+------+-----+-----+-----+

This "addressing mode" does not have an effective address and is not considered to be an addressing mode on some computers.

In this example, all the operands are in registers, and the result is placed in a register.

Base plus offset, and variations

+------+-----+-----+----------------+
| load | reg | base| offset |
+------+-----+-----+----------------+

(Effective address = offset + contents of specified base register)

The offset is usually a signed 16-bit value (though the 80386 expanded it to 32 bits).

If the offset is zero, this becomes an example of register indirect addressing; the effective address is just the value in the base register.

On many RISC machines, register 0 is fixed at the value zero. If register 0 is used as the base register, this becomes an example of absolute addressing. However, only a small portion of memory can be accessed (64 kilobytes, if the offset is 16 bits).

The 16-bit offset may seem very small in relation to the size of current computer memories (which is why the 80386 expanded it to 32-bit). It could be worse: IBM System/360 mainframes only have an unsigned 12-bit offset. However, the principle of locality of reference applies: over a short time span, most of the data items a program wants to access are fairly close to each other.

This addressing mode is closely related to the indexed absolute addressing mode.

Example 1: Within a subroutine a programmer will mainly be interested in the parameters and the local variables, which will rarely exceed 64 KB, for which one base register (the frame pointer) suffices. If this routine is a class method in an object-oriented language, then a second base register is needed which points at the attributes for the current object (this or self in some high level languages).

Example 2: If the base register contains the address of a composite type (a record or structure), the offset can be used to select a field from that record (most records/structures are less than 32 kB in size).

Immediate/literal

+------+-----+-----+----------------+
| add | reg1| reg2| constant | reg1 := reg2 + constant;
+------+-----+-----+----------------+

This "addressing mode" does not have an effective address, and is not considered to be an addressing mode on some computers.

The constant might be signed or unsigned. For example move.l #$FEEDABBA, D0 to move the immediate hex value of "FEEDABBA" into register D0.

Instead of using an operand from memory, the value of the operand is held within the instruction itself. On the DEC VAX machine, the literal operand sizes could be 6, 8, 16, or 32 bits long.

Andrew Tanenbaum showed that 98% of all the constants in a program would fit in 13 bits (see RISC design philosophy).

Implicit

+-----------------+
| clear carry bit |
+-----------------+

The implied addressing mode, also called the implicit addressing mode, does not explicitly specify an effective address for either the source or the destination (or sometimes both).

Either the source (if any) or destination effective address (or sometimes both) is implied by the opcode.

Implied addressing was quite common on older computers (up to mid-1970s). Such computers typically had only a single register in which arithmetic can be performed -- the accumulator. Such accumulator machines implicitly reference that accumulator in almost every instruction. For example, the operation can be done using the sequence -- the destination (the accumulator) is implied in every "load" and "add" instruction; the source (the accumulator) is implied in every "store" instruction.

Later computers generally had more than one general purpose register or RAM location which could be the source or destination or both for arithmetic -- and so later computers need some other addressing mode to specify the source and destination of arithmetic.

Many computers (such as x86 and AVR) have one special-purpose register called the stack pointer which is implicitly incremented or decremented when pushing or popping data from the stack, and the source or destination effective address is (implicitly) the address stored in that stack pointer.

Most 32-bit computers (such as ARM and PowerPC) have more than one register which could be used as a stack pointer -- and so use the "register autoincrement indirect" addressing mode to specify which of those registers should be used when pushing or popping data from a stack.

Some current computer architectures (e.g. IBM/390 and Intel Pentium) contain some instructions with implicit operands in order to maintain backwards compatibility with earlier designs.

On many computers, instructions that flip the user/system mode bit, the interrupt-enable bit, etc. implicitly specify the special register that holds those bits. This simplifies the hardware necessary to trap those instructions in order to meet the Popek and Goldberg virtualization requirements -- on such a system, the trap logic does not need to look at any operand (or at the final effective address), but only at the opcode.

A few CPUs have been designed where every operand is always implicitly specified in every instruction -- zero-operand CPUs.

Other addressing modes for code or data

Absolute/Direct

+------+-----+--------------------------------------+
| load | reg | address |
+------+-----+--------------------------------------+

(Effective address = address as given in instruction)

This requires space in an instruction for quite a large address. It is often available on CISC machines which have variable-length instructions, such as x86.

Some RISC machines have a special Load Upper Literal instruction which places a 16-bit constant in the top half of a register. An OR literal instruction can be used to insert a 16-bit constant in the lower half of that register, so that a full 32-bit address can then be used via the register-indirect addressing mode, which itself is provided as "base-plus-offset" with an offset of 0.

Indexed absolute

+------+-----+-----+--------------------------------+
| load | reg |index| address |
+------+-----+-----+--------------------------------+

(Effective address = address + contents of specified index register)

This also requires space in an instruction for quite a large address. The address could be the start of an array or vector, and the index could select the particular array element required. The processor may scale the index register to allow for the size of each array element.

Note that this is more or less the same as base-plus-offset addressing mode, except that the offset in this case is large enough to address any memory location.

Example 1: Within a subroutine, a programmer may define a string as a local constant or a static variable. The address of the string is stored in the literal address in the instruction. The offset -- which character of the string to use on this iteration of a loop -- is stored in the index register.

Example 2: A programmer may define several large arrays as globals or as class variables. The start of the array is stored in the literal address (perhaps modified at program-load time by a relocating loader) of the instruction that references it. The offset -- which item from the array to use on this iteration of a loop -- is stored in the index register. Often the instructions in a loop re-use the same register for the loop counter and the offsets of several arrays.

Base plus index

+------+-----+-----+-----+
| load | reg | base|index|
+------+-----+-----+-----+

(Effective address = contents of specified base register + contents of specified index register)

The base register could contain the start address of an array or vector, and the index could select the particular array element required. The processor may scale the index register to allow for the size of each array element. This could be used for accessing elements of an array passed as a parameter.

Base plus index plus offset

+------+-----+-----+-----+----------------+
| load | reg | base|index| offset |
+------+-----+-----+-----+----------------+

(Effective address = offset + contents of specified base register + contents of specified index register)

The base register could contain the start address of an array or vector of records, the index could select the particular record required, and the offset could select a field within that record. The processor may scale the index register to allow for the size of each array element.

Scaled

+------+-----+-----+-----+
| load | reg | base|index|
+------+-----+-----+-----+

(Effective address = contents of specified base register + scaled contents of specified index register)

The base register could contain the start address of an array or vector, and the index could contain the number of the particular array element required.

This addressing mode dynamically scales the value in the index register to allow for the size of each array element, e.g. if the array elements are double precision floating-point numbers occupying 8 bytes each then the value in the index register is multiplied by 8 before being used in the effective address calculation. The scale factor is normally restricted to being a power of two, so that shifting rather than multiplication can be used.

Register indirect

+------+-----+-----+
| load | reg | base|
+------+-----+-----+

(Effective address = contents of base register)

A few computers have this as a distinct addressing mode. Many computers just use base plus offset with an offset value of 0. For example, (A7)

Register autoincrement indirect

+------+-----+-------+
| load | reg | base |
+------+-----+-------+

(Effective address = contents of base register)

After determining the effective address, the value in the base register is incremented by the size of the data item that is to be accessed. For example, (A7)+ would access the content of the address register A7, then increase the address pointer of A7 by 1 (usually 1 word).

Within a loop, this addressing mode can be used to step through all the elements of an array or vector. A stack can be implemented by using this mode in conjunction with the next addressing mode (autodecrement).

In high-level languages it is often thought to be a good idea that functions which return a result should not have side effects (lack of side effects makes program understanding and validation much easier). This addressing mode has a side effect in that the base register is altered. If the subsequent memory access causes an error (e.g. page fault, bus error, address error) leading to an interrupt, then restarting the instruction becomes much more problematic since one or more registers may need to be set back to the state they were in before the instruction originally started.

There have been at least two computer architectures which have had implementation problems with regard to recovery from interrupts when this addressing mode is used:

* Motorola 68000(address is represented in 24 bits). Could have one or two autoincrement register operands. The 68010+ resolved the problem by saving the processor's internal state on bus or address errors.
* DEC VAX. Could have up to 6 autoincrement register operands. Each operand access could cause two page faults (if operands happened to straddle a page boundary). Of course the instruction itself could be over 50 bytes long and might straddle a page boundary as well!

Autodecrement register indirect

+------+-----+-----+
| load | reg | base|
+------+-----+-----+

(Effective address = new contents of base register)

Before determining the effective address, the value in the base register is decremented by the size of the data item which is to be accessed.

Within a loop, this addressing mode can be used to step backwards through all the elements of an array or vector. A stack can be implemented by using this mode in conjunction with the previous addressing mode (autoincrement).

See the discussion of side-effects under the autoincrement addressing mode.

Memory indirect

Any of the addressing modes mentioned in this article could have an extra bit to indicate indirect addressing, i.e. the address calculated using some mode is in fact the address of a location (typically a complete word) which contains the actual effective address.

Indirect addressing may be used for code or data. It can make implementation of pointers or references or handles very much easier, and can also make it easier to call subroutines which are not otherwise addressable. Indirect addressing does carry a performance penalty due to the extra memory access involved.

Some early minicomputers (e.g. DEC PDP-8, Data General Nova) had only a few registers and only a limited addressing range (8 bits). Hence the use of memory indirect addressing was almost the only way of referring to any significant amount of memory.

Obsolete addressing modes

The addressing modes listed here were used in the 1950–1980 time frame, but are no longer available on most current computers. This list is by no means complete; there have been many other interesting and peculiar addressing modes used from time to time, e.g. absolute-plus-logical-OR of two or three index registers.[citation needed]

Multi-level memory indirect

If the word size is larger than the address then the word referenced for memory-indirect addressing could itself have an indirect flag set to indicate another memory indirect cycle. Care is needed to ensure that a chain of indirect addresses does not refer to itself; if it did, you could get an infinite loop while trying to resolve an address.

The DEC PDP-10 computer with 18-bit addresses and 36-bit words allowed multi-level indirect addressing with the possibility of using an index register at each stage as well.

Memory-mapped registers

On some computers, the registers were regarded as occupying the first 8 or 16 words of memory (e.g. ICL 1900, DEC PDP-10). This meant that there was no need for a separate "Add register to register" instruction — you could just use the "add memory to register" instruction.

In the case of early models of the PDP-10, which did not have any cache memory, you could actually load a tight inner loop into the first few words of memory (the fast registers in fact), and have it run much faster than if it would have in magnetic core memory.

Later models of the DEC PDP-11 series mapped the registers onto addresses in the input/output area, but this was primarily intended to allow remote diagnostics. Confusingly, the 16-bit registers were mapped onto consecutive 8-bit byte addresses.

Memory indirect, autoincrement

On some early minicomputers (e.g. DEC PDP-8, Data General Nova), there were typically 16 special memory locations.[citation needed] When accessed via memory indirect addressing, 8 would automatically increment after use and 8 would automatically decrement after use. This made it very easy to step through memory in loops without using any registers.

Zero page

The Motorola 6800 family and MOS Technology 6502 family of processors were a register poor series of CISC microprocessors. Arithmetic and logical instructions were mostly performed against values in memory as opposed to internal registers. As a result, instructions were generally required to include a two byte (16-bit) location to memory. Given that opcodes on these processors were only one byte (8-bit) in length, memory addresses could make up a significant part of code size.

Designers of these processors included a partial remedy known as "zero page" addressing. The initial 256 bytes of memory ($0000 - $00FF; a.k.a., page "0") could be accessed using a one byte absolute or indexed memory address. This reduced instruction execution time by one clock cycle and instruction length by one byte. By storing often used data in this region, programs could be made smaller and faster.

As a result, the zero page was used similar to a register file. On many systems, however, this resulted in high utilization of the zero page memory area by the operating system and user programs. This limited its use since free space was limited.

Direct page

The zero page address mode was enhanced in several descendants of the MOS Technology 6502, including the WDC 65816 and the MOS Technology 65CE02. The new mode, known as "direct page" addressing, added the ability to move the 256 byte zero page memory window from the start of memory (offset address $0000) to a new location within the first 64KB of memory.

The MOS 65CE02 allowed the direct page to be moved to any 256 byte boundary within the first 64KB of memory by storing an 8-bit offset value in the new B (block) register. The WDC 65816 went a step further and allowed the direct page to be moved to any location within the first 64KB of memory by storing a 16-bit offset value in the new D (direct) register.

As a result, a greater number of programs were able to utilize the enhanced direct page addressing mode versus legacy processors that only included the zero page addressing mode.

Scaled index with bounds checking

This is similar to scaled index addressing, except that the instruction has two extra operands (typically constants), and the hardware would check that the index value was between these bounds.

Another variation uses vector descriptors to hold the bounds; this makes it easy to implement dynamically allocated arrays and still have full bounds checking.

Register indirect to byte within word

The DEC PDP-10 computer used 36-bit words. It had a special addressing mode which allowed memory to be treated as a sequence of bytes (bytes could be any size from 1 bit to 36 bits). A one-word sequence descriptor held the current word address within the sequence, a bit position within a word, and the size of each byte.

Instructions existed to load and store bytes via this descriptor, and to increment the descriptor to point at the next byte (bytes were not split across word boundaries). Much DEC software used five 7-bit bytes per word (plain ASCII characters), with 1 bit unused per word. Implementations of C had to use four 9-bit bytes per word, since C assumes that you can access every bit of memory by accessing consecutive bytes

Index next instruction

The Elliott 503, the Elliott 803, and the Apollo Guidance Computer only used absolute addressing, and did not have any index registers. Thus, indirect jumps, or jumps through registers, were not supported in the instruction set. Instead, it could be instructed to add the contents of the current memory word to the next instruction. Adding a small value to the next instruction to be executed could, for example, change a JUMP 0 into a JUMP 20, thus creating the effect of an indirect jump via self-modifying code.

Address space

In computing, an address space defines a range of discrete addresses, each of which may correspond to a physical or virtual memory register, a network host, peripheral device, disk sector or other logical or physical entity.

A memory address identifies a physical location in computer memory, somewhat similar to a street address in a town. The address points to the location where data is stored, just like your address points to where you live. In the analogy of a person's address, the address space would be an area of locations, such as a neighborhood, town, city, or country. Two addresses may be numerically the same but refer to different locations, if they belong to different address spaces. This is similar to your address being, say, "32, Main Street", while another person may reside in "32, Main Street" in a different town from yours.

Example address spaces:

* House numbers in street addresses
* Street addresses in towns
* Main memory (physical memory)
* Virtual memory
* I/O port space
* Network
o IP addresses in particular
* The cylinder-head-sector scheme for hard drives

Specific examples for the Linux kernel:

* Kernel virtual address space
* User virtual address space, accessed by the kernel through copy_to_user(), copy_from_user() and similar functions
* I/O memory, accessed through readb(), writel(), memcpy_toio(), etc.

Address translation

In general, things in one address space are physically in a different location than things in another address space. For example, "house number 101 South" on one particular southward street is completely different from any house number (not just the 101st house) on a different southward street.

However, sometimes different address spaces overlap (some physical location exists in both address spaces). When overlapping address spaces are not aligned, translation is necessary. For example, virtual-to-physical address translation is necessary to translate addresses in the virtual memory address space to addresses in physical address space -- one physical address, and one or more numerically different virtual addresses, all refer to the same physical byte of RAM.

Memory models

Many programmers prefer to use a flat memory model, in which there is no distinction between code space, data space, and virtual memory -- in other words, numerically identical pointers refer to exactly the same byte of RAM in all three address spaces.

Unfortunately, many early computers did not support a flat memory model -- in particular, Harvard architecture machines force program storage to be completely separate from data storage. Many modern DSPs (such as the Motorola 56000) have 3 separate storage areas -- program storage, coefficient storage, and data storage. Some commonly-used instructions fetch from all three areas simultaneously -- fewer storage areas (even if there were the same or more total bytes of storage) would make those instructions run slower.

Memory models in x86 architecture

Early x86 computers used addresses based on a combination of two numbers: a memory segment, and an offset within that segment. Some segments were implicitly treated as code segments, dedicated for instructions, stack segments, or normal data segments. Although the usages were different, the segments did not have different memory protections reflecting this.

Now, many programmers prefer to use a flat memory model, in which all segments (segment registers) are generally set to zero, and only offsets are variable.

Abstraction layer

This is about the concept in computer science, for the concept in grouping, see Principle of abstraction.

An abstraction layer (or abstraction level) is a way of hiding the implementation details of a particular set of functionality. Software models that use layers of abstraction include the OSI 7 Layer model for computer network protocols, the OpenGL graphics drawing library, and the byte stream input/output (I/O) model originated by Unix and adopted by MSDOS, Linux, and most other modern operating systems.

In the Unix operating system, most types of input and output operations are considered to be streams of bytes being read from a device or being written to a device. This stream of bytes model is used for file I/O, socket I/O, and terminal I/O in order to provide device independence. In order to read and write to a device at the application level, the program calls a function to open the device which may be a real device such as a terminal or a virtual device such as a network port or a file in a file system. The device's physical characteristics are mediated by the operating system which in turn presents an abstract interface that allows the programmer to read and write bytes from/to the device. The operating system then performs the actual transformation needed to read and write the stream of bytes to the device.

Most graphics libraries such as OpenGL provide an abstract graphical device model as an interface. The library is responsible for translating the commands provided by the programmer into the specific device commands needed to draw the graphical elements and objects. The specific device commands for a plotter are different from the device commands for a CRT monitor but the graphics library hides the implementation and device dependent details by providing an abstract interface which provides a set of primitives that are generally useful for drawing graphical objects.

In computer science, an abstraction level is a generalization of a model or algorithm, away from any specific implementation. These generalizations arise from broad similarities that are best encapsulated by models that express similarities present in various specific implementations. The simplification provided by a good abstraction layer allows for easy reuse by distilling a useful concept or metaphor so that situations where it may be accurately applied can be quickly recognized.

A good abstraction will generalize that which can be made abstract; while allowing specificity where the abstraction breaks down and its successful application requires customization to each unique requirement or problem.

Frequently abstraction layers can be composed into a hierarchy of abstraction levels. The ISO-OSI networking model comprises seven abstraction layers. Each layer of the OSI ISO networking model encapsulates and addresses a different part of the needs of much digital communications thereby reducing the complexity of the associated engineering solutions.

A famous aphorism of Butler Lampson goes: All problems in computer science can be solved by another level of indirection; this is often deliberately mis-quoted with "abstraction" substituted for "indirection". Kevlin Henney's corollary to this is, "...except for the problem of too many layers of indirection."