Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (13 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

13.55Mb size Format: txt, pdf, ePub

Read Book Download Book

will not necessarily be completed at the close of each clock cycle, as was the

case with the single-cycle processor. Instead, a new instruction will be com-

pleted at the close of only those clock cycles in which the write stage has been working on an instruction. Any clock cycle with an empty write stage will add

no new instructions to the “Completed Instructions” box, and any clock cycle

with an active write stage will add one new instruction to the box. Of course,

this means that when the pipeline first starts to work on a program, there will

be a few clock cycles—three to be exact—during which no instructions are

completed. But once the fourth clock cycle starts, the first instruction enters

the write stage and the pipeline can then begin completing new instructions

on each clock cycle, which, because each clock cycle is 1 ns, translates into a

completion rate of one instruction per nanosecond.

Shrinking Program Execution Time

Note that the total execution time for each individual instruction is not

changed by pipelining. It still takes an instruction 4 ns to make it all the way through the processor; that 4 ns can be split up into four clock cycles of 1 ns

each, or it can cover one longer clock cycle, but it’s still the same 4 ns. Thus pipelining doesn’t speed up instruction execution time, but it does speed up

program execution time
(the number of nanoseconds that it takes to execute an entire program) by increasing the number of instructions finished per unit

Pipelined Execution

of time. Just like pipelining our hypothetical SUV assembly line allowed us to

fill the Army’s orders in a shorter span of time, even though each individual

SUV still spent a total of five hours in the assembly line, so does pipelining

allow a processor to execute programs in a shorter amount of time, even

though each individual instruction still spends the same amount of time

traveling through the CPU. Pipelining makes more efficient use of the CPU’s

existing resources by putting all of its units to work simultaneously, thereby

allowing it to do more total work each nanosecond.

The Speedup from Pipelining

In general, the speedup in completion rate versus a single-cycle implementa-

tion that’s gained from pipelining is ideally equal to the number of pipeline

stages. A four-stage pipeline yields a fourfold speedup in the completion rate

versus a single-cycle pipeline, a five-stage pipeline yields a fivefold speedup, a twelve-stage pipeline yields a twelvefold speedup, and so on. This speedup is

possible because the more pipeline stages there are in a processor, the more

instructions the processor can work on simultaneously, and the more instruc-

tions it can complete in a given period of time. So the more finely you can slice those four phases of the instruction’s lifecycle, the more of the hardware that’s used to implement those phases you can put to work at any given moment.

To return to our assembly line analogy, let’s say that each crew is made

up of six workers, and that each of the hour-long tasks that each crew per-

forms can be readily subdivided into two shorter, 30-minute tasks. So we can

double our factory’s throughput by splitting each crew into two smaller, more

specialized crews of three workers each, and then having each smaller crew

perform one of the shorter tasks on one SUV per 30 minutes.

Stage 1:
Build the chassis.

Crew 1a:
Fit the parts of the chassis together and spot-weld the joints.

Crew 1b:
Fully weld all the parts of the chassis.

Stage 2:
Drop the engine into the chassis.

Crew 2a:
Place the engine into the chassis and mount it in place.

Crew 2b:
Connect the engine to the moving parts of the car.

Stage 3:
Put the doors, a hood, and coverings on the chassis.

Crew 3a:
Put the doors and hood on the chassis.

Crew 3b:
Put the other coverings on the chassis.

Stage 4:
Attach the wheels.

Crew 4a:
Attach the two front wheels.

Crew 4b:
Attach the two rear wheels.

Stage 5:
Paint the SUV.

Crew 5a:
Paint the sides of the SUV.

Crew 5b:
Paint the top of the SUV.

Chapter 3

After the modifications described here, the 10 smaller crews in our

factory now have a collective total of 10 SUVs in progress during the course

of any given 30-minute period. Furthermore, our factory can now complete

a new SUV every 30 minutes, a tenfold improvement over our first factory’s

completion rate of one SUV every five hours. So by pipelining our assembly

line even more deeply, we’ve put even more of its workers to work con-

currently, thereby increasing the number of SUVs that can be worked on

simultaneously and increasing the number of SUVs that can be completed

within a given period of time.

Deepening the pipeline of the four-stage processor works on similar

principles and has a similar effect on completion rates. Just as the five

stages in our SUV assembly line could be broken down further into a

longer sequence of more specialized stages, the execution process that

each instruction goes through can be broken down into a series of much

more than just four discrete stages. By breaking the processor’s four-stage

pipeline down into a longer series of shorter, more specialized stages,

even more of the processor’s specialized hardware can work simultaneously

on more instructions and thereby increase the number of instructions that

the pipeline completes each nanosecond.

We first moved from a single-cycle processor to a pipelined processor

by taking the 4 ns time period that the instruction spent traveling through

the processor and slicing it into four discrete pipeline stages of 1 ns each in

length. These four discrete pipeline stages corresponded to the four phases

of an instruction’s lifecycle. A processor’s pipeline stages aren’t always going to correspond exactly to the four phases of a processor’s lifecycle, though.

Some processors have a five-stage pipeline, some have a six-stage pipeline,

and many have pipelines deeper than 10 or 20 stages. In such cases, the CPU

designer must slice up the instruction’s lifecycle into the desired number of

stages in such a way that all the stages are equal in length.

Now let’s take that 4 ns execution process and slice it into eight discrete

stages. Because all eight pipeline stages must be of exactly the same duration

for pipelining to work, the eight pipeline stages must each be 4 ns ÷ 8 = 0.5 ns in length. Since we’re presently working with an idealized example, let’s

pretend that splitting up the processor’s four-phase lifecycle into eight

equally long (0.5 ns) pipeline stages is a trivial matter, and that the results

look like what you see in Figure 3-8. (In reality, this task is not trivial and

involves a number of trade-offs. As a concession to that reality, I’ve chosen

to use the eight stages of a real-world pipeline—the MIPS pipeline—in

Figure 3-8, instead of just splitting each of the four traditional stages in two.) Because pipelining requires that each pipeline stage take exactly one

clock cycle to complete, the clock cycle can now be shortened to 0.5 ns

in order to fit the lengths of the eight pipeline stages. At the bottom of

Figure 3-8, you can see the impact that this increased number of pipeline

stages has on the number of instructions completed per unit time.

Pipelined Execution

0.5 1 1.5 2 2.5 3 3.5 4 4.5

Stored

Instructions

CPU

Fetch1

Fetch2

Decode

Execute

Data1

Data2

Tag Ch.

Write

Completed

Instructions

4.5 5 5.5 6 6.5 7 7.5 8 8.5

Stored

Instructions

CPU

Fetch1

Fetch2

Decode

Execute

Data1

Data2

Tag Ch.

Write

Completed

Instructions

Figure 3-8: An eight-stage pipeline

Chapter 3

The single-cycle processor can complete one instruction every 4 ns, for

a completion rate of 0.25 instructions/ns, and the four-stage pipelined pro-

cessor can complete one instruction every nanosecond for a completion rate

of one instructions/ns. The eight-stage processor depicted in Figure 3-8

improves on both of these by completing one instruction every 0.5 ns, for a

completion rate of two instructions/ns. Note that because each instruction

still takes 4 ns to execute, the first 4 ns of the eight-stage processor are still dedicated to filling up the pipeline. But once the pipeline is full, the processor can begin completing instructions twice as fast as the four-stage processor

and eight times as fast as the single-stage processor.

This eightfold increase in completion rate versus a single-cycle design

means that the eight-stage processor can execute programs much faster than

either a single-cycle or a four-stage processor. But does the eightfold increase in completion rate translate into an eightfold increase in processor performance? Not exactly.

Program Execution Time and Completion Rate

If the program that the single-cycle processor in Figure 3-6 is running

consisted of only the four instructions depicted, that program would have

a program execution time of 16 ns, or 4 instructions ÷ 0.25 instructions/ns.

If the program consisted of, say, seven instructions, it would have a program

execution time of 7 instructions ÷ 0.25 instructions/ns = 28 ns. In general,

a program’s execution time is equal to the total number of instructions in

the program divided by the processor’s instruction completion rate (number

of instructions completed per nanosecond), as in the following equation:

program execution time = number of instructions in program / instruction

completion rate

Most of the time, when I talk about processor
performance
in this book,

I’m talking about program execution time. One processor performs better

than another if it executes all of a program’s instructions in a shorter

amount of time, so reducing program execution time is the key to increasing

processor performance.

From the preceding equation, it should be clear that program execution

time can be reduced in one of two ways: by a reduction in the number of

instructions per program or by an increase in the processor’s completion

rate. For now, let’s assume that the number of instructions in a program is

fixed and that there’s nothing that can be done about this term of the equa-

tion. As such, our focus in this chapter will be on increasing instruction

completion rates.

In the case of a non-pipelined, single-cycle processor, the instruction

completion rate (
x
instructions per 1 ns) is simply the inverse of the instruction execution time (
y
ns per 1 instruction), where
x
and
y
have different numerical values. Because the relationship between completion rate and

instruction execution time is simple and direct in a single-cycle processor,

Pipelined Execution

an
n
fold improvement in one is an
n
fold improvement in the other. So improving the performance of a single-cycle processor is really about

reducing instruction execution times.

With pipelined processors, the relationship between instruction execution

time and completion rate is more complex. As discussed previously, pipelined

processors allow you to increase the processor’s completion rate without

altering the instruction execution time. Of course, a reduction in instruction

execution time still translates into a completion rate improvement, but the

reverse is not necessarily true. In fact, as you’ll learn later on, pipelining’s improvements to completion rate often come at the price of
increased
instruction execution times. This means that for pipelining to improve performance,

the processor’s completion rate must be as high as possible over the course

of a program’s execution.

The Relationship Between Completion Rate and Program Execution Time

If you look at the “Completed Instructions” box of the four-stage processor

back in Figure 3-7, you’ll see that a total of five instructions have been com-

pleted at the start of the ninth nanosecond. In contrast, the non-pipelined

processor illustrated in Figure 3-6 sports two completed instructions at the

start of the ninth nanosecond. Five completed instructions in the span of 8 ns

is obviously not a fourfold improvement over two completed instructions in

the same time period, so what gives?

Remember that it took the pipelined processor 4 ns initially to fill up

with instructions; the pipelined processor did not complete its first instruc-