1. This post deals with one such processor

 

1.   i        Introduction:

Tomasulo’s algorithm
is a dynamic scheduling technique that allows out-of-order execution of
instructions in a processor while at the same time minimizing hazards caused
due to dependencies. This technique is widely used in implementing dynamic
scheduling in modern day processors. This post deals with one such processor architecture,
Intel’s Core microarchitecture, and compares the dynamic scheduling technique
used in it with Tomasulo’s approach.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Dynamic scheduling in Intel’s Core
microarchitecture has been implemented by taking Tomasulo’s algorithm as the
basis. The pipeline of the Core microarchitecture contains the following three
stages.

· An in-order issue front end which
supplies instructions to execution core in program order. In this stage the
most likely instructions are pre-fetched and are decoded into micro-operations.

· An out-of-order execution core which
has the capability to re-order the micro-operations in a way to take care that
no CPU cycles are lost.

· An in-order retirement unit which
performs the task of updating the in the order of the original program once the
micro-operations have finished executing.

It was developed by Robert Tomasulo at
IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating
point unit. Robert Tomasulo received the Eckert-Mauchly Award in 1997 for his
work on the algorithm.

Figure
1.1: Tomasulo’s Algorithm: The Picture

2.          
Implementation Concepts:

The following are the concepts necessary to the
implementation of Tomasulo’s Algorithm:

1)           
Common data bus

2)           
Instruction order

3)           
Register renaming

4)           
Exceptions

2.1        
Common Data Bus:

 

The
Common Data Bus (CDB) connects reservation stations directly to functional
units. According to Tomasulo it “preserves precedence while encouraging
concurrency”. This has two important effects:

1)             
Functional units can access the result of any operation
without involving a floating point-register, allowing multiple units waiting on
a result to proceed without waiting to resolve contention for access to
register file read ports.

2)             
Hazard Detection and control execution are distributed. The
reservation stations control when an instruction can execute, rather than a
single dedicated hazard unit.

 

2.2        
Instruction Order:

Instructions are issued sequentially so that the
effects of a sequence of instructions, such as exceptions raised by these
instructions, occur in the same order as they would on an in-order processor,
regardless of the fact that they are being executed out-of-order (i.e.
non-sequentially).

2.3        
Register Renaming:

 

Tomasulo’s Algorithm uses register renaming to
correctly perform out-of-order execution. All general-purpose and reservation
station registers hold either a real value or a placeholder value. If a real
value is unavailable to a destination register during the issue stage, a
placeholder value is initially used. The placeholder value is a tag indicating
which reservation station will produce the real value. When the unit finishes
and broadcasts the result on the CDB, the placeholder will be replaced with the
real value.

 

Each functional unit has a single reservation station.
Reservation stations hold information needed to execute a single instruction,
including the operation and the operands. The functional unit begins processing
when it is free and when all source operands needed for an instruction are
real.

 

2.4        
Exceptions:

Practically speaking, there
may be exceptions for which not enough status information about an exception is
available, in which case the processor may raise a special exception, called an
“imprecise” exception. Imprecise exceptions cannot occur in non-Out of
Order Execution implementations, as processor state is changed only in program
order.

Programs that experience
“precise” exceptions, where the specific instruction that took the
exception can be determined, can restart or re-execute at the point of the
exception. However, those that experience “imprecise” exceptions
generally cannot restart or re-execute, as the system cannot determine the
specific instruction that took the exception.

3.          
Instruction Lifecycle:

 

­­ The three
stages listed below are the stages through which each instruction passes from
the time it is issued to the time its execution is complete.

Reservation Station Components:

Op

Operation to perform in the unit (e.g., + or
–)

Qj, Qk

Reservation stations producing source
registers (value to be written)
u  
Qj,Qk=0 => ready
u  
Store buffers only have Qi for RS producing result

Vj, Vk

Value of Source operands
u  
Store buffers has V field, result to be stored

A

Used to hold the
memory address information for a load or store

Busy

Indicates
reservation station or FU is busy

Register result
status

Indicates which
functional unit will write each register, if one exists. Blank when no
pending instructions that will write that register.

 

3.1       Stage 1: issue:

In
the issue stage, instructions are issued for execution if all operands and
reservation stations are ready or else they are stalled. Registers are renamed
in this step, eliminating WAR and WAW hazards.

v  Retrieve the next instruction from the head of the instruction
queue. If the instruction operands are currently in the registers, then

·        
If a matching functional
unit is available, issue the instruction.

·        
Else, as there is no
available functional unit, stall the instruction until a station or buffer is
free.

v  Otherwise, we can assume the operands are not in the registers,
and so use virtual values. The functional unit must calculate the real value to
keep track of the functional units that produce the operand.

Pseudocode:

Instruction state

Wait until

Action or bookkeeping

Issue

Station
r empty

if
(RegisterStatrs.Qi¦0) {
                                RSr.Qj ?
RegisterStatrs.Qi }
else {      RSr.Vj ? Regsrs;
                RSr.Qj ? 0;         }
if
(RegisterStatrt.Qi¦0) {
                RSr.Qk ?
RegisterStatrt.Qi;             }
else {      RSr.Vk ? Regsrt; RSr.Qk ? 0;      }
RSr.Busy ? yes;    RegisterStatrd.Q ? r;

Load
or Store

Buffer
r empty

if
(RegisterStatrs.Qi¦0) {
                RSr.Qj ?
RegisterStatrs.Qi;}
else {      RSr.Vj ? Regsrs;
                RSr.Qj ? 0;                          }
RSr.A ? imm;    RSr.Busy ? yes;

Load only

 

RegisterStatrt.Qi
? r;

Store only            

 

if
(RegisterStatrt.Qi¦0) {
                RSr.Qk ?
RegisterStatrs.Qi;  }
else {      RSr.Vk ? Regsrt;
                RSr.Qk ? 0    };

3.2       Stage 2: Execute:

In the execute stage, the instruction operations are carried
out. Instructions are delayed in this step until all of their operands are
available, eliminating RAW hazards. Program correctness is maintained through
effective address calculation to prevent hazards through memory.

v  If one or more of the operands is
not yet available then: wait for operand to become available on the CDB.

v  When all operands are available,
then: if the instruction is a load or store

·        
Compute
the effective address when the base register is available, and place it in the
load/store buffer:

·        
If
the instruction is a load then: execute as soon as the memory unit is available

·        
Else,
if the instruction is a store then: wait for the value to be stored before
sending it to the memory unit

v  Else, the instruction is an arithmetic logic unit (ALU) operation then: execute the instruction at the
corresponding functional unit.

Pseudocode:

Instruction state

Wait until

Action or book keeping

FP operation

(RSr.Qj = 0) and (RSr.Qk = 0)

Compute result: operands are in Vj
and Vk

Load/store step 1

RSr.Qj = 0 & r is head of
load-store queue

RSr.A ? RSr.Vj + RSr.A;

Load step 2

Load step 1 complete

Read from MemRSr.A

3.3       Stage
3: Write Result:

            In
the write Result stage, ALU operations results are written back to registers
and store operations are written back to memory.

v  If the instruction was
an ALU operation

·        
If the result is available, then: write it on the CDB and from
there into the registers and any reservation stations waiting for this result

v  Else, if the
instruction was a store then: write the data to memory during this step.

 

Pseudocode:

Instruction state

Wait until

Action or bookkeeping

FP operation or load       

Execution complete at r & CDB
available  

?x(if (RegisterStatx.Qi = r) {
                                regsx
? result;
                                RegisterStatx.Qi
= 0  });
                ?x(if (RSx.Qj = r) {
                                RSx.Vj
? result;
                                RSx.Qj
? 0;   });
                ?x(if (RSx.Qk = r) {
                                RSx.Vk
? result;
                                RSx.Qk
? 0;});
                RSr.Busy
? no;

Store

Execution complete at r &
RSr.Qk = 0

MemRSr.A ? RSr.Vk;   RSr.Busy ? no;

 

4.          
Algorithm Improvements:

The concepts of reservation stations, register renaming, and the common
data bus in Tomasulo’s algorithm presents significant advancements in the
design of high-performance computers. Reservation stations take on the
responsibility of waiting for operands in the presence of data dependencies and
other inconsistencies such as varying storage access time and circuit speeds,
thus freeing up the functional units. This improvement overcomes long floating
point delays and memory accesses. In particular the algorithm is more tolerant
of cache misses. Additionally, programmers are freed from implementing
optimized code. This is a result of the common data bus and reservation station
working together to preserve dependencies as well as encouraging concurrency.

By tracking operands for instructions in the reservation stations and
register renaming in hardware the algorithm minimizes read-after-write (RAW)
and eliminates write-after-write (WAW) and Write-after-Read (WAR) computer
architecture hazards. This improves performance by reducing wasted time that
would otherwise be required for stalls.

An equally important improvement in the algorithm is the design is not
limited to a specific pipeline structure. This improvement allows the algorithm
to be more widely adopted by multiple-issue processors. Additionally, the
algorithm is easily extended to enable branch speculation.

 

5.          
Simulation/Output:

 

Simulator’s output/ values are obtained from online Tomasulo
algorithm simulator http://nathantypanski.github.io/tomasulo-simulator. This Simulator simulates Tomasulo’s algorithm for a
floating-point MIPS-like instruction pipeline, demonstrating out-of-order
execution. Simulation Results are as under:

 

 

6.          
Summary:

 

Tomasulo’s Algorithm, so called because it was first
described in a paper by R.M. Tomasulo, was first used in the IBM System/360
Model 91 Floating-point Unit. Just like the Scoreboard mechanism originally
designed for the CDC 6600, Tomasulo’s Algorithm was designed to control the
flow of data between a set of programmable floating-point registers and a group
of parallel arithmetic units. Both are designed to ensure that constraints
imposed by instruction dependencies are satisfied. Tomasulo’s Algorithm is
different in that it systematically attempts to minimize delays between the
production of a result by one operation and the start of a subsequent operation
which requires that result as an input.

The algorithm works by using
additional registers, known as Reservation Stations, at the inputs to the
arithmetic units and a system of tags which direct result operands to where
they are next needed, rather than necessarily to where they would have gone
when the instructions which produced them were issued. This is essentially
a register renaming mechanism, and Tomasulo’s Algorithm was one of
the first examples of this mechanism, now used in a variety of high performance
processors. Tomasulo’s Algorithm itself is still used today in processors such
as the PowerPC 620.

7.          
References:

 

1)         
https://en.wikipedia.org/wiki/Tomasulo_algorithm

2)         
http://nathantypanski.github.io/tomasulo-simulator        \
Online Simulator

3)         
https://www.cc.gatech.edu/~milos/Teaching/CS6290F07/4_Tomasulo.pdf

4)         
www.cs.ucf.edu/~lboloni/Teaching/EEL5708_2005/slides/Tomasulo.ppt

5)         
www.info425.ece.mcgill.ca/tutorials/T06-Tomasulo.pdf

6)         

Начало

7)         
https://people.eecs.berkeley.edu/~pattrsn/252F96/Lecture04.pdf

8)       
 https://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/tomasulo.html

9)         
https://www.youtube.com/watch?v=CsCei-44jHs            \ Video Tutorial

10)        
https://courses.cs.washington.edu/courses/cse471/04au/reorderBufNew2.pdf

11)        
www.ecs.umass.edu/ece/koren/architecture/Tomasulo/AppletTomasulo.html

12)      
T. Arons and A. Pnueli. Verifying Tomasulo’s algorithm
by refinement. Technical report, Welzmann Institute, 1998