Super scalar processors

Overview

So far in Scalar pipelined processors we had been focusing on reducing the CPI (Cycles Per Instruction). With Super Scalar the focus is going to shift to increase IPC (Instructions per cycle) there by improvement processor performance

There are 2 approaches that help achieve this

Super ScalarVLIW
Dynamic multiple instruction issueStatic (at code compilation) multiple instruction issue
Issues multiple instructions per cyclePacks multiple operations in 1 instruction
Hardware decides dynamically instruction issue widhtStatically decided at compile time
Micro architectureInstruction Set Architecture

The super scalar micro architecture aims increase the Instructions per cycle by

  • Reducing the instruction count
  • Increasing the frequency of clock cycles

To do the above it uses following techniques

  • Multiple issue – more than one instruction per cycle
  • Out of order execution – as a instruction is ready to execute it executes instead of following an order
  • Register renaming – eliminates false dependencies (Write After Read, Write after Write)
  • Dispatch buffers – holding on to instructions waiting for them to be ready
  • Dynamic scheduling – by tracking operand readiness

Pipelining in Super Scalar

  • Single unified/master pipeline – not optimal in Super Scalar where multiple instructions are issued in 1 cycle since
    • bubbles get introduced i.e. every instruction does not use every stage
    • pipeline is underutilized due to bubbles and dependencies
    • increased structural hazards due to multiple instructions competing for same resources
  • Parallel pipelines are optimal for Super Scalar and can be of following types
    • Temporal – multiple sequential pipelines processing different cycles
    • Spatial – duplicate harwdare pipelines processing instructions in parallel
    • Hybrid – combines both temporal and spatial
    • Note: hardware cost significantly increases with such parallel pipelines with more registers, ports and so on
  • Diversified Pipelines
    • Multiple specialized execution units process specific instruction types
    • After fetch decode the RD stage dispatches instruction to one of the specialized execution unit based on instruction type
    • Benefits of diversified pipelines
      • efficient hardware design since since each execution unit is specialized for its type of instruction
      • each instruction type incurrs only required latency since its execution unit is specialized for its execution
      • if inter instruction dependencies are resolved before dispatching to specific execution unit then stalls in execution will not happen
    • Example: Motorola 88110 has 10 pipelines: 2 IU, 1 BU, 1 FPU, 1 Load/Store, 2 GU, etc., each with its own register read/write ports
  • Dynamic Pipelines
    • A parallel pipeline that allows out of order execution of instructions is a Dynamic pipeline
      • With out of order execution stalling is reduced and Instructions per cycle are increased
    • In order to support such out of order of execution the multi entry buffers are required
    • Multi entry buffers
      • In scalar pipeline every stage between it has single entry buffer. The output of instruction at stage i is saved to the buffer which retrieved by stage i+1 for execution
        • in case of dependency the buffer that latches to stage i output is stalled until the dependency is resolved
      • A multi entry buffer latch on to output of more than 1 instruction. And can maintain the output of the instruction for multiple execution cycles until dependencies are resolved
        • It is like a Multi port RAM where each buffer entry is indepedent and can be accessed via its input/output
        • Also each buffer entry can be indexed with a tag like associative memory cache. when an instruction is ready it is announced and any of the buffer entry associated with that instruction is reached
      • One of the first buffer is dispatch multi entry buffer which is responsible for dispatching instructions to specific execution units
      • At the end there is a re-order buffer since with out of order execution the processor still needs to complete all instructions in program order

Flow of execution in Super Scalar Pipeline

  1. Parallel fetch
  2. Parallel decode
  3. Multiple Instruction Issue
    • Register renaming
    • Speculative branch processing
  4. Parallel instruction execution
  5. Reordering
    • sequency instruction execution consistency
    • sequential exception processing consistency

Instruction Fetch

  • In a super scalar pipeline the Fetch aims to fetch multiple instructions from Instruction (I) cache in 1 machine cycle
  • The structure of the I cache is like an array (see below) and lets say Fetch fetches 1 row of the array at a time
    • Fetch width = row width = cache memory in bytes
    • So if fetch does 4 bytes at a time then cache, that is divided into line, with each line split into blocks of 4 bytes
  • The main impediments to solve in fetching row width are
    • misalignment of instructions on multiple array rows
      • solution 1: compiler at compile time organizes the instructions cache such that misalignment of instruction does not happen
      • solution 2: hardware like a T hardware in RS/6000 processors which works as follows
        • Example: There is sa0 = i0, sa1 = i1, sa2 = i2 and so on
        • Suppose fetch group is of 4 instructions at a time
        • Suppose PC is pointing to i2 so fetch will need to cross sub array boundaries to fetch 4 instructions
        • Here T hardware detects that i2 is in sa2. So it changes addresses of sa0 and sa1 to get instructions from next row
    • existence of branch/control instruction in the fetch group

Instruction decode

  • P6 microarchitecture is based on CISC and has super scalar pipelines
  • With CISC decoding complex instructions in parallel becomes challenging. This solved as follows
    • 3 decoders in parallel that convert CISC instructions into RISC like instructions or micro operations
      • 1 generalized decoder that decodes complex instructions and generate upto 4 micro operations
      • 2 simple decoders for simple instructions that generate 1 micro operation
      • The decoded instructions are put into a reorder buffer waiting for dispatch and issue
    • In addition it employes a pre-decoder that understand opcode, start/end, prefixes and so on simplifying the work of decoder
      • since CISC has variable length instructions decoder needs to understand the size of instruction (could range from 1-15 bytes)
      • hence decoder needs to find out start and end of instruction
  • Below is the flow of decoding
    • predecode buffer – decode begin/end instruction
    • instruction buffer gets all instructions
    • simple instructions (like add) go to partial/simple decoder
    • complex instructions (like move) go to full decoder since has multiple micro operations

Dispatch and Issue

  • Dispatch objective is to maximize the number of instructions in a fetch packet (fetch in super scalar fetches multiple instructions) that are sent to execute stage
    • Dispatch = is the entry into the dispatch buffer or stage
    • Issue = exit from dispatch buffer
    • Dispatch buffer is a temporary storage of instructions until dependencies (resource/structural, data, control dependencies) are resolved
    • With Dispatch buffer the stalling of fetch/decode will not stall unless the buffer is full
    • Dispatch buffer is also called Reservation Station or Shelving Buffer
    • Dispatch and Issue take care of
      • Renaming of registers
      • Allotment of reservation stations
      • Allotment of reorder buffer in case of out of order execution
      • Instruction into reservation stations
    • Used in case of Super Scalar architectures
  • In case of CISC, individual micro operations for each instruction will enter into Dispatch stage while in case of RISC single instructions are going to enter
  • In order to dispatch an instruction the following conditions needs to be met
    • Free reservation station
    • Free renaming register
    • Free re-order buffer
    • if any or all of above not met then instruction stalled in decode stage
  • In order to issue an instruction the following conditions needs to be met
    • all operands are available
    • corresponding execution unit is ready

Execution

Dynamic execution core

  • By the time a instruction enters into reservations statitions for issue, false and control dependencies are resolved. The only remaining are true dependencies
  • In above example there are 2 reservations stations for ALU instructions, 3 for Load/Store (memory) instructions and so on
  • ALU instruction takes 1 cycle to execute, load/store takes 2 cycles and so on
  • After execution instructions enter into completion buffer or complete stage
  • After complete stage the instructions go into retire stage when architectural registers are updates
  • Once an instruction leaves execution unit the process is called finish

Complete

  • After execution is finished in the complete stage the executed instruction will enter into Re-order buffer to ensure all instructions are in order
  • Re-order buffer controls the backend of the pipeline

Re-order buffer

  • circular buffer where
    • head pointer point towards first valid entry
    • tail pointer point towards first available location
  • instructions enter into reservation station and at the same time enter into the re-order buffer
  • in re-order buffer the instructions are sequentially maintained
  • example: if 4 instructions a,b,c,d
    • A and C finished execution while b,d have not finished
    • so A finsihes execution and it can complete then retire
    • and C will need to wait for B to finish
  • x -is in execute
  • f – finsihed execution
  • i – is in issue or in reservation station

Structure of re-order buffer

  • Busy – whether entry is being used or not
  • Issued – instruction has been issued for instruction or not
  • Finished -instruction has finished executing
  • Inst Addr – address of the instruction
  • Rename Reg – architectural register renamed to which phsyical register
  • Speculative – whether instruction is speculative based on branch prediction module
  • Valid – if branch prediction correct or not. this is updated once actual branch instruction is executed

Note: In case of speculative branch prediction

  • all instructions are executed but they are kept in re-order buffer until actual branch executed
  • once actual branch executed if branch prediction is in-correct then all instructions that were executed, due to speculative branch prediction, are flushed from re-order buffer

Example with branch instruction

Retire

  • In this stage architectural registers are going to be updated

Details – Dispatch and Issue

Dispatch Policy

  • Dispatch use the following in case of
    • False dependencies – Write after Read (WAR), Write after Write (WAW)
      • Register renaming – hardware in Dispatch buffer to do register renaming
    • Control dependencies – branching
      • speculative predict branch condition outcome as taken or not taken
      • Uses re-order buffer hardware for this
    • Control, data dependencies
      • Reservation stations/Shelving buffers – Use shelving (dispatch) buffer to make child instructions to wait while parent instruction(s) are executed
  • Policies to support above
    • In order
      • a -> b (dep on a) -> c -> d (dep on b)
      • Cycle 1 -> a
      • Cycle 2 -> b, c (c has no dependency)
      • Cycle 3 -> d
    • Out of order – for some instructions like ALU
      • a -> b (dep on a) -> c -> d (dep on b)
      • Cycle 1 -> a, c
      • Cycle 2 -> b, d
    • Fixed Window
      • A window is number of instructions to be dispatched in 1 execution cycle
      • In case of fixed window = 4 then until all 4 instructions are executed it will continue to work on dispatching the 4 instructions in current window
  • Gliding window
    • In this if window size = 4 then with every dispatch cycle it will keep taking on additional instructions to maintain window size of 4
in order execution
out of order execution

Based on the dispatch policies above dispatch rate is as follows

  • CISC – 2
  • RISC – 3/4
  • Power (has 10 EU) – 6

Reservation stations

Reservation stations are defined by 1) type, 2) capacity, 3) number of ports. There are following types of reservations stations

  • Centralized
    • Common to all types of execution units (ALU, Memory, FPU and so on)
    • So same reservation station will be used for all types of instructions
    • Pros: Enables optimized utilization of reservation station
    • Cons: Complexity of ports due to instructions needing to go to different execution units
    • Uses DRIS (Dynamic Renaming Instruction Shelving) architecture combining into one hardware 1) reservation station 2) register renaming 3) reorder buffer
    • Can have upto 20-32 reservation stations depending on number of Execution units
  • Distributed
    • Different reservation stations for each execution unit say 2 x ALU, 1 x Memory, 1 x FPU and so on will each have 1 reservation station
    • Pros: Simplicity of ports since each reservation unit connected to its execution unit
    • Cons: Potential under utilization of a reservation station depending on number of instructions of a particular type
    • Can have upto 2-4 resevration stations per execution unit
  • Clustered
    • rarely used since theoritically proposed
    • similar kind of execution units like ALU, MEM could use a common reservation stations

Reservation station operations

  • Dispatching
    • entering reservation station and re-order buffer
    • register renaming
  • Waiting
    • waiting for operands
  • Issuing
    • operands available goes for execution

Dispatch vs Issue

Operand Fetch Policies

The operand fetch or read happens during Dispatch and Issue stage as per 1 of the following policies

Dispatch bound policy

Fetch the value of the source operands at the point of entry into the reservation stations

Issue bound policy

Fetch the value of the source operands at the point of exit from the reservation stations

Issue Selection rule

  • Instruction is eligible for execution
    • no dependency due to which instruction has to wait
    • example: add R0, R1, R2
      • R1, R2 both available
      • ALU is available
  • Register renamed for a false dependency
    • example: (WAR dependency)
      • add R1, R2, R3
      • xor R2, R0, R6
    • in above example renaming register R2 in xor instruction
    • after which false dependency is resolved and instruction can be issued
  • Speculative branching prediction resolves control dependency
    • after doing branch prediction the instruction can be executed
  • In case multiple instructions are ready for execution then
    • older instruction is executed before younger instruction

Issue order

  • Example
    • ld r1, [r2]
    • add r2, r1, r1
    • fmul f1,f2,f3
    • st r6, [r12]
    • xor r9, r10, r12
    • bne x1
  • In order
    • take the last entry in reservation station and issue
    • in above example 1 after the other as instructions get ready to execute
    • not useful for super scalar architecture
  • Out of order
    • issue those instructions ready for execution if they dont have data or resource dependency
    • 2 types
      • Partially out of order
        • when only select instructions based on reservation stations or instruction type are issued out of order
      • Out of order
        • any instruction when it is ready to execute
        • in above example – instructions 1,3,5,6
  • 3 policies based on above
    • Straight forward
      • Individual/Distributed reservation stations per execution unit
      • last instruction in reservation station if ready it is sent for execution
    • Enhanced
      • Individual/Distributed reservation stations per execution unit
      • Out of order issue of instructions as soon as any instruction is ready
    • Advanced
      • Centralized reservation stations
      • Multiple instructions can be issued
      • Out of order issue of instructions as soon as any instruction is ready
        • checks the type of instruction to send to correct execution unit

Issue Rate

  • 1 or more instructions in a cycle
  • clustered reservation stations will be able to do lower than centralized reservation stations
  • Issue rate > Dispatch rate
    • since many instructions are already there in reservation stations hence higher issue rate compared to dispatch rate

Availability of operands

  • instruction, held in reservation station, are going to be issued only when all operands are available
    • operands are fetched from register file
  • to check if operands are available then each register content needs to be checked for being valid
  • Policy 1 – Issue bound – stores tag of operands, destination register and checking of operands is done at the time of issue
    • a scoreboard tells if register content are valid or not. example
      • reg 0 – 1 (valid)
      • reg 1 – 0 (in valid)
      • reg 8 – 1
    • Example: add R6, R0, R8
      • since R0, R8 are valid the instruction will get issued
      • when instruction is issued R6 is set as invalid
      • once instruction execution is completed R6 will be set as valid
        • this ensures a subsequent instruction which has dependency say on R6 then it will not be issued until it is valid
    • Figure below
      • Rs1, Rs2 (source register) to be checked for validity
      • Rd (destination register) will be set as invalid
  • Policy 2 – Dispatch bound – operands get into reservation station with validity

(Check example in class of 12-Oct 2025)

Register Renaming

  • to remove false dependency register renaming is done at Issue stage
  • Types
    • Static
      • done by compiler in scalar
    • Dynamic
      • done by hardware in super scalar
      • Example – false depdnency
        • add R0, R1, R2
        • sub R1, R3, R4
        • in such case R1 is renamed to a phsyical register when issuing to execution unit
  • Renaming implementation
    • scope – are we going to rename for all instructions?
      • partial
      • full – all instructions will rename destination register
    • layout – how registers will be laid out
      • type of rename buffers
        • Architectural register file (ARF) – virtual registers and only compiler will know only about this
        • Rename register file (RRF) – actual physical registers
    • When does register renaming happen?

Types of layout

Merged ARF and RRF

  • example: 0-39 registers in which upto 31 registers were architectural registers

Seperate ARF and RRF

Example – add R5, R0, R1

  • R0 is busy hence mapped to P1
  • R1 is free hence left as is
  • R5 is free hence left as is
  • instruction becomes – add R5, P1, R1

ARF and ROB – ?

ARF and DIBS – ?

Method to access rename buffers

  • Red – true dependency
  • Blue – false dependency
  • Black – false dependency

(See class on 19-10-2025 for explanation of this example)

Register renaming workflow

Above for Dispatch bound

Below for Issue bound

Tomasulo Algorithm

  • 2 instructions dispatched at a time

Branch Prediction Module

Branch preduction happens during Instruction fetch

Instruction Processing

  • Instruction Flow
    • Branch Instruction Processing
  • Register Data Flow
    • ALU Instruction Processing
  • Memory Data Flow
    • Load/Store Instruction Processing

Instruction Flow Techniques

  • Maximize supply of instructions
  • Pipeline
    • Front End
      • Fetch
      • Decode
      • Dispatch and Issue
    • Backend
      • Execution Unit
      • Complete
      • Retire
    • Branch Execution Unit is part of the Backend of the pipeline
  • Example: BEQ PC+offset
    • overhead of 3 cycles since need to go through Fetch->decode -> dispatch + Issue

Branch Pentaly

  • BTA – Branch Target Address -> depending on how we give address we reduce penalty
    • PC Relative – BEQ PC+offset
      • In case of unconditional branch – JPM PC+offset
        • Bypass from Decode
        • Hence Penalty is 1
    • Register Indirect – BEQ [R0]
      • In case of unconditional branch
        • Bypass from Dispatch (since requires to know register value)
        • Hence Penalty is 2
    • Register Indirect + offset – BEQ [R0 + offset]
      • In case of unconditional branch
        • Bypass from Branch Execution Unit (branch address needs calculation)
        • Hence Penalty is 3
  • BC – Branch Condition
    • CC (Condition Code or Flag) Register
      • Bypass from Dispatch (since requires to read Flag register)
      • hence Penalty is 2
    • GP value Comparison
      • BEQ R0, R1, offset
      • Bypass from Branch Execution Unit (condition needs to be evaluated)
      • hence Penalty is 3
  • Branch Prediction
    • To remove the penalties we need to go for Branch Prediction
    • Why?
      • number of instructions available for issue since create control dependency
    • Branch Prediction Unit is in the frontend of the pipeline along with Fetch unit
      • This since if in Backend of pipeline then penalties will still exist
    • Types
      • Static branch prediction
        • policy implemented in processor
      • Dynamic branch prediction
        • 2 parts
          • Branch Target (Address) Speculation
          • Branch Condition Speculation

Dynamic Branch Prediction – Branch Target Speculation

  • Branch predicton is done in fetch (front end of pipeline) and validated in finish (backend of pipeline)
  • Branch Target Buffer (or Branch Target Address Cache)
    • For a branch instruction address (BIA) there is an entry in BTB for Branch Target Address (BTA) then it goes to BTA
    • Example:
      • ld R0, [R1]+4
      • ld R2, [R1] +4
      • add R3, R0, R2
      • st R3, {r4]+4
      • dec R5
      • BEQ X1
    • First time it will suffer penalty
    • After first time it will know the target address and entry in BTB table hence will go directly to BTA
      • With this there will not be any penalty
      • if there is a loop of 0-31 there will be penalties for 0th and 31st loop but for the remaining there will be no penalties

Dynamic Branch Prediction – Branch Condition Speculation

1 bit predictor – remember last taken/not taken branch

  • If last time branch was taken then it will be predicted as taken next time
  • Like a 1 bit counter – 0 or 1 state
    • when counter starts it starts with not taken (0)
  • behavior – it will be
    • 0 – if branch is not taken and prediction is ‘not taken’ for next
      • and will predict as not taken for next cycle
    • 1 – if branch is taken and prediction is taken for next
      • and will predict as taken for next cycle
    • so transition from 0 to 1 will depend on actual branch execution being taken or not taken
      • all state transitions happen after branch execution

2 bit predictor

  • Has 4 states instead of 2, allowing more information about tendencies
  • It works like a up-down saturating counter
  • See below state diagram of the 4 states. and prediction will be based on
    • 00, 01 -> not taken decision
    • 10, 11 -> taken decision
  • Predictors never start in extreme states so either will start from 01 or 10 state
  • The different states of predictor are
    • 00 – strongly not taken
    • 01 – weakly not taken -> starts here
    • 10 – weakly taken -> or starts here
    • 11 – strongly take
  • See example below for how it works

Branch History Table

  • Table of predictors with a row for each Branch and a 2 predictor
    • there will be multiple predictors
    • it is a associative table indexed by Branch Instruction Address (BIA)
      • this introduces latency when executing instructions since address needs to be found in the table
    • alternate is a direct access table where one uses part of the branch address
  • Flow
    • Associative search is done in BHT
    • If finds BIA then checks Branch History (2 bit predictor) for taken or not taken
    • if taken then will go to BTA
  • Generally in predictors direct access is used instead of associative
  • if predictor predicts
    • correctly – then execution continues and BHT is updated to next higher state of predictor
    • in corretly – then pipeline is flushed and BHT is updated to previous lower state of predictor

Branch prediction accuracy

  • 2:1 – In case of prediction it is twice likely to be taken vs not taken
  • 1 bit predictor – 79.7% – 96%
  • 2 bit predictor – 83.4% -94.5%
  • 3 and 4 bit predictor – 93%-98%
  • No improvement beyond 4 bit predictor

Branch misprediction recovery

  • There are 2 interacting engines
    • Leading engine – does speculation at the front end of pipeline (fetch stage)
    • Training engine – does validation at the back end of pipeline (finish stage)
  • They both work together to ensure branch misprediction recovery happens by updating validity of branch prediction in the re-order buffer

Notes

  • In case of multiple branches each set of instruction sent for execution based on branch prediction will be given a tag so that it is known on which branch prediction they are being executed
    • in case of invalid branch prediction then any instructions tagged with that branch and its sub branch predictions will be marked invalid in re-order buffer
  • In case of branch misprediction then incorrect path is termination and fetch is begun from new (valid) path

Correlated Branches

  • Result of one branch effect the result of another branch
  • Example
    • if (d== 0)
      • d =1;
    • if (d==1)
    • So for example if ->
      • B1 is Taken then
        • B2 is Taken or Not Taken
      • B1 is Not Taken then
        • B2 is Not Taken
    • so in this example B2 is correlated to B1
    • In case of predictor for B2 it will have to look at both B2 and B1 history to predict
  • 2-4 bit predictors cannot be used in case of correlated branches
  • For predictor of correlated branches the following is the solution
    • there will be multiple Branch History Table (BHT) lets say 4
    • for predicton one of the BHT table row will be used to predict the branches
  • Example below

Instruction status

The following could be the instruction status

  • Waiting execution
  • In execution
  • Finished execution
  • Speculative
    • in fetch branch perdiction is done
    • if branch prediction is correct then move to next instructions
    • if branch prediction is in-correct the flush the pipeline