, ,

Processor pipeline and pipeline designing

Introduction to Pipelining

  • Pipelining
    • Breaking up instruction execution into multiple stages so that multiple instructions can be executed at the same time in different stages
    • It is vertical scaling of the processor and done in scalar processors
    • Deep pipelining could lead to adverse impact due to
      • simple instructions forced to go through each stage
      • branch instructions causing stall in the pipeline
      • memory operations slowing down the pipeline stage
    • However balanced pipelining can significantly increase throughput while maintaining same instruction execution/operational latency
    • For example:
      • non pipelined – 2 instructions x 5 machine cycles each = 10 machine cycles for 2 instructions
      • 5 stage pipelined – 6 cycles for 2 instructions
  • How is an instruction (ADD R1, R2, R3) execution in a 5 stage – Fetch, Decode, Execute, Memory, Write Bak pipeline
    • Fetch
      • Gets PC i.e. address of next instuction
      • Get Instruction Bits from Instruction Cache using address
      • Writes Instructions Bits to IF/DE register
      • Write PC + 4 to IF/DE register
    • Decode
      • Gets Instruction Bits from IF/DE register
      • Decodes Op Code and generate CONTROL signal
      • Get operands from Register File
      • Writes CONTROL signal, operands and PC + 4 to DE/EX register
    • Execute
      • ALU gets CONTROL signal and operands
      • Calculates ALU result
      • Saves ALU result, CONTROL signal, PC+4 to EX/MEM register
    • Memory
      • Passes ALU result, CONTROL signal to MEM/WB register
    • WriteBack
      • Reads CONTROL signal
      • Writes ALU result to destination register
  • Pipeline performance
    • non pipelined = n x (IF + DE + EX + MEM + WB)
    • pipelined = max (IF, DE, EX, MEM, WB) x (n + 1)
  • Pipeline idealism
    • uniform sub operations
    • reptition of identical operations
    • repitition of independent operations
  • Pipeline realism
    • Uniform sub operations will never happen so
      • do stage quantization to have uniform stages
      • minimize internal fragmentation (so faster stage waiting time is low for slower stages)
    • Repitition of identical operations will never happen so
      • unifying instruction types i.e. coalescing instruction types
    • Repitition of independent operations will never happen so
      • resolve data and resource hazards so dependency (data, control, resource) detection and resolution
  • Hardware requirements for pipelining
    • Pipeline registers
    • Seperate memory for data and instructions
    • Multiple register read/write ports
    • Forwarding hardware
    • Hazard detection unit
  • Example processor pipeline
    • MIPS – 5 stage pipeline with simple addressing
    • AMDAHLS – 12 stage pipeline with complex addressing

RISC Scalar pipeline

Designing a pipeline

  1. Select a Instruction set architecture
  2. For every instruction in ISA identify subtasks to be performed
  3. Logical layout of the pipeline
  4. Build forwarding or bypassing to remove hazards
  5. Do timing of the pipeline operations
  6. Dependency resolution
  7. # of pipeline stages

Performance of the pipeline

There are 2 factors to consider:

  • Cost of pipeline
    • between stages of a pipeline a latch or buffer is put
    • there is 1 latch per stage of pipeline
    • Hence, Cost = G (gates cost of non pipeline system) + [K (stages) x L (latches)]
  • Computation rate
    • Computation rate = 1 / T (latency of non pipelined system)
    • In case of pipeline systems
      • computation rate = 1 / (T / K stages) + S (delay due to latches)
  • Based on the above
    • Differential performance of K stage = SQT (GT / LS)
    • where
      • G = Numbers of gates
      • T = latency of non pipeline system
      • L = Number of latches in the pipeline stages
      • S = Latch cost
  • Example
    • Given T = 100ns, G = 1000 units, L=50units, S=5ns
    • K opt = SQRT (1000 x 100 / 50 x 5) = SQRT (400) = 20
    • Pipeline stages = 20

Example of designing pipeline with basic ISA

Step 1: ISA

  1. ALU
    • FX – fixed point instructions
    • FP – floating point instructions
  2. Load and store
  3. Branching

Step 2: Sub tasks of each instruction of ISA

OperationALU-FXFPULoadStoreBR – UnconditionalBR – Conditional
ExampleAdd R1,R2,R3LD R4,[R1+16]SR [R1+8], R6JMP [PC+OF]JNC [PC+OF]
FetchIFFetchFetchFetchFetchFetch
DecodeDEDecodeDecodeDecodeDecodeDecode
Operand FetchOFRead RegisterRead Register
Generate Addr
Read Memory
Read RegisterRead RegisterRead Register
ExecuteEXPerform ALUEvaluate Cond
Write backOSWrite RegisterWrite RegisterGenerate Addr
Write Memory
Generate Addr
Write PC
Generate Addr
Write PC

Step 3: Logical layout

  • Declaration of the pipeline to be implemented
  • Specifications of subtasks to be performed and sequence

2 types

  1. Sequential Operation
    • Single – multi functional pipeline
      • same (scalar) pipeline used for any type of instructions
      • different (super scalar) pipeline for FP and rest of instructions
    • Mutliple pipeline – Dedicated pipeline for each instruction (super scalar)
  2. Recycled
    • ….

Logical of a single pipeline

  1. evaluate sub tasks present
  2. commonality present then share same pipeline stage
  3. increase merging of stages

Example: Using sub-tasks in section above (Step 2)

Below are the commonalities

Based on above it becomes a 6 stage pipeline like below

  • Fetch – Same for all
  • Decode – Same for all
  • Operand Fetch – Read Register
  • Execute – Add or Generate Mem Address (Load, Store, Branch)
  • Memory – Read (Load), Write (Store), Update PC (Branch)
  • WriteBack – Write Register (Add, Load)

Logical layout of single pipeline

Step4: Removing harzards

Types of hazards

  1. Resource (Structural) hazard – waiting for a resource to be available say 2 FP instruction and 1 FPU
  2. Data hazard – instruction ready to execute before data it requires is ready
  3. Control hazard – branching decisions before branch condition is evaluated

Note: Any hazard can be solved by waiting!

Resolution: Structural Hazard

  • For example: Single memory is used for instruction and data
  • Possible resolutions
    • Stall – introduce time delays or stall until stage becomes available
      • low cost, simple
      • impacts performance
      • use for rare cases
    • Resource replication – split instruction and data into seperate memory (Harvard Architecture)
      • good performance
      • increases cost
      • useful for cheap resources
  • Some rules to resolve structural hazards
    • each instruction should use a resource once
    • use the resource in the same pipeline stage
    • use the resource for 1 cycle only

Resolution: Data dependency and haxards

  • Possible resolutions
    • Stall until the next instruction – this is pipeline interlocking with a hardware detecting a hazard
    • By passing/forwarding paths – the producer pass output to consumer without waiting to write to register file
      • Example:
        • ADD R1, R2, R3
        • MUL R4, R1, R3
        • During execution ALU will send the output of ADD straight back to waiting instruction MUL so that it can perform the operation
      • Example
        • AND R1, R3, R4
        • NEG R5
        • ADD R6, R1, R1
        • During execution a second forwarding path from MEM to ALU will be created so that 3rd instruction does not stall
    • Out of order execution (used in super scalar pipeline) – executes instructions when operands are ready rather than in instruction order
    • Branch prediction – speculate branch outcome

RAW (Read after write hazard)

Circuitry with bypassing

Step 5: Performance of scalar pipeline

Ways to improve performance

  • By passing and forwarding
  • Instruction re-scheduling or reordering
  • Branch optimizations – prediction
  • Deep pipelining – increases frequency of cycles but may increase penalty needs to be balanced

Lets say we have

  • instruction cache
  • data cache
  • a program has following split of instructions
    • ALU – 40%
    • Loads – 25%
    • Store – 15%
    • Branches – 20%
      • 33% – unconditional – included in branch penalty
      • 33% – conditional taken – – included in branch penalty
      • 33% – conditional not taken – not included in branch penalty
ALULoadStoreBR- CondBR – Cond TakenBR – Cond Not Taken
40%25%15%33% of 20%33% of 20%33% of 20%
No by pass – RAW Distance33440
No bypass Penalty.4 x 3.25 x 30.2 x 0.66 x 4
Total Penalty no by pass =2.4781.2.75.528
Bypass RAW Distance01440
By pass penalty0.25 x 10.2 x 0.66 x 4
Total Penalty with by pass = 0.7780.25.528
Instruction re-scheduling or reorderding (75% of load instructions reordered) = .590500.25 x .25 x 1.528

Cycles per instruction = 1 (1 cycle for all instructions) + total penalty