Introduction to Pipelining
- Pipelining
- Breaking up instruction execution into multiple stages so that multiple instructions can be executed at the same time in different stages
- It is vertical scaling of the processor and done in scalar processors
- Deep pipelining could lead to adverse impact due to
- simple instructions forced to go through each stage
- branch instructions causing stall in the pipeline
- memory operations slowing down the pipeline stage
- However balanced pipelining can significantly increase throughput while maintaining same instruction execution/operational latency
- For example:
- non pipelined – 2 instructions x 5 machine cycles each = 10 machine cycles for 2 instructions
- 5 stage pipelined – 6 cycles for 2 instructions
- How is an instruction (ADD R1, R2, R3) execution in a 5 stage – Fetch, Decode, Execute, Memory, Write Bak pipeline
- Fetch
- Gets PC i.e. address of next instuction
- Get Instruction Bits from Instruction Cache using address
- Writes Instructions Bits to IF/DE register
- Write PC + 4 to IF/DE register
- Decode
- Gets Instruction Bits from IF/DE register
- Decodes Op Code and generate CONTROL signal
- Get operands from Register File
- Writes CONTROL signal, operands and PC + 4 to DE/EX register
- Execute
- ALU gets CONTROL signal and operands
- Calculates ALU result
- Saves ALU result, CONTROL signal, PC+4 to EX/MEM register
- Memory
- Passes ALU result, CONTROL signal to MEM/WB register
- WriteBack
- Reads CONTROL signal
- Writes ALU result to destination register
- Fetch
- Pipeline performance
- non pipelined = n x (IF + DE + EX + MEM + WB)
- pipelined = max (IF, DE, EX, MEM, WB) x (n + 1)
- Pipeline idealism
- uniform sub operations
- reptition of identical operations
- repitition of independent operations
- Pipeline realism
- Uniform sub operations will never happen so
- do stage quantization to have uniform stages
- minimize internal fragmentation (so faster stage waiting time is low for slower stages)
- Repitition of identical operations will never happen so
- unifying instruction types i.e. coalescing instruction types
- Repitition of independent operations will never happen so
- resolve data and resource hazards so dependency (data, control, resource) detection and resolution
- Uniform sub operations will never happen so
- Hardware requirements for pipelining
- Pipeline registers
- Seperate memory for data and instructions
- Multiple register read/write ports
- Forwarding hardware
- Hazard detection unit
- Example processor pipeline
- MIPS – 5 stage pipeline with simple addressing
- AMDAHLS – 12 stage pipeline with complex addressing
RISC Scalar pipeline
Designing a pipeline
- Select a Instruction set architecture
- For every instruction in ISA identify subtasks to be performed
- Logical layout of the pipeline
- Build forwarding or bypassing to remove hazards
- Do timing of the pipeline operations
- Dependency resolution
- # of pipeline stages
Performance of the pipeline
There are 2 factors to consider:
- Cost of pipeline
- between stages of a pipeline a latch or buffer is put
- there is 1 latch per stage of pipeline
- Hence, Cost = G (gates cost of non pipeline system) + [K (stages) x L (latches)]
- Computation rate
- Computation rate = 1 / T (latency of non pipelined system)
- In case of pipeline systems
- computation rate = 1 / (T / K stages) + S (delay due to latches)
- Based on the above
- Differential performance of K stage = SQT (GT / LS)
- where
- G = Numbers of gates
- T = latency of non pipeline system
- L = Number of latches in the pipeline stages
- S = Latch cost
- Example
- Given T = 100ns, G = 1000 units, L=50units, S=5ns
- K opt = SQRT (1000 x 100 / 50 x 5) = SQRT (400) = 20
- Pipeline stages = 20
Example of designing pipeline with basic ISA
Step 1: ISA
- ALU
- FX – fixed point instructions
- FP – floating point instructions
- Load and store
- Branching
Step 2: Sub tasks of each instruction of ISA
| Operation | ALU-FX | FPU | Load | Store | BR – Unconditional | BR – Conditional | |
| Example | Add R1,R2,R3 | LD R4,[R1+16] | SR [R1+8], R6 | JMP [PC+OF] | JNC [PC+OF] | ||
| Fetch | IF | Fetch | Fetch | Fetch | Fetch | Fetch | |
| Decode | DE | Decode | Decode | Decode | Decode | Decode | |
| Operand Fetch | OF | Read Register | Read Register Generate Addr Read Memory | Read Register | Read Register | Read Register | |
| Execute | EX | Perform ALU | Evaluate Cond | ||||
| Write back | OS | Write Register | Write Register | Generate Addr Write Memory | Generate Addr Write PC | Generate Addr Write PC |
Step 3: Logical layout
- Declaration of the pipeline to be implemented
- Specifications of subtasks to be performed and sequence
2 types
- Sequential Operation
- Single – multi functional pipeline
- same (scalar) pipeline used for any type of instructions
- different (super scalar) pipeline for FP and rest of instructions
- Mutliple pipeline – Dedicated pipeline for each instruction (super scalar)
- Single – multi functional pipeline
- Recycled
- ….
Logical of a single pipeline
- evaluate sub tasks present
- commonality present then share same pipeline stage
- increase merging of stages
Example: Using sub-tasks in section above (Step 2)
Below are the commonalities

Based on above it becomes a 6 stage pipeline like below
- Fetch – Same for all
- Decode – Same for all
- Operand Fetch – Read Register
- Execute – Add or Generate Mem Address (Load, Store, Branch)
- Memory – Read (Load), Write (Store), Update PC (Branch)
- WriteBack – Write Register (Add, Load)

Logical layout of single pipeline

Step4: Removing harzards
Types of hazards
- Resource (Structural) hazard – waiting for a resource to be available say 2 FP instruction and 1 FPU
- Data hazard – instruction ready to execute before data it requires is ready
- Control hazard – branching decisions before branch condition is evaluated
Note: Any hazard can be solved by waiting!
Resolution: Structural Hazard
- For example: Single memory is used for instruction and data
- Possible resolutions
- Stall – introduce time delays or stall until stage becomes available
- low cost, simple
- impacts performance
- use for rare cases
- Resource replication – split instruction and data into seperate memory (Harvard Architecture)
- good performance
- increases cost
- useful for cheap resources
- Stall – introduce time delays or stall until stage becomes available
- Some rules to resolve structural hazards
- each instruction should use a resource once
- use the resource in the same pipeline stage
- use the resource for 1 cycle only
Resolution: Data dependency and haxards
- Possible resolutions
- Stall until the next instruction – this is pipeline interlocking with a hardware detecting a hazard
- By passing/forwarding paths – the producer pass output to consumer without waiting to write to register file
- Example:
- ADD R1, R2, R3
- MUL R4, R1, R3
- During execution ALU will send the output of ADD straight back to waiting instruction MUL so that it can perform the operation
- Example
- AND R1, R3, R4
- NEG R5
- ADD R6, R1, R1
- During execution a second forwarding path from MEM to ALU will be created so that 3rd instruction does not stall
- Example:
- Out of order execution (used in super scalar pipeline) – executes instructions when operands are ready rather than in instruction order
- Branch prediction – speculate branch outcome
RAW (Read after write hazard)




Circuitry with bypassing

Step 5: Performance of scalar pipeline
Ways to improve performance
- By passing and forwarding
- Instruction re-scheduling or reordering
- Branch optimizations – prediction
- Deep pipelining – increases frequency of cycles but may increase penalty needs to be balanced
Lets say we have
- instruction cache
- data cache
- a program has following split of instructions
- ALU – 40%
- Loads – 25%
- Store – 15%
- Branches – 20%
- 33% – unconditional – included in branch penalty
- 33% – conditional taken – – included in branch penalty
- 33% – conditional not taken – not included in branch penalty
| ALU | Load | Store | BR- Cond | BR – Cond Taken | BR – Cond Not Taken | |
| 40% | 25% | 15% | 33% of 20% | 33% of 20% | 33% of 20% | |
| No by pass – RAW Distance | 3 | 3 | 4 | 4 | 0 | |
| No bypass Penalty | .4 x 3 | .25 x 3 | 0.2 x 0.66 x 4 | |||
| Total Penalty no by pass =2.478 | 1.2 | .75 | .528 | |||
| Bypass RAW Distance | 0 | 1 | 4 | 4 | 0 | |
| By pass penalty | 0 | .25 x 1 | 0.2 x 0.66 x 4 | |||
| Total Penalty with by pass = 0.778 | 0 | .25 | .528 | |||
| Instruction re-scheduling or reorderding (75% of load instructions reordered) = .5905 | 0 | 0.25 x .25 x 1 | .528 |
Cycles per instruction = 1 (1 cycle for all instructions) + total penalty