Processor and its micro architecture for high performance

What is a processor?

Processor is a silicon chip whose function is to perform efficientily set of arithmetic and logical operations. It is the brain of any computer system. Some of its key characteristics

works with hardwares components like memory to efficiently complete given operations
they are classified by
- Speed of execution, measured in “Hertz”, representing cycles per second. General purpose modern day processors have speeds in Giga Hertz i.e. billons of cycles per second
  - 1 cycle is 1 execution cycle of a processor in which it executes a set of instructions
  - A internal clock sends electric signals to the processor. Each signal being equal to 1 cycle. Time interval between clock cycles is generally is in nano seconds (e.g: 0.25 nano second for 4Ghz cycles per second)
- Size of instructions, measured in “Bits”, representing number of instructions it can handle in one cycle. General purpose modern day processors can handle 64 bits per cycle
  - processors have a Arithemetic Logical Unit (ALU) of execution and it is its capacity to process size of instruction defines processor capacity

How to measure processor performance for a program?

Performance = Time per program = Instruction Count (number of instructions per program) x Cycles per instruction (average) x Cycle Time (time per machine cycle)

Note: Cycle Time = time per machine cycle = number of clock cycles per machine cycle x time per clock cycle

All the 3 parameters are interdependent implying adjusting one will impact the other leading no real gain in performance. Hence to improve performance a delicate balance has to be achieved between all 3 parameters

Types of processors

General purpose processor (GPP) – general purpose CPU used to execute the main operations
Graphic Processing Unit (GPU) – optimized for rendering images and large blocks of visual data
Direct signal processor (DSP) – optimized for processing audio and video signal data
Field Programmable Gate Array (FPGA) – re-programmable at hardware level after manufacturing commonly used for prototyping
Application Specific Integrated Circuit (ASIC) – designed for specific application use

Note: Above is in descending order of Flexibility and Ascending order of Performance

How does a processor execute instructions?

Processor executes instructions in a Fetch-Decode-Execute cycle as follows

Fetch – Program address generator (BIU) fetches the instruction to execute
- 1.1 – stores the instruction in instruction register
- 1.2 – increments program counter (PC) to next instruction
Decode – Control Unit (EU) decodes the instructions to identify Operator and Operand
Decode – Control Unit (EU) fetches the Operand and send to ALU (EU)
Execute – ALU (EU) executes the instruction
Execute – ALU (EU) stores the result to register

Processor micro architecture considerations

Any digital system design has 2 important elements – 1) specifications – what does it do? 2) implementation – how it does it?

Processor micro architecture defines its implementation design. A processor micro architecture needs to take into consideration performance, power, cost, reliability. Here the focus is on micro architecture considerations for achieving high performance

Instruction Set Architecture

Instruction Set Architecture (ISA) is the contract (specification) between hardware and software or machine and program. A processor’s micro architecture (implementation) defines how the instructions (part of ISA) will be executed at speed and is impacted by the complexity of the ISA.

Instruction consists of an operation (a.k.a operator) and operand. ISA defines a set of instructions also called assembly instructions that together uniquely define an assembly language.

Given the above,

ISA enables independent development of hardware (processor and other resources) and software. It is guaranteed that a software that is built based on a specific ISA will be able to execute on any processor and hardware, whatever be its micro architecture, if has been developed for that specific ISA.
ISA evolve very slowly since there is significant cost in recompiling software for a new ISA. Due to this there is more focus on innovating micro architecture of processors

Most modern day processors are built on one of the following ISAs. The major differentiation in these ISAs is based on there complexity.

The complexity of an ISA is defined by its Dynamic Static Interface (DSI). Dynamic means the tasks, optimizations that are done at runtime by the hardware (processor and other resources) and static means the tasks and optimizations that are done by the software at compile time. In other words the ISA complexity is higher when its instructions are designed to do more in a single instruction thereby leaving it to hardware to execute the instruction in the most efficient manner.

Dimension	RISC	CISC
Number of instruction types	Lower	Higher
Complexity of instructions	Lower	Higher
Orthogonality	Instructions are generic and can operate on any register and use any addressing mode (location of operands)	Instruction specific use of register and addressing mode
Instruction length	Fixed length	Variable length
Instruction execution	Operations on register only. Only memory operation being load and store	Operations can be done on register or memory depending on instruction
Software code	More code due to simpler instructions	Low code (like Macros in C) due to complex instructions
Size of chip impacting silicon usage	simpler micro architecture hence smaller chip	complex micro architecture hence larger chip
Example	ARM family (apple devices)	Intel x86 family (most windows based devices)
Pros	– 1 instruction 1 clock cycle – Faster decoding of instructions – Allows pipelining for better performance	– Small software code size – hardware does the performance optimization
Cons	– Large software code size	– 1 instruction multiple clock cycles – No pipelining for better performance

Architecture of data access from memory

There are 3 types of architecture based on number of memory and data buses

	Von Neumann	Harvard	Super Harvard
Description	– Both program and data are stored in the same memory – Use time division multiplexing (time slots for different data transmission signals) to access program instructions or data	– Program and data are stored in different memory – Program and data can be fetched simultaneously via respective address and data bus	Hardvard + – Cache to store frequently used program instructions or data
Pros	– Low cost	– High performance due to possibility to pipeline	– Improved performance since cache will enable faster access of instructions or data
Cons	– Low performance – Chances of memory corruption since both program and data is stored in same memory	– High cost	– Higher cost

How does a Processor operations store data in memory?

There can be 2 ways in which processor stores data in memory

Suppose a word (32 bits = 4 bytes) has to be stored in memory then

Little endian
- will store the lowest byte 0 at lowest memory address
- and highest byte 3 at the highest memory address
Big endian
- will store the lowest byte 0 at highest memory address
- and the highest byte 3 at lowest memory address

How can processor speed be increased?

As seen in section above processor performance is measured as

Performance = Time per program = Instruction Count (number of instructions per program) x Cycles per instruction (average) x Cycle Time (time per machine cycle)

Reducing any or all of the 3 parameters will improve performance. There are techniques to reduce instruction count like remove duplicate instructions, RISC (higher but simple instruction count) vs CISC (lower but complex instruction count) architecture. Similarly work can be done to reduce cycle time by reducing the signal latency. Such optimizations should always be done if the cost of doing them is not very high.

Let us focus on methods to reduce Cycles per instructions (CPI)

Parallelism – instructions in batches are fetched and executed in parallel thereby reducing number of clock cycles required to execute an instruction
Many instructions/cycle
Out of order execution – processor does not strictly follow program instruction sequence. in case there is an instruction that needs to wait for some data then it jumps ahead executes another independent instruction
Branch prediction – when a processor encounters a branch (if-then-else like decision) then instead of waiting for actual outcome it predicts the outcome and move ahead with the execution. Branch Prediction Unit (BPU) is responsible for doing this

However there are constraints in improving performance

Power – when we do too many operations in parallel it starts heating up the processor and with limits to cooling capacity it can start melting the processor
Memory – the rate of processor speed increase has been much higher compared to rate of memory access speed increase
Instruction level parallelism – often instructions have dependencies leading to limit of parallel execution that can be done

Parallelism

There are 2 types of parallelism

Functional level
- parallelism done by executing an instruction or set of instructions in parallel
- it is done by Instruction Level Parallelism or Thread Level Parallelism
- it is irregular since depends on extent till which instructions can be parallelized
Data level
- parallelism done by executing same instruction on different sets of data
- example: brightning an image where it is a single instruction operating on multiple pixels (data)
- it is regular since depends on number of parallel processing units available to execute the same instruction

Data Level Parallelism

Flynn’s taxonomy tells about a computer architecture capability to process instruction and data streams

We will look at Instruction Level Parallelism in next section in detail

Amdahl’s law – Measuring performance in case of parallelism

We can measure relative speed up of processor (S) in case it is parallelized using the following referred to as Amdahl’s law:

S = 1 / T where 1 is total workload and T is total time spent

T = (1 – f) + (f / N) where

f = fraction of program that can be executed in parallel
N = number of parallel processing units

This states that after increasing parallel processing units to the extent f/N approaches 0, the performance will be limited by fraction of program that is sequential. This is referred to as Sequential Bottleneck

Instruction level parallelism

Types of ILP processors

All are pipelined processors with the following characteristics

Scalar
- can issue 1 instruction per machine cycle
- has a single execution unit
Super Scalar
- can issue multiple instructions per machine cycle
- has multiple execution units
- has a dispatch unit (hardware) responsible for managing the instruction execution in different execution units
VLIW
- Same as Super Scalar except the below
  - software compiler is responsible for breaking down the instruction sets such that can be executed in different execution units

Pipelining

Pipelining is technique in a processor to improve its performance where an instruction execution (fetch-decode-execute) is divided into multiple stages such that multiple instructions could be executed in each of the different stages in parallel. In other words it is vertical scaling of the processor.

Deep pipelines (Super pipelined) i.e. with multiple stages can help reduce overall cycle time per instruction (CPI) improving processor performance. With Deep Pipelines overall clock frequency increases since each stage executes very quickly. However deep pipelines start having a negative effect on performance in the following ways

when the number of stages between fetch and execute becomes many then there could be higher number of penatly cycles when instructions dont need to go through each stage
in case of instructions requiring memory operations the latency is going to be high so increased clock frequency will have negative impact on performance due to stalling of the pipeline due to higher latency for memory operation
in case of primitive (basic) instructions the ALU latency will be higher since the instructions will unnecessarily need to go through multiple stages hence multiple clock cycles to complete

Phase of a pipeline

Filling – when instructions one by one are filled in each stage of the pipeline
Full – when all pipeline stages are full with instructions
Draining – as each instruction execution completes the pipeline stages start becoming empty

Amdahl’s law – Performance of pipelining

Amdahl’s law used above for measuring speed up in case of multiple processors can also be applied to pipelining. We can measure relative speed up of processor (S) in case it is pipelined using the following:

S = 1 / T where 1 is total workload and T is total time spent

T = (1 – g) + (g / N) where

g = fraction of program that can be executed in parallel in different stages of the pipeline
N = number of pipeline stages

However this does not account for the stalls or penalty cycles when only 1 instruction can be executed in the pipeline. Hence in real world there will be multiple fill-full-drain cycles impacting overall performance as shown below

Based on the above a generalization of Amdahl’s law will be as follows:

S = 1 / T where 1 is total workload and T is total time spent

T = g1/1 + g2/2 + (g / N) where

g1 = fraction of program when that can be executed 1 instruction at a time
g2 = fraction of program that can be executed 2 instructions at a time
gN = fraction of program that can be executed N instructions at a time
N = number of pipeline stages

Dependencies and impact on parallelism

Data dependency
- when an instruction needs as input that is output of another instruction
- Data dependency can be of 2 types
  - TRUE – instructions must be executed sequentially
    - load r1, a
    - add r2, r1, r1
  - FALSE – instructions dependency can be removed by “register renaming”
    - add r1, r2, r3
    - mul r2, r4, r5
Control dependency
- when an instruction has branching or jump to instruction in which case it must follow a sequence
- control dependency are the hardest to solve although techniques like branch predicton are used to not stall parallel instruction execution
Resource dependency
- when there are multiple instructions that can say only be executed on a Floating Point Unit (FPU) while there is only 1 FPU

Performance Benchmarking

There have been multiple attempts at benchmarking the limits of ILP that can be achieved. That ones that have influenced further studies and benchmarking are

Flynn’s Bottleneck
- Due to control dependencies it says that maximum of 2 ILP can be achieved
- Known as Flynn’s bottleneck
Fisher’s optimism
- Made some ideal world assumptions and used super scalar processor to achieve a high degree of ILP
- Known as Fishr’s optimism
SPEC (Standard Performance Evaluation Corporation)
- Non profit organization that performs and publishes benchmarking results of different computer systems like CPUs, Embedded Systems and so on

Parameters for classification of ILP processors

Joupi defined parameters for classifying a ILP (pipelined) processor. The following are Joupi’s classifications

Operational latency
- machine cycles taken by an instruction to be executed and its result passed to next instruction
Machine Parallelism
- number of parallel instructions that can be executed in a pipelined processor
Issue latency
- machines cycles taken between issuing 2 instructions
Issue Parallelism
- number of instructions that can be issued in a machine cycle

Base pipeline

OL=1
MP=4
IL=1
IP=1

Super pipelined

OL = 1 (3 minor cycles)
MP = 12
IL = 1 minor cycle
IP = 3

Super scalar

IF	DE	EX	WB
IF	DE	EX	WB
IF	DE	EX	WB

OL=1
MP=12
IL=1
IP=3

Super scalar – super pipelined

IF	DE	EX	WB
IF	DE	EX	WB
IF	DE	EX	WB

OL=1(3 minor cycles)
MP=36
IL=1 minor cycle
IP=9

VLIW

IF	DE	EX	WB
		EX	WB
		EX	WB

OL=1
MP=3
IL=1
IP=1