Study Notes50 min read

Computer Architecture

CPU Design, Memory Systems & Instruction Processing

1. CPU Components

Control Unit (CU)

The "brain" of the CPU that directs all operations.

Functions:

• Fetches instructions from memory
• Decodes instructions
• Controls data flow between components
• Generates timing signals
• Manages instruction sequencing

Types:

Hardwired: Fixed logic gates, fast but inflexible
Microprogrammed: Uses microcode, flexible but slower

Arithmetic Logic Unit (ALU)

Performs all arithmetic and logical operations.

Arithmetic Operations:

• Addition (+)
• Subtraction (-)
• Multiplication (×)
• Division (÷)
• Increment/Decrement

Logical Operations:

• AND, OR, NOT, XOR
• Shift (left, right)
• Rotate
• Comparison
• Complement

CPU Registers

Program Counter (PC)

Holds address of the next instruction to execute

Instruction Register (IR)

Holds the current instruction being executed

Memory Address Register (MAR)

Holds address for memory read/write operations

Memory Data Register (MDR/MBR)

Holds data being transferred to/from memory

Accumulator (ACC)

Stores intermediate arithmetic/logical results

Stack Pointer (SP)

Points to top of the stack in memory

Status/Flag Register (PSW)

Contains condition flags: Zero (Z), Carry (C), Sign (S), Overflow (V)

2. Instruction Cycle

Fetch-Decode-Execute Cycle

Fetch

Read instruction from memory address in PC

MAR ← PC; MDR ← Memory[MAR]; IR ← MDR; PC ← PC + 1

Decode

Interpret opcode and determine operation

CU decodes IR, identifies operands

Execute

Perform the operation using ALU

ALU performs operation, results stored

Store (Write-back)

Write results to register or memory

Results → Register/Memory

Instruction Format

Zero-address (Stack)

Operations use stack; PUSH, POP, ADD

One-address (Accumulator)

One operand + accumulator; LOAD X, ADD X

Two-address

Two operands; ADD R1, R2 (R1 ← R1 + R2)

Three-address

Dest + two sources; ADD R1, R2, R3

Addressing Modes

ImmediateMOV R1, #5 (operand is the value)

Direct/AbsoluteMOV R1, 1000 (operand is address)

RegisterMOV R1, R2 (operand in register)

IndexedMOV R1, X(R2) (base + offset)

Base + DisplacementMOV R1, 100(R2)

3. Memory Hierarchy

Memory Pyramid

Registers

Fastest, most expensive, smallest

L1 Cache

L2 Cache

L3 Cache

Main Memory (RAM)

Secondary (SSD/HDD)

Slowest, cheapest, largest

Memory Types

RAM (Volatile)

SRAM (Static)

• Uses flip-flops
• Faster, more expensive
• Used for cache

DRAM (Dynamic)

• Uses capacitors
• Needs refresh
• Used for main memory

ROM (Non-Volatile)

ROM: Read-only, factory programmed

PROM: One-time programmable

EPROM: UV erasable

EEPROM: Electrically erasable

Flash: Block erasable, SSDs/USB

Memory Performance

Key Metrics:

Access Time: Time to read/write
Cycle Time: Minimum time between accesses
Bandwidth: Data transfer rate
Latency: Delay before transfer begins

Typical Access Times:

Registers: < 1 ns
L1 Cache: 1-2 ns
L2 Cache: 3-10 ns
RAM: 50-100 ns
SSD: 50-100 μs
HDD: 5-10 ms

4. Cache Memory

Cache Concepts

Cache Hit

Data found in cache (fast access)

Cache Miss

Data not in cache (fetch from memory)

Hit Rate

h = Hits / Total Accesses

Miss Rate

m = 1 - h

Average Memory Access Time (AMAT)

AMAT = Hit_Time + (Miss_Rate × Miss_Penalty)

Example: 2ns + (0.05 × 100ns) = 2 + 5 = 7ns

Cache Mapping Techniques

Direct Mapping

Each memory block maps to exactly one cache line.

Cache Line = (Memory Address) MOD (Number of Lines)

+ Simple, fast
- High conflict misses

Fully Associative

Any block can go to any cache line.

+ No conflict misses
- Complex hardware (parallel search)
- Expensive for large caches

Set-Associative (n-way)

Cache divided into sets; block maps to a set, can go in any line within set.

Set = (Address) MOD (Number of Sets)

2-way: 2 lines per set
4-way: 4 lines per set
Balances simplicity and hit rate

Replacement Policies

LRU (Least Recently Used)

Replace block not used for longest time

FIFO (First In First Out)

Replace oldest block

Random

Replace randomly selected block

LFU (Least Frequently Used)

Replace least accessed block

Write Policies

Write-Through

Write to both cache and memory simultaneously

+ Simple, consistent
- Slower writes

Write-Back

Write to cache only; memory updated on eviction

+ Faster writes
- More complex (dirty bit needed)

5. Pipelining

Pipeline Concept

Pipelining overlaps execution of multiple instructions, like an assembly line.

Classic 5-Stage Pipeline (MIPS):

IF
Fetch

ID
Decode

EX
Execute

MEM
Memory

WB
Write

Pipeline Performance

Speedup: S = n / (1 + (n-1)/k) ≈ k (for large n)

Throughput: 1 instruction per cycle (ideal)

Latency: k cycles for one instruction

Where n = number of instructions, k = pipeline stages

Pipeline Hazards

Structural Hazards

Hardware resource conflict (e.g., one memory port)

Solution: Duplicate resources, separate I/D caches

Data Hazards

Instruction depends on result of previous instruction

RAW (Read After Write): Most common

WAR (Write After Read): Anti-dependency

WAW (Write After Write): Output dependency

Solutions: Forwarding/bypassing, stalling, reordering

Control Hazards

Branch instruction changes program flow

Solutions:

• Stall (wait for branch resolution)

• Branch prediction (static/dynamic)

• Delayed branch (execute delay slot)

• Speculative execution

Branch Prediction

Static Prediction

• Always predict not taken
• Always predict taken
• Backward taken, forward not taken (BTFNT)

Dynamic Prediction

• 1-bit predictor
• 2-bit saturating counter
• Branch History Table (BHT)
• Branch Target Buffer (BTB)

6. RISC vs CISC

Feature	RISC	CISC
Instruction Set	Small, simple	Large, complex
Instruction Length	Fixed (32-bit)	Variable (1-15 bytes)
Execution	1 cycle per instruction	Multiple cycles
Addressing Modes	Few (Load/Store)	Many
Registers	Many (32+)	Few (8-16)
Control Unit	Hardwired	Microprogrammed
Memory Access	Load/Store only	Direct memory operations
Examples	ARM, MIPS, SPARC	x86, x64, VAX

Modern Reality

Modern processors blur the lines between RISC and CISC. Intel x86 processors decode complex CISC instructions into simpler micro-operations (RISC-like) internally for pipelining efficiency.

7. Bus Architecture

Bus Types

Data Bus

Carries actual data

Width = word size (32-bit, 64-bit)

Bidirectional

Address Bus

Specifies memory/I/O location

Width determines addressable memory

Unidirectional (CPU → Memory)

Control Bus

Carries control signals

Read/Write, Clock, Interrupt

Bidirectional

Bus Standards

Internal Buses:

• FSB (Front Side Bus) - Legacy
• QPI (QuickPath) - Intel
• HyperTransport - AMD

I/O Buses:

• PCIe (PCI Express)
• USB (Universal Serial Bus)
• SATA (Serial ATA)
• NVMe (Non-Volatile Memory Express)

8. Key Takeaways for CpE Students

Essential Formulas & Concepts

Performance

• CPU Time = IC × CPI × T
• CPI = Cycles/Instruction
• MIPS = IC / (CPU Time × 10⁶)
• Speedup = Time_old / Time_new

Cache

• AMAT = Hit_Time + Miss_Rate × Miss_Penalty
• Hit Rate + Miss Rate = 1
• n-way set = 2ⁿ blocks per set

Memory

• Address bits = log₂(Memory Size)
• 32-bit = 4GB addressable
• 64-bit = 16 EB addressable

Pipeline

• Ideal CPI = 1
• Speedup ≈ k (stages)
• Hazards reduce throughput

Key Architecture Differences

Von Neumann: Single memory for instructions and data (bottleneck)

Harvard: Separate instruction and data memories (faster, used in embedded)

Modified Harvard: Separate caches, shared main memory (modern CPUs)

Previous: Digital Logic Next: Microprocessors

In This Section

Take Practice Quiz