root.system / 0x05 / cpu

Bits in a loop.
A machine that runs.

You have bits, you have encodings, you have gates. Wire them into a circuit that fetches a bit pattern from memory, decides what it means, and acts on it. Then put that whole thing on a clock and let it loop. That's a CPU. This page builds one, conceptually, from the parts you already have.

Right now, inside your machine, something is running a loop.

It hasnt stopped since you pressed the power button.

It wont stop until you shut down.

Billions of times a second, the same three steps, over and over, no rest and no variation.

Fetch. Decode. Execute.

You think your CPU is doing a thousand things at once. Running your browser. Playing music. Checking for updates. It feels like juggling.

It isnt.

Your CPU has exactly one job. Grab the next instruction. Work out what it means. Do it. Then grab the next one. The illusion of doing everything comes from doing this one thing fast enough that you cant see the seams.

On page four you built logic gates out of transistors. AND, OR, NOT, XOR. Tiny circuits that compute.

A CPU is what happens when you wire those gates into a loop and put them on a clock.

Thats it. Thats the entire machine.

The bits from page two are the instructions. The gates from page four do the work. The clock makes it all move forward.

Everything your computer has ever done, every frame, every keystroke, every transaction it has ever checked, is this one loop, turning.

It has been turning the whole time youve been reading this.

Lets watch it turn.

Beginner// level 01

Fetch, decode, execute

Your CPU has one job. One. It reads an instruction, it executes it, it moves to the next one. That is the entire CPU.

But here is what nobody tells you about that one job. That instruction is not English. It is not Python. It is not even assembly. It is this: 10110000 01100001. Raw binary. Two bytes. Sixteen switches, on and off.

Your CPU reads those 16 bits, decodes them through logic gates, and in a single clock tick executes the operation they describe. Do you see what is happening? Everything connects. Transistors built the logic gates. Logic gates built the circuits. Circuits decode the binary. Binary carries the instruction. The instruction changes the state.

All of that, in one clock cycle, at four billion cycles per second. And it has been doing this since the moment you powered on. Without stopping. Without resting. Without ever asking what it all means.

A CPU has exactly one job: it sits in a loop. Each tick of its clock, it does three things:

  1. Fetch the next instruction (a bit pattern) from memory.
  2. Decode it: figure out what those bits mean.
  3. Execute the operation, possibly updating memory or a register.

That's the whole machine. Every program (your browser, this page, the OS scheduler) is a sequence of those instructions, fetched one by one, billions of times a second.

What's an instruction?

An instruction is just a byte (or a few bytes) where different bits mean different things. Same logic as the ASCII table on page 2, except instead of "this byte is the letter A," the convention is "this byte is the operation ADD, on registers 0 and 1."

A real x86 instruction can be 1 to 15 bytes; RISC-V is a clean 4 bytes; ARM Thumb is 2. They all share the same shape: opcode + operands, packed into bits.

Step through a program

// step through a program
program counter: 0x00
memory
0x0000100101LOAD r0, 5
0x0100101111LOAD r1, 7
0x0201100001ADD r0, r1
0x0310100000PRT r0
0x0400000000HALT
cpu
fetchPC = 0x00reading from memory…
decodeopcode: 001 = LOADdst: 00 = r0 · imm: 101 = 5
executer0 ← 5r0 = 00000000
registers
r000000000(0)
r100000000(0)
r200000000(0)
r300000000(0)
event log
press step to begin.

the exact program from the toy-CPU code below: LOAD r0 5, LOAD r1 7, ADD, print, halt. step through it one instruction at a time and watch fetch, decode, and execute light up in turn.

Decoding an instruction is just bit shifts and masks, the same bitwise operators from the binary page. (ins >> 5) & 0b111 extracts the opcode; (ins >> 3) & 0b11 extracts the register. The CPU reads instructions exactly the way you learned to read bits. ← see: binary

bitsfieldmeaning
7..5opcode011 = ADD, 001 = LOAD, …
4..3dest regwhich register receives the result
2..0src reg / immediatethe other operand
// remember from page 1?
Decoding an instruction is just bit shifts and masks: exactly the bitwise operators from the binary page. The CPU pulls fields out of a byte the same way you tested individual bits in (x >> 3) & 1.

A 6-instruction CPU you can read in 5 minutes

Here's a complete software simulation of a tiny CPU. It has 4 registers, 6 opcodes, and a single byte per instruction. The whole loop fits on one screen, and every concept is real. Every line maps to something a real chip does.

Rust• • •
// A toy CPU with 4 registers and 6 ops.
// Each instruction is a single byte:
//   bits 7..5 = opcode (3 bits)
//   bits 4..3 = destination register (2 bits)
//   bits 2..0 = source register OR small immediate

const HALT: u8 = 0b000;
const LOAD: u8 = 0b001; // dst = imm
const MOV : u8 = 0b010; // dst = src
const ADD : u8 = 0b011; // dst = dst + src
const SUB : u8 = 0b100; // dst = dst - src
const PRT : u8 = 0b101; // print dst

fn run(program: &[u8]) {
    let mut reg = [0u8; 4];
    let mut pc  = 0usize;

    loop {
        // FETCH
        let ins = program[pc];
        pc += 1;

        // DECODE: pure bit-shifts, the same trick from page 1.
        let op  = (ins >> 5) & 0b111;
        let dst = ((ins >> 3) & 0b11) as usize;
        let src = (ins & 0b111) as usize;

        // EXECUTE
        match op {
            x if x == HALT => break,
            x if x == LOAD => reg[dst] = src as u8,
            x if x == MOV  => reg[dst] = reg[src & 0b11],
            x if x == ADD  => reg[dst] = reg[dst].wrapping_add(reg[src & 0b11]),
            x if x == SUB  => reg[dst] = reg[dst].wrapping_sub(reg[src & 0b11]),
            x if x == PRT  => println!("r{} = {}", dst, reg[dst]),
            _ => panic!("bad opcode"),
        }
    }
}

fn main() {
    // r0 = 5; r1 = 7; r0 = r0 + r1; print r0; halt
    let program = [
        0b001_00_101, // LOAD r0, 5
        0b001_01_111, // LOAD r1, 7
        0b011_00_001, // ADD  r0, r1
        0b101_00_000, // PRT  r0   → 12
        0b000_00_000, // HALT
    ];
    run(&program);
}
C• • •
// Same toy CPU in C.
#include <stdio.h>
#include <stdint.h>

#define HALT 0
#define LOAD 1
#define MOV  2
#define ADD  3
#define SUB  4
#define PRT  5

void run(const uint8_t *program) {
    uint8_t reg[4] = {0};
    size_t pc = 0;

    for (;;) {
        // FETCH
        uint8_t ins = program[pc++];

        // DECODE
        uint8_t op  = (ins >> 5) & 0x7;
        uint8_t dst = (ins >> 3) & 0x3;
        uint8_t src = ins & 0x7;

        // EXECUTE
        switch (op) {
            case HALT: return;
            case LOAD: reg[dst] = src; break;
            case MOV : reg[dst] = reg[src & 0x3]; break;
            case ADD : reg[dst] += reg[src & 0x3]; break;
            case SUB : reg[dst] -= reg[src & 0x3]; break;
            case PRT : printf("r%u = %u\n", dst, reg[dst]); break;
        }
    }
}

int main(void) {
    uint8_t program[] = {
        0b00100101, // LOAD r0, 5
        0b00101111, // LOAD r1, 7
        0b01100001, // ADD  r0, r1
        0b10100000, // PRT  r0   → 12
        0b00000000, // HALT
    };
    run(program);
    return 0;
}
// takeaway
The fetch-decode-execute loop is the entire machine. Every other CPU feature (caches, pipelines, predictors) is just an optimization on top of those three steps.
Intermediate// level 02

Registers, ALU, control unit: the parts of a CPU

Open up the toy simulator above and ask: where does that loop live in silicon? Each piece maps to a physical block of gates from the previous page.

block 01
Registers
A bank of flip-flops, exactly the SR-latch + clock circuits from the logic-gates page. 32 flip-flops side by side = one 32-bit register.
block 02
ALU
The arithmetic-logic unit. The full adder you built from XOR + AND + OR is right here, repeated 32 times for a 32-bit add.
block 03
Control unit
Reads the opcode field and decides which wires to enable. Pure combinational logic: a tree of NAND gates routing bits to the right block.
block 04
Program counter
A register that holds the address of the next instruction. After each fetch, an adder bumps it by the instruction size.

The ALU is the full adder you built on the logic gates page, repeated 64 times side by side for a 64-bit processor. Every ADD instruction your program executes is those XOR and AND gates firing in sequence. ← see: logic gates

The clock: what makes it all step forward

A CPU is a sea of combinational gates plus banks of flip-flops. Combinational gates have no memory; their output is a pure function of input. So nothing changes on its own. What drives the loop forward is the clock: a square-wave signal toggling at a fixed rate (3 GHz = 3 billion times a second). On every rising edge, the flip-flops latch their inputs and the next state begins.

"Faster CPU" mostly means "shorter clock period," which only works if every signal can propagate through every gate before the next tick. Get that wrong and you read garbage out of the latches.

Memory: where the program actually lives

Registers are tiny and fast (a few hundred bytes, accessed in 1 cycle). Real programs need more, gigabytes more. So the CPU is wired to a separate chip called RAM, addressed by a number. Each instruction fetch is really:

PCaddress busRAMdata businstructionregisterdecode

Compared to a register access, RAM is slow. Hundreds of cycles. The whole subject of computer architecture is, mostly, about hiding that latency.

Every value the CPU reads from memory is binary bytes at an address. That address is a hex number, 0x7fff5fbff8a4, a pointer: a number that means somewhere. The CPU follows millions of these every second. ← see: pointers

// the loop, restated
PC tells RAM which bits to send. The control unit decodes them. The ALU crunches numbers. The registers remember results. The clock keeps everything in lockstep. That's it. That's a CPU.
Advanced// level 03

Pipelining, caches & branch prediction

Why a 3 GHz CPU runs more than 3 billion ops/s

Naively, fetch-decode-execute takes at least three clock cycles per instruction. So a 3 GHz CPU should top out at 1 billion instructions per second. In practice, modern CPUs sustain several instructions per cycle. How?

Pipelining: an assembly line for instructions

The trick: while instruction N is executing, instruction N+1 can be decoding, and N+2 can be fetching. Each pipeline stage runs in parallel, like a factory assembly line. A 5-stage pipeline finishes (ideally) one instruction per cycle.

Real CPUs go further. Modern x86 cores have 14 to 20 pipeline stages and issue 4+ instructions per cycle (superscalar, with multiple ALUs).

Caches: bridging the RAM gap

RAM is far away (electrically) and slow (hundreds of cycles). To hide that, the CPU keeps recently-used bytes in small, fast SRAM banks right next to the cores: caches. Most programs touch the same bytes repeatedly, so this works extraordinarily well.

levelsizelatencywhere
L132-64 KB~4 cyclesper-core, split into instruction and data
L2256 KB to 1 MB~12 cyclesper-core
L38-64 MB~40 cyclesshared between cores
RAMGBs~200+ cyclesoff-chip

This is why cache-friendly code matters. A linear walk over an array hits L1 on every access. A pointer-chase through a linked list, where every node sits in a different cache line, pays the full RAM tax. Same algorithmic complexity, 50× difference in wall time.

Branch prediction

Pipelining fights one big enemy: the branch. When the CPU hits an if, it doesn't yet know which way to go, but the pipeline needs to keep fetching something. So the CPU guesses, using a tiny on-chip lookup table that tracks the recent history of every branch. Right guess: full speed. Wrong guess: flush the pipeline and start over (10 to 20 cycle penalty).

Modern predictors hit 95%+. They're so good that data layout can change performance dramatically. Sort an array first, and the same hot loop runs several times faster, because the same branch now goes the same way for many iterations in a row.

Rust• • •
// The classic branch-prediction demo.
// Sorting the array makes the branch predictable, so the same
// loop runs ~3-6× faster on real hardware.
use std::time::Instant;

fn main() {
    let mut data: Vec<i32> = (0..32_768).map(|i| (i * 1103515245 + 12345) % 256).collect();

    // Run once UNSORTED, once SORTED.
    for label in ["unsorted", "sorted"] {
        if label == "sorted" { data.sort(); }

        let t = Instant::now();
        let mut sum: u64 = 0;
        for _ in 0..1_000 {
            for &x in &data {
                if x >= 128 {        // ← the unpredictable branch
                    sum += x as u64;
                }
            }
        }
        println!("{label:>9}: sum={sum}  in {:.2?}", t.elapsed());
    }
}
C• • •
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static int cmp(const void *a, const void *b) {
    return *(const int*)a - *(const int*)b;
}

int main(void) {
    enum { N = 32768 };
    int *data = malloc(N * sizeof *data);
    for (int i = 0; i < N; i++)
        data[i] = (int)((i * 1103515245u + 12345u) % 256u);

    for (int phase = 0; phase < 2; phase++) {
        if (phase == 1) qsort(data, N, sizeof *data, cmp);

        clock_t t = clock();
        unsigned long long sum = 0;
        for (int k = 0; k < 1000; k++)
            for (int i = 0; i < N; i++)
                if (data[i] >= 128)        // ← the unpredictable branch
                    sum += (unsigned)data[i];

        double secs = (double)(clock() - t) / CLOCKS_PER_SEC;
        printf("%s: sum=%llu  in %.2fs\n",
               phase ? "  sorted" : "unsorted", sum, secs);
    }
    free(data);
    return 0;
}
// the famous side channel
Branch prediction speculates ahead, and the speculative path leaves traces in the cache even when it's discarded. That's Spectre: a 2018 vulnerability that turned a performance optimization into a way to read protected memory. Every CPU shipped before mid-2018 was affected. Mitigations cost real performance.

The CPU inside Bitcoin miners

Every Bitcoin miner on Earth is a CPU running one function, over and over, as fast as possible, forever. That function is SHA-256. And SHA-256 is pure CPU work. Here is what a miner is actually computing, one round of the 64 that make up a single hash:

Rust• • •
// One SHA-256 round: pure CPU operations
// on registers a through h.
fn sha256_round(
    a: u32, b: u32, c: u32, d: u32,
    e: u32, f: u32, g: u32, h: u32,
    k: u32, w: u32,
) -> (u32, u32, u32, u32, u32, u32, u32, u32) {
    // Every line below is an ALU operation.
    let s1 = e.rotate_right(6)
           ^ e.rotate_right(11)
           ^ e.rotate_right(25);

    let ch = (e & f) ^ (!e & g); // AND, XOR, NOT

    let temp1 = h
        .wrapping_add(s1)
        .wrapping_add(ch)
        .wrapping_add(k)
        .wrapping_add(w);

    let s0 = a.rotate_right(2)
           ^ a.rotate_right(13)
           ^ a.rotate_right(22);

    let maj = (a & b) ^ (a & c) ^ (b & c);

    let temp2 = s0.wrapping_add(maj);

    (
        temp1.wrapping_add(temp2), // new a
        a, b, c,
        d.wrapping_add(temp1),     // new e
        e, f, g,
    )
}
C• • •
#include <stdint.h>

static inline uint32_t rotr(uint32_t x, int n) {
    return (x >> n) | (x << (32 - n));
}

void sha256_round(
    uint32_t *a, uint32_t *b,
    uint32_t *c, uint32_t *d,
    uint32_t *e, uint32_t *f,
    uint32_t *g, uint32_t *h,
    uint32_t k,  uint32_t w
) {
    uint32_t s1  = rotr(*e,6) ^ rotr(*e,11) ^ rotr(*e,25);
    uint32_t ch  = (*e & *f) ^ (~*e & *g);
    uint32_t t1  = *h + s1 + ch + k + w;
    uint32_t s0  = rotr(*a,2) ^ rotr(*a,13) ^ rotr(*a,22);
    uint32_t maj = (*a & *b) ^ (*a & *c) ^ (*b & *c);
    uint32_t t2  = s0 + maj;

    *h = *g; *g = *f; *f = *e;
    *e = *d + t1;
    *d = *c; *c = *b; *b = *a;
    *a = t1 + t2;
}

Every operation in those functions is a single CPU instruction:

  • rotate_right / rotr: a barrel shifter in the ALU.
  • &, ^, !: AND, XOR, NOT gates firing.
  • wrapping_add: the full adder circuit from the logic gates page.

SHA-256 runs 64 of these rounds per hash attempt. Bitcoin miners attempt trillions of hashes. Each attempt is 64 rounds; each round is dozens of ALU operations; each ALU operation is logic gates; each gate is transistors. A modern Bitcoin ASIC runs tens of trillions of SHA-256 hashes per second.

All of it is the fetch-decode-execute loop you learned on this page, running as fast as physics allows. This is what secures hundreds of billions of dollars. Not cryptographic magic. Not financial engineering. Just a CPU, doing one job, very very fast.

SHA-256 is a CPU workload: 64 rounds of rotations, AND, XOR, NOT, and additions per hash, run trillions of times per second by mining hardware to find a hash with enough leading zeros. ← see: hashing

The full stack, again

// from electron to executable, with the loop on top
  1. Electrons gated by transistors form logic gates.
  2. Gates form adders, latches, multiplexers.
  3. Those compose into registers, ALUs, control units.
  4. A clock drives them through fetch-decode-execute.
  5. Bits in memory encode instructions (this page) and data.
  6. Data bits encode numbers (page 1) and characters (page 2).
  7. Sequences of instructions become a program.
  8. Run a few billion of them per second and you get the experience of using a computer.

Where to dig in next

You've seen the whole stack now. Natural deep-dives:

  • RISC-V: a clean, open ISA you can read in an afternoon.
  • nand2tetris: build a CPU from NAND gates and run software on it.
  • Agner Fog's microarchitecture manuals, the canonical reference on how real x86 cores actually work.
  • Compiler Explorer (godbolt.org): see what bytes your high-level code actually becomes.

CPU across ScrapyBytes

The same ideas surface all over ScrapyBytes. Here is where this page connects to the rest of the curriculum, and how to follow each thread.

Logic Gates

The CPU is millions of logic gates wired into a machine. The ALU is the adder from the gates page. Every instruction is gates switching.

scrapybytes.vercel.app/logic-gates
Binary

Instructions, addresses, and data are all binary, and decoding an instruction is bit shifts and masks. The CPU runs on the binary page, clocked billions of times a second.

scrapybytes.vercel.app/binary
Memory

The CPU does nothing without memory. It fetches instructions and data by address, and caches hide the latency. Stack and heap from the memory page are where it reads and writes.

scrapybytes.vercel.app/memory
Operating System

The OS decides which program gets the CPU and when. Kernel mode, interrupts, and context switches all sit on top of the fetch-decode-execute loop on this page.

scrapybytes.vercel.app/operating-system
Pointers

The program counter is a pointer: a number naming the next instruction's address. The CPU spends its life following pointers through memory.

scrapybytes.vercel.app/pointers
Compile vs Runtime

The compiler turns your code into the exact instructions this page executes. Compile time produces them; the CPU is runtime. The boundary on that page is this CPU's front door.

scrapybytes.vercel.app/compile-vs-runtime
Variables

A variable in a register is the fastest variable there is. The CPU page is where let x = 42 finally becomes a value the silicon holds.

scrapybytes.vercel.app/variables
Hashing

SHA-256 is so common that CPUs ship dedicated instructions for it (SHA-NI, ARMv8 Crypto). Each round is one trip through the fetch-decode-execute loop on this page.

scrapybytes.vercel.app/hashing
Number Systems

Instruction addresses are hex, register values binary, immediates decimal in source: the same number in three masks. The number systems page is the CPU's notation.

scrapybytes.vercel.app/number-systems
ASCII

The CPU processes a character as its ASCII integer, 72 for H, with no idea it is a letter. The ASCII page is the meaning the silicon never sees.

scrapybytes.vercel.app/ascii
Networking

Your network card runs its own fetch-decode-execute loop alongside the main CPU, two processors sharing one machine. The networking page has a CPU of its own.

scrapybytes.vercel.app/networking
Blockchain

Mining is the fetch-decode-execute loop pushed to its physical limit: an ASIC runs SHA-256 trillions of times a second. The blockchain page is why mining costs electricity.

scrapybytes.vercel.app/blockchain
next up / 0x06
Where state lives between cycles: memory
memory