Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Technical Specifications

💬 Technical Questions? Join #powercommons:matrix.org to discuss ISA specifications, architectural decisions, and implementation details.


A2O Architecture: From Embedded to Server-Class ISA Compliance

Heritage and Current Architecture

The A2O processor core represents a significant piece of open-source Power architecture heritage, originally developed by IBM for the Blue Gene/Q supercomputer program. The A2 family comprises two variants:

  • A2I (In-order): A 4-way simultaneously multithreaded (SMT4) in-order execution core
  • A2O (Out-of-order): A 2-way simultaneously multithreaded (SMT2) out-of-order execution core with advanced speculation and dynamic scheduling

Both cores were designed to the Power ISA v2.06 Embedded specification, optimized for high-efficiency parallel computing environments where power consumption, thread density, and deterministic behavior were paramount. The A2 cores successfully powered some of the world’s most energy-efficient supercomputers, demonstrating exceptional performance-per-watt characteristics.

Embedded Architecture Foundation

As Book III-E (Embedded) implementations, the A2 cores feature:

  • Software-loaded TLB with hardware support for radix tree lookups via “indirect” entries
  • Embedded Hypervisor model using MSR[GS] (Guest State) for virtualization
  • Multiple SRR pairs (SRR0/1, CSRR0/1, DSRR0/1, MCSRR0/1) for nested interrupt handling
  • Implementation-specific debug facilities (DAC, DBCR, IAC registers)
  • Implementation-specific performance monitoring (custom PMU counters and controls)
  • No VMX/VSX support (Vector/SIMD extensions)

This embedded-focused design made the A2 ideal for embedded, real-time, and specialized HPC workloads but created barriers for running modern general-purpose software stacks.


The Compliance Gap: v2.06 to v3.1C

The journey from Power ISA v2.06 Embedded to v3.0C/v3.1C compliance represents a fundamental architectural transformation, not merely an incremental update. This upgrade bridges the gap between embedded and server-class implementations, enabling compatibility with mainstream Linux distributions and modern virtualization frameworks.

Two Compliance Targets

SFFS (Scalar Fixed-point and Floating-point Subset)

  • Core instruction set compliance without vector extensions
  • Sufficient for custom software stacks and embedded Linux
  • Smaller implementation footprint
  • Approximately 40+ new scalar instructions

LCS (Linux Compliancy Subset)

  • Full compliance including VMX/VSX vector extensions
  • Required for modern glibc (depends on VMX for optimized string operations)
  • Enables compatibility with Fedora, Ubuntu, RHEL, and other mainstream distributions
  • Adds 400+ vector/scalar instructions on top of SFFS requirements
  • Essential for running unmodified Linux distribution binaries

Book I: User-Mode Instruction Set Evolution

The Book I (user-mode) changes represent the most visible transformation, adding powerful new capabilities:

New Instruction Categories

Atomic Operations: Quadword load/store atomic (lqarx/stqcx.) enable lock-free algorithms on 128-bit data structures, critical for modern concurrent programming.

Bit Manipulation: New bit permutation and manipulation instructions (cnttzw, cnttzd, extswsli) improve performance for cryptographic operations, compression algorithms, and bit field processing.

Prefixed Instructions (v3.1): 64-bit instruction encoding enables PC-relative addressing with ±8 EiB range and immediate operands up to 34 bits, dramatically improving code density and position-independent code generation.

Message Synchronization: Architected message-passing primitives improve inter-thread communication efficiency.

VMX/VSX Vector Extensions

For LCS compliance, implementing the complete VMX (Vector Multimedia Extension) and VSX (Vector-Scalar Extension) instruction sets represents the largest single implementation effort:

  • 128-bit SIMD operations: 32 vector registers (VR0-VR31) for parallel data processing
  • 64-bit scalar floating-point: 64 VSX registers overlaying FPRs and VRs
  • Fused multiply-add operations: High-throughput floating-point computation
  • Permute and shuffle: Flexible data reorganization within vectors
  • Load/store vectors: Efficient memory access for parallel data

Modern glibc depends on VMX for optimized implementations of memcpy, memset, strcmp, and other fundamental library functions, making VMX/VSX essential for running unmodified Linux distributions.

See: WS1 - Instruction Set Updates | WS7 - VMX/VSX Implementation | WS8 - Prefixed Instructions


Book III: Privileged Architecture Transformation

The Book III changes represent a complete rearchitecture from embedded (Book III-E) to server (Book III-S) privileged facilities. This transformation touches every aspect of privileged operation.

Hypervisor Model: From Embedded to Server

Current (Embedded Hypervisor):

  • MSR[GS] bit distinguishes guest from hypervisor
  • Flat privilege model with limited partitioning
  • Suitable for single-guest environments

Target (Server Hypervisor):

  • MSR[HV] bit defines hypervisor mode
  • LPCR (Logical Partitioning Control Register) for per-partition configuration
  • HRMOR (Hypervisor Real Mode Offset Register) for hypervisor address space isolation
  • PCR (Processor Compatibility Register) for ISA compatibility modes
  • Per-thread partition IDs enabling true KVM support
  • Two-level memory translation (guest virtual → guest real → host real)

This transformation enables the A2O to run modern KVM-based virtualization stacks, supporting multiple isolated virtual machines with hardware-assisted memory protection.

See: WS2 - Hypervisor & Virtualization

Interrupt Architecture Consolidation

Current: Multiple save/restore register pairs (SRR0/1, CSRR0/1, DSRR0/1, MCSRR0/1) for nested interrupt contexts, typical of embedded designs.

Target: Consolidated HSRR0/1 (Hypervisor SRR) with streamlined interrupt handling:

  • System Call Vectored (scv) instruction for low-latency system calls
  • Return from scv (rfscv) with reduced context switching overhead
  • Simplified interrupt priority and masking

This consolidation reduces hardware complexity while improving interrupt latency for hypervisor and operating system transitions.

See: WS3 - Interrupt Architecture

MMU: From Software TLB to Hardware Radix Translation

The MMU transformation represents one of the most significant architectural changes:

Current State:

  • Software-loaded TLB (operating system manages TLB entries explicitly)
  • Radix tree support via “indirect” TLB entries (hardware walks radix trees for indirect entries)
  • Hybrid approach with software control

Target State:

  • Full hardware radix tree page table walker
  • Page-walk cache for translation caching (reduces memory accesses for address translation)
  • Two-level translation for LPAR (Logical Partitioning):
    • Level 1: Guest virtual → Guest real (managed by guest OS)
    • Level 2: Guest real → Host real (managed by hypervisor)
  • Guest-real addressing mode for hypervisor efficiency
  • Nested page table support for KVM

This MMU upgrade enables:

  • Modern Linux memory management compatibility
  • Efficient large page support (64KB, 2MB, 1GB pages)
  • Hardware-accelerated address translation
  • Industry-standard LPAR implementation for cloud and virtualization workloads

See: WS4 - Storage Management (MMU)

Debug Facilities Modernization

Replace: Implementation-specific registers (DAC, DBCR, IAC)

With: Architected debug facilities:

  • DAWR/DAWRX (Data Address Watchpoint Register) for data breakpoints
  • CIABR (Completed Instruction Address Breakpoint Register) for instruction breakpoints
  • Standardized debug event handling

See: WS5 - Debug Facilities

Performance Monitor Unit (PMU) Standardization

Replace: Custom implementation-specific PMU counters

With: Architected PMU:

  • MMCR0-2 (Monitor Mode Control Registers) for event selection and configuration
  • PMC1-6 (Performance Monitor Counters) for standardized event counting
  • SIER (Sampled Instruction Event Register) for precise event attribution
  • Compatibility with standard Linux perf tools and profilers

See: WS6 - Performance Monitor Unit


Implementation Strategy and Priorities

Phase 1: SFFS Compliance Foundation

  1. Book I scalar instructions (40+ new instructions)
  2. Basic Book III-S structure (MSR[HV], LPCR, HSRR0/1)
  3. Simplified interrupt architecture
  4. Debug and PMU standardization
  5. Custom Linux kernel support

Phase 2: MMU and Hypervisor

  1. Hardware radix page table walker
  2. Page-walk cache implementation
  3. Two-level translation for LPAR
  4. Full KVM support with per-thread partitioning
  5. Guest-real addressing mode

Phase 3: LCS Full Compliance

  1. VMX instruction set (128-bit SIMD)
  2. VSX instruction set (64-bit scalar + vector extensions)
  3. Vector register file (32 × 128-bit VRs)
  4. Mainstream Linux distribution compatibility (Fedora, Ubuntu, RHEL)

Phase 4: v3.1 Advanced Features

  1. Prefixed instruction support (64-bit encoding)
  2. PC-relative addressing infrastructure
  3. Extended immediate operands (34-bit)

See: Workstream Summary for detailed timeline and dependencies


Instruction Set Summary

Total Changes

  • New Instructions: 463+ (40 scalar + 400 VMX/VSX + 18 prefixed + 5 other)
  • Removed Instructions: 38+ (15 user-mode + 7 interrupt + 15 MMU + 1 hypervisor)
  • Net Growth: +425 instructions

By Workstream

WorkstreamNew InstructionsRemoved Instructions
WS1 - User Mode40+15+
WS2 - Hypervisor01
WS3 - Interrupts37
WS4 - MMU215+
WS7 - VMX/VSX400+0
WS8 - Prefixed180

The Value Proposition

This comprehensive upgrade transforms the A2O from a specialized embedded processor into a server-class, general-purpose Power architecture core capable of:

  • Running unmodified Linux distributions (Ubuntu, Fedora, RHEL, Debian)
  • Hosting KVM virtual machines with hardware-assisted isolation
  • Supporting modern development toolchains (glibc, GCC, LLVM)
  • Enabling cloud and datacenter workloads with industry-standard LPAR
  • Maintaining open-source heritage while achieving commercial viability

The A2O upgrade represents a unique opportunity in the open-source hardware ecosystem: transforming a proven, high-efficiency supercomputer core into a fully capable, ISA-compliant processor suitable for general-purpose computing. By bridging the embedded-to-server gap, PowerCommons enables the A2O to serve markets ranging from sovereign computing initiatives to academic research, embedded systems to cloud infrastructure—all while remaining fully open source.

This is not merely a compliance exercise; it is the revival of a world-class processor architecture for the next generation of open, auditable, and trustworthy computing systems.


Resource Requirements

Engineering Team

  • Core RTL Engineers: 6-8
  • Verification Engineers: 4-5
  • Integration Engineers: 2-3
  • Total: 12-16 engineers

Timeline

  • Aggressive: 18 months (high risk)
  • Nominal: 20-22 months (recommended)
  • Conservative: 24 months (low risk)

Infrastructure

  • High-performance simulation servers
  • FPGA prototyping boards (VCU-118 recommended)
  • Compliance test licenses
  • Linux distribution access

Compliance and Verification

The A2O upgrade targets full compliance with Power ISA v3.0C or v3.1C specifications through comprehensive testing:

Test Strategy

  1. Unit tests during development (WS1-WS8)
  2. Integration tests for cross-workstream validation (WS9)
  3. ISA compliance suite execution (WS9, weeks 36-52)
  4. Linux Test Project for OS-level validation
  5. glibc test suite for ABI compliance
  6. KVM selftests for hypervisor validation

See: WS9 - Integration & Verification | Compliance Testing Strategy



For implementation questions and collaboration opportunities, join us on Matrix: #powercommons:matrix.org