Technical Specifications

💬 Technical Questions? Join #powercommons:matrix.org to discuss ISA specifications, architectural decisions, and implementation details.

A2O Architecture: From Embedded to Server-Class ISA Compliance

Heritage and Current Architecture

The A2O processor core represents a significant piece of open-source Power architecture heritage, originally developed by IBM for the Blue Gene/Q supercomputer program. The A2 family comprises two variants:

A2I (In-order): A 4-way simultaneously multithreaded (SMT4) in-order execution core
A2O (Out-of-order): A 2-way simultaneously multithreaded (SMT2) out-of-order execution core with advanced speculation and dynamic scheduling

Both cores were designed to the Power ISA v2.06 Embedded specification, optimized for high-efficiency parallel computing environments where power consumption, thread density, and deterministic behavior were paramount. The A2 cores successfully powered some of the world’s most energy-efficient supercomputers, demonstrating exceptional performance-per-watt characteristics.

Embedded Architecture Foundation

As Book III-E (Embedded) implementations, the A2 cores feature:

Software-loaded TLB with hardware support for radix tree lookups via “indirect” entries
Embedded Hypervisor model using MSR[GS] (Guest State) for virtualization
Multiple SRR pairs (SRR0/1, CSRR0/1, DSRR0/1, MCSRR0/1) for nested interrupt handling
Implementation-specific debug facilities (DAC, DBCR, IAC registers)
Implementation-specific performance monitoring (custom PMU counters and controls)
No VMX/VSX support (Vector/SIMD extensions)

This embedded-focused design made the A2 ideal for embedded, real-time, and specialized HPC workloads but created barriers for running modern general-purpose software stacks.

The Compliance Gap: v2.06 to v3.1C

The journey from Power ISA v2.06 Embedded to v3.0C/v3.1C compliance represents a fundamental architectural transformation, not merely an incremental update. This upgrade bridges the gap between embedded and server-class implementations, enabling compatibility with mainstream Linux distributions and modern virtualization frameworks.

Two Compliance Targets

SFFS (Scalar Fixed-point and Floating-point Subset)

Core instruction set compliance without vector extensions
Sufficient for custom software stacks and embedded Linux
Smaller implementation footprint
Approximately 40+ new scalar instructions

LCS (Linux Compliancy Subset)

Full compliance including VMX/VSX vector extensions
Required for modern glibc (depends on VMX for optimized string operations)
Enables compatibility with Fedora, Ubuntu, RHEL, and other mainstream distributions
Adds 400+ vector/scalar instructions on top of SFFS requirements
Essential for running unmodified Linux distribution binaries

Book I: User-Mode Instruction Set Evolution

The Book I (user-mode) changes represent the most visible transformation, adding powerful new capabilities:

New Instruction Categories

Atomic Operations: Quadword load/store atomic (lqarx/stqcx.) enable lock-free algorithms on 128-bit data structures, critical for modern concurrent programming.

Bit Manipulation: New bit permutation and manipulation instructions (cnttzw, cnttzd, extswsli) improve performance for cryptographic operations, compression algorithms, and bit field processing.

Prefixed Instructions (v3.1): 64-bit instruction encoding enables PC-relative addressing with ±8 EiB range and immediate operands up to 34 bits, dramatically improving code density and position-independent code generation.

Message Synchronization: Architected message-passing primitives improve inter-thread communication efficiency.

VMX/VSX Vector Extensions

For LCS compliance, implementing the complete VMX (Vector Multimedia Extension) and VSX (Vector-Scalar Extension) instruction sets represents the largest single implementation effort:

128-bit SIMD operations: 32 vector registers (VR0-VR31) for parallel data processing
64-bit scalar floating-point: 64 VSX registers overlaying FPRs and VRs
Fused multiply-add operations: High-throughput floating-point computation
Permute and shuffle: Flexible data reorganization within vectors
Load/store vectors: Efficient memory access for parallel data

Modern glibc depends on VMX for optimized implementations of memcpy, memset, strcmp, and other fundamental library functions, making VMX/VSX essential for running unmodified Linux distributions.

See: WS1 - Instruction Set Updates | WS7 - VMX/VSX Implementation | WS8 - Prefixed Instructions

Book III: Privileged Architecture Transformation

The Book III changes represent a complete rearchitecture from embedded (Book III-E) to server (Book III-S) privileged facilities. This transformation touches every aspect of privileged operation.

Hypervisor Model: From Embedded to Server

Current (Embedded Hypervisor):

MSR[GS] bit distinguishes guest from hypervisor
Flat privilege model with limited partitioning
Suitable for single-guest environments

Target (Server Hypervisor):

MSR[HV] bit defines hypervisor mode
LPCR (Logical Partitioning Control Register) for per-partition configuration
HRMOR (Hypervisor Real Mode Offset Register) for hypervisor address space isolation
PCR (Processor Compatibility Register) for ISA compatibility modes
Per-thread partition IDs enabling true KVM support
Two-level memory translation (guest virtual → guest real → host real)

This transformation enables the A2O to run modern KVM-based virtualization stacks, supporting multiple isolated virtual machines with hardware-assisted memory protection.

See: WS2 - Hypervisor & Virtualization

Interrupt Architecture Consolidation

Current: Multiple save/restore register pairs (SRR0/1, CSRR0/1, DSRR0/1, MCSRR0/1) for nested interrupt contexts, typical of embedded designs.

Target: Consolidated HSRR0/1 (Hypervisor SRR) with streamlined interrupt handling:

System Call Vectored (scv) instruction for low-latency system calls
Return from scv (rfscv) with reduced context switching overhead
Simplified interrupt priority and masking

This consolidation reduces hardware complexity while improving interrupt latency for hypervisor and operating system transitions.

See: WS3 - Interrupt Architecture

MMU: From Software TLB to Hardware Radix Translation

The MMU transformation represents one of the most significant architectural changes:

Current State:

Software-loaded TLB (operating system manages TLB entries explicitly)
Radix tree support via “indirect” TLB entries (hardware walks radix trees for indirect entries)
Hybrid approach with software control

Target State:

Full hardware radix tree page table walker
Page-walk cache for translation caching (reduces memory accesses for address translation)
Two-level translation for LPAR (Logical Partitioning):
- Level 1: Guest virtual → Guest real (managed by guest OS)
- Level 2: Guest real → Host real (managed by hypervisor)
Guest-real addressing mode for hypervisor efficiency
Nested page table support for KVM

This MMU upgrade enables:

Modern Linux memory management compatibility
Efficient large page support (64KB, 2MB, 1GB pages)
Hardware-accelerated address translation
Industry-standard LPAR implementation for cloud and virtualization workloads

See: WS4 - Storage Management (MMU)

Debug Facilities Modernization

Replace: Implementation-specific registers (DAC, DBCR, IAC)

With: Architected debug facilities:

DAWR/DAWRX (Data Address Watchpoint Register) for data breakpoints
CIABR (Completed Instruction Address Breakpoint Register) for instruction breakpoints
Standardized debug event handling

See: WS5 - Debug Facilities

Performance Monitor Unit (PMU) Standardization

Replace: Custom implementation-specific PMU counters

With: Architected PMU:

MMCR0-2 (Monitor Mode Control Registers) for event selection and configuration
PMC1-6 (Performance Monitor Counters) for standardized event counting
SIER (Sampled Instruction Event Register) for precise event attribution
Compatibility with standard Linux perf tools and profilers

See: WS6 - Performance Monitor Unit

Implementation Strategy and Priorities

Phase 1: SFFS Compliance Foundation

Book I scalar instructions (40+ new instructions)
Basic Book III-S structure (MSR[HV], LPCR, HSRR0/1)
Simplified interrupt architecture
Debug and PMU standardization
Custom Linux kernel support

Phase 2: MMU and Hypervisor

Hardware radix page table walker
Page-walk cache implementation
Two-level translation for LPAR
Full KVM support with per-thread partitioning
Guest-real addressing mode

Phase 3: LCS Full Compliance

VMX instruction set (128-bit SIMD)
VSX instruction set (64-bit scalar + vector extensions)
Vector register file (32 × 128-bit VRs)
Mainstream Linux distribution compatibility (Fedora, Ubuntu, RHEL)

Phase 4: v3.1 Advanced Features

Prefixed instruction support (64-bit encoding)
PC-relative addressing infrastructure
Extended immediate operands (34-bit)

See: Workstream Summary for detailed timeline and dependencies

Instruction Set Summary

Total Changes

New Instructions: 463+ (40 scalar + 400 VMX/VSX + 18 prefixed + 5 other)
Removed Instructions: 38+ (15 user-mode + 7 interrupt + 15 MMU + 1 hypervisor)
Net Growth: +425 instructions

By Workstream

Workstream	New Instructions	Removed Instructions
WS1 - User Mode	40+	15+
WS2 - Hypervisor	0	1
WS3 - Interrupts	3	7
WS4 - MMU	2	15+
WS7 - VMX/VSX	400+	0
WS8 - Prefixed	18	0

The Value Proposition

This comprehensive upgrade transforms the A2O from a specialized embedded processor into a server-class, general-purpose Power architecture core capable of:

Running unmodified Linux distributions (Ubuntu, Fedora, RHEL, Debian)
Hosting KVM virtual machines with hardware-assisted isolation
Supporting modern development toolchains (glibc, GCC, LLVM)
Enabling cloud and datacenter workloads with industry-standard LPAR
Maintaining open-source heritage while achieving commercial viability

The A2O upgrade represents a unique opportunity in the open-source hardware ecosystem: transforming a proven, high-efficiency supercomputer core into a fully capable, ISA-compliant processor suitable for general-purpose computing. By bridging the embedded-to-server gap, PowerCommons enables the A2O to serve markets ranging from sovereign computing initiatives to academic research, embedded systems to cloud infrastructure—all while remaining fully open source.

This is not merely a compliance exercise; it is the revival of a world-class processor architecture for the next generation of open, auditable, and trustworthy computing systems.

Resource Requirements

Engineering Team

Core RTL Engineers: 6-8
Verification Engineers: 4-5
Integration Engineers: 2-3
Total: 12-16 engineers

Timeline

Aggressive: 18 months (high risk)
Nominal: 20-22 months (recommended)
Conservative: 24 months (low risk)

Infrastructure

High-performance simulation servers
FPGA prototyping boards (VCU-118 recommended)
Compliance test licenses
Linux distribution access

Compliance and Verification

The A2O upgrade targets full compliance with Power ISA v3.0C or v3.1C specifications through comprehensive testing:

Test Strategy

Unit tests during development (WS1-WS8)
Integration tests for cross-workstream validation (WS9)
ISA compliance suite execution (WS9, weeks 36-52)
Linux Test Project for OS-level validation
glibc test suite for ABI compliance
KVM selftests for hypervisor validation

See: WS9 - Integration & Verification | Compliance Testing Strategy

Workstream Summary - Full task breakdown and dependencies
A2O Revival Project - Restoring current A2O functionality
Compliance Testing Strategy - Verification approach
Individual workstream pages (WS1-WS9) - Detailed task lists

For implementation questions and collaboration opportunities, join us on Matrix: #powercommons:matrix.org

Keyboard shortcuts

PowerCommons - OpenPOWER of the People, By the People, for the People