From ebbf584756744dc82e1ffe858b89f3779969329a Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 17 Mar 2026 21:43:05 +0000 Subject: [PATCH 1/4] docs: add GemminiBall, Verilator, and build system documentation - Add GemminiBall Architecture guide: instruction routing, funct7 encoding, configuration and execution paths - Add Verilator Simulation and CI guide: clock handling improvements, memory configuration, test execution - Add Development Workflow and Build System guide: Nix setup, bbdev tool usage, common tasks - Update .order.json to include new Guide section - Sync both English and Chinese documentation Co-authored-by: Shiroha --- content/.order.json | 6 +- .../Development Workflow and Build System.md | 262 ++++++++++++++++++ content/en/Guide/GemminiBall Architecture.md | 153 ++++++++++ .../en/Guide/Verilator Simulation and CI.md | 170 ++++++++++++ .../Development Workflow and Build System.md | 262 ++++++++++++++++++ content/zh/Guide/GemminiBall Architecture.md | 153 ++++++++++ .../zh/Guide/Verilator Simulation and CI.md | 170 ++++++++++++ 7 files changed, 1175 insertions(+), 1 deletion(-) create mode 100644 content/en/Guide/Development Workflow and Build System.md create mode 100644 content/en/Guide/GemminiBall Architecture.md create mode 100644 content/en/Guide/Verilator Simulation and CI.md create mode 100644 content/zh/Guide/Development Workflow and Build System.md create mode 100644 content/zh/Guide/GemminiBall Architecture.md create mode 100644 content/zh/Guide/Verilator Simulation and CI.md diff --git a/content/.order.json b/content/.order.json index 9dda683..8be315b 100644 --- a/content/.order.json +++ b/content/.order.json @@ -3,5 +3,9 @@ "Overview.md", "Tutorial", "Building Your Own Hardware Designs.md", - "Buckyball ISA.md" + "Buckyball ISA.md", + "Guide", + "GemminiBall Architecture.md", + "Verilator Simulation and CI.md", + "Development Workflow and Build System.md" ] \ No newline at end of file diff --git a/content/en/Guide/Development Workflow and Build System.md b/content/en/Guide/Development Workflow and Build System.md new file mode 100644 index 0000000..58ea9da --- /dev/null +++ b/content/en/Guide/Development Workflow and Build System.md @@ -0,0 +1,262 @@ +# Development Workflow and Build System + +## Overview + +Buckyball provides a streamlined development environment using Nix Flakes, with tools like `bbdev` for managing hardware simulation, compilation, and testing. This guide covers the build system, common workflows, and troubleshooting. + +## Initial Setup + +### Using Nix Flakes + +Nix Flakes provides reproducible development environments with all required tools: + +```bash +# Install Nix (if not already installed) +curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh + +# Enable Flakes (if not enabled by default) +nix flake update +``` + +### Entering the Development Environment + +```bash +nix develop +``` + +This command sets up: +- Scala/Chisel RTL development tools +- Verilator for hardware simulation +- C/C++ compiler for software +- Test frameworks and dependencies +- Pre-commit hooks + +### Full Repository Initialization + +The `build-all.sh` script automates the complete setup process: + +```bash +cd buckyball +./scripts/nix/build-all.sh +``` + +**Setup Steps** (can be skipped with `--skip N`): + +1. Install bbdev +2. Compile the compiler toolchain +3. Pre-compile RTL sources +4. Install bebop (simulation framework) +5. Pre-compile test workloads +6. Build waveform-mcp module +7. Install pre-commit hooks + +**Options:** + +```bash +# Skip specific steps +./scripts/nix/build-all.sh --skip 2 --skip 5 + +# Verbose output +./scripts/nix/build-all.sh --verbose + +# Install dependencies in Nix store (default) +./scripts/nix/build-all.sh --install-in-nix +``` + +## bbdev Tool + +`bbdev` is the primary interface for hardware simulation and build management in Buckyball. + +### Basic Usage + +```bash +bbdev [options] +``` + +### Verilator Simulation + +Run RTL simulations using the Verilator backend: + +```bash +# Basic simulation run +bbdev verilator --run '' + +# Common simulation options +bbdev verilator --run \ + '--jobs 16 \ + --binary \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +**Workload Examples:** + +- `ctest_vecunit_matmul_ones_singlecore-baremetal`: Single-core matrix multiplication test +- `ctest_toy_add-baremetal`: Toy vector add test +- Model tests: `ModelTest-` (e.g., `ModelTest-LeNet`) + +### Simulation Configuration + +Available configurations in `sims/verilator/`: + +- `BuckyballToyVerilatorConfig`: Single-core configuration for unit testing +- Custom configs: Define in Scala configuration files + +## Code Organization + +### Directory Structure + +``` +buckyball/ +├── arch/ # RTL design (Chisel/Scala) +│ ├── src/main/scala/ +│ │ ├── examples/ # Reference designs +│ │ ├── framework/ # Core framework +│ │ └── sims/ # Simulation harness +│ └── tests/ # RTL unit tests +├── bb-tests/ # Software workloads and tests +│ ├── workloads/src/ # Test applications +│ │ ├── ModelTest/ # ML model inference tests +│ │ ├── OpTest/ # Operation tests +│ │ └── custom/ # User-defined workloads +│ └── sardine/ # Test framework +├── compiler/ # MLIR-based compiler +├── frontend/ # Software framework +├── backend/ # System support libraries +└── scripts/ # Build and utility scripts +``` + +### Key Subsystems + +| Subsystem | Purpose | +|-----------|---------| +| `framework.balldomain` | Accelerator module (Ball) framework | +| `framework.top.GlobalConfig` | System-wide configuration | +| `sims.verilator` | Verilator simulation harness | +| `bb-tests.sardine` | Test orchestration | +| `bbAgent` | Software agent/orchestration | + +## Common Development Tasks + +### Modifying RTL + +1. Edit Chisel files in `arch/src/main/scala/` +2. Rebuild simulations: + ```bash + cd arch + mill arch.test # Run unit tests + ``` +3. Verify with Verilator simulation + +### Adding Custom Test Workload + +1. Create source files in `bb-tests/workloads/src//` +2. Add CMakeLists.txt if needed +3. Update test configuration in `sardine/` if using test framework +4. Run with `bbdev verilator --run '--binary ...'` + +### Debugging Simulations + +**Enable waveform capture:** + +```bash +bbdev verilator --run '--vcd out.vcd --binary --batch' +``` + +**Open waveforms:** + +```bash +gtkwave out.vcd & +``` + +**Trace specific signals:** +- Add `dontTouch` or `debug` annotations in Chisel +- Verify signal names in generated RTL + +## Testing and Validation + +### Unit Tests + +RTL unit tests: + +```bash +cd arch +mill arch.test +``` + +### Integration Tests + +Full system tests using Verilator: + +```bash +./scripts/nix/test-suite.sh # Run full test suite (if available) +``` + +### Coverage-Driven Testing + +Enable coverage tracking (if supported): + +```bash +bbdev verilator --run '--coverage --binary --batch' +``` + +## Continuous Integration + +### Pre-commit Hooks + +Installed hooks validate: +- Code formatting +- Lint checks (Scala, C++) +- CMake syntax + +```bash +# Manual pre-commit check +pre-commit run --all-files +``` + +### CI Workflows + +GitHub Actions workflows (`.github/workflows/`): +- `test.yml`: Runs Verilator tests on push +- `lint.yml`: Code quality checks +- `build.yml`: Compiler and RTL builds + +## Troubleshooting + +### Build Issues + +| Problem | Solution | +|---------|----------| +| `nix develop` fails | Update flake: `nix flake update` | +| Scala compilation errors | Ensure `mill` is up-to-date: `mill --version` | +| Missing dependencies | Run `./scripts/nix/build-all.sh` again | + +### Simulation Issues + +| Problem | Solution | +|---------|----------| +| Test timeout | Reduce workload size or increase timeout | +| Out-of-memory | Check workload data size, reduce test scale | +| Incorrect results | Enable waveform capture for debugging | + +### Environment Issues + +```bash +# Reset environment +rm -rf .mill-cache +nix flake update +nix develop --impure +``` + +## Recent Changes + +- **bbdev updates**: Improved tool performance and added new simulation options +- **Build system**: Enhanced CMake support for workload compilation +- **Nix setup**: Added Yosys and OpenSTA for optional synthesis flows +- **Test framework**: Updated sardine framework for new simulation semantics + +## See Also + +- [Verilator Simulation and CI](Verilator%20Simulation%20and%20CI.md) +- [GemminiBall Architecture](GemminiBall%20Architecture.md) +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) diff --git a/content/en/Guide/GemminiBall Architecture.md b/content/en/Guide/GemminiBall Architecture.md new file mode 100644 index 0000000..0771b45 --- /dev/null +++ b/content/en/Guide/GemminiBall Architecture.md @@ -0,0 +1,153 @@ +# GemminiBall Architecture + +## Overview + +GemminiBall is a specialized Ball (accelerator module) in Buckyball that implements systolic array-based matrix multiplication operations. It integrates Gemmini-style compute semantics with the Buckyball Blink interface for instruction dispatch and result handling. + +## Architecture Components + +### Core Modules + +- **GemminiBall**: Main instruction router and execution controller +- **GemminiExCtrl**: Execution unit controller for non-loop instructions (CONFIG, PRELOAD, COMPUTE, FLUSH) +- **LoopMatmulUnroller**: Handles blocked matrix multiplication loops +- **LoopConvUnroller**: Handles convolutional computation loops +- **LoopCmdEncoder**: Encodes loop commands for execution + +### Configuration Registers + +GemminiBall maintains configuration state via: + +- **loopWsConfig**: Stores loop parameters for matrix multiplication + - `max_i`, `max_j`, `max_k`: Loop iteration bounds + - DRAM addresses for matrices A, B, C, D + - Stride parameters for memory access patterns + +- **loopConvConfig**: Stores convolution-specific parameters + +## Instruction Routing by funct7 + +GemminiBall dispatches instructions based on the `funct7` field: + +| funct7 | Operation | Type | +|--------|-----------|------| +| 0x02 | CONFIG | ExUnit | +| 0x03 | FLUSH | ExUnit | +| 0x35 | PRELOAD | ExUnit | +| 0x42 | COMPUTE_PRELOADED | ExUnit | +| 0x43 | COMPUTE_ACCUMULATED | ExUnit | +| 0x50–0x56 | Loop Config | Configuration | +| 0x57 | Loop Trigger (Matrix) | Control | +| 0x60–0x68 | Loop Config (Conv) | Configuration | +| 0x69 | Loop Trigger (Conv) | Control | + +### Execution Paths + +**ExUnit Path** (CONFIG, PRELOAD, COMPUTE, FLUSH): +- Routed directly to GemminiExCtrl +- Produces standard responses with full latency + +**Config Path** (Loop configuration): +- Immediate response (single cycle) +- Stores configuration registers +- Includes ROB tracking for metadata association + +## Instruction Format + +### ExUnit Instructions + +ExUnit instructions follow the standard Blink command format with Gemmini semantics: + +``` +Field | Bits | Description +----------|---------|----------------------------------- +funct7 | [6:0] | Operation selector +rs2/cmd | [63:0] | Operand/configuration data +rs1 | [31:0] | Address/register file pointer +``` + +### Loop Configuration Instructions + +Loop configuration uses immediate mode with operand encoding: + +``` +Instruction: funct7 | rs2_data (special) +funct7 0x50: max_i [47:32], max_j [31:16], max_k [15:0] +funct7 0x51: dram_addr_a [38:0] +funct7 0x52: dram_addr_b [38:0] +funct7 0x53: dram_addr_d [38:0] +funct7 0x54: dram_addr_c [38:0] +funct7 0x55: stride_a [31:0], stride_b [63:32] +funct7 0x56: stride_d [31:0], stride_c [63:32] +``` + +## Register Tracking + +GemminiBall tracks ROB IDs via `rob_id_reg` to maintain metadata association across configuration and execution stages. This enables: + +- Correct result routing to the ReOrder Buffer +- Sub-operation tracking for pipelined configurations +- Coherent state management for blocked operations + +## Usage Example + +### Matrix Multiplication Sequence + +```scala +// 1. Configure loop parameters (M=64, N=64, K=64) +gemmini_loop_config_i(64, 64, 64) + +// 2. Set DRAM addresses +gemmini_dram_addr_a(0x0) +gemmini_dram_addr_b(0x10000) +gemmini_dram_addr_c(0x20000) +gemmini_dram_addr_d(0x20000) + +// 3. Set stride parameters +gemmini_stride_a_b(1024, 1024) +gemmini_stride_d_c(1024, 1024) + +// 4. Trigger loop execution +gemmini_loop_trigger() + +// 5. (Optional) Flush results +gemmini_flush() +``` + +## Integration with Buckyball + +### Blink Interface + +GemminiBall implements `BlinkIO` for command receipt and response: + +- **cmdReq**: Command request (instruction + operands) +- **cmdResp**: Response (result + metadata) +- **status**: Current execution status + +### ReOrder Buffer (ROB) + +Results include: +- `rob_id`: Original instruction ID for out-of-order execution +- `is_sub`: Indicates sub-operation status +- `sub_rob_id`: Secondary ROB tracking for composed operations + +## Recent Enhancements + +### funct7 Encoding Update (Latest) + +Recent commits refactored the funct7 encoding scheme to: +- Separate ExUnit instructions (immediate response paths) from loop control +- Align with updated DISA (Domain-Specific ISA) specification +- Support new bank enable tracing for debug and profiling + +### Instruction Tracing + +Bank enable support enables: +- Memory access pattern visualization +- Performance profiling per memory bank +- Debugging of data dependencies + +## See Also + +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) +- [Building Your Own Hardware Designs](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) diff --git a/content/en/Guide/Verilator Simulation and CI.md b/content/en/Guide/Verilator Simulation and CI.md new file mode 100644 index 0000000..0bf2a73 --- /dev/null +++ b/content/en/Guide/Verilator Simulation and CI.md @@ -0,0 +1,170 @@ +# Verilator Simulation and CI + +## Overview + +Verilator is the primary hardware simulation tool in Buckyball, used for rapid RTL verification and continuous integration testing. Recent updates improve clock handling, timing robustness, and CI configuration consistency. + +## Setup and Configuration + +### Installation + +Verilator is included in the Nix development environment: + +```bash +nix develop +``` + +### Project Configuration + +Verilator projects in Buckyball are configured via: + +- `BBSimHarness`: System-level simulation harness with clock and reset management +- `sims/verilator/` directory: Simulation-specific configurations and test runners + +## Clock and Timing Improvements + +### Rising-Edge Detection (mmio_tick) + +Recent updates to `BBSimHarness` refine the `mmio_tick` signal for rising-edge detection: + +```scala +// Debounce and rising-edge detection for wFire signal +val wFire_r = RegNext(wFire) +val wFire_rising = wFire && !wFire_r +``` + +This prevents spurious trigger detection and aligns with MMIO peripheral behavior. + +### Clock Edge Handling + +The simulation harness now: +1. Maintains explicit rising-edge and falling-edge cycle tracking +2. Debounces write-fire signals to prevent duplicate MMIO operations +3. Aligns clock phases with expected rising-edge semantics + +## Memory Section Handling + +### BBSimHarness Configuration + +Memory sections are correctly handled in simulation via: + +- **Linker script**: Defines DRAM, code, and stack regions +- **Memory mapping**: Ensures virtual-to-physical address translation matches hardware +- **Initialization**: Pre-loads test code and data into simulated memory + +### Example Linker Configuration + +Memory sections typically include: + +``` +DRAM: 0x80000000 – 0x9FFFFFFF (512 MB default, configurable) +CODE: 0x80000000 – 0x800FFFFF (1 MB default) +STACK: 0x9FFF0000 – 0x9FFFFFFF +``` + +## Running Verilator Simulation + +### Basic Test + +```bash +bbdev verilator --run \ + '--jobs 16 \ + --binary ctest_vecunit_matmul_ones_singlecore-baremetal \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +### Parameters + +| Option | Meaning | +|--------|---------| +| `--jobs 16` | Use 16 parallel compile jobs | +| `--binary` | Workload binary name | +| `--config` | Simulation configuration class | +| `--batch` | Non-interactive mode | + +### Available Configurations + +- `BuckyballToyVerilatorConfig`: Single-core toy configuration for unit testing +- `BuckyballFullVerilatorConfig`: Multi-core full system (if available) + +## CI Pipeline Updates + +### Workflow Configuration + +Recent CI updates in `.github/workflows/test.yml`: + +1. **Verilator Setup**: Installs and caches compiled simulator +2. **Test Matrix**: Runs subset of tests in parallel +3. **Timing**: Enforces clock edge semantics for deterministic results +4. **Debugging**: Captures waveforms on failure (conditional) + +### Example Workflow Step + +```yaml +- name: Run Verilator Tests + run: | + nix develop --command bash -c \ + 'bbdev verilator --run "--batch --jobs 16 --binary "' +``` + +## Debugging and Analysis + +### Waveform Capture + +To generate VCD waveforms for debugging: + +```bash +bbdev verilator --run '--batch --vcd out.vcd --binary ' +``` + +### Inspection + +- Open VCD files in GTKWave, Verdi, or similar tools +- Trace signal behavior across clock cycles +- Correlate with instruction commit log + +## Performance Considerations + +### Simulation Speed + +- Single-core toybox: ~100K cycles/second on modern CPU +- Multi-core systems: Speed degrades with core count +- Typical small test: 1–5 seconds simulation time + +### Resource Usage + +- Memory: ~500 MB–2 GB depending on design size +- Disk: Compiled simulator (~200 MB) +- CPU: Scales with `--jobs` parameter + +## Troubleshooting + +### Common Issues + +| Issue | Solution | +|-------|----------| +| Memory access out of bounds | Verify linker script and workload base address | +| Simulation hangs | Check for deadlock in memory controller or DMA | +| Incorrect clock edge | Verify rising-edge detection in BBSimHarness | +| Test timeouts | Increase timeout or reduce test complexity | + +### Debug Output + +Enable verbose simulation output: + +```bash +bbdev verilator --run '--batch --verbose --binary ' +``` + +## Recent Changes + +- **Rising-edge detection**: Improved `mmio_tick` debouncing and clock phase handling +- **Memory layout**: Corrected section handling in BBSimHarness linker script +- **CI configuration**: Unified Verilator settings across test workflows +- **Coverage support**: Enhanced signal handling for coverage-driven verification (if enabled) + +## See Also + +- [Building Your Own Hardware Designs](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) diff --git a/content/zh/Guide/Development Workflow and Build System.md b/content/zh/Guide/Development Workflow and Build System.md new file mode 100644 index 0000000..67ece66 --- /dev/null +++ b/content/zh/Guide/Development Workflow and Build System.md @@ -0,0 +1,262 @@ +# 开发工作流和构建系统 + +## 概述 + +Buckyball 使用 Nix Flakes 提供简化的开发环境,使用 `bbdev` 等工具来管理硬件模拟、编译和测试。本指南涵盖构建系统、常见工作流和故障排除。 + +## 初始设置 + +### 使用 Nix Flakes + +Nix Flakes 使用所有必需的工具提供可重复的开发环境: + +```bash +# 安装 Nix(如果尚未安装) +curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh + +# 启用 Flakes(如果默认未启用) +nix flake update +``` + +### 进入开发环境 + +```bash +nix develop +``` + +此命令设置: +- Scala/Chisel RTL 开发工具 +- 硬件模拟用的 Verilator +- C/C++ 编译器 +- 测试框架和依赖项 +- 预提交钩子 + +### 完整存储库初始化 + +`build-all.sh` 脚本自动化完整的设置过程: + +```bash +cd buckyball +./scripts/nix/build-all.sh +``` + +**设置步骤**(可以用 `--skip N` 跳过): + +1. 安装 bbdev +2. 编译编译器工具链 +3. 预编译 RTL 源 +4. 安装 bebop(模拟框架) +5. 预编译测试工作负载 +6. 构建 waveform-mcp 模块 +7. 安装预提交钩子 + +**选项:** + +```bash +# 跳过特定步骤 +./scripts/nix/build-all.sh --skip 2 --skip 5 + +# 详细输出 +./scripts/nix/build-all.sh --verbose + +# 在 Nix 存储中安装依赖项(默认) +./scripts/nix/build-all.sh --install-in-nix +``` + +## bbdev 工具 + +`bbdev` 是 Buckyball 中硬件模拟和构建管理的主要接口。 + +### 基本用法 + +```bash +bbdev [options] +``` + +### Verilator 模拟 + +使用 Verilator 后端运行 RTL 模拟: + +```bash +# 基本模拟运行 +bbdev verilator --run '' + +# 常见模拟选项 +bbdev verilator --run \ + '--jobs 16 \ + --binary \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +**工作负载示例:** + +- `ctest_vecunit_matmul_ones_singlecore-baremetal`: 单核矩阵乘法测试 +- `ctest_toy_add-baremetal`: 玩具向量加法测试 +- 模型测试: `ModelTest-`(例如 `ModelTest-LeNet`) + +### 模拟配置 + +`sims/verilator/` 中的可用配置: + +- `BuckyballToyVerilatorConfig`: 用于单元测试的单核配置 +- 自定义配置: 在 Scala 配置文件中定义 + +## 代码组织 + +### 目录结构 + +``` +buckyball/ +├── arch/ # RTL 设计(Chisel/Scala) +│ ├── src/main/scala/ +│ │ ├── examples/ # 参考设计 +│ │ ├── framework/ # 核心框架 +│ │ └── sims/ # 模拟框架 +│ └── tests/ # RTL 单元测试 +├── bb-tests/ # 软件工作负载和测试 +│ ├── workloads/src/ # 测试应用程序 +│ │ ├── ModelTest/ # 机器学习模型推理测试 +│ │ ├── OpTest/ # 操作测试 +│ │ └── custom/ # 用户定义的工作负载 +│ └── sardine/ # 测试框架 +├── compiler/ # 基于 MLIR 的编译器 +├── frontend/ # 软件框架 +├── backend/ # 系统支持库 +└── scripts/ # 构建和实用脚本 +``` + +### 关键子系统 + +| 子系统 | 用途 | +|--------|------| +| `framework.balldomain` | 加速器模块(Ball)框架 | +| `framework.top.GlobalConfig` | 系统级配置 | +| `sims.verilator` | Verilator 模拟框架 | +| `bb-tests.sardine` | 测试编排 | +| `bbAgent` | 软件代理/编排 | + +## 常见开发任务 + +### 修改 RTL + +1. 编辑 `arch/src/main/scala/` 中的 Chisel 文件 +2. 重建模拟: + ```bash + cd arch + mill arch.test # 运行单元测试 + ``` +3. 使用 Verilator 模拟验证 + +### 添加自定义测试工作负载 + +1. 在 `bb-tests/workloads/src//` 中创建源文件 +2. 如果需要,添加 CMakeLists.txt +3. 如果使用测试框架,在 `sardine/` 中更新测试配置 +4. 使用 `bbdev verilator --run '--binary ...'` 运行 + +### 调试模拟 + +**启用波形捕获:** + +```bash +bbdev verilator --run '--vcd out.vcd --binary --batch' +``` + +**打开波形:** + +```bash +gtkwave out.vcd & +``` + +**追踪特定信号:** +- 在 Chisel 中添加 `dontTouch` 或 `debug` 注解 +- 验证生成 RTL 中的信号名称 + +## 测试和验证 + +### 单元测试 + +RTL 单元测试: + +```bash +cd arch +mill arch.test +``` + +### 集成测试 + +使用 Verilator 的完整系统测试: + +```bash +./scripts/nix/test-suite.sh # 运行完整测试套件(如果可用) +``` + +### 覆盖率驱动测试 + +启用覆盖率追踪(如果支持): + +```bash +bbdev verilator --run '--coverage --binary --batch' +``` + +## 持续集成 + +### 预提交钩子 + +安装的钩子验证: +- 代码格式化 +- Lint 检查(Scala、C++) +- CMake 语法 + +```bash +# 手动预提交检查 +pre-commit run --all-files +``` + +### CI 工作流 + +GitHub Actions 工作流(`.github/workflows/`): +- `test.yml`: 在推送时运行 Verilator 测试 +- `lint.yml`: 代码质量检查 +- `build.yml`: 编译器和 RTL 构建 + +## 故障排除 + +### 构建问题 + +| 问题 | 解决方案 | +|------|---------| +| `nix develop` 失败 | 更新 flake:`nix flake update` | +| Scala 编译错误 | 确保 `mill` 是最新的:`mill --version` | +| 缺少依赖项 | 再次运行 `./scripts/nix/build-all.sh` | + +### 模拟问题 + +| 问题 | 解决方案 | +|------|---------| +| 测试超时 | 减少工作负载大小或增加超时 | +| 内存不足 | 检查工作负载数据大小,减小测试规模 | +| 结果不正确 | 启用波形捕获用于调试 | + +### 环境问题 + +```bash +# 重置环境 +rm -rf .mill-cache +nix flake update +nix develop --impure +``` + +## 最近的变化 + +- **bbdev 更新**: 改进了工具性能并添加了新的模拟选项 +- **构建系统**: 增强了对工作负载编译的 CMake 支持 +- **Nix 设置**: 为可选合成流程添加了 Yosys 和 OpenSTA +- **测试框架**: 为新的模拟语义更新了 sardine 框架 + +## 参见 + +- [Verilator 模拟和 CI](Verilator%20Simulation%20and%20CI.md) +- [GemminiBall 架构](GemminiBall%20Architecture.md) +- [Buckyball ISA 文档](../Overview/Buckyball%20ISA.md) diff --git a/content/zh/Guide/GemminiBall Architecture.md b/content/zh/Guide/GemminiBall Architecture.md new file mode 100644 index 0000000..24b8702 --- /dev/null +++ b/content/zh/Guide/GemminiBall Architecture.md @@ -0,0 +1,153 @@ +# GemminiBall 架构 + +## 概述 + +GemminiBall 是 Buckyball 中的一个专用 Ball(加速器模块),实现了基于脉动阵列的矩阵乘法操作。它将 Gemmini 风格的计算语义与 Buckyball Blink 接口结合,用于指令分发和结果处理。 + +## 架构组件 + +### 核心模块 + +- **GemminiBall**: 主指令路由器和执行控制器 +- **GemminiExCtrl**: 非循环指令执行单元控制器(CONFIG、PRELOAD、COMPUTE、FLUSH) +- **LoopMatmulUnroller**: 处理分块矩阵乘法循环 +- **LoopConvUnroller**: 处理卷积计算循环 +- **LoopCmdEncoder**: 编码循环命令用于执行 + +### 配置寄存器 + +GemminiBall 通过以下配置寄存器维护状态: + +- **loopWsConfig**: 存储矩阵乘法的循环参数 + - `max_i`、`max_j`、`max_k`:循环迭代边界 + - 矩阵 A、B、C、D 的 DRAM 地址 + - 内存访问模式的步长参数 + +- **loopConvConfig**: 卷积特定的参数 + +## 按 funct7 指令路由 + +GemminiBall 根据 `funct7` 字段分发指令: + +| funct7 | 操作 | 类型 | +|--------|------|------| +| 0x02 | CONFIG | ExUnit | +| 0x03 | FLUSH | ExUnit | +| 0x35 | PRELOAD | ExUnit | +| 0x42 | COMPUTE_PRELOADED | ExUnit | +| 0x43 | COMPUTE_ACCUMULATED | ExUnit | +| 0x50–0x56 | 循环配置 | 配置 | +| 0x57 | 循环触发(矩阵) | 控制 | +| 0x60–0x68 | 循环配置(卷积) | 配置 | +| 0x69 | 循环触发(卷积) | 控制 | + +### 执行路径 + +**ExUnit 路径**(CONFIG、PRELOAD、COMPUTE、FLUSH): +- 直接路由到 GemminiExCtrl +- 输出完整延迟的标准响应 + +**配置路径**(循环配置): +- 立即响应(单周期) +- 存储配置寄存器 +- 包含 ROB 元数据追踪 + +## 指令格式 + +### ExUnit 指令 + +ExUnit 指令遵循标准 Blink 命令格式,具有 Gemmini 语义: + +``` +字段 | 位 | 描述 +----------|-------|----------------------------------- +funct7 | [6:0] | 操作选择器 +rs2/cmd | [63:0] | 操作数/配置数据 +rs1 | [31:0] | 地址/寄存器文件指针 +``` + +### 循环配置指令 + +循环配置使用立即数模式,操作数编码如下: + +``` +指令:funct7 | rs2_data(特殊) +funct7 0x50: max_i [47:32], max_j [31:16], max_k [15:0] +funct7 0x51: dram_addr_a [38:0] +funct7 0x52: dram_addr_b [38:0] +funct7 0x53: dram_addr_d [38:0] +funct7 0x54: dram_addr_c [38:0] +funct7 0x55: stride_a [31:0], stride_b [63:32] +funct7 0x56: stride_d [31:0], stride_c [63:32] +``` + +## 寄存器追踪 + +GemminiBall 通过 `rob_id_reg` 追踪 ROB ID 以维护配置和执行阶段之间的元数据关联。这使得: + +- 正确的结果路由到重排序缓冲区 +- 分块操作的子操作追踪 +- 管道化配置的一致状态管理 + +## 使用示例 + +### 矩阵乘法序列 + +```scala +// 1. 配置循环参数 (M=64, N=64, K=64) +gemmini_loop_config_i(64, 64, 64) + +// 2. 设置 DRAM 地址 +gemmini_dram_addr_a(0x0) +gemmini_dram_addr_b(0x10000) +gemmini_dram_addr_c(0x20000) +gemmini_dram_addr_d(0x20000) + +// 3. 设置步长参数 +gemmini_stride_a_b(1024, 1024) +gemmini_stride_d_c(1024, 1024) + +// 4. 触发循环执行 +gemmini_loop_trigger() + +// 5. (可选)刷新结果 +gemmini_flush() +``` + +## 与 Buckyball 的集成 + +### Blink 接口 + +GemminiBall 实现 `BlinkIO` 用于命令接收和响应: + +- **cmdReq**: 命令请求(指令 + 操作数) +- **cmdResp**: 响应(结果 + 元数据) +- **status**: 当前执行状态 + +### 重排序缓冲区(ROB) + +结果包括: +- `rob_id`: 原始指令 ID,用于乱序执行 +- `is_sub`: 表示子操作状态 +- `sub_rob_id`: 复合操作的二级 ROB 追踪 + +## 最近的改进 + +### funct7 编码更新(最新) + +最近的提交重构了 funct7 编码方案以: +- 将 ExUnit 指令(立即响应路径)与循环控制分离 +- 与更新的 DISA(领域特定 ISA)规范对齐 +- 支持用于调试和分析的新增银行使能追踪 + +### 指令追踪 + +银行使能支持: +- 内存访问模式可视化 +- 每个内存银行的性能分析 +- 数据依赖关系的调试 + +## 参见 + +- [Buckyball ISA 文档](../Overview/Buckyball%20ISA.md) +- [构建自己的硬件设计](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) diff --git a/content/zh/Guide/Verilator Simulation and CI.md b/content/zh/Guide/Verilator Simulation and CI.md new file mode 100644 index 0000000..cf4f0f3 --- /dev/null +++ b/content/zh/Guide/Verilator Simulation and CI.md @@ -0,0 +1,170 @@ +# Verilator 模拟和 CI + +## 概述 + +Verilator 是 Buckyball 中的主要硬件模拟工具,用于快速 RTL 验证和持续集成测试。最近的更新改进了时钟处理、时序鲁棒性和 CI 配置一致性。 + +## 设置和配置 + +### 安装 + +Verilator 包含在 Nix 开发环境中: + +```bash +nix develop +``` + +### 项目配置 + +Buckyball 中的 Verilator 项目通过以下方式配置: + +- `BBSimHarness`: 系统级模拟框架,具有时钟和复位管理 +- `sims/verilator/` 目录: 模拟特定配置和测试运行器 + +## 时钟和时序改进 + +### 上升沿检测(mmio_tick) + +`BBSimHarness` 的最近更新完善了 `mmio_tick` 信号的上升沿检测: + +```scala +// 对 wFire 信号的去抖和上升沿检测 +val wFire_r = RegNext(wFire) +val wFire_rising = wFire && !wFire_r +``` + +这防止了虚假触发检测,并与 MMIO 外设行为一致。 + +### 时钟边沿处理 + +模拟框架现在: +1. 维护显式的上升沿和下降沿周期追踪 +2. 对写入-触发信号进行去抖,以防止重复的 MMIO 操作 +3. 将时钟相位与预期的上升沿语义对齐 + +## 内存段处理 + +### BBSimHarness 配置 + +内存段在模拟中通过以下方式正确处理: + +- **链接脚本**: 定义 DRAM、代码和堆栈区域 +- **内存映射**: 确保虚拟到物理地址转换与硬件匹配 +- **初始化**: 将测试代码和数据预加载到模拟内存中 + +### 示例链接器配置 + +内存段通常包括: + +``` +DRAM: 0x80000000 – 0x9FFFFFFF (512 MB 默认,可配置) +CODE: 0x80000000 – 0x800FFFFF (1 MB 默认) +STACK: 0x9FFF0000 – 0x9FFFFFFF +``` + +## 运行 Verilator 模拟 + +### 基本测试 + +```bash +bbdev verilator --run \ + '--jobs 16 \ + --binary ctest_vecunit_matmul_ones_singlecore-baremetal \ + --config sims.verilator.BuckyballToyVerilatorConfig \ + --batch' +``` + +### 参数 + +| 选项 | 含义 | +|------|------| +| `--jobs 16` | 使用 16 个并行编译作业 | +| `--binary` | 工作负载二进制名称 | +| `--config` | 模拟配置类 | +| `--batch` | 非交互模式 | + +### 可用配置 + +- `BuckyballToyVerilatorConfig`: 用于单元测试的单核玩具配置 +- `BuckyballFullVerilatorConfig`: 多核完整系统(如果可用) + +## CI 流水线更新 + +### 工作流配置 + +`.github/workflows/test.yml` 中的最近 CI 更新: + +1. **Verilator 设置**: 安装和缓存编译的模拟器 +2. **测试矩阵**: 并行运行测试子集 +3. **时序**: 强制执行时钟边沿语义以获得确定性结果 +4. **调试**: 失败时捕获波形(条件) + +### 工作流步骤示例 + +```yaml +- name: Run Verilator Tests + run: | + nix develop --command bash -c \ + 'bbdev verilator --run "--batch --jobs 16 --binary "' +``` + +## 调试和分析 + +### 波形捕获 + +要生成 VCD 波形用于调试: + +```bash +bbdev verilator --run '--batch --vcd out.vcd --binary ' +``` + +### 检查 + +- 在 GTKWave、Verdi 或类似工具中打开 VCD 文件 +- 跟踪跨时钟周期的信号行为 +- 与指令提交日志相关联 + +## 性能考虑 + +### 模拟速度 + +- 单核玩具盒: 现代 CPU 上约 100K 周期/秒 +- 多核系统: 随着核心数增加而下降 +- 典型小测试: 1–5 秒模拟时间 + +### 资源使用 + +- 内存: ~500 MB–2 GB(取决于设计大小) +- 磁盘: 编译的模拟器(~200 MB) +- CPU: 随 `--jobs` 参数扩展 + +## 故障排除 + +### 常见问题 + +| 问题 | 解决方案 | +|------|---------| +| 内存访问越界 | 验证链接脚本和工作负载基址 | +| 模拟挂起 | 检查内存控制器或 DMA 中的死锁 | +| 不正确的时钟边沿 | 验证 BBSimHarness 中的上升沿检测 | +| 测试超时 | 增加超时或减少测试复杂性 | + +### 调试输出 + +启用详细模拟输出: + +```bash +bbdev verilator --run '--batch --verbose --binary ' +``` + +## 最近的变化 + +- **上升沿检测**: 改进 `mmio_tick` 去抖和时钟相位处理 +- **内存布局**: 修正 BBSimHarness 链接脚本中的段处理 +- **CI 配置**: 统一测试工作流中的 Verilator 设置 +- **覆盖率支持**: 增强覆盖率驱动验证的信号处理(如果启用) + +## 参见 + +- [构建自己的硬件设计](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) +- [Buckyball ISA 文档](../Overview/Buckyball%20ISA.md) From b981eb5d033d43d797ee686de01440f65c6b3b20 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 17 Mar 2026 21:52:56 +0000 Subject: [PATCH 2/4] docs: add Goban multi-core architecture documentation and sync recent GemminiBall enhancements - Add comprehensive Goban documentation (en and zh) covering: - Multi-core BBTile architecture and synchronization primitives - BarrierUnit hardware synchronization support - SPMD programming model with examples - Configuration variants (1-tile, 2-tile) - Test workloads (barrier tests) - Update GemminiBall Architecture documentation (en and zh): - Document funct7 encoding with bank enable bits [6:4] - Add new instructions: IM2COL, TRANSPOSE, RELU, QUANT, DEQUANT, MATMUL_WARP16 - Clarify loop configuration and execution paths - Reference recent loop unroller enhancements Addresses missing Goban documentation reported in PR comments. Co-authored-by: Shiroha --- content/en/Guide/GemminiBall Architecture.md | 57 +++-- .../en/Guide/Goban Multi-Core Architecture.md | 225 ++++++++++++++++++ content/zh/Guide/GemminiBall Architecture.md | 97 +++++--- .../zh/Guide/Goban Multi-Core Architecture.md | 225 ++++++++++++++++++ 4 files changed, 546 insertions(+), 58 deletions(-) create mode 100644 content/en/Guide/Goban Multi-Core Architecture.md create mode 100644 content/zh/Guide/Goban Multi-Core Architecture.md diff --git a/content/en/Guide/GemminiBall Architecture.md b/content/en/Guide/GemminiBall Architecture.md index 0771b45..507e1fa 100644 --- a/content/en/Guide/GemminiBall Architecture.md +++ b/content/en/Guide/GemminiBall Architecture.md @@ -27,19 +27,29 @@ GemminiBall maintains configuration state via: ## Instruction Routing by funct7 -GemminiBall dispatches instructions based on the `funct7` field: - -| funct7 | Operation | Type | -|--------|-----------|------| -| 0x02 | CONFIG | ExUnit | -| 0x03 | FLUSH | ExUnit | -| 0x35 | PRELOAD | ExUnit | -| 0x42 | COMPUTE_PRELOADED | ExUnit | -| 0x43 | COMPUTE_ACCUMULATED | ExUnit | -| 0x50–0x56 | Loop Config | Configuration | -| 0x57 | Loop Trigger (Matrix) | Control | -| 0x60–0x68 | Loop Config (Conv) | Configuration | -| 0x69 | Loop Trigger (Conv) | Control | +GemminiBall dispatches instructions based on the `funct7` field. The field is partitioned as: + +- **Bits [6:4]**: Bank enable field (encodes memory access type: 000/001/010/011/100 for varying access patterns; 101/110/111 for extended opcodes) +- **Bits [3:0]**: Operation code + +| funct7 | Bits [6:4] | Operation | Type | +|--------|-----------|-----------|------| +| 0x02 | 000 | CONFIG | ExUnit | +| 0x03 | 000 | FLUSH | ExUnit | +| 0x04 | 000 | BDB_COUNTER | Debug | +| 0x30 | 011 | IM2COL | Compute | +| 0x31 | 011 | TRANSPOSE | Compute | +| 0x32 | 011 | RELU | Compute | +| 0x33 | 011 | QUANT | Compute | +| 0x34 | 011 | DEQUANT | Compute | +| 0x35 | 011 | PRELOAD | ExUnit | +| 0x36 | 011 | BDB_BACKDOOR | Debug | +| 0x40 | 100 | MATMUL_WARP16 | Compute | +| 0x41 | 100 | SYSTOLIC | Compute | +| 0x42 | 100 | COMPUTE_PRELOADED | ExUnit | +| 0x43 | 100 | COMPUTE_ACCUMULATED | ExUnit | +| 0x50–0x57 | 101 | Loop WS Config / Loop Trigger (Matrix) | Configuration | +| 0x60–0x69 | 110 | Loop Conv Config / Loop Trigger (Conv) | Configuration | ### Execution Paths @@ -135,19 +145,28 @@ Results include: ### funct7 Encoding Update (Latest) -Recent commits refactored the funct7 encoding scheme to: -- Separate ExUnit instructions (immediate response paths) from loop control +Recent commits (March 2026) updated the funct7 encoding scheme to: +- Encode bank enable bits in [6:4] for memory access pattern tracking +- Support new operations: IM2COL, TRANSPOSE, RELU, QUANT, DEQUANT, and MATMUL_WARP16 - Align with updated DISA (Domain-Specific ISA) specification -- Support new bank enable tracing for debug and profiling +- Enable instruction tracing with bank access visualization ### Instruction Tracing Bank enable support enables: -- Memory access pattern visualization -- Performance profiling per memory bank -- Debugging of data dependencies +- Memory access pattern visualization per bank +- Performance profiling of memory operations +- Debugging of data dependencies and bank conflicts + +### Loop Unrollers + +Recent GemminiBall enhancements add: +- **LoopMatmulUnroller**: Blocked matrix multiplication with configurable bounds +- **LoopConvUnroller**: Convolutional loop unrolling with flexible address generation +- Both support arbitrary loop nesting and strided memory access ## See Also +- [Goban Multi-Core Architecture](Goban%20Multi-Core%20Architecture.md) — Multi-core configurations using GemminiBall - [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) - [Building Your Own Hardware Designs](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) diff --git a/content/en/Guide/Goban Multi-Core Architecture.md b/content/en/Guide/Goban Multi-Core Architecture.md new file mode 100644 index 0000000..f930bf8 --- /dev/null +++ b/content/en/Guide/Goban Multi-Core Architecture.md @@ -0,0 +1,225 @@ +# Goban Multi-Core Architecture + +## Overview + +Goban is a multi-core BBTile configuration in Buckyball that enables parallel execution of SPMD (Single Program Multiple Data) workloads. Each BBTile contains multiple Rocket cores, where each core is paired with its own BuckyballAccelerator. All accelerators within a tile share a single SharedMemBackend and BarrierUnit for synchronization. + +## Architecture Overview + +### Tile Structure + +``` +┌─────────────────────────────────────┐ +│ BBTile (Goban) │ +├─────────────────────────────────────┤ +│ Core 0 │ Core 1 │ ...│ Core N-1 │ +│ Rocket + │ Rocket + │ │ Rocket + │ +│ Accel │ Accel │ │ Accel │ +├──────────┴──────────┴────┴──────────┤ +│ SharedMemBackend + BarrierUnit │ +└─────────────────────────────────────┘ +``` + +### Configuration Variants + +**BuckyballGobanConfig** +- 1 BBTile × 4 cores +- 4 Rocket cores + 4 BuckyballAccelerators +- Single SharedMem + BarrierUnit + +**BuckyballGoban2TileConfig** +- 2 BBTiles × 4 cores = 8 total cores +- 8 Rocket cores + 8 BuckyballAccelerators +- Per-tile SharedMem + BarrierUnit + +## Core Components + +### Multi-Core Execution + +Each core executes the same program independently with access to: +- Local register file +- Private instruction cache +- Paired BuckyballAccelerator for hardware operations +- Shared memory for inter-core communication +- Hart ID (hardware thread ID) via CSR `mhartid` + +### Barrier Unit + +The BarrierUnit provides hardware-level synchronization via the `bb_barrier()` intrinsic: + +- Stalls all cores in the tile until all reach the barrier +- Single-cycle synchronization overhead +- Supports multiple sequential barriers in same program +- Essential for SPMD algorithm coordination + +### Shared Memory Backend + +SharedMemBackend manages memory operations across all cores in the tile: + +- Arbitrates memory requests from multiple cores +- Maintains coherency for shared data structures +- Handles memory-mapped I/O (MMIO) for inter-tile communication +- Supports atomic operations for synchronization primitives + +## Programming Model + +### SPMD Execution Pattern + +```c +#include "goban.h" + +int main(void) { + int cid = bb_get_core_id(); // Get hart ID [0, nCores-1] + + // Phase 1: Per-core computation + int local_result = compute(cid, input_data); + + // Phase 2: Synchronization + bb_barrier(); + + // Phase 3: Shared result processing + if (cid == 0) { + process_all_results(local_result); + } + + bb_barrier(); // Ensure all cores reach exit + + return 0; +} +``` + +### Core Identification + +```c +static inline int bb_get_core_id(void) { + int hartid; + asm volatile("csrr %0, mhartid" : "=r"(hartid)); + return hartid; +} +``` + +Returns hart ID in range `[0, nCores-1]`. In Goban configurations, this maps directly to core index within the tile. + +## Test Workloads + +### barrier_test.c + +Smoke test for multi-core barrier synchronization: + +1. Each core sets `arrived[cid] = 1` +2. All cores execute `bb_barrier()` +3. Each core verifies all `arrived[]` flags are set +4. Repeat with `bb_barrier()` a second time +5. Core 0 prints final result + +Correctness check: simulation must not hang and all cores must reach completion. + +### barrier_mvin_test.c + +Combines barrier synchronization with accelerator operations (mvins): + +- Tests that memory barrier coordination works with hardware acceleration +- Verifies data coherency across cores +- Validates BarrierUnit blocking during in-flight accelerator operations + +## Integration with Buckyball + +### System Bus + +Goban uses a 128-bit system bus (vs. toy's narrower bus) to accommodate higher memory bandwidth requirements for multi-core workloads. + +### Configuration in build.sc + +Goban is defined as a configuration target in `arch/src/main/scala/examples/goban/CustomConfigs.scala`: + +```scala +object GobanConfig { + val nCores: Int = 4 + + def apply(): GlobalConfig = { + val base = GlobalConfig() + base.copy(top = base.top.copy(nCores = nCores)) + } +} + +class BuckyballGobanConfig + extends Config( + new WithNBBTiles(1, buckyballConfig = GobanConfig()) ++ + new chipyard.config.WithSystemBusWidth(128) ++ + new chipyard.config.AbstractConfig + ) +``` + +### Running Goban Workloads + +```bash +# Simulate with Goban config (1 tile, 4 cores) +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGobanVerilatorConfig \ + --batch' + +# Simulate with Goban2Tile config (2 tiles, 8 cores) +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGoban2TileVerilatorConfig \ + --batch' +``` + +## Design Considerations + +### Scalability + +Goban supports configurations with 1 or more BBTiles: +- Each tile operates independently +- Tiles can be scaled by instantiating multiple `WithNBBTiles` configurations +- Memory bandwidth grows with system bus width + +### Synchronization Overhead + +BarrierUnit provides hardware acceleration for barrier operations: +- No busy-waiting required +- Single-cycle barrier (after all cores arrive) +- Efficient for bulk synchronization at algorithm phase boundaries + +### Data Layout + +For optimal performance with multi-core memory access: +- Use bank-striped layouts to distribute load +- Align data structures to cache line boundaries +- Avoid false sharing in shared arrays + +## Performance Profiling + +Use instruction tracing and bank enable signals (from GemminiBall tracing enhancements) to profile: + +- Per-core instruction flow +- Memory access patterns across cores +- Barrier stall time +- Accelerator utilization per core + +## Troubleshooting + +### Simulation Hangs at Barrier + +- Check that all cores reach the barrier with correct hart IDs +- Verify `nCores` matches barrier array size in test program +- Ensure BarrierUnit is not deadlocked on memory operations + +### Inconsistent Shared Data + +- Add volatile keyword to shared variables +- Insert barriers before and after shared memory access +- Check for cache coherency issues (review memory backend logs) + +### Performance Issues + +- Profile barrier stall time with waveform traces +- Verify load balance across cores (use trace data) +- Consider increasing system bus width for memory-bound workloads + +## See Also + +- [Development Workflow and Build System](Development%20Workflow%20and%20Build%20System.md) — Building and simulating Goban configs +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) — RISC-V + Blink ISA details +- [GemminiBall Architecture](GemminiBall%20Architecture.md) — Accelerator operations in Goban diff --git a/content/zh/Guide/GemminiBall Architecture.md b/content/zh/Guide/GemminiBall Architecture.md index 24b8702..462d2b8 100644 --- a/content/zh/Guide/GemminiBall Architecture.md +++ b/content/zh/Guide/GemminiBall Architecture.md @@ -27,19 +27,29 @@ GemminiBall 通过以下配置寄存器维护状态: ## 按 funct7 指令路由 -GemminiBall 根据 `funct7` 字段分发指令: - -| funct7 | 操作 | 类型 | -|--------|------|------| -| 0x02 | CONFIG | ExUnit | -| 0x03 | FLUSH | ExUnit | -| 0x35 | PRELOAD | ExUnit | -| 0x42 | COMPUTE_PRELOADED | ExUnit | -| 0x43 | COMPUTE_ACCUMULATED | ExUnit | -| 0x50–0x56 | 循环配置 | 配置 | -| 0x57 | 循环触发(矩阵) | 控制 | -| 0x60–0x68 | 循环配置(卷积) | 配置 | -| 0x69 | 循环触发(卷积) | 控制 | +GemminiBall 根据 `funct7` 字段分发指令。该字段分区为: + +- **位 [6:4]**:银行使能字段(编码内存访问类型:000/001/010/011/100 用于不同访问模式;101/110/111 用于扩展操作码) +- **位 [3:0]**:操作码 + +| funct7 | 位 [6:4] | 操作 | 类型 | +|--------|-----------|------|------| +| 0x02 | 000 | CONFIG | ExUnit | +| 0x03 | 000 | FLUSH | ExUnit | +| 0x04 | 000 | BDB_COUNTER | 调试 | +| 0x30 | 011 | IM2COL | 计算 | +| 0x31 | 011 | TRANSPOSE | 计算 | +| 0x32 | 011 | RELU | 计算 | +| 0x33 | 011 | QUANT | 计算 | +| 0x34 | 011 | DEQUANT | 计算 | +| 0x35 | 011 | PRELOAD | ExUnit | +| 0x36 | 011 | BDB_BACKDOOR | 调试 | +| 0x40 | 100 | MATMUL_WARP16 | 计算 | +| 0x41 | 100 | SYSTOLIC | 计算 | +| 0x42 | 100 | COMPUTE_PRELOADED | ExUnit | +| 0x43 | 100 | COMPUTE_ACCUMULATED | ExUnit | +| 0x50–0x57 | 101 | Loop WS Config / Loop Trigger (Matrix) | 配置 | +| 0x60–0x69 | 110 | Loop Conv Config / Loop Trigger (Conv) | 配置 | ### 执行路径 @@ -50,28 +60,28 @@ GemminiBall 根据 `funct7` 字段分发指令: **配置路径**(循环配置): - 立即响应(单周期) - 存储配置寄存器 -- 包含 ROB 元数据追踪 +- 为元数据关联包含 ROB 追踪 ## 指令格式 ### ExUnit 指令 -ExUnit 指令遵循标准 Blink 命令格式,具有 Gemmini 语义: +ExUnit 指令遵循标准 Blink 命令格式和 Gemmini 语义: ``` -字段 | 位 | 描述 -----------|-------|----------------------------------- -funct7 | [6:0] | 操作选择器 -rs2/cmd | [63:0] | 操作数/配置数据 -rs1 | [31:0] | 地址/寄存器文件指针 +字段 | 位 | 描述 +--------|-------|----------------------------------- +funct7 | [6:0] | 操作选择器 +rs2/cmd | [63:0]| 操作数/配置数据 +rs1 | [31:0]| 地址/寄存器文件指针 ``` ### 循环配置指令 -循环配置使用立即数模式,操作数编码如下: +循环配置使用立即数模式与操作数编码: ``` -指令:funct7 | rs2_data(特殊) +指令: funct7 | rs2_data (特殊) funct7 0x50: max_i [47:32], max_j [31:16], max_k [15:0] funct7 0x51: dram_addr_a [38:0] funct7 0x52: dram_addr_b [38:0] @@ -83,11 +93,11 @@ funct7 0x56: stride_d [31:0], stride_c [63:32] ## 寄存器追踪 -GemminiBall 通过 `rob_id_reg` 追踪 ROB ID 以维护配置和执行阶段之间的元数据关联。这使得: +GemminiBall 通过 `rob_id_reg` 追踪 ROB ID,在配置和执行阶段之间维护元数据关联。这实现了: - 正确的结果路由到重排序缓冲区 -- 分块操作的子操作追踪 -- 管道化配置的一致状态管理 +- 管道化配置的子操作追踪 +- 分块操作的一致状态管理 ## 使用示例 @@ -127,27 +137,36 @@ GemminiBall 实现 `BlinkIO` 用于命令接收和响应: ### 重排序缓冲区(ROB) 结果包括: -- `rob_id`: 原始指令 ID,用于乱序执行 -- `is_sub`: 表示子操作状态 -- `sub_rob_id`: 复合操作的二级 ROB 追踪 +- `rob_id`: 用于乱序执行的原始指令 ID +- `is_sub`: 指示子操作状态 +- `sub_rob_id`: 用于组合操作的辅助 ROB 追踪 -## 最近的改进 +## 最近的增强 ### funct7 编码更新(最新) -最近的提交重构了 funct7 编码方案以: -- 将 ExUnit 指令(立即响应路径)与循环控制分离 -- 与更新的 DISA(领域特定 ISA)规范对齐 -- 支持用于调试和分析的新增银行使能追踪 +最近的提交(2026 年 3 月)更新了 funct7 编码方案以: +- 在 [6:4] 中编码银行使能位用于内存访问模式追踪 +- 支持新操作:IM2COL、TRANSPOSE、RELU、QUANT、DEQUANT 和 MATMUL_WARP16 +- 与更新的 DISA(特定领域 ISA)规范对齐 +- 启用带银行访问可视化的指令追踪 ### 指令追踪 银行使能支持: -- 内存访问模式可视化 -- 每个内存银行的性能分析 -- 数据依赖关系的调试 +- 按银行的内存访问模式可视化 +- 内存操作的性能分析 +- 数据依赖性和银行冲突的调试 -## 参见 +### 循环展开器 -- [Buckyball ISA 文档](../Overview/Buckyball%20ISA.md) -- [构建自己的硬件设计](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) +最近的 GemminiBall 增强添加了: +- **LoopMatmulUnroller**: 具有可配置边界的分块矩阵乘法 +- **LoopConvUnroller**: 具有灵活地址生成的卷积循环展开 +- 两者都支持任意循环嵌套和步长内存访问 + +## 相关文档 + +- [Goban 多核架构](Goban%20Multi-Core%20Architecture.md) — 使用 GemminiBall 的多核配置 +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) +- [Building Your Own Hardware Designs](../Tutorial/Building%20Your%20Own%20Hardware%20Designs.md) diff --git a/content/zh/Guide/Goban Multi-Core Architecture.md b/content/zh/Guide/Goban Multi-Core Architecture.md new file mode 100644 index 0000000..5bf1cd1 --- /dev/null +++ b/content/zh/Guide/Goban Multi-Core Architecture.md @@ -0,0 +1,225 @@ +# Goban 多核架构 + +## 概述 + +Goban 是 Buckyball 中的一个多核 BBTile 配置,支持 SPMD(单程序多数据)工作负载的并行执行。每个 BBTile 包含多个 Rocket 核,每个核都配置有自己的 BuckyballAccelerator。所有加速器在一个瓦片内共享单个 SharedMemBackend 和 BarrierUnit 以进行同步。 + +## 架构概览 + +### 瓦片结构 + +``` +┌─────────────────────────────────────┐ +│ BBTile (Goban) │ +├─────────────────────────────────────┤ +│ 核心0 │ 核心1 │ ...│ 核心N-1 │ +│Rocket+│Rocket+│ │Rocket+ │ +│加速 │加速 │ │加速 │ +├──────┴───────┴────┴────────────────┤ +│ SharedMemBackend + BarrierUnit │ +└─────────────────────────────────────┘ +``` + +### 配置变体 + +**BuckyballGobanConfig** +- 1 个 BBTile × 4 核 +- 4 个 Rocket 核 + 4 个 BuckyballAccelerator +- 单个 SharedMem + BarrierUnit + +**BuckyballGoban2TileConfig** +- 2 个 BBTile × 4 核 = 8 个核心 +- 8 个 Rocket 核 + 8 个 BuckyballAccelerator +- 每瓦片的 SharedMem + BarrierUnit + +## 核心组件 + +### 多核执行 + +每个核心独立执行相同程序,访问: +- 本地寄存器文件 +- 私有指令缓存 +- 配套的 BuckyballAccelerator 用于硬件操作 +- 用于核间通信的共享内存 +- 通过 CSR `mhartid` 获取 Hart ID(硬件线程 ID) + +### 屏障单元 + +BarrierUnit 通过 `bb_barrier()` 内置函数提供硬件级同步: + +- 在所有核到达屏障前,暂停瓦片内所有核 +- 单周期同步开销 +- 支持同一程序中的多个连续屏障 +- 对于 SPMD 算法协调至关重要 + +### 共享内存后端 + +SharedMemBackend 管理来自瓦片内所有核的内存操作: + +- 仲裁来自多个核的内存请求 +- 为共享数据结构维护一致性 +- 处理内存映射 I/O (MMIO) 用于瓦片间通信 +- 支持同步原语的原子操作 + +## 编程模型 + +### SPMD 执行模式 + +```c +#include "goban.h" + +int main(void) { + int cid = bb_get_core_id(); // 获取 hart ID [0, nCores-1] + + // 阶段 1:每核计算 + int local_result = compute(cid, input_data); + + // 阶段 2:同步 + bb_barrier(); + + // 阶段 3:共享结果处理 + if (cid == 0) { + process_all_results(local_result); + } + + bb_barrier(); // 确保所有核到达退出 + + return 0; +} +``` + +### 核心标识 + +```c +static inline int bb_get_core_id(void) { + int hartid; + asm volatile("csrr %0, mhartid" : "=r"(hartid)); + return hartid; +} +``` + +返回 hart ID,范围为 `[0, nCores-1]`。在 Goban 配置中,这直接映射到瓦片内的核心索引。 + +## 测试工作负载 + +### barrier_test.c + +多核屏障同步冒烟测试: + +1. 每个核设置 `arrived[cid] = 1` +2. 所有核执行 `bb_barrier()` +3. 每个核验证所有 `arrived[]` 标志已设置 +4. 再用 `bb_barrier()` 重复一次 +5. 核心 0 打印最终结果 + +正确性检查:模拟不能挂起,所有核必须完成。 + +### barrier_mvin_test.c + +结合屏障同步和加速器操作(mvins): + +- 测试内存屏障协调与硬件加速的配合 +- 验证跨核的数据一致性 +- 验证 BarrierUnit 在飞行中加速器操作期间的阻塞 + +## 与 Buckyball 的集成 + +### 系统总线 + +Goban 使用 128 位系统总线(相对于 toy 的较窄总线),以适应多核工作负载的更高内存带宽需求。 + +### build.sc 中的配置 + +Goban 在 `arch/src/main/scala/examples/goban/CustomConfigs.scala` 中定义为配置目标: + +```scala +object GobanConfig { + val nCores: Int = 4 + + def apply(): GlobalConfig = { + val base = GlobalConfig() + base.copy(top = base.top.copy(nCores = nCores)) + } +} + +class BuckyballGobanConfig + extends Config( + new WithNBBTiles(1, buckyballConfig = GobanConfig()) ++ + new chipyard.config.WithSystemBusWidth(128) ++ + new chipyard.config.AbstractConfig + ) +``` + +### 运行 Goban 工作负载 + +```bash +# 使用 Goban 配置(1 瓦片,4 核)进行模拟 +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGobanVerilatorConfig \ + --batch' + +# 使用 Goban2Tile 配置(2 瓦片,8 核)进行模拟 +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGoban2TileVerilatorConfig \ + --batch' +``` + +## 设计考虑 + +### 可扩展性 + +Goban 支持具有 1 个或更多 BBTile 的配置: +- 每个瓦片独立运行 +- 通过实例化多个 `WithNBBTiles` 配置可扩展瓦片 +- 内存带宽随系统总线宽度增长 + +### 同步开销 + +BarrierUnit 为屏障操作提供硬件加速: +- 无需忙等待 +- 单周期屏障(所有核到达后) +- 对于算法阶段边界的批量同步效率高 + +### 数据布局 + +为与多核内存访问性能最优化: +- 使用银行条纹布局分散负载 +- 将数据结构对齐到缓存行边界 +- 避免共享数组中的伪共享 + +## 性能分析 + +使用指令追踪和银行使能信号(来自 GemminiBall 追踪增强)分析: + +- 每核指令流 +- 跨核的内存访问模式 +- 屏障停滞时间 +- 每核的加速器利用率 + +## 故障排查 + +### 模拟在屏障处挂起 + +- 检查所有核是否以正确的 hart ID 到达屏障 +- 验证 `nCores` 是否与测试程序中的屏障数组大小匹配 +- 确保 BarrierUnit 未在内存操作上死锁 + +### 共享数据不一致 + +- 向共享变量添加 volatile 关键字 +- 在共享内存访问前后插入屏障 +- 检查缓存一致性问题(查看内存后端日志) + +### 性能问题 + +- 使用波形追踪分析屏障停滞时间 +- 使用追踪数据验证跨核的负载均衡 +- 考虑为内存绑定的工作负载增加系统总线宽度 + +## 相关文档 + +- [Development Workflow and Build System](Development%20Workflow%20and%20Build%20System.md) — 构建和模拟 Goban 配置 +- [Buckyball ISA Documentation](../Overview/Buckyball%20ISA.md) — RISC-V + Blink ISA 详情 +- [GemminiBall Architecture](GemminiBall%20Architecture.md) — Goban 中的加速器操作 From 401b4213b0a9ddd3d3daf113c7e3c83ce29e9004 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 17 Mar 2026 21:53:56 +0000 Subject: [PATCH 3/4] docs: update Development Workflow guide with Goban multi-core simulation info - Add Goban configuration options (BuckyballGobanVerilatorConfig, BuckyballGoban2TileVerilatorConfig) - Include example Goban barrier test command - Add reference link to Goban Multi-Core Architecture documentation - Update both en and zh versions Co-authored-by: Shiroha --- .../Development Workflow and Build System.md | 16 ++++++++++++++++ .../Development Workflow and Build System.md | 16 ++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/content/en/Guide/Development Workflow and Build System.md b/content/en/Guide/Development Workflow and Build System.md index 58ea9da..25845b3 100644 --- a/content/en/Guide/Development Workflow and Build System.md +++ b/content/en/Guide/Development Workflow and Build System.md @@ -100,8 +100,24 @@ bbdev verilator --run \ Available configurations in `sims/verilator/`: - `BuckyballToyVerilatorConfig`: Single-core configuration for unit testing +- `BuckyballGobanVerilatorConfig`: Multi-core configuration (1 tile, 4 cores) with shared accelerators +- `BuckyballGoban2TileVerilatorConfig`: Multi-tile configuration (2 tiles, 8 cores) for SPMD workloads - Custom configs: Define in Scala configuration files +**Multi-core Simulation:** + +For Goban-based configurations, workloads must implement SPMD patterns with hardware barrier synchronization: + +```bash +# Run Goban multi-core test +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGobanVerilatorConfig \ + --batch' +``` + +See [Goban Multi-Core Architecture](Goban%20Multi-Core%20Architecture.md) for detailed multi-core programming guidance. + ## Code Organization ### Directory Structure diff --git a/content/zh/Guide/Development Workflow and Build System.md b/content/zh/Guide/Development Workflow and Build System.md index 67ece66..31493ef 100644 --- a/content/zh/Guide/Development Workflow and Build System.md +++ b/content/zh/Guide/Development Workflow and Build System.md @@ -100,8 +100,24 @@ bbdev verilator --run \ `sims/verilator/` 中的可用配置: - `BuckyballToyVerilatorConfig`: 用于单元测试的单核配置 +- `BuckyballGobanVerilatorConfig`: 多核配置(1 瓦片,4 核)具有共享加速器 +- `BuckyballGoban2TileVerilatorConfig`: 多瓦片配置(2 瓦片,8 核)用于 SPMD 工作负载 - 自定义配置: 在 Scala 配置文件中定义 +**多核模拟:** + +对于基于 Goban 的配置,工作负载必须实现具有硬件屏障同步的 SPMD 模式: + +```bash +# 运行 Goban 多核测试 +bbdev verilator --run \ + '--binary barrier_test-baremetal \ + --config sims.verilator.BuckyballGobanVerilatorConfig \ + --batch' +``` + +有关详细的多核编程指南,请参阅 [Goban 多核架构](Goban%20Multi-Core%20Architecture.md)。 + ## 代码组织 ### 目录结构 From b7a619207c798847c9745209416cc2a9d26b06fc Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 17 Mar 2026 21:54:41 +0000 Subject: [PATCH 4/4] docs: enhance Overview with architecture section and Goban references - Add Architecture & Key Concepts section in both en and zh Overview - Document core components: Ball, Blink, BBTile, BarrierUnit - Reference Toy and Goban configuration models - Add Goban barrier test example to Quick Start - Link to detailed architecture guides (ISA, GemminiBall, Goban) Co-authored-by: Shiroha --- content/en/Overview/Overview.md | 28 +++++++++++++++++++++++----- content/zh/Overview/Overview.md | 25 +++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/content/en/Overview/Overview.md b/content/en/Overview/Overview.md index c832f3f..e3ae98c 100644 --- a/content/en/Overview/Overview.md +++ b/content/en/Overview/Overview.md @@ -39,17 +39,35 @@ Run Verilator simulation test to verify installation: bbdev verilator --run '--jobs 16 --binary ctest_vecunit_matmul_ones_singlecore-baremetal --config sims.verilator.BuckyballToyVerilatorConfig --batch' ``` +For multi-core testing, try the Goban configuration: +```bash +bbdev verilator --run '--binary barrier_test-baremetal --config sims.verilator.BuckyballGobanVerilatorConfig --batch' +``` + + +## Architecture & Key Concepts - +- **Ball**: Customizable accelerator module (e.g., GemminiBall for matrix operations) +- **Blink**: Standard interface for Ball instruction dispatch and result handling +- **BBTile**: Tile containing Rocket cores paired with accelerators and shared memory +- **BarrierUnit**: Hardware synchronization primitive for multi-core workloads +### Configuration Models + +- **Toy**: Single-core reference configuration for development and testing +- **Goban**: Multi-core configuration supporting SPMD parallel workloads with hardware barriers + +For detailed architecture information, see: +- [Buckyball ISA Documentation](Buckyball%20ISA.md) +- [Goban Multi-Core Architecture](../Guide/Goban%20Multi-Core%20Architecture.md) +- [GemminiBall Architecture](../Guide/GemminiBall%20Architecture.md) ## Tutorial + You can start to learn ball and blink from [here](https://github.com/DangoSys/buckyball/blob/main/docs/bb-note/src/tutorial/tutorial.md) ## Additional Resources diff --git a/content/zh/Overview/Overview.md b/content/zh/Overview/Overview.md index 57e3100..bc0b7ee 100644 --- a/content/zh/Overview/Overview.md +++ b/content/zh/Overview/Overview.md @@ -38,6 +38,11 @@ nix develop bbdev verilator --run '--jobs 16 --binary ctest_vecunit_matmul_ones_singlecore-baremetal --config sims.verilator.BuckyballToyVerilatorConfig --batch' ``` +对于多核测试,尝试 Goban 配置: +```bash +bbdev verilator --run '--binary barrier_test-baremetal --config sims.verilator.BuckyballGobanVerilatorConfig --batch' +``` + +## 架构与核心概念 + +Buckyball 的模块化架构支持灵活的硬件加速器设计: + +### 核心组件 + +- **Ball**: 可定制的加速器模块(例如,用于矩阵运算的 GemminiBall) +- **Blink**: Ball 指令分发和结果处理的标准接口 +- **BBTile**: 包含与加速器和共享内存配对的 Rocket 核的瓦片 +- **BarrierUnit**: 用于多核工作负载的硬件同步原语 + +### 配置模型 + +- **Toy**: 用于开发和测试的单核参考配置 +- **Goban**: 支持 SPMD 并行工作负载和硬件屏障的多核配置 + +有关详细的架构信息,请参阅: +- [Buckyball ISA Documentation](Buckyball%20ISA.md) +- [Goban 多核架构](../Guide/Goban%20Multi-Core%20Architecture.md) +- [GemminiBall 架构](../Guide/GemminiBall%20Architecture.md) ## 教程 您可以从[这里](https://github.com/DangoSys/buckyball/blob/main/docs/bb-note/src/tutorial/tutorial.md)开始学习 ball 和 blink。