A lightweight, educational implementation of Gated Recurrent Units (GRUs) using only binary operations! Instead of floating-point arithmetic, this project explores what happens when we constrain neural networks to work with just 1s and 0s.
Traditional neural networks use floating-point numbers and backpropagation. But what if we tried something completely different? This project implements:
- Binary-only operations: All weights are single bits (0 or 1)
- GRU architecture: A proven recurrent neural network design
- Evolutionary training: We evolve the network rather than using gradients
- Blazing fast: Operations on bits are incredibly efficient
Think of it as neural networks meets genetic algorithms, all running on pure binary logic!
The network consists of stacked binary GRU layers, where each gate (update, reset, and candidate) operates entirely on bit matrices.
-
Multiple Training Strategies
- Genetic Algorithm: Explores diverse solutions through crossover and mutation
- Simulated Annealing: Gradually refines solutions with temperature cooling
- Evolution Strategy: Self-adaptive mutation for efficient local search
-
Optimized Performance
- 64-bit word operations instead of byte-level processing
- Hardware-accelerated bit counting (
popcount) - Vectorized bitwise operations
- Cache-friendly matrix layouts
-
Ready-to-Use Toy Tasks
- Echo: Learn to copy sequences
- Delayed Echo: Remember inputs from N steps ago
- Parity: Count bits and track odd/even state
- XOR: Compute temporal XOR operations
- Counter: Maintain a binary counter
# You'll need:
# - C++20 compatible compiler (GCC 10+, Clang 11+, or MSVC 2019+)
# - CMake 3.23 or higher# Clone the repository
git clone https://github.com/yourusername/bitrnn.git
cd bitrnn
# Build with CMake
mkdir build && cd build
cmake ..
make
# Run it!
./bitrnn# Use the genetic algorithm (default)
./bitrnn 0
# Or try simulated annealing (often works well)
./bitrnn 1
# Or the evolution strategy
./bitrnn 2
# Compare all three methods
./bitrnn --benchmarkA standard GRU has three gates that control information flow:
z = σ(Wz·x + Uz·h) # Update gate
r = σ(Wr·x + Ur·h) # Reset gate
h̃ = tanh(Wh·x + Uh·(r⊙h)) # Candidate hidden state
h = (1-z)⊙h + z⊙h̃ # New hidden state
In our binary version, we replace:
- Matrix multiplication → Binary matmul with majority voting
- Sigmoid activation → Threshold function
- Tanh activation → Threshold function
- Element-wise multiply → Bitwise AND
- Addition → Bitwise OR (or threshold addition)
// Traditional: C = A × B (with floats)
// Binary: Count AND bits, threshold at majority
for (i, j):
sum = popcount(row_i(A) AND col_j(B))
C[i][j] = (sum >= threshold) ? 1 : 0This is incredibly fast because:
- We process 64 bits at a time using
uint64_t - Modern CPUs have native
popcountinstructions - No floating-point operations needed!
Since we can't use backpropagation (no gradients in binary!), we evolve the network:
Genetic Algorithm:
- Create a population of networks
- Breed the best performers (crossover)
- Randomly flip bits (mutation)
- Repeat for many generations
Simulated Annealing:
- Start with high "temperature" (accept many changes)
- Randomly flip bits and test fitness
- Gradually cool down (become more selective)
- Accept worse solutions probabilistically to escape local optima
Evolution Strategy:
- Keep the best network
- Create offspring by flipping bits
- Adapt mutation rate based on success (1/5 rule)
// Easy: Echo task (just copy the input)
EchoTask::generate(sequences, targets, 20, 4, 4);
StackedBitGRU model(4, {8, 4});
// Medium: Delayed echo (remember from 2 steps ago)
DelayedEchoTask::generate(sequences, targets, 30, 6, 4, 2);
StackedBitGRU model(4, {16, 8, 4});
// Hard: Parity tracking (count all 1s seen)
ParityTask::generate(sequences, targets, 30, 6, 4);
StackedBitGRU model(4, {16, 8, 1});
// Very hard: Binary counter
CounterTask::generate(sequences, targets, 20, 8, 4);
StackedBitGRU model(1, {32, 16, 4});// Genetic Algorithm - tune population and mutation
GeneticAlgorithm ga(
100, // population size
0.01, // mutation rate (1%)
0.15 // elite ratio (keep top 15%)
);
ga.train(model, sequences, targets, 200); // 200 generations
// Simulated Annealing - tune temperature schedule
SimulatedAnnealing sa(
2.0f, // initial temperature
0.9995f, // cooling rate
10 // bits to flip per iteration
);
sa.train(model, sequences, targets, 10000);
// Evolution Strategy - tune initial mutation size
OnePlusOneES es(15); // start with 15 bit flips
es.train(model, sequences, targets, 10000);| Strategy | Best For | Speed | Quality | Tuning Difficulty |
|---|---|---|---|---|
| Genetic Algorithm | Complex problems, exploration | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Simulated Annealing | General purpose, good default | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |
| Evolution Strategy | Quick prototyping, simple problems | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐ |
Recommendation: Start with Simulated Annealing (strategy 1). It's reliable and doesn't need much tuning!
- Start small: Use fewer training samples and shorter sequences initially
- Increase hidden size: More bits = more expressiveness (but slower)
- More iterations: Binary networks need many evaluations to converge
- Match task to architecture: Deeper networks for harder tasks
Problem: Network outputs all zeros
- Solution: This happens with unlucky initialization. Just restart training!
Problem: Training is too slow
- Solution: Reduce population size (GA) or number of flips per iteration (SA/ES)
Problem: Can't learn the task
- Solution: Increase hidden layer size or add more layers. Some tasks need more capacity!
bitrnn/
├── bitmatrix.h # Core binary matrix operations
├── fast_bitmatrix.h # Optimized 64-bit word operations
├── bitgru.h # Binary GRU cell implementation
├── stacked_bitgru.h # Multi-layer GRU network
├── helper.h # Matrix utilities (matmul, transpose, etc.)
├── evolutionary_trainer.h # Original simple trainer
├── optimized_evolutionary_trainer.h # Advanced training strategies
├── toy_datasets.h # Pre-built sequence tasks
├── bitrnn.cpp # Main entry point
└── CMakeLists.txt # Build configuration
- Architecture search: How many layers? How wide?
- Task difficulty: At what complexity do binary networks fail?
- Hybrid approaches: Can we combine training strategies?
- Quantization: Compare to traditional networks quantized to 1-bit
- Memory efficiency: How much can we compress?
This project is open source and available under the MIT License.
- Inspired by BinaryNet and XNOR-Net research
- GRU architecture from Cho et al. (2014)
- Evolution strategies from Rechenberg and Schwefel
- Learning to forget: Continual prediction with LSTM
- Empirical Evaluation of Gated Recurrent Neural Networks
- BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning
- No gradient information: Can't use advanced optimization tricks
- Local optima: Evolutionary methods can get stuck
- Sample inefficient: Needs many evaluations compared to gradient descent
- Limited capacity: Binary representations are fundamentally constrained
But that's part of the fun! It's about exploring what's possible with extreme constraints.
Happy bit-flipping! 🎲
If you use this project in work or teaching, I'd love to hear about it! Open an issue or drop me a message.
