Restore dual-repo FA3 logic lost during V4 rewrite

## Problem
Original train.py selected FA3 repo by GPU capability:
- `varunneal/flash-attention-3` for Hopper (SM 9.0)
- `kernels-community/flash-attn3` for others

V4 rewrite hardcoded only varunneal. Fixed in 4f3fbf7.
Note: neither repo supports Blackwell yet — falls through to FlexAttention.