Problem
Original train.py selected FA3 repo by GPU capability:
varunneal/flash-attention-3 for Hopper (SM 9.0)
kernels-community/flash-attn3 for others
V4 rewrite hardcoded only varunneal. Fixed in 4f3fbf7.
Note: neither repo supports Blackwell yet — falls through to FlexAttention.