r/LocalLLaMA Technical Insights

Technical blog articles distilled from high-scoring r/LocalLLaMA posts (2025). Generated by Nemotron 9B running locally on vLLM.

Disclaimer: These articles are based on unverified community information from Reddit. Numbers, benchmarks, and claims are self-reported by original posters. Always verify before relying on any data.

Articles

vLLM & Inference

#	Article	Reddit Score
00	Don't Waste Electricity Running vLLM — Use This Patch	303
05	Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM	50
06	DeepSeek Open-Sources nano-vLLM	621
07	GH200 Desktop: vLLM Tuning Notes (TP vs PP, max-num-seqs)	648
08	Megakernel Doubles Batch-1 Inference Speed	73

Quantization & FP8/NVFP4

#	Article	Reddit Score
03	Software FP8: 3x Speedup Without Hardware Support	266
04	8+ Hours Benchmarking Every MoE Backend for Qwen3.5-397B NVFP4	223
11	NVIDIA NVFP4: 4-bit Pretraining Matches FP8 Accuracy	808

Hardware & Multi-GPU

#	Article	Reddit Score
02	Dual-GPU Boosts Speed Despite Common Wisdom (5090 vs H100)	161
09	Patched P2P Driver Enables Multi-5090 Systems	86
12	Qwen3-30B FP8 on RTX Pro 6000 Blackwell: 88.4 tok/s	96
13	RTX Pro 6000 vLLM Benchmark: 120B Model Analysis	173

KV Cache & Optimization

#	Article	Reddit Score
01	KV Cache RAM Swap is ~10x Faster Than Recomputation	220
14	LMCache: Reuse Non-Prefix KV Cache, 3x RAG Speedup	127

Setup Guides

#	Article	Reddit Score
10	Qwen3-Next 80B FP8 on WSL2 + vLLM + Docker (Blackwell)	86

How This Was Made

Downloaded 90K r/LocalLLaMA posts via Arctic Shift
Filtered to 485 high-quality technical posts (score >= 20, 2025+, tech-signal detection)
Selected 15 posts most relevant to vLLM / Blackwell / FP8 stack
Generated articles with Nemotron 9B Japanese on local vLLM (RTX 5090)
Pipeline orchestrated by Claude Code (Opus 4.6)

Contributing

Found an error? Have additional context? Issues and corrections are welcome!

License

Articles are derivative of Reddit posts (user-generated content). Shared for educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

r/LocalLLaMA Technical Insights

Articles

vLLM & Inference

Quantization & FP8/NVFP4

Hardware & Multi-GPU

KV Cache & Optimization

Setup Guides

How This Was Made

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
00_psa_don_t_waste_electricity_when_running_vllm_use_this_patch.md		00_psa_don_t_waste_electricity_when_running_vllm_use_this_patch.md
01_til_for_long_lived_llm_sessions_swapping_kv_cache_to_ram_is.md		01_til_for_long_lived_llm_sessions_swapping_kv_cache_to_ram_is.md
02_benchmark_dual_gpu_boosts_speed_despire_all_common_internet.md		02_benchmark_dual_gpu_boosts_speed_despire_all_common_internet.md
03_software_fp8_for_gpus_without_hardware_support_3x_speedup_on.md		03_software_fp8_for_gpus_without_hardware_support_3x_speedup_on.md
04_i_spent_8_hours_benchmarking_every_moe_backend_for_qwen3_5_3.md		04_i_spent_8_hours_benchmarking_every_moe_backend_for_qwen3_5_3.md
05_benchmarking_llm_inference_backends_vllm_lmdeploy_mlc_llm_te.md		05_benchmarking_llm_inference_backends_vllm_lmdeploy_mlc_llm_te.md
06_deepseek_guys_open_source_nano_vllm.md		06_deepseek_guys_open_source_nano_vllm.md
07_i_bought_a_9k_gh200_desktop_to_save_1_27_on_claude_code_vllm.md		07_i_bought_a_9k_gh200_desktop_to_save_1_27_on_claude_code_vllm.md
08_megakernel_doubles_llama_1b_inference_speed_for_batch_size_1.md		08_megakernel_doubles_llama_1b_inference_speed_for_batch_size_1.md
09_patched_p2p_nvidia_driver_now_works_with_multiple_5090s_and.md		09_patched_p2p_nvidia_driver_now_works_with_multiple_5090s_and.md
10_qwen3_next_80b_a3b_instruct_fp8_on_windows_11_wsl2_vllm_dock.md		10_qwen3_next_80b_a3b_instruct_fp8_on_windows_11_wsl2_vllm_dock.md
11_nvidia_breakthrough_gives_4_bit_pretraining_technique_the_ac.md		11_nvidia_breakthrough_gives_4_bit_pretraining_technique_the_ac.md
12_qwen3_30b_a3b_fp8_on_rtx_pro_6000_blackwell_with_vllm.md		12_qwen3_30b_a3b_fp8_on_rtx_pro_6000_blackwell_with_vllm.md
13_rtx_pro_6000_blackwell_vllm_benchmark_120b_model_performance.md		13_rtx_pro_6000_blackwell_vllm_benchmark_120b_model_performance.md
14_reuse_non_prefix_kv_cache_and_speed_up_rag_by_3x_with_lmcach.md		14_reuse_non_prefix_kv_cache_and_speed_up_rag_by_3x_with_lmcach.md
AUDIT.md		AUDIT.md
README.md		README.md
_config.yml		_config.yml

Folders and files

Latest commit

History

Repository files navigation

r/LocalLLaMA Technical Insights

Articles

vLLM & Inference

Quantization & FP8/NVFP4

Hardware & Multi-GPU

KV Cache & Optimization

Setup Guides

How This Was Made

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages