Experimental jank multiGPU inference that's 2x faster than native somehow (#2100)

2023-05-17 06:41:09 -07:00 · 2023-05-17 06:41:09 -07:00 · 1f50dbe352
commit 1f50dbe352
parent fd743a0207
4 changed files with 10 additions and 3 deletions
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -107,6 +107,8 @@ This is the performance:
 Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
 ```

+You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
+
 ## Using LoRAs in 4-bit mode

 At the moment, this feature is not officially supported by the relevant libraries, but a patch exists and is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit