Experimental jank multiGPU inference that's 2x faster than native somehow (#2100)

This commit is contained in:
Alex "mcmonkey" Goodwin 2023-05-17 06:41:09 -07:00 committed by GitHub
parent fd743a0207
commit 1f50dbe352
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 10 additions and 3 deletions

View file

@ -107,6 +107,8 @@ This is the performance:
Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
```
You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
## Using LoRAs in 4-bit mode
At the moment, this feature is not officially supported by the relevant libraries, but a patch exists and is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit