Experimental jank multiGPU inference that's 2x faster than native somehow (#2100)
This commit is contained in:
parent
fd743a0207
commit
1f50dbe352
4 changed files with 10 additions and 3 deletions
|
|
@ -107,6 +107,8 @@ This is the performance:
|
|||
Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
|
||||
```
|
||||
|
||||
You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
|
||||
|
||||
## Using LoRAs in 4-bit mode
|
||||
|
||||
At the moment, this feature is not officially supported by the relevant libraries, but a patch exists and is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue