Add gpu_split param to ExLlama

Adapted from code created by Ph0rk0z. Thank you Ph0rk0z.
This commit is contained in:
oobabooga 2023-06-16 20:49:36 -03:00
parent cb9be5db1c
commit 5f392122fd
6 changed files with 20 additions and 4 deletions

View file

@ -267,6 +267,12 @@ Optionally, you can use the following command-line flags:
| `--no_inject_fused_mlp` | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
#### ExLlama
| Flag | Description |
|------------------|-------------|
|`--gpu-split` | Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7` |
#### GPTQ-for-LLaMa
| Flag | Description |