Add gpu_split param to ExLlama

Adapted from code created by Ph0rk0z. Thank you Ph0rk0z.
2023-06-16 20:49:36 -03:00 · 2023-06-16 20:49:36 -03:00 · 5f392122fd
commit 5f392122fd
parent cb9be5db1c
6 changed files with 20 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -267,6 +267,12 @@ Optionally, you can use the following command-line flags:
 | `--no_inject_fused_mlp`        | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
 | `--desc_act`                   | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |

+#### ExLlama
+
+| Flag             | Description |
+|------------------|-------------|
+|`--gpu-split`     | Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7` |
+
 #### GPTQ-for-LLaMa

 | Flag                      | Description |