Add the --disable_exllama option for AutoGPTQ (#3545 from clefever/disable-exllama)

This commit is contained in:
oobabooga 2023-08-14 15:15:55 -03:00 committed by GitHub
commit ccfc02a28d
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
7 changed files with 8 additions and 2 deletions

View file

@ -270,6 +270,7 @@ Optionally, you can use the following command-line flags:
| `--no_inject_fused_mlp` | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
| `--no_use_cuda_fp16` | This can make models faster on some systems. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
| `--disable_exllama` | Disable ExLlama kernel, which can improve inference speed on some systems. |
#### ExLlama