AutoGPTQ: Add UI and command line support for disabling fused attention and fused MLP (#2648)

This commit is contained in:
Tom Jobbins 2023-06-16 03:59:54 +01:00 committed by GitHub
parent 909d8c6ae3
commit 646b0c889f
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
5 changed files with 11 additions and 3 deletions

View file

@ -249,8 +249,10 @@ Optionally, you can use the following command-line flags:
| Flag | Description |
|------------------|-------------|
| `--triton` | Use triton. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
| `--triton` | Use triton. |
| `--no_inject_fused_attention` | Disable the use of fused attention, which will use less VRAM at the cost of slower inference. |
| `--no_inject_fused_mlp` | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
#### GPTQ-for-LLaMa