AutoGPTQ: Add UI and command line support for disabling fused attention and fused MLP (#2648)

2023-06-16 03:59:54 +01:00 · 2023-06-16 03:59:54 +01:00 · 646b0c889f
commit 646b0c889f
parent 909d8c6ae3
5 changed files with 11 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -249,8 +249,10 @@ Optionally, you can use the following command-line flags:

 | Flag             | Description |
 |------------------|-------------|
-| `--triton`       | Use triton. |
-| `--desc_act`     | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
+| `--triton`                     | Use triton. |
+| `--no_inject_fused_attention`  | Disable the use of fused attention, which will use less VRAM at the cost of slower inference. |
+| `--no_inject_fused_mlp`        | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
+| `--desc_act`                   | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |

 #### GPTQ-for-LLaMa