AutoGPTQ: Add --disable_exllamav2 flag (Mixtral CPU offloading needs this)

2023-12-15 06:46:13 -08:00 · 2023-12-15 06:46:13 -08:00 · 3bbf6c601d
commit 3bbf6c601d
parent 7de10f4c8e
7 changed files with 16 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -285,6 +285,7 @@ List of command-line flags
 | `--no_use_cuda_fp16`           | This can make models faster on some systems. |
 | `--desc_act`                   | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
 | `--disable_exllama`            | Disable ExLlama kernel, which can improve inference speed on some systems. |
+| `--disable_exllamav2`          | Disable ExLlamav2 kernel. |

 #### GPTQ-for-LLaMa