Reorganize model loading UI completely (#2720)

2023-06-16 19:00:37 -03:00 · 2023-06-16 19:00:37 -03:00 · 7ef6a50e84
commit 7ef6a50e84
parent 57be2eecdf
16 changed files with 365 additions and 243 deletions
--- a/README.md
+++ b/README.md
@ -211,6 +211,12 @@ Optionally, you can use the following command-line flags:
 | `--extensions EXTENSIONS [EXTENSIONS ...]` | The list of extensions to load. If you want to load more than one extension, write the names separated by spaces. |
 | `--verbose`                                | Print the prompts to the terminal. |

+#### Model loader
+
+| Flag                                       | Description |
+|--------------------------------------------|-------------|
+| `--loader LOADER`                          | Choose the model loader manually, otherwise, it will get autodetected. Valid options: autogptq, gptq-for-llama, transformers, llamacpp, rwkv, flexgen |
+
 #### Accelerate/transformers

 | Flag                                        | Description |
@ -265,7 +271,6 @@ Optionally, you can use the following command-line flags:

 | Flag                      | Description |
 |---------------------------|-------------|
-| `--gptq-for-llama` | Use GPTQ-for-LLaMa to load the GPTQ model instead of AutoGPTQ. |
 | `--wbits WBITS`           | Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported. |
 | `--model_type MODEL_TYPE` | Model type of pre-quantized model. Currently LLaMA, OPT, and GPT-J are supported. |
 | `--groupsize GROUPSIZE`   | Group size. |
@ -280,7 +285,6 @@ Optionally, you can use the following command-line flags:

 | Flag             | Description |
 |------------------|-------------|
-| `--flexgen`                       | Enable the use of FlexGen offloading. |
 | `--percent PERCENT [PERCENT ...]` | FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
 | `--compress-weight`               | FlexGen: Whether to compress weight (default: False).|
 | `--pin-weight [PIN_WEIGHT]`       | FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |