Remove GGML support

2023-09-11 07:30:56 -07:00 · 2023-09-11 07:30:56 -07:00 · ed86878f02
commit ed86878f02
parent cc7b7ba153
15 changed files with 24 additions and 123 deletions
--- a/README.md
+++ b/README.md
@ -158,7 +158,7 @@ text-generation-webui
 │   │   └── tokenizer.model
 ```

-* GGML/GGUF models are a single file and should be placed directly into `models`. Example:
+* GGUF models are a single file and should be placed directly into `models`. Example:

 ```
 text-generation-webui
@ -260,7 +260,7 @@ Optionally, you can use the following command-line flags:
 | `--quant_type QUANT_TYPE`                   | quant_type for 4-bit. Valid options: nf4, fp4. |
 | `--use_double_quant`                        | use_double_quant for 4-bit. |

-#### GGML/GGUF (for llama.cpp and ctransformers)
+#### GGUF (for llama.cpp and ctransformers)

 | Flag        | Description |
 |-------------|-------------|
@ -279,8 +279,6 @@ Optionally, you can use the following command-line flags:
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
 | `--tensor_split TENSOR_SPLIT`  | Split the model across multiple GPUs, comma-separated list of proportions, e.g. 18,17 |
 | `--llama_cpp_seed SEED`        | Seed for llama-cpp models. Default 0 (random). |
-| `--n_gqa N_GQA`                | GGML only (not used by GGUF): Grouped-Query Attention. Must be 8 for llama-2 70b. |
-| `--rms_norm_eps RMS_NORM_EPS`  | GGML only (not used by GGUF): 5e-6 is a good value for llama-2 models. |
 | `--cpu`                        | Use the CPU version of llama-cpp-python instead of the GPU-accelerated version. |
 |`--cfg-cache`                   | llamacpp_HF: Create an additional cache for CFG negative prompts. |