Bump llama-cpp-python, +tensor_split by @shouyiwang, +mul_mat_q (#3610)

2023-08-18 12:03:34 -03:00 · 2023-08-18 12:03:34 -03:00 · 7cba000421
commit 7cba000421
parent 4b69f4f6ae
8 changed files with 31 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -261,7 +261,9 @@ Optionally, you can use the following command-line flags:
 |-------------|-------------|
 | `--no-mmap` | Prevent mmap from being used. |
 | `--mlock`   | Force the system to keep the model in RAM. |
+| `--mul_mat_q` | Activate new mulmat kernels. |
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
+| `--tensor_split TENSOR_SPLIT` | Split the model across multiple GPUs, comma-separated list of proportions, e.g. 18,17 |
 | `--llama_cpp_seed SEED` | Seed for llama-cpp models. Default 0 (random). |
 | `--n_gqa N_GQA`         | grouped-query attention. Must be 8 for llama-2 70b. |
 | `--rms_norm_eps RMS_NORM_EPS`  | 5e-6 is a good value for llama-2 models. |