Bump llama-cpp-python to 0.2.18 (#4611)
This commit is contained in:
parent
61f429563e
commit
923c8e25fb
17 changed files with 92 additions and 174 deletions
|
@ -21,7 +21,7 @@ Options:
|
|||
* **alpha_value**: Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.
|
||||
* **rope_freq_base**: Originally another way to write "alpha_value", it ended up becoming a necessary parameter for some models like CodeLlama, which was fine-tuned with this set to 1000000 and hence needs to be loaded with it set to 1000000 as well.
|
||||
* **compress_pos_emb**: The first and original context-length extension method, discovered by [kaiokendev](https://kaiokendev.github.io/til). When set to 2, the context length is doubled, 3 and it's tripled, etc. It should only be used for models that have been fine-tuned with this parameter set to different than 1. For models that have not been tuned to have greater context length, alpha_value will lead to a smaller accuracy loss.
|
||||
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower. Note: this parameter has a different interpretation in the llama.cpp loader (see below).
|
||||
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower.
|
||||
* **load-in-8bit**: Load the model in 8-bit precision using bitsandbytes. The 8-bit kernel in that library has been optimized for training and not inference, so load-in-8bit is slower than load-in-4bit (but more accurate).
|
||||
* **bf16**: Use bfloat16 precision instead of float16 (the default). Only applies when quantization is not used.
|
||||
* **auto-devices**: When checked, the backend will try to guess a reasonable value for "gpu-memory" to allow you to load a model with CPU offloading. I recommend just setting "gpu-memory" manually instead. This parameter is also needed for loading GPTQ models, in which case it needs to be checked before loading the model.
|
||||
|
@ -97,7 +97,6 @@ Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
|
|||
* **no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
|
||||
* **mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).
|
||||
* **numa**: May improve performance on certain multi-cpu systems.
|
||||
* **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
|
||||
* **tensor_split**: For multi-gpu only. Sets the amount of memory to allocate per GPU.
|
||||
* **Seed**: The seed for the llama.cpp random number generator. Not very useful as it can only be set once (that I'm aware).
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue