Add --load-in-4bit parameter (#2320)
This commit is contained in:
parent
63ce5f9c28
commit
361451ba60
6 changed files with 61 additions and 22 deletions
11
README.md
11
README.md
|
@ -214,13 +214,22 @@ Optionally, you can use the following command-line flags:
|
|||
| `--cpu-memory CPU_MEMORY` | Maximum CPU memory in GiB to allocate for offloaded weights. Same as above.|
|
||||
| `--disk` | If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
|
||||
| `--disk-cache-dir DISK_CACHE_DIR` | Directory to save the disk cache to. Defaults to `cache/`. |
|
||||
| `--load-in-8bit` | Load the model with 8-bit precision.|
|
||||
| `--load-in-8bit` | Load the model with 8-bit precision (using bitsandbytes).|
|
||||
| `--bf16` | Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
|
||||
| `--no-cache` | Set `use_cache` to False while generating text. This reduces the VRAM usage a bit with a performance cost. |
|
||||
| `--xformers` | Use xformer's memory efficient attention. This should increase your tokens/s. |
|
||||
| `--sdp-attention` | Use torch 2.0's sdp attention. |
|
||||
| `--trust-remote-code` | Set trust_remote_code=True while loading a model. Necessary for ChatGLM. |
|
||||
|
||||
#### Accelerate 4-bit
|
||||
|
||||
| Flag | Description |
|
||||
|---------------------------------------------|-------------|
|
||||
| `--load-in-4bit` | Load the model with 4-bit precision (using bitsandbytes). |
|
||||
| `--compute_dtype COMPUTE_DTYPE` | compute dtype for 4-bit. Valid options: bfloat16, float16, float32. |
|
||||
| `--quant_type QUANT_TYPE` | quant_type for 4-bit. Valid options: nf4, fp4. |
|
||||
| `--use_double_quant` | use_double_quant for 4-bit. |
|
||||
|
||||
#### llama.cpp
|
||||
|
||||
| Flag | Description |
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue