Add --load-in-4bit parameter (#2320)

This commit is contained in:
oobabooga 2023-05-25 01:14:13 -03:00 committed by GitHub
parent 63ce5f9c28
commit 361451ba60
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
6 changed files with 61 additions and 22 deletions

View file

@ -214,13 +214,22 @@ Optionally, you can use the following command-line flags:
| `--cpu-memory CPU_MEMORY` | Maximum CPU memory in GiB to allocate for offloaded weights. Same as above.|
| `--disk` | If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
| `--disk-cache-dir DISK_CACHE_DIR` | Directory to save the disk cache to. Defaults to `cache/`. |
| `--load-in-8bit` | Load the model with 8-bit precision.|
| `--load-in-8bit` | Load the model with 8-bit precision (using bitsandbytes).|
| `--bf16` | Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
| `--no-cache` | Set `use_cache` to False while generating text. This reduces the VRAM usage a bit with a performance cost. |
| `--xformers` | Use xformer's memory efficient attention. This should increase your tokens/s. |
| `--sdp-attention` | Use torch 2.0's sdp attention. |
| `--trust-remote-code` | Set trust_remote_code=True while loading a model. Necessary for ChatGLM. |
#### Accelerate 4-bit
| Flag | Description |
|---------------------------------------------|-------------|
| `--load-in-4bit` | Load the model with 4-bit precision (using bitsandbytes). |
| `--compute_dtype COMPUTE_DTYPE` | compute dtype for 4-bit. Valid options: bfloat16, float16, float32. |
| `--quant_type QUANT_TYPE` | quant_type for 4-bit. Valid options: nf4, fp4. |
| `--use_double_quant` | use_double_quant for 4-bit. |
#### llama.cpp
| Flag | Description |