From aa83fc21d40a455a2c74005da20847e0a5a95ab1 Mon Sep 17 00:00:00 2001 From: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu, 1 Jun 2023 12:14:27 -0300 Subject: [PATCH] Update Low-VRAM-guide.md --- docs/Low-VRAM-guide.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/Low-VRAM-guide.md b/docs/Low-VRAM-guide.md index 1dc86f9..7814ecb 100644 --- a/docs/Low-VRAM-guide.md +++ b/docs/Low-VRAM-guide.md @@ -1,4 +1,4 @@ -If you GPU is not large enough to fit a model, try these in the following order: +If you GPU is not large enough to fit a 16-bit model, try these in the following order: ### Load the model in 8-bit mode @@ -6,7 +6,11 @@ If you GPU is not large enough to fit a model, try these in the following order: python server.py --load-in-8bit ``` -This reduces the memory usage by half with no noticeable loss in quality. Only newer GPUs support 8-bit mode. +### Load the model in 4-bit mode + +``` +python server.py --load-in-4bit +``` ### Split the model across your GPU and CPU @@ -34,8 +38,6 @@ python server.py --auto-devices --gpu-memory 3500MiB ... ``` -Additionally, you can also set the `--no-cache` value to reduce the GPU usage while generating text at a performance cost. This may allow you to set a higher value for `--gpu-memory`, resulting in a net performance gain. - ### Send layers to a disk cache As a desperate last measure, you can split the model across your GPU, CPU, and disk: