Revert "Remove GPTQ-for-LLaMa monkey patch support"

This reverts commit e3d3565b2a.
2023-08-10 08:39:41 -07:00 · 2023-08-10 08:39:41 -07:00 · c7f52bbdc1
commit c7f52bbdc1
parent 16e2b117b4
6 changed files with 103 additions and 0 deletions
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -157,4 +157,31 @@ Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)

 You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.

+### Using LoRAs with GPTQ-for-LLaMa
+
+This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
+
+To use it:
+
+1. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder:
+
+```
+cd text-generation-webui/repositories
+git clone https://github.com/johnsmith0031/alpaca_lora_4bit
+```
+
+⚠️  I have tested it with the following commit specifically: `2f704b93c961bf202937b10aac9322b092afdce0`
+
+2. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command:
+
+```
+pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
+```
+
+3. Start the UI with the `--monkey-patch` flag:
+
+```
+python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
+```
+