Remove GPTQ-for-LLaMa monkey patch support

AutoGPTQ will be the preferred GPTQ LoRa loader in the future.
2023-08-09 23:59:04 -05:00 · 2023-08-09 23:59:04 -05:00 · e3d3565b2a
commit e3d3565b2a
parent bee73cedbd
6 changed files with 0 additions and 103 deletions
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -198,31 +198,4 @@ Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)

 You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.

-### Using LoRAs with GPTQ-for-LLaMa
-
-This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
-
-To use it:
-
-1. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder:
-
-```
-cd text-generation-webui/repositories
-git clone https://github.com/johnsmith0031/alpaca_lora_4bit
-```
-
-⚠️  I have tested it with the following commit specifically: `2f704b93c961bf202937b10aac9322b092afdce0`
-
-2. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command:
-
-```
-pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
-```
-
-3. Start the UI with the `--monkey-patch` flag:
-
-```
-python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
-```
-

--- a/docs/LoRA.md
+++ b/docs/LoRA.md
@ -11,7 +11,6 @@ This is the current state of LoRA integration in the web UI:
 | Transformers | Full support in 16-bit, `--load-in-8bit`, `--load-in-4bit`, and CPU modes. |
 | ExLlama | Single LoRA support. Fast to remove the LoRA afterwards. |
 | AutoGPTQ | Single LoRA support. Removing the LoRA requires reloading the entire model.|
-| GPTQ-for-LLaMa | Full support with the [monkey patch](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#using-loras-with-gptq-for-llama). |

 ## Downloading a LoRA

--- a/docs/Training-LoRAs.md
+++ b/docs/Training-LoRAs.md
@ -131,14 +131,6 @@ So, in effect, Loss is a balancing game: you want to get it low enough that it u

 Note: if you see Loss start at or suddenly jump to exactly `0`, it is likely something has gone wrong in your training process (eg model corruption).

-## Note: 4-Bit Monkeypatch
-
-The [4-bit LoRA monkeypatch](GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode) works for training, but has side effects:
- VRAM usage is higher currently. You can reduce the `Micro Batch Size` to `1` to compensate.
- Models do funky things. LoRAs apply themselves, or refuse to apply, or spontaneously error out, or etc. It can be helpful to reload base model or restart the WebUI between training/usage to minimize chances of anything going haywire.
- Loading or working with multiple LoRAs at the same time doesn't currently work.
- Generally, recognize and treat the monkeypatch as the dirty temporary hack it is - it works, but isn't very stable. It will get better in time when everything is merged upstream for full official support.
-
 ## Legacy notes

 LoRA training was contributed by [mcmonkey4eva](https://github.com/mcmonkey4eva) in PR [#570](https://github.com/oobabooga/text-generation-webui/pull/570).