Lora Trainer improvements, part 6 - slightly better raw text inputs (#2108)

This commit is contained in:
Alex "mcmonkey" Goodwin 2023-05-19 08:58:54 -07:00 committed by GitHub
parent 511470a89b
commit 50c70e28f0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 35 additions and 13 deletions

View file

@ -75,6 +75,13 @@ So for example if a dataset has `"instruction": "answer my question"`, then the
If you have different sets of key inputs, you can make your own format file to match it. This format-file is designed to be as simple as possible to enable easy editing to match your needs.
## Raw Text File Settings
When using raw text files as your dataset, the text is automatically split into chunks based on your `Cutoff Length` you get a few basic options to configure them.
- `Overlap Length` is how much to overlap chunks by. Overlapping chunks helps prevent the model from learning strange mid-sentence cuts, and instead learn continual sentences that flow from earlier text.
- `Prefer Newline Cut Length` sets a maximum distance in characters to shift the chunk cut towards newlines. Doing this helps prevent lines from starting or ending mid-sentence, preventing the model from learning to cut off sentences randomly.
- `Hard Cut String` sets a string that indicates there must be a hard cut without overlap. This defaults to `\n\n\n`, meaning 3 newlines. No trained chunk will ever contain this string. This allows you to insert unrelated sections of text in the same text file, but still ensure the model won't be taught to randomly change the subject.
## Parameters
The basic purpose and function of each parameter is documented on-page in the WebUI, so read through them in the UI to understand your options.