-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated documentation and added conversion utility #34319
Updated documentation and added conversion utility #34319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, docs LGTM! Once @ArthurZucker has reviewed, we can merge 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIce PR!
@@ -893,3 +894,37 @@ def train_new_from_iterator( | |||
kwargs["additional_special_tokens"] = additional_special_tokens | |||
|
|||
return self.__class__(tokenizer_object=tokenizer, **kwargs) | |||
|
|||
|
|||
def convert_tiktoken_to_fast(encoding: Any, output_dir: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
We should do :
from tiktoken import get_encoding
# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
in this function directly IMO!
Also let's maybe place this under the integration
folder ! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't like to load the encoding inside the function to allow for custom encodings like https://github.com/openai/tiktoken?tab=readme-ov-file#extending-tiktoken
We could allow Encoding | str
as input and load the encoding with get_encoding
if a str is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to integration/tiktoken.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM was not pinged so did not come back to it! Could you run make fixup
and rebase to make sure you are up to date?
5b012f7
to
f77a51a
Compare
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
f77a51a
to
326fcdc
Compare
@ArthurZucker could not properly run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 🤗
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* Updated documentation and added conversion utility * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Moved util function to integration folder + allow for str * Update formatting Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Updated formatting * style changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Updated documentation and added conversion utility * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Moved util function to integration folder + allow for str * Update formatting Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Updated formatting * style changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
|
||
tokenizer = TikTokenConverter( | ||
vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens | ||
).tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To actually work, the tokenizer pipeline needs the ByteLevel and Split pre-tokenizers (with the pattern string), the ByteLevel post-processor, and the special tokens from the original encoding. It looks like the .converted()
method on the original TikTokenConverter adds these already:
def converted(self) -> Tokenizer: |
but the convert_tiktoken_to_fast
wrapper just uses the .tokenizer()
method, which saves only the actual merges. This isn't documented, so unless you read the source code and/or inspect the actual tokenizers file, you won't realize the full pipeline is incomplete until Unicode tokens or special tokens start breaking.
This took me a couple hours to figure out what went wrong (porting the Fish Speech 1.5 tokenizer from Tiktoken). Is there a reason convert_tiktoken_to_fast
doesn't just use the .converted()
method? If this isn't possible, could the docs at least explicitly call out what the remaining steps are to set up the BPE pipeline?
What does this PR do?
Documentation improvement on tiktoken integration + tiktoken conversion function.
Fixes #34221
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker
@stevhliu