Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated documentation and added conversion utility #34319

Merged

Conversation

ViktorooReps
Copy link
Contributor

@ViktorooReps ViktorooReps commented Oct 22, 2024

What does this PR do?

Documentation improvement on tiktoken integration + tiktoken conversion function.

Fixes #34221

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker
@stevhliu

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, docs LGTM! Once @ArthurZucker has reviewed, we can merge 🙂

docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
docs/source/en/tiktoken.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce PR!

@@ -893,3 +894,37 @@ def train_new_from_iterator(
kwargs["additional_special_tokens"] = additional_special_tokens

return self.__class__(tokenizer_object=tokenizer, **kwargs)


def convert_tiktoken_to_fast(encoding: Any, output_dir: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!
We should do :

from tiktoken import get_encoding

# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")

in this function directly IMO!

Also let's maybe place this under the integration folder ! 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't like to load the encoding inside the function to allow for custom encodings like https://github.com/openai/tiktoken?tab=readme-ov-file#extending-tiktoken

We could allow Encoding | str as input and load the encoding with get_encoding if a str is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to integration/tiktoken.py

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM was not pinged so did not come back to it! Could you run make fixup and rebase to make sure you are up to date?

src/transformers/integrations/tiktoken.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
@ViktorooReps ViktorooReps force-pushed the tiktoken_improved_documentation branch from 5b012f7 to f77a51a Compare November 25, 2024 10:26
ViktorooReps and others added 5 commits November 25, 2024 11:32
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@ViktorooReps ViktorooReps force-pushed the tiktoken_improved_documentation branch from f77a51a to 326fcdc Compare November 25, 2024 10:33
@ViktorooReps
Copy link
Contributor Author

ViktorooReps commented Nov 25, 2024

@ArthurZucker could not properly run make fixup due to some dependency problems, I think should be all good now

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🤗

@ArthurZucker ArthurZucker merged commit 95c10fe into huggingface:main Nov 25, 2024
9 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Updated documentation and added conversion utility

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Moved util function to integration folder + allow for str

* Update formatting

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Updated formatting

* style changes

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* Updated documentation and added conversion utility

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Moved util function to integration folder + allow for str

* Update formatting

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Updated formatting

* style changes

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

tokenizer = TikTokenConverter(
vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens
).tokenizer()
Copy link

@EndlessReform EndlessReform Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To actually work, the tokenizer pipeline needs the ByteLevel and Split pre-tokenizers (with the pattern string), the ByteLevel post-processor, and the special tokens from the original encoding. It looks like the .converted() method on the original TikTokenConverter adds these already:

def converted(self) -> Tokenizer:

but the convert_tiktoken_to_fast wrapper just uses the .tokenizer() method, which saves only the actual merges. This isn't documented, so unless you read the source code and/or inspect the actual tokenizers file, you won't realize the full pipeline is incomplete until Unicode tokens or special tokens start breaking.

This took me a couple hours to figure out what went wrong (porting the Fish Speech 1.5 tokenizer from Tiktoken). Is there a reason convert_tiktoken_to_fast doesn't just use the .converted() method? If this isn't possible, could the docs at least explicitly call out what the remaining steps are to set up the BPE pipeline?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation improvement on tiktoken integration
5 participants
  NODES
COMMUNITY 2
innovation 1
Project 5
USERS 14