Updated documentation and added conversion utility #34319

ViktorooReps · 2024-10-22T20:19:14Z

What does this PR do?

Documentation improvement on tiktoken integration + tiktoken conversion function.

Fixes #34221

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker
@stevhliu

stevhliu

Thanks, docs LGTM! Once @ArthurZucker has reviewed, we can merge 🙂

docs/source/en/tiktoken.md

ArthurZucker

NIce PR!

ArthurZucker · 2024-10-29T13:24:00Z

src/transformers/tokenization_utils_fast.py

@@ -893,3 +894,37 @@ def train_new_from_iterator(
            kwargs["additional_special_tokens"] = additional_special_tokens

        return self.__class__(tokenizer_object=tokenizer, **kwargs)
+
+
+def convert_tiktoken_to_fast(encoding: Any, output_dir: str):


very nice!
We should do :

from tiktoken import get_encoding # You can load your custom encoding or the one provided by OpenAI encoding = get_encoding("gpt2")

in this function directly IMO!

Also let's maybe place this under the integration folder ! 🤗

I wouldn't like to load the encoding inside the function to allow for custom encodings like https://github.com/openai/tiktoken?tab=readme-ov-file#extending-tiktoken

We could allow Encoding | str as input and load the encoding with get_encoding if a str is passed.

Moved it to integration/tiktoken.py

ArthurZucker

LGTM was not pinged so did not come back to it! Could you run make fixup and rebase to make sure you are up to date?

src/transformers/integrations/tiktoken.py

src/transformers/tokenization_utils_fast.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ViktorooReps · 2024-11-25T11:04:02Z

@ArthurZucker could not properly run make fixup due to some dependency problems, I think should be all good now

ArthurZucker

Thanks 🤗

HuggingFaceDocBuilderDev · 2024-11-25T18:05:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Updated documentation and added conversion utility * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Moved util function to integration folder + allow for str * Update formatting Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Updated formatting * style changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

EndlessReform · 2024-12-07T18:32:11Z

src/transformers/integrations/tiktoken.py

+
+    tokenizer = TikTokenConverter(
+        vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens
+    ).tokenizer()


To actually work, the tokenizer pipeline needs the ByteLevel and Split pre-tokenizers (with the pattern string), the ByteLevel post-processor, and the special tokens from the original encoding. It looks like the .converted() method on the original TikTokenConverter adds these already:

transformers/src/transformers/convert_slow_tokenizer.py

Line 1532 in 98e8062

def converted(self) -> Tokenizer:

but the convert_tiktoken_to_fast wrapper just uses the .tokenizer() method, which saves only the actual merges. This isn't documented, so unless you read the source code and/or inspect the actual tokenizers file, you won't realize the full pipeline is incomplete until Unicode tokens or special tokens start breaking.

This took me a couple hours to figure out what went wrong (porting the Fish Speech 1.5 tokenizer from Tiktoken). Is there a reason convert_tiktoken_to_fast doesn't just use the .converted() method? If this isn't possible, could the docs at least explicitly call out what the remaining steps are to set up the BPE pipeline?

stevhliu approved these changes Oct 25, 2024

View reviewed changes

docs/source/en/tiktoken.md Outdated Show resolved Hide resolved

docs/source/en/tiktoken.md Outdated Show resolved Hide resolved

ArthurZucker reviewed Oct 29, 2024

View reviewed changes

ArthurZucker reviewed Nov 25, 2024

View reviewed changes

src/transformers/integrations/tiktoken.py Outdated Show resolved Hide resolved

src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved

ViktorooReps force-pushed the tiktoken_improved_documentation branch from 5b012f7 to f77a51a Compare November 25, 2024 10:26

ViktorooReps and others added 5 commits November 25, 2024 11:32

Updated documentation and added conversion utility

3deb764

Update docs/source/en/tiktoken.md

200c020

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tiktoken.md

3ce1828

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Moved util function to integration folder + allow for str

a955057

Update formatting

326fcdc

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ViktorooReps force-pushed the tiktoken_improved_documentation branch from f77a51a to 326fcdc Compare November 25, 2024 10:33

ViktorooReps added 2 commits November 25, 2024 11:34

Updated formatting

656d161

style changes

afcddba

ArthurZucker approved these changes Nov 25, 2024

View reviewed changes

ArthurZucker merged commit 95c10fe into huggingface:main Nov 25, 2024
9 checks passed

EndlessReform reviewed Dec 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated documentation and added conversion utility #34319

Updated documentation and added conversion utility #34319

ViktorooReps commented Oct 22, 2024 •

edited

Loading

stevhliu left a comment

ArthurZucker left a comment

ArthurZucker Oct 29, 2024

ViktorooReps Oct 29, 2024

ViktorooReps Oct 29, 2024

ArthurZucker left a comment

ViktorooReps commented Nov 25, 2024 •

edited

Loading

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Nov 25, 2024

EndlessReform Dec 7, 2024 •

edited

Loading

Updated documentation and added conversion utility #34319

Updated documentation and added conversion utility #34319

Conversation

ViktorooReps commented Oct 22, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

stevhliu left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 29, 2024

Choose a reason for hiding this comment

ViktorooReps Oct 29, 2024

Choose a reason for hiding this comment

ViktorooReps Oct 29, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ViktorooReps commented Nov 25, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 25, 2024

EndlessReform Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

ViktorooReps commented Oct 22, 2024 •

edited

Loading

ViktorooReps commented Nov 25, 2024 •

edited

Loading

EndlessReform Dec 7, 2024 •

edited

Loading