Fix convert_tokens_to_string when decoder is None #34569

dszeto · 2024-11-01T23:38:23Z

What does this PR do?

In convert_tokens_to_string of src/transformers/tokenization_utils_fast.py, self.backend_tokenizer.decoder can be None when the tokenizer is trained by tokenizers.trainers.WordLevelTrainer. This fix adds a check and falls back to joining tokens with a space when no decoder is found. This follows the default behavior in the Rust implementation of Tokenizers.

Original question posted on https://discuss.huggingface.co/t/pretrainedtokenizerfast-convert-tokens-to-string-always-assumes-the-presence-of-decoder/114978

All existing tokenizer tests are passing. Style changes were made by make fixup.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker

AlonKellner-Jounce · 2024-11-11T17:36:34Z

I want to emphasize my support for this PR with a simply reproducible error (and a real use-case) that this PR fixes:
When running vLLM with huggingface models which have a PreTrainedTokenizerFast tokenizer, for example:

python3 -m vllm.entrypoints.openai.api_server --model jounce/dummy-phi3

Currently fails like so:

ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 122, in decode_sequence_inplace
ERROR 11-11 09:00:29 engine.py:158]     read_offset) = detokenize_incrementally(
ERROR 11-11 09:00:29 engine.py:158]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 301, in detokenize_incrementally
ERROR 11-11 09:00:29 engine.py:158]     prefix_text = tokenizer.convert_tokens_to_string(
ERROR 11-11 09:00:29 engine.py:158]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_fast.py", line 641, in convert_tokens_to_string
ERROR 11-11 09:00:29 engine.py:158]     return self.backend_tokenizer.decoder.decode(tokens)
ERROR 11-11 09:00:29 engine.py:158]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158] AttributeError: 'NoneType' object has no attribute 'decode'

Great PR, tnx :)

dszeto · 2024-11-13T06:22:58Z

Hey @ArthurZucker , would be awesome if you could take a quick look at this.

Please let me know if someone else should review.

ArthurZucker

There are quite a few unrelated changes that made their way here, can you revert them so we can merge? 🤗

HuggingFaceDocBuilderDev · 2024-11-25T13:52:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Fix convert_tokens_to_string when decoder is None * revert unrelated changs --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>

dszeto force-pushed the convert-tokens-to-string-wordlevel branch from 1dd4100 to bf0e086 Compare November 11, 2024 17:38

Fix convert_tokens_to_string when decoder is None

07ec57f

dszeto force-pushed the convert-tokens-to-string-wordlevel branch from bf0e086 to 07ec57f Compare November 12, 2024 00:50

ArthurZucker approved these changes Nov 25, 2024

View reviewed changes

revert unrelated changs

50c4ea0

ArthurZucker merged commit 74db22f into huggingface:main Nov 25, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix convert_tokens_to_string when decoder is None #34569

Fix convert_tokens_to_string when decoder is None #34569

dszeto commented Nov 1, 2024

AlonKellner-Jounce commented Nov 11, 2024

dszeto commented Nov 13, 2024

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Nov 25, 2024

Fix convert_tokens_to_string when decoder is None #34569

Fix convert_tokens_to_string when decoder is None #34569

Conversation

dszeto commented Nov 1, 2024

What does this PR do?

Before submitting

Who can review?

AlonKellner-Jounce commented Nov 11, 2024

dszeto commented Nov 13, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 25, 2024