-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix convert_tokens_to_string when decoder is None #34569
Fix convert_tokens_to_string when decoder is None #34569
Conversation
I want to emphasize my support for this PR with a simply reproducible error (and a real use-case) that this PR fixes: python3 -m vllm.entrypoints.openai.api_server --model jounce/dummy-phi3 Currently fails like so: ERROR 11-11 09:00:29 engine.py:158] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 122, in decode_sequence_inplace
ERROR 11-11 09:00:29 engine.py:158] read_offset) = detokenize_incrementally(
ERROR 11-11 09:00:29 engine.py:158] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 301, in detokenize_incrementally
ERROR 11-11 09:00:29 engine.py:158] prefix_text = tokenizer.convert_tokens_to_string(
ERROR 11-11 09:00:29 engine.py:158] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158] File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_fast.py", line 641, in convert_tokens_to_string
ERROR 11-11 09:00:29 engine.py:158] return self.backend_tokenizer.decoder.decode(tokens)
ERROR 11-11 09:00:29 engine.py:158] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158] AttributeError: 'NoneType' object has no attribute 'decode' Great PR, tnx :) |
1dd4100
to
bf0e086
Compare
bf0e086
to
07ec57f
Compare
Hey @ArthurZucker , would be awesome if you could take a quick look at this. Please let me know if someone else should review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few unrelated changes that made their way here, can you revert them so we can merge? 🤗
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* Fix convert_tokens_to_string when decoder is None * revert unrelated changs --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
* Fix convert_tokens_to_string when decoder is None * revert unrelated changs --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
What does this PR do?
In
convert_tokens_to_string
ofsrc/transformers/tokenization_utils_fast.py
,self.backend_tokenizer.decoder
can beNone
when the tokenizer is trained bytokenizers.trainers.WordLevelTrainer
. This fix adds a check and falls back to joining tokens with a space when no decoder is found. This follows the default behavior in the Rust implementation of Tokenizers.Original question posted on https://discuss.huggingface.co/t/pretrainedtokenizerfast-convert-tokens-to-string-always-assumes-the-presence-of-decoder/114978
All existing tokenizer tests are passing. Style changes were made by
make fixup
.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker