-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative sampling does not maintain probability distribution of main model #32867
Comments
It looks like that change (#30778) was made to save time, since if the token were sampled with correct probability then it would also have to re-sample some of the time. I think you are right, that would have to be reverted to restore correctness. Unless, "assisted decoding" is meant to behave differently from speculative sampling (#26565 (comment)) and not have the correctness property (Appendix A.1). @gante Also, I suspect (but haven't done the math) that correctness would technically only hold if the probabilities here were adjusted by temp/min_p/etc (they seem to just be temp 1 probabilities right now): transformers/src/transformers/generation/utils.py Lines 4126 to 4130 in 52cb403
|
I agree that forcing greedy decoding on the assistant has to be reverted -- empirically, we can check that sampling with an assistant model has much smaller entropy than without an assistant model (and they should be the same). It was a rushed decision on my end before, I will open a PR to revert it. However, I disagree with some of your statements on why it must be done 🤗 You wrote
We have to distinguish two phases of text generation: producing the distribution for the next token and selecting the next token given the distribution. On sampling and greedy decoding, the distribution for the next token is the same. Greedy decoding does not set the temperature to 0, it simply takes the argmax of the distribution instead of sampling. The probability properties of speculative decoding at a token level do hold even with the probability distributions for the next tokens with greedy decoding. However, speculative decoding assumes we are doing sampling from the next token distribution in the assistant model, which is simply not happening at the moment and results in quasi-deterministic outputs from speculative decoding. |
@gante Thanks for opening the PR. I'm not quite sure what you mean in the second part
I was definitely a bit mathematically imprecise earlier- while of course greedy decoding doesn't involve dividing the logits by 0 before the softmax, the output of the softmax function in the limit as This was meant in the context of that, if greedy decoding is used in the assistant model, that transformers/src/transformers/generation/utils.py Line 4130 in 52cb403
Not always: in the two-token vocabulary case, if the speculative model outputs Also note that if |
@dmelcer9 agreed with what you wrote 🤗
I was imprecise here indeed. It is missing ",if we were sampling from that distribution to get the next token from the assistant" :) |
… like huggingface#32867) (huggingface#34553) * Update test_utils.py * formatting * Update test_utils.py * formatting * formatting * Update test_utils.py * formatting * Update test_utils.py * formatting * format * comments at standard positions
… like huggingface#32867) (huggingface#34553) * Update test_utils.py * formatting * Update test_utils.py * formatting * formatting * Update test_utils.py * formatting * Update test_utils.py * formatting * format * comments at standard positions
System Info
transformers
version: 4.44.0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In the speculative sampling procedure:
transformers/src/transformers/generation/utils.py
Line 4130 in 52cb403
The probability ratio is calculated as compared to the output probability of the assistant model.
However, the speculative model is always used greedily:
transformers/src/transformers/generation/candidate_generator.py
Line 158 in 52cb403
This is equivalent to setting the temperature to zero, so the output probability of the assistant model should always be 1 (for the selected token).
As a more concrete example, if the assistant model outputs
[0.51, 0.49]
, as long as the main model outputs[x >= 0.51, y <= 0.49]
, this will lead to the first token always being sampled by the procedure.This is evident when you use a model as its own assistant, at least for the first 5 tokens from the speculative model (there is still some randomness from the extra token generated by the main model but not the assistant).
Expected behavior
Assisted decoding should use a correct sampling method.
The text was updated successfully, but these errors were encountered: