-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. #33325
Conversation
This gist shows the before and after GPT2 model generation after adding new tokens: https://colab.research.google.com/gist/abuelnasr0/7cc8decff72ccdbcf5073ad8259ee360/embedding_resize.ipynb |
Nice! Thanks @abuelnasr0. cc @Rocketknight1 in case you have some bandwidth to do a first review? |
Yes, I'm happy to take it, it seems like a great PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall impression is that this is a great addition that we should definitely merge, because the existing behaviour is highly undesirable, as mentioned in the article. My issues are basically nits:
- We have a mild preference for avoiding "math variable" naming, though I don't want to bloat the code either. Maybe replace
mu
andcov
withmean_embedding
andcovariance
orcovariance_matrix
? - It'd be great if we could add a small test for this. There is a
test_resize_tokens_embedding
intests/test_modeling_common.py
, and the test could either be appended to that, or as a new test just below it. The test could check that new tokens are relatively close to the mean of the old tokens, though please set the tolerances quite loose - it'll be a real pain if the test becomes flaky because an outlier value is sampled!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks @LysandreJik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One final extra-minor nit, but I'm happy with it now. cc @LysandreJik for final review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/transformers/modeling_utils.py
Outdated
@@ -2162,8 +2162,24 @@ def _get_resized_embeddings( | |||
dtype=old_embeddings.weight.dtype, | |||
) | |||
|
|||
# initialize all new embeddings (in particular added tokens) | |||
self._init_weights(new_embeddings) | |||
# initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing we need to make sure of is that this does not have issue with deepspeed and multinodes! (as for example the mean would require an all gather I think. This computation has to be done on gpu 0 is what my intuition tells me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is absolutely right, thanks for mentioning that. I have added support for Deepspeed and tested it manually. I will add test cases for it tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker I have added tests for deepspeed but they fail because the test env doesn't have deepspeed installed. what could be a solution for this? should I add the tests to transformers/tests/deepspeed/test_deepspeed.py
for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abuelnasr0 when tests depend on a library like deepspeed, tag them with the @require_deepspeed
decorator so the test runner knows how to handle them! You can search the codebase for other examples where it's used and just copy the imports/decorator from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rocketknight1 Thank you for your response. It turns out that I was importing Deepspeed without checking if it was available. I fixed it.
@Rocketknight1 @ArthurZucker The code is ready for review now. Could you check it and see if the deepspeed tests are all right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change so can you add 🔴 🔴 🔴 🔴
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker I have added 🔴 to the PR title. Is that what you meant? or should I add a logger.warning()
to the code describing the new change in initializing embedding weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make sure we warn users about it!
I didn't see this comment the first time I opened the PR! I will add a warning to the user describing new changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, it's abreaking change so let's make sure we warn users about it!
src/transformers/modeling_utils.py
Outdated
@@ -2162,8 +2162,24 @@ def _get_resized_embeddings( | |||
dtype=old_embeddings.weight.dtype, | |||
) | |||
|
|||
# initialize all new embeddings (in particular added tokens) | |||
self._init_weights(new_embeddings) | |||
# initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
src/transformers/modeling_utils.py
Outdated
@@ -2162,8 +2162,24 @@ def _get_resized_embeddings( | |||
dtype=old_embeddings.weight.dtype, | |||
) | |||
|
|||
# initialize all new embeddings (in particular added tokens) | |||
self._init_weights(new_embeddings) | |||
# initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change so can you add 🔴 🔴 🔴 🔴
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay looks great! Given how big of a change this is, let's add a flag to _get_resized_embeddings
, something like multivariate_resizing
WDYT?
@ArthurZucker Sorry for the delay. last week was very busy for me. |
Bias for linear lm_heads is not included in the article. but I have came to the conclusion that initializing with zero is the best solution.
If the new bias equals zero, then exp(0) = 1, and then the proof in the article will remain the same. |
@abuelnasr0 I haven't dived into the math, but that seems counter-intuitive to me. Most models do not have a bias on the output head, but if they do, I would guess that the bias is usually large and negative for most tokens (I can't test this, though, because I can't find any examples of models with bias on the output head!) If the average bias is large and negative, then initializing new tokens with the mean embedding but a zero bias will mean that logits for the new token will be very large, right? UPDATE: I finally found an example of an old model with a bias on the lm_head ( |
@Rocketknight1 That's right. After considering the new value of this link contains a proof that should be replaced with "Averaging bounds the KL-divergence" part in the article: https://imgur.com/a/ZZQ3Mwk |
I have initialized the new bias like this: new_lm_head.bias.data.normal_(mean=bias_mean, std=bias_std * 1e-5) I have multiplied the std by 1e-5 just because the article initializes the embeddings by multiplying the covariance by 1e-5. I don't really know why, maybe to make the new embeddings alot closer to the mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, getting in a good shape! Let's help our users a little bit more, and make the code more readable !
"As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html." | ||
) | ||
added_num_tokens = new_num_tokens - old_num_tokens | ||
if is_deepspeed_zero3_enabled() and not is_quantized: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this can be re-used no? As a "self.init_tensor"
which checks if deepspeed is available, computes the covariance if not given, uses None otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have introduced three functions:
self._init_added_embeddings_weights_with_mean()
self._init_added_lm_head_weights_with_mean()
and it usesself._init_added_embeddings_weights_with_mean()
self._init_added_lm_head_bias_with_mean()
This will improve code usability for our case. what do you think? I am open to any other change.
Also, I think that mean_resizing
is more user-friendly and explains the whole point of the new resizing technique.
1256fe0
to
e11dcfc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner! Thanks a lot for this update
The only thing left is to fix the failing test which are related (example torch for example) 🤗 |
e11dcfc
to
fc436d7
Compare
@ArthurZucker Thanks for your review. |
All good for me! 🤗 |
Congrats on the PR @abuelnasr0, and thank you! |
@Rocketknight1 Thank you for the reviews and the help. |
…l distribution. (huggingface#33325) * intilize new embeddings from normal distrib * Fix typo in comments * Fix typo in comments * Fix style * Fix variables naming * Add tests * Fix style * code consistency nit * Add deepspeed support * Add deepspeed support * Conver embeddings weights to float32 before computations * Add deepspeed tests * Cover when vocab_size is smaller than embedding_size * Style fix * Add tests for vocab_size smaller than hiddin_size * Style fix * Nits in tests * Nits in tests * Check for deepspeed before importing it * Increase vocab_size for positive definite covariance matrix test * Add warning * Add multivariate_resizing flag and implement resizing for lm_heads * Fix typo * Fix wrong bias indexing * Fix bias is zero check * remove multivariate_resizing flag from tests * Intialize bias from old bias normal distribution * Fixup * Code usability * Use mean_resizing instead of multivariate_resizing * Fix up * Fix comments and docs
This commit causes breaking changes we need to fix, for now we will pin the version but we will fix shortly huggingface/transformers#33325
…l distribution. (huggingface#33325) * intilize new embeddings from normal distrib * Fix typo in comments * Fix typo in comments * Fix style * Fix variables naming * Add tests * Fix style * code consistency nit * Add deepspeed support * Add deepspeed support * Conver embeddings weights to float32 before computations * Add deepspeed tests * Cover when vocab_size is smaller than embedding_size * Style fix * Add tests for vocab_size smaller than hiddin_size * Style fix * Nits in tests * Nits in tests * Check for deepspeed before importing it * Increase vocab_size for positive definite covariance matrix test * Add warning * Add multivariate_resizing flag and implement resizing for lm_heads * Fix typo * Fix wrong bias indexing * Fix bias is zero check * remove multivariate_resizing flag from tests * Intialize bias from old bias normal distribution * Fixup * Code usability * Use mean_resizing instead of multivariate_resizing * Fix up * Fix comments and docs
…l distribution. (huggingface#33325) * intilize new embeddings from normal distrib * Fix typo in comments * Fix typo in comments * Fix style * Fix variables naming * Add tests * Fix style * code consistency nit * Add deepspeed support * Add deepspeed support * Conver embeddings weights to float32 before computations * Add deepspeed tests * Cover when vocab_size is smaller than embedding_size * Style fix * Add tests for vocab_size smaller than hiddin_size * Style fix * Nits in tests * Nits in tests * Check for deepspeed before importing it * Increase vocab_size for positive definite covariance matrix test * Add warning * Add multivariate_resizing flag and implement resizing for lm_heads * Fix typo * Fix wrong bias indexing * Fix bias is zero check * remove multivariate_resizing flag from tests * Intialize bias from old bias normal distribution * Fixup * Code usability * Use mean_resizing instead of multivariate_resizing * Fix up * Fix comments and docs
What does this PR do?
This PR initializes new tokens embeddings from a normal distribution with the old embeddings' mean and covariance. as described in this article https://nlp.stanford.edu/~johnhew/vocab-expansion.html. Thanks to this article, you can now add new tokens to your model without affecting its generation accuracy.
Fixes #32948
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
cc: @ArthurZucker @LysandreJik