🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. #33325

abuelnasr0 · 2024-09-05T14:33:56Z

What does this PR do?

This PR initializes new tokens embeddings from a normal distribution with the old embeddings' mean and covariance. as described in this article https://nlp.stanford.edu/~johnhew/vocab-expansion.html. Thanks to this article, you can now add new tokens to your model without affecting its generation accuracy.

Fixes #32948

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

cc: @ArthurZucker @LysandreJik

abuelnasr0 · 2024-09-05T15:05:10Z

This gist shows the before and after GPT2 model generation after adding new tokens: https://colab.research.google.com/gist/abuelnasr0/7cc8decff72ccdbcf5073ad8259ee360/embedding_resize.ipynb

LysandreJik · 2024-09-05T15:41:10Z

Nice! Thanks @abuelnasr0. cc @Rocketknight1 in case you have some bandwidth to do a first review?

Rocketknight1 · 2024-09-05T15:47:29Z

Yes, I'm happy to take it, it seems like a great PR!

Rocketknight1

My overall impression is that this is a great addition that we should definitely merge, because the existing behaviour is highly undesirable, as mentioned in the article. My issues are basically nits:

We have a mild preference for avoiding "math variable" naming, though I don't want to bloat the code either. Maybe replace mu and cov with mean_embedding and covariance or covariance_matrix?
It'd be great if we could add a small test for this. There is a test_resize_tokens_embedding in tests/test_modeling_common.py, and the test could either be appended to that, or as a new test just below it. The test could check that new tokens are relatively close to the mean of the old tokens, though please set the tolerances quite loose - it'll be a real pain if the test becomes flaky because an outlier value is sampled!

HuggingFaceDocBuilderDev · 2024-09-05T16:07:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

abuelnasr0 · 2024-09-05T22:13:50Z

Thanks @LysandreJik
And thanks @Rocketknight1 for the review and the comments. I have added a test case to check if the new embeddings' mean is close to the original embeddings' mean, I think that will reduce the effect of any outlier values.
Feel free to let me know if there is anything else I should add.

Rocketknight1

LGTM! One final extra-minor nit, but I'm happy with it now. cc @LysandreJik for final review

tests/test_modeling_common.py

ArthurZucker

Thanks! I think we need a test for test_resize_tokens_embeddings with deepspeed and multi GPU as we had a lot of breaking changes in the past year around it! #26102 #32192 #21065 etc etc

ArthurZucker · 2024-09-06T12:14:16Z

src/transformers/modeling_utils.py

@@ -2162,8 +2162,24 @@ def _get_resized_embeddings(
            dtype=old_embeddings.weight.dtype,
        )

-        # initialize all new embeddings (in particular added tokens)
-        self._init_weights(new_embeddings)
+        # initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger


one thing we need to make sure of is that this does not have issue with deepspeed and multinodes! (as for example the mean would require an all gather I think. This computation has to be done on gpu 0 is what my intuition tells me)

That is absolutely right, thanks for mentioning that. I have added support for Deepspeed and tested it manually. I will add test cases for it tomorrow!

@ArthurZucker I have added tests for deepspeed but they fail because the test env doesn't have deepspeed installed. what could be a solution for this? should I add the tests to transformers/tests/deepspeed/test_deepspeed.py for example?

@abuelnasr0 when tests depend on a library like deepspeed, tag them with the @require_deepspeed decorator so the test runner knows how to handle them! You can search the codebase for other examples where it's used and just copy the imports/decorator from there.

@Rocketknight1 Thank you for your response. It turns out that I was importing Deepspeed without checking if it was available. I fixed it.
@Rocketknight1 @ArthurZucker The code is ready for review now. Could you check it and see if the deepspeed tests are all right?

Sorry for the delay!

This is a breaking change so can you add 🔴 🔴 🔴 🔴

@ArthurZucker I have added 🔴 to the PR title. Is that what you meant? or should I add a logger.warning() to the code describing the new change in initializing embedding weights?

let's make sure we warn users about it!

I didn't see this comment the first time I opened the PR! I will add a warning to the user describing new changes.

ArthurZucker

Looks great to me, it's abreaking change so let's make sure we warn users about it!

ArthurZucker · 2024-09-13T12:07:59Z

src/transformers/modeling_utils.py

@@ -2162,8 +2162,24 @@ def _get_resized_embeddings(
            dtype=old_embeddings.weight.dtype,
        )

-        # initialize all new embeddings (in particular added tokens)
-        self._init_weights(new_embeddings)
+        # initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger


Sorry for the delay!

ArthurZucker · 2024-09-13T12:08:23Z

src/transformers/modeling_utils.py

@@ -2162,8 +2162,24 @@ def _get_resized_embeddings(
            dtype=old_embeddings.weight.dtype,
        )

-        # initialize all new embeddings (in particular added tokens)
-        self._init_weights(new_embeddings)
+        # initialize new embeddings (in particular added tokens) if `new_num_tokens` is larger


This is a breaking change so can you add 🔴 🔴 🔴 🔴

ArthurZucker

Okay looks great! Given how big of a change this is, let's add a flag to _get_resized_embeddings, something like multivariate_resizing WDYT?

abuelnasr0 · 2024-09-27T15:57:24Z

@ArthurZucker Sorry for the delay. last week was very busy for me.
I have added the flag to resize_token_embeddings, _get_resized_embeddings, and _get_resized_lm_head.
_get_resized_lm_head is very important for models that have untied weights. This should have been added from the beginning but I don't know how I didn't notice that.

abuelnasr0 · 2024-09-27T16:44:16Z

Bias for linear lm_heads is not included in the article. but I have came to the conclusion that initializing with zero is the best solution.
here is the proof given the article:

$p_{\theta'}(w_i \mid w_{1:i-1}) = \frac{\exp(h_{i-1}^{\top} e_{w_i} + b_{w_i})}{Z + \exp(h_{i-1}^\top e_{n+1} + b_{n+1} ) }$
$p_{\theta'}(w_i \mid w_{1:i-1}) = \frac{\exp(h_{i-1}^{\top} e_{w_i} + b_{w_i})}{Z + \exp(h_{i-1}^\top e_{n+1} ) exp( b_{n+1})}$

If the new bias equals zero, then exp(0) = 1, and then the proof in the article will remain the same.

Rocketknight1 · 2024-09-27T17:11:02Z

@abuelnasr0 I haven't dived into the math, but that seems counter-intuitive to me. Most models do not have a bias on the output head, but if they do, I would guess that the bias is usually large and negative for most tokens (I can't test this, though, because I can't find any examples of models with bias on the output head!)

If the average bias is large and negative, then initializing new tokens with the mean embedding but a zero bias will mean that logits for the new token will be very large, right?

UPDATE: I finally found an example of an old model with a bias on the lm_head (Salesforce/ctrl). The mean bias was -0.05, which is a much smaller magnitude than I expected. This probably isn't a big problem - a zero embedding is fine!

abuelnasr0 · 2024-09-27T18:06:56Z

@Rocketknight1 That's right.
I didn't consider that the new value of Z has also bias in it.

After considering the new value of Z, the new bias should be initialized with the old bias mean. I have tried to be more focused this time. I hope I am not wrong again 😅

this link contains a proof that should be replaced with "Averaging bounds the KL-divergence" part in the article: https://imgur.com/a/ZZQ3Mwk

abuelnasr0 · 2024-09-27T18:34:28Z

I have initialized the new bias like this:

new_lm_head.bias.data.normal_(mean=bias_mean, std=bias_std * 1e-5)

I have multiplied the std by 1e-5 just because the article initializes the embeddings by multiplying the covariance by 1e-5. I don't really know why, maybe to make the new embeddings alot closer to the mean.

ArthurZucker

Thanks, getting in a good shape! Let's help our users a little bit more, and make the code more readable !

src/transformers/modeling_utils.py

ArthurZucker · 2024-10-01T13:06:24Z

src/transformers/modeling_utils.py

+                "As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html."
+            )
+            added_num_tokens = new_num_tokens - old_num_tokens
+            if is_deepspeed_zero3_enabled() and not is_quantized:


all of this can be re-used no? As a "self.init_tensor" which checks if deepspeed is available, computes the covariance if not given, uses None otherwise

I have introduced three functions:

self._init_added_embeddings_weights_with_mean()

self._init_added_lm_head_weights_with_mean() and it uses self._init_added_embeddings_weights_with_mean()

self._init_added_lm_head_bias_with_mean()
This will improve code usability for our case. what do you think? I am open to any other change.

Also, I think that mean_resizing is more user-friendly and explains the whole point of the new resizing technique.

src/transformers/modeling_utils.py

ArthurZucker

Cleaner! Thanks a lot for this update

ArthurZucker · 2024-10-03T12:14:12Z

The only thing left is to fix the failing test which are related (example torch for example) 🤗

abuelnasr0 · 2024-10-03T18:00:31Z

@ArthurZucker Thanks for your review.
Regarding tests, all tests have passed after rebasing.

ArthurZucker · 2024-10-04T14:29:51Z

All good for me! 🤗

Rocketknight1 · 2024-10-04T16:50:51Z

Congrats on the PR @abuelnasr0, and thank you!

abuelnasr0 · 2024-10-05T09:20:54Z

@Rocketknight1 Thank you for the reviews and the help.

…l distribution. (huggingface#33325) * intilize new embeddings from normal distrib * Fix typo in comments * Fix typo in comments * Fix style * Fix variables naming * Add tests * Fix style * code consistency nit * Add deepspeed support * Add deepspeed support * Conver embeddings weights to float32 before computations * Add deepspeed tests * Cover when vocab_size is smaller than embedding_size * Style fix * Add tests for vocab_size smaller than hiddin_size * Style fix * Nits in tests * Nits in tests * Check for deepspeed before importing it * Increase vocab_size for positive definite covariance matrix test * Add warning * Add multivariate_resizing flag and implement resizing for lm_heads * Fix typo * Fix wrong bias indexing * Fix bias is zero check * remove multivariate_resizing flag from tests * Intialize bias from old bias normal distribution * Fixup * Code usability * Use mean_resizing instead of multivariate_resizing * Fix up * Fix comments and docs

This commit causes breaking changes we need to fix, for now we will pin the version but we will fix shortly huggingface/transformers#33325

…l distribution. (huggingface#33325) * intilize new embeddings from normal distrib * Fix typo in comments * Fix typo in comments * Fix style * Fix variables naming * Add tests * Fix style * code consistency nit * Add deepspeed support * Add deepspeed support * Conver embeddings weights to float32 before computations * Add deepspeed tests * Cover when vocab_size is smaller than embedding_size * Style fix * Add tests for vocab_size smaller than hiddin_size * Style fix * Nits in tests * Nits in tests * Check for deepspeed before importing it * Increase vocab_size for positive definite covariance matrix test * Add warning * Add multivariate_resizing flag and implement resizing for lm_heads * Fix typo * Fix wrong bias indexing * Fix bias is zero check * remove multivariate_resizing flag from tests * Intialize bias from old bias normal distribution * Fixup * Code usability * Use mean_resizing instead of multivariate_resizing * Fix up * Fix comments and docs

Rocketknight1 approved these changes Sep 5, 2024

View reviewed changes

abuelnasr0 requested a review from Rocketknight1 September 5, 2024 22:05

Rocketknight1 approved these changes Sep 6, 2024

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Sep 6, 2024

View reviewed changes

abuelnasr0 requested review from ArthurZucker and Rocketknight1 September 8, 2024 14:04

ArthurZucker approved these changes Sep 13, 2024

View reviewed changes

abuelnasr0 changed the title ~~Resizing tokens embeddings: initialize from old embeddings' normal distribution.~~ 🔴 🔴 Resizing tokens embeddings: initialize from old embeddings' normal distribution. Sep 13, 2024

abuelnasr0 changed the title ~~🔴 🔴 Resizing tokens embeddings: initialize from old embeddings' normal distribution.~~ 🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. Sep 13, 2024

ArthurZucker reviewed Sep 17, 2024

View reviewed changes

ArthurZucker reviewed Oct 1, 2024

View reviewed changes

abuelnasr0 force-pushed the resize_embeddings branch from 1256fe0 to e11dcfc Compare October 1, 2024 19:05

ArthurZucker approved these changes Oct 3, 2024

View reviewed changes

abuelnasr0 added 6 commits October 3, 2024 15:59

intilize new embeddings from normal distrib

25c92e1

Fix typo in comments

a95639c

Fix typo in comments

d850b99

Fix style

3f44507

Fix variables naming

5ea5f82

Add tests

d1d81d5

abuelnasr0 added 13 commits October 3, 2024 15:59

Check for deepspeed before importing it

226f31c

Increase vocab_size for positive definite covariance matrix test

cef744f

Add warning

6583cd5

Add multivariate_resizing flag and implement resizing for lm_heads

7577cd4

Fix typo

0472bac

Fix wrong bias indexing

fd4ad00

Fix bias is zero check

6ff2bca

remove multivariate_resizing flag from tests

12e61c6

Intialize bias from old bias normal distribution

eb80c33

Fixup

ef6bdbc

Code usability

5cdce5f

Use mean_resizing instead of multivariate_resizing

f4a9cf4

Fix up

fc436d7

abuelnasr0 force-pushed the resize_embeddings branch from e11dcfc to fc436d7 Compare October 3, 2024 12:59

Fix comments and docs

8e60a36

ArthurZucker merged commit 78ef583 into huggingface:main Oct 4, 2024
24 checks passed

ArthurZucker mentioned this pull request Oct 7, 2024

Fix Failed tests with mobile bert resize tokens embedding #33950

Merged

abuelnasr0 mentioned this pull request Oct 9, 2024

check if eigenvalues of covariance matrix are complex. #34037

Merged

BenjaminBossan mentioned this pull request Oct 15, 2024

ValueError: FlatParameter requires uniform dtype but got torch.float16 and torch.float32 huggingface/accelerate#1620

Open

4 tasks

winglian mentioned this pull request Oct 28, 2024

add option for resizing embeddings when adding new tokens axolotl-ai-cloud/axolotl#2000

Merged

loadams mentioned this pull request Nov 4, 2024

Pin transformers to 4.45.2 in nv-ds-chat workflow microsoft/DeepSpeed#6710

Merged

abuelnasr0 mentioned this pull request Dec 15, 2024

When extending embeddings, multivariate distribution isn't correctly estimated even when the calculated sigma matrix is symmetric and positive definite #35075

Open

4 tasks

🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. #33325

🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. #33325

Conversation

abuelnasr0 commented Sep 5, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

abuelnasr0 commented Sep 5, 2024

LysandreJik commented Sep 5, 2024

Rocketknight1 commented Sep 5, 2024

Rocketknight1 left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 5, 2024

abuelnasr0 commented Sep 5, 2024 • edited Loading

Rocketknight1 left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abuelnasr0 Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

abuelnasr0 commented Sep 27, 2024

abuelnasr0 commented Sep 27, 2024 • edited Loading

Rocketknight1 commented Sep 27, 2024 • edited Loading

abuelnasr0 commented Sep 27, 2024 • edited Loading

abuelnasr0 commented Sep 27, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abuelnasr0 Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 3, 2024

abuelnasr0 commented Oct 3, 2024

ArthurZucker commented Oct 4, 2024

Rocketknight1 commented Oct 4, 2024

abuelnasr0 commented Oct 5, 2024

abuelnasr0 commented Sep 5, 2024 •

edited

Loading

abuelnasr0 commented Sep 5, 2024 •

edited

Loading

abuelnasr0 Sep 9, 2024 •

edited

Loading

abuelnasr0 commented Sep 27, 2024 •

edited

Loading

Rocketknight1 commented Sep 27, 2024 •

edited

Loading

abuelnasr0 commented Sep 27, 2024 •

edited

Loading

abuelnasr0 Oct 1, 2024 •

edited

Loading