Auto compile when static cache #34247

ArthurZucker · 2024-10-18T11:37:10Z

What does this PR do?

Add automatic compile for static cache.
This can be tested with:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

sequence = "Hey what's the plan"

inputs = tokenizer.encode(sequence, return_tensors='pt').to(device)
model.generation_config.temperature = 1.0
model.generation_config.top_p = 1.0

t0 = time.time()
out = model.generate(inputs, do_sample=False, max_new_tokens=500)
out = tokenizer.batch_decode(out, skip_special_tokens=True)[0]
dt = time.time() - t0
print(f'dt: {dt}', out)

t0 = time.time()
out = model.generate(inputs, do_sample=False, max_new_tokens=500, cache_implementation="static")
out = tokenizer.batch_decode(out, skip_special_tokens=True)[0]
dt = time.time() - t0
print(f'dt: {dt}', out)

t0 = time.time()
out = model.generate(inputs, do_sample=False, max_new_tokens=500, cache_implementation="static")
out = tokenizer.batch_decode(out, skip_special_tokens=True)[0]
dt = time.time() - t0
print(f'dt: {dt}', out)

which give 15sec for dynamic, 30 for the first generate, then 4seconds for the next one

HuggingFaceDocBuilderDev · 2024-10-31T10:02:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ate-compile

…s into generate-compile

Cyrilvallez

On principle this LGTM, however in practice I am not experiencing any speedup (but aggravated performance) until the number of new tokens is quite high, of the order of ~2500/3000 (quick test with Llama 3.1 8B)
Note sure if it comes only from compilation time and warmup, or some graph breaks somewhere
Did you try to compare performances a bit?

src/transformers/generation/utils.py

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

BenjaminBossan · 2024-11-25T16:29:09Z

src/transformers/generation/utils.py

@@ -3222,6 +3223,16 @@ def _sample(
        unfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=input_ids.device)
        model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs)

+        def model_forward(model, *args, **kwargs):
+            return model.forward(*args, **kwargs)


@ArthurZucker This PR breaks some tests on PEFT :(

I checked why that is exactly, and I found that it has nothing to do with the compilation. The sole reason is that on 2nd iteration, we use this function, which effectively calls:

self.forward(**model_inputs, return_dict=True)

whereas on the first iteration (and before the PR), we would call:

self(**model_inputs, return_dict=True)

Is there any specific reason why forward is used? Using __call__ looks correct to me.

Should be solved with #34923 🤗

Nice, thanks for the quick reply.

* generate with compile * nits * simple * generate with compile * nits * simple * safe * style * Update src/transformers/generation/utils.py Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> * remove TOKENIZER forked warning --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

generate with compile

6c84da3

ArthurZucker requested a review from gante October 18, 2024 13:01

ArthurZucker added 5 commits October 18, 2024 17:17

nits

0b01a08

simple

a44e659

generate with compile

5ba2eb0

nits

36bd65b

simple

83a2c00

ArthurZucker force-pushed the generate-compile branch from a44e659 to 83a2c00 Compare October 31, 2024 09:35

ArthurZucker added 4 commits November 21, 2024 14:59

Merge branch 'main' of github.com:huggingface/transformers into gener…

059a838

…ate-compile

Merge branch 'generate-compile' of github.com:huggingface/transformer…

0206164

…s into generate-compile

safe

948e86e

style

c554aec

ArthurZucker marked this pull request as ready for review November 21, 2024 15:52

ArthurZucker requested a review from Cyrilvallez November 21, 2024 15:52

Cyrilvallez reviewed Nov 21, 2024

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

ArthurZucker and others added 2 commits November 22, 2024 13:07

Update src/transformers/generation/utils.py

144d89f

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

remove TOKENIZER forked warning

f9393d6

ArthurZucker changed the title ~~Draft compile decoding~~ Auto compile when static cache Nov 22, 2024

ArthurZucker merged commit 597efd2 into main Nov 22, 2024
21 of 25 checks passed

ArthurZucker deleted the generate-compile branch November 22, 2024 14:33

ArthurZucker mentioned this pull request Nov 25, 2024

Static KV cache status: How to use it? Does it work for all models? #33270

Closed

BenjaminBossan reviewed Nov 25, 2024

View reviewed changes

This was referenced Nov 27, 2024

Avoid calling get_max_length #34971

Merged

Remove _supports_static_cache = True for some model classes #34975

Open

qubvel mentioned this pull request Nov 29, 2024

Slow Generation of Tokens when trying to run for the first time. #35014

Closed

4 tasks

ArthurZucker mentioned this pull request Jan 7, 2025

enable StaticCache for assisted generation #34797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto compile when static cache #34247

Auto compile when static cache #34247

ArthurZucker commented Oct 18, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 31, 2024

Cyrilvallez left a comment •

edited

Loading

BenjaminBossan Nov 25, 2024

Cyrilvallez Nov 25, 2024

BenjaminBossan Nov 25, 2024

Auto compile when static cache #34247

Auto compile when static cache #34247

Conversation

ArthurZucker commented Oct 18, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 31, 2024

Cyrilvallez left a comment • edited Loading

Choose a reason for hiding this comment

BenjaminBossan Nov 25, 2024

Choose a reason for hiding this comment

Cyrilvallez Nov 25, 2024

Choose a reason for hiding this comment

BenjaminBossan Nov 25, 2024

Choose a reason for hiding this comment

ArthurZucker commented Oct 18, 2024 •

edited

Loading

Cyrilvallez left a comment •

edited

Loading