Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow unused input parameters passthrough when chunking in asr pipelines #33889

Merged

Conversation

VictorAtIfInsurance
Copy link
Contributor

@VictorAtIfInsurance VictorAtIfInsurance commented Oct 2, 2024

What does this PR do?

Harmonizing behaviour for what input data is passthrough automatic speech recognition pipelines.

Today you can get different return dictionaries keys when using ASR pipeline with and without the call parameter chunk_length_s. With this parameter you will only get a dictionary with the transformed input data, while omitting this parameter allows you to retrieve any unused input data parameter back in the resulting dictionary. This inconsistency can be confusing as sometimes you need to keep track of which inputs corresponds to which outputs, while it is not obvious to a user why.

The following code snippet highlights these differences:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import numpy as np


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

data = {"raw": np.zeros(100), "sampling_rate": 16_000, "id": 1}

chunk_output = pipe(data.copy(), chunk_length_s=30)
non_chunk_output = pipe(data.copy())

print(chunk_output)
print(non_chunk_output)
assert chunk_output.keys() == non_chunk_output.keys()

This will print the following and fail the assert.

{'text': ' you'}
{'text': ' you', 'id': [1]}

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik
Copy link
Member

cc @Rocketknight1 for pipelines and @ylacombe for ASR

@ylacombe
Copy link
Contributor

ylacombe commented Oct 4, 2024

Hey @VictorAtIfInsurance, thanks for this PR!
Could you please provide some motivation on the PR and code snippets, if necessary?

From a quick look, I think we'd rather raise an Error when some parameters are unused, rather than just passing them through. Especially because we want to avoid silent errors or mis-usages. For example, let's say I misspelled the much-used return_timestamps to retun_timestamps. With your PR, there would be no warnings and no errors and the resulting dictionary would not be as expected!

@VictorAtIfInsurance
Copy link
Contributor Author

VictorAtIfInsurance commented Oct 4, 2024

Hi,

I should've marked this as WIP. The test is not setup for properly testing the intention of the PR, it should be unused inputs and not any unused parameter as is written in the test now.

One of the reasons for this PR is inconsistent behaviour when passing chunking_length_s as a parameter to the pipeline.

E.g. I wouldv'e expected the outputs from these two different pipe calls to have the same structure of the output, i.e. that the assert would pass,

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import numpy as np


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

data = {"raw": np.zeros(100), "sampling_rate": 16_000, "id": 1}

chunk_output = pipe(data.copy(), chunk_length_s=30)
non_chunk_output = pipe(data.copy())

print(chunk_output)
print(non_chunk_output)
assert chunk_output.keys() == non_chunk_output.keys()

But printing the outputs gives

{'text': ' you'}
{'text': ' you', 'id': [1]}

I.e. the unused parameter in the input "survives" when not using chunking but is removed when using chunking.

The behaviour from the non chunking pipeline call comes from this line, so would expect something similar for chunking as well

yield {"is_last": True, **processed, **extra}

Where the behaviour of the non chunking pipeline is especially useful if you generate your input data through a generator where you can directly call the pipeline with the generator while also retrieve metadata returned from the generator, without the extra step of first retrieving the next item, and call the pipe separately. See example:

def generate_data():
    iter = 0
    while iter < 10:
        yield {"raw": np.zeros(100), "sampling_rate": 16_000, "id": iter}
        iter += 1

print("Non chunking")
for output in pipe(generate_data()):
    print(output)

print()
print("Chunking")
for output in pipe(generate_data(), chunk_length_s=30):
    print(output)

which prints

Non chunking
{'text': ' you', 'id': [0]}
{'text': ' you', 'id': [1]}
{'text': ' you', 'id': [2]}
{'text': ' you', 'id': [3]}
{'text': ' you', 'id': [4]}
{'text': ' you', 'id': [5]}
{'text': ' you', 'id': [6]}
{'text': ' you', 'id': [7]}
{'text': ' you', 'id': [8]}
{'text': ' you', 'id': [9]}

Chunking
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}
{'text': ' you'}

In these instances it wouldv'e been super useful to pass along the unused inputs.

I hope this makes sense and that you agree with my intended behaviour. I will update the tests accordingly in that case, and sorry for posting unfinished code for review.

@VictorAtIfInsurance VictorAtIfInsurance changed the title allow unused parameter passthrough when chunking in asr pipelines [WIP] allow unused parameter passthrough when chunking in asr pipelines Oct 4, 2024
@VictorAtIfInsurance VictorAtIfInsurance changed the title [WIP] allow unused parameter passthrough when chunking in asr pipelines [WIP] allow unused input parameters passthrough when chunking in asr pipelines Oct 4, 2024
@VictorAtIfInsurance VictorAtIfInsurance changed the title [WIP] allow unused input parameters passthrough when chunking in asr pipelines allow unused input parameters passthrough when chunking in asr pipelines Oct 7, 2024
@VictorAtIfInsurance VictorAtIfInsurance force-pushed the feature/parameter-passthrough branch from 2ebdb16 to ac1116b Compare October 7, 2024 13:59
@VictorAtIfInsurance
Copy link
Contributor Author

@ylacombe

I've now added some code examples explaining my issue, and updated the test accordingly:)

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VictorAtIfInsurance, this actually makes sense, especially since we pass extra parameters when we don't chunk.

I'm pretty sure it doesn't align with the pipeline philosophy though, and we could maybe ask @Rocketknight1's opinion on this!

@Rocketknight1
Copy link
Member

I think keeping the output consistent with different input kwargs definitely fits with our philosophy!

@VictorAtIfInsurance we're actually working on a project to standardize the inputs and outputs for our pipeline classes, to match an inference spec that will be common across lots of projects like TGI / transformers.js / etc. See this PR.

If this PR makes the pipeline always compliant with the Hub output spec, then I'm definitely in favour of it! Can you:

  1. Rebase on the latest version of main to make sure your PR is doing the spec compliance tests
  2. Add a compliance test when chunk_length_s is passed as well?

Other than that this PR looks good to me.

@VictorAtIfInsurance
Copy link
Contributor Author

VictorAtIfInsurance commented Oct 11, 2024

Hi,

Thanks for the input.

I rebased and added check against AutomaticSpeechRecognitionOutput using the compare_pipeline_output_to_hub_spec test utility method.

As i change the output from the pipeline call the test fails locally, which is to be expected since it doesn't follow the output structure. So this is probably not the way to go forward. Though then the non-chunk version of ASR pipeline has a problem on main, see the second assert in test_pipeline_output_on_unused_kwargs in #34087.

What would be the correct way to handle invalid kwargs using the new standardized inputs for pipelines. Raise an exception such as in test_fail_on_invalid_input_kwargs in the same PR? #34087

Or maybe a warning would be better in that case as it could break backwards compatibility for the users?

What would be the best next step here @ylacombe? I think consistence is important, but maybe my proposed change is not the way to go as it deviates from the standardized approach.

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VictorAtIfInsurance, indeed it is not compatible with AutomaticSpeechRecognitionOutput, that you can find here.

I think keeping the output consistent with different input kwargs definitely fits with our philosophy!

@Rocketknight1, I'm not sure to follow, AutomaticSpeechRecognitionOutput are really constraining and some models (say Whisper) might return many more output fields! How can we reconcile both?

.gitignore Outdated Show resolved Hide resolved
@Rocketknight1
Copy link
Member

@ylacombe It's okay for models to return many output fields, but we're trying to get the pipelines to have standardized inputs/outputs, so that they can fit into a standard inference API.

However, in this case, the pipeline obviously has compliance issues already, as mentioned in #34087. I didn't realize this, I'm sorry! I don't think we should delay this PR because of issues in main, so for now, I think we can skip the compliance checks I suggested, and I'll work on making the outputs compliant later in a separate PR.

@VictorAtIfInsurance
Copy link
Contributor Author

VictorAtIfInsurance commented Oct 16, 2024

@ylacombe I've removed the compliance check in the test, making the test pass. Let me know if I should update anything more here.

@VictorAtIfInsurance VictorAtIfInsurance force-pushed the feature/parameter-passthrough branch from fd8d15f to ec1d73b Compare October 16, 2024 13:20
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@VictorAtIfInsurance
Copy link
Contributor Author

Rebased main as some tests failed due to missing files.

@VictorAtIfInsurance VictorAtIfInsurance force-pushed the feature/parameter-passthrough branch 2 times, most recently from 1fd7168 to 9cdf43d Compare October 25, 2024 08:58
@VictorAtIfInsurance
Copy link
Contributor Author

@Rocketknight1 Could we rerun the workflow? The last time it failed as I was missing some files from main, but now it should be rebased.

@VictorAtIfInsurance
Copy link
Contributor Author

@ylacombe are we ready for merge?:)

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's rebase on main ?

@VictorAtIfInsurance VictorAtIfInsurance force-pushed the feature/parameter-passthrough branch from 9cdf43d to 3835ade Compare October 31, 2024 13:18
@VictorAtIfInsurance
Copy link
Contributor Author

Sure, though there were no merge conflicts

@ylacombe ylacombe requested a review from ArthurZucker October 31, 2024 15:52
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice thanks!

@VictorAtIfInsurance
Copy link
Contributor Author

@ylacombe would you mind merging this PR, I do not have write access to do it myself.

@ArthurZucker ArthurZucker merged commit a0f4f31 into huggingface:main Nov 25, 2024
26 checks passed
@ArthurZucker
Copy link
Collaborator

Merged sorry for the wait!

@VictorAtIfInsurance VictorAtIfInsurance deleted the feature/parameter-passthrough branch November 26, 2024 10:01
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…nes (huggingface#33889)

* allow unused parameter passthrough when chunking in asr pipelines

* format code

* format

* run fixup

* update tests

* update parameters to pipline in test

* updates parametrs in tests

* change spelling in gitignore

* revert .gitignore to main

* add git ignore of devcontainer folder

* assert asr output follows expected inference output type

* run fixup

* Remove .devcontainer from .gitignore

* remove compliance check
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
…nes (huggingface#33889)

* allow unused parameter passthrough when chunking in asr pipelines

* format code

* format

* run fixup

* update tests

* update parameters to pipline in test

* updates parametrs in tests

* change spelling in gitignore

* revert .gitignore to main

* add git ignore of devcontainer folder

* assert asr output follows expected inference output type

* run fixup

* Remove .devcontainer from .gitignore

* remove compliance check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
  NODES
COMMUNITY 3
innovation 1
Project 7
USERS 2