-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562
Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for changing the format to be consistent with the API inference endpoint.
Actually I have a PR for updating chat templates to laod images/videos directly when calling the apply_chat_template
(#34275). Yes, my inital idea was to use the format from TGI but I realized since we use the same function to load from path/url/PIL it would be easier to simplify the format.
Indeed that would make things easier if we use the same format and since the pipeline will be used by the API endpoint. But I find the format a bit complicated myself for url, with unnecessary nesting. Also the audio
format from OpenAI is completely different according to this post.
From what I see the serving libraries dont have a uniform format yet except for the current url format in some cases. That is a good chance for us to set the new standard and encourage developers/users to user our templates when they upload models on the hub.
WDYT if we remove extra nesting but leave the keys to define "url" or "path"? That is what I had in mind before simplifying it further in linked PR
Otherwise the format looks a bit weird to me as follows:
- URL:
content: ["type": "image_url", "image_url": {"url": 'https:// or "data:base64-image"'}]
- PIL Image:
content: ["type": "image", "image": {"image": PIL.Image}]
- Path but not here yet?:
content: ["type": "image_path", "image_path": {"path": "my-local-path.png"}]
also cc @Rocketknight1 in the loop to get your opinion on templates
Thanks for the review! Agreed that the Transformers chat template looks cleaner and easier to use, and nice to see it's getting even better with your PR. |
@zucchini-nlp I think modifications inside For the formats of content keys with PIL images vs. image paths etc. my preference is always to be close to the OpenAI API, and then add extra features on top of that if we need. |
Thanks, i think in that case we can remove the extra nesting in dicts which will still be close to OpenAI format. And yeah, we can consider passing load time params through the template dict, currently we dont have any options except for number of frames in loading videos or sampling rate in audios, which are usually same for all video/audio in the conversation @yonigozlan wdyt? we can update the PR with these in mind |
Looking again at the OpenAI API doc, it seems that the only way to pass in images is through "image_url", even if the image is local, as shown here . I also agree that the simplified format of transformers using only "image", whether we have an url, PIL image or path makes more sense and is more user friendly. So I don't think we need to add support for anything else other than the openai/TGI nested "image_url" format in this pipeline. It seems to me that we should keep the nesting in that specific case, because the goal is to be fully compatible with the openai/tgi format. |
2238e81
to
bf1fa1b
Compare
149f913
to
1b35f13
Compare
Just to sync after the offsite so we have same formats with #34275, do I remember correctly that we will:
I agree OpenAI format is kinda standard for most users and we use it in TGI/endpoints etc., but sadly it is very inconvenient and is made to support only URLs |
Sounds good to me @zucchini-nlp!
Or are we explicitly typing |
@yonigozlan no, let's keep the explicit typing only and OpenAI format (which will be available only through pipelines for now) |
04aa698
to
b30adf8
Compare
Ok I see, I made some changes to reflect this in this PR then :) |
) | ||
if isinstance(content, dict): | ||
if content.get("type") == "image": | ||
if any(key in content for key in ["image", "url", "path"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also have one more type here - base64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM! I'll change the chat template later to support the same format, so we can call apply_template
internally in pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think we need to mention this in the documentation of the pipeline as well! Quite a big win IMO
) | ||
if isinstance(content, dict): | ||
if content.get("type") == "image": | ||
if any(key in content for key in ["image", "url", "path"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be more efficient, a single for loop should be enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just learned about the for...else python structure, makes it cleaner indeed, thanks!
…o-text pipeline (huggingface#34562) * add support for openai api image_url input * change continue to elif * Explicitely add support for OpenAI/TGI chat format * rewrite content to transformers chat format and add tests * Add support for typing of image type in chat templates * add base64 to possible image types * refactor nesting
…o-text pipeline (huggingface#34562) * add support for openai api image_url input * change continue to elif * Explicitely add support for OpenAI/TGI chat format * rewrite content to transformers chat format and add tests * Add support for typing of image type in chat templates * add base64 to possible image types * refactor nesting
What does this PR do?
The huggingface inference API uses an OpenAI like chat template for image_url input: https://huggingface.co/docs/api-inference/tasks/image-text-to-text#using-the-api
This adds support for such image inputs in chats in the image-text-to-text pipeline.
Who can review?
Cc @zucchini-nlp @NielsRogge