Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

Conversation

yonigozlan
Copy link
Member

What does this PR do?

The huggingface inference API uses an OpenAI like chat template for image_url input: https://huggingface.co/docs/api-inference/tasks/image-text-to-text#using-the-api

This adds support for such image inputs in chats in the image-text-to-text pipeline.

Who can review?

Cc @zucchini-nlp @NielsRogge

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for changing the format to be consistent with the API inference endpoint.

Actually I have a PR for updating chat templates to laod images/videos directly when calling the apply_chat_template (#34275). Yes, my inital idea was to use the format from TGI but I realized since we use the same function to load from path/url/PIL it would be easier to simplify the format.

Indeed that would make things easier if we use the same format and since the pipeline will be used by the API endpoint. But I find the format a bit complicated myself for url, with unnecessary nesting. Also the audio format from OpenAI is completely different according to this post.

From what I see the serving libraries dont have a uniform format yet except for the current url format in some cases. That is a good chance for us to set the new standard and encourage developers/users to user our templates when they upload models on the hub.

WDYT if we remove extra nesting but leave the keys to define "url" or "path"? That is what I had in mind before simplifying it further in linked PR

Otherwise the format looks a bit weird to me as follows:

  • URL: content: ["type": "image_url", "image_url": {"url": 'https:// or "data:base64-image"'}]
  • PIL Image: content: ["type": "image", "image": {"image": PIL.Image}]
  • Path but not here yet?: content: ["type": "image_path", "image_path": {"path": "my-local-path.png"}]

also cc @Rocketknight1 in the loop to get your opinion on templates

@yonigozlan
Copy link
Member Author

Thanks for the review! Agreed that the Transformers chat template looks cleaner and easier to use, and nice to see it's getting even better with your PR.
However it does look like the pipelines are the one place where we need support for both transformers and TGI chat templates format, so I can improve this PR to do so.

@Rocketknight1
Copy link
Member

Rocketknight1 commented Nov 4, 2024

@zucchini-nlp I think modifications inside apply_chat_template to do image loading are a good idea! We should ensure that the input to Jinja doesn't change though.

For the formats of content keys with PIL images vs. image paths etc. my preference is always to be close to the OpenAI API, and then add extra features on top of that if we need.

@zucchini-nlp
Copy link
Member

Thanks, i think in that case we can remove the extra nesting in dicts which will still be close to OpenAI format. And yeah, we can consider passing load time params through the template dict, currently we dont have any options except for number of frames in loading videos or sampling rate in audios, which are usually same for all video/audio in the conversation

@yonigozlan wdyt? we can update the PR with these in mind

@yonigozlan
Copy link
Member Author

WDYT if we remove extra nesting but leave the keys to define "url" or "path"? That is what I had in mind before simplifying it further in linked PR

Otherwise the format looks a bit weird to me as follows:

URL: content: ["type": "image_url", "image_url": {"url": 'https:// or "data:base64-image"'}]
PIL Image: content: ["type": "image", "image": {"image": PIL.Image}]
Path but not here yet?: content: ["type": "image_path", "image_path": {"path": "my-local-path.png"}]

Looking again at the OpenAI API doc, it seems that the only way to pass in images is through "image_url", even if the image is local, as shown here .
Same for TGI and API inference.

I also agree that the simplified format of transformers using only "image", whether we have an url, PIL image or path makes more sense and is more user friendly. So I don't think we need to add support for anything else other than the openai/TGI nested "image_url" format in this pipeline. It seems to me that we should keep the nesting in that specific case, because the goal is to be fully compatible with the openai/tgi format.

@yonigozlan yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 2238e81 to bf1fa1b Compare November 5, 2024 17:08
@yonigozlan yonigozlan requested review from ArthurZucker and removed request for ArthurZucker November 5, 2024 17:08
@yonigozlan yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 149f913 to 1b35f13 Compare November 5, 2024 17:43
@zucchini-nlp
Copy link
Member

Just to sync after the offsite so we have same formats with #34275, do I remember correctly that we will:

  1. use unnested dict but with typing so each image might be a url, path, base64, or just PIL/array/tensor
  2. can support OpenAI format which is only for URLs with a bit preprocessing, but only as a possible format for pipelines/endpoints while still keeping the general chat template format with unnested dicts

I agree OpenAI format is kinda standard for most users and we use it in TGI/endpoints etc., but sadly it is very inconvenient and is made to support only URLs

@yonigozlan
Copy link
Member Author

Sounds good to me @zucchini-nlp!
Just to make sure about 1. , we would still have:

 {
     "role": "user",
     "content": [
         {
             "type": "image",
             "image": image, # where image can be a url, path, base64, or PIL/array/tensor
         },
         {"type": "text", "text": "Describe this image."},
     ],
 },

Or are we explicitly typing "image"?

@zucchini-nlp
Copy link
Member

@yonigozlan no, let's keep the explicit typing only and OpenAI format (which will be available only through pipelines for now)

@yonigozlan yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 04aa698 to b30adf8 Compare November 18, 2024 19:38
@yonigozlan
Copy link
Member Author

@yonigozlan no, let's keep the explicit typing only and OpenAI format (which will be available only through pipelines for now)

Ok I see, I made some changes to reflect this in this PR then :)

)
if isinstance(content, dict):
if content.get("type") == "image":
if any(key in content for key in ["image", "url", "path"]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also have one more type here - base64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! I'll change the chat template later to support the same format, so we can call apply_template internally in pipelines

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think we need to mention this in the documentation of the pipeline as well! Quite a big win IMO

)
if isinstance(content, dict):
if content.get("type") == "image":
if any(key in content for key in ["image", "url", "path"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be more efficient, a single for loop should be enough!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just learned about the for...else python structure, makes it cleaner indeed, thanks!

@yonigozlan yonigozlan merged commit b99ca4d into huggingface:main Nov 19, 2024
26 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…o-text pipeline (huggingface#34562)

* add support for openai api image_url input

* change continue to elif

* Explicitely add support for OpenAI/TGI chat format

* rewrite content to transformers chat format and add tests

* Add support for typing of image type in chat templates

* add base64 to possible image types

* refactor nesting
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
…o-text pipeline (huggingface#34562)

* add support for openai api image_url input

* change continue to elif

* Explicitely add support for OpenAI/TGI chat format

* rewrite content to transformers chat format and add tests

* Add support for typing of image type in chat templates

* add base64 to possible image types

* refactor nesting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
  NODES
chat 29
COMMUNITY 2
Idea 2
idea 2
innovation 1
INTERN 1
Project 5
USERS 3