Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

yonigozlan · 2024-11-01T13:02:35Z

What does this PR do?

The huggingface inference API uses an OpenAI like chat template for image_url input: https://huggingface.co/docs/api-inference/tasks/image-text-to-text#using-the-api

This adds support for such image inputs in chats in the image-text-to-text pipeline.

Who can review?

Cc @zucchini-nlp @NielsRogge

HuggingFaceDocBuilderDev · 2024-11-01T13:31:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Hey, thanks for changing the format to be consistent with the API inference endpoint.

Actually I have a PR for updating chat templates to laod images/videos directly when calling the apply_chat_template (#34275). Yes, my inital idea was to use the format from TGI but I realized since we use the same function to load from path/url/PIL it would be easier to simplify the format.

Indeed that would make things easier if we use the same format and since the pipeline will be used by the API endpoint. But I find the format a bit complicated myself for url, with unnecessary nesting. Also the audio format from OpenAI is completely different according to this post.

From what I see the serving libraries dont have a uniform format yet except for the current url format in some cases. That is a good chance for us to set the new standard and encourage developers/users to user our templates when they upload models on the hub.

WDYT if we remove extra nesting but leave the keys to define "url" or "path"? That is what I had in mind before simplifying it further in linked PR

Otherwise the format looks a bit weird to me as follows:

URL: content: ["type": "image_url", "image_url": {"url": 'https:// or "data:base64-image"'}]
PIL Image: content: ["type": "image", "image": {"image": PIL.Image}]
Path but not here yet?: content: ["type": "image_path", "image_path": {"path": "my-local-path.png"}]

also cc @Rocketknight1 in the loop to get your opinion on templates

yonigozlan · 2024-11-04T14:38:36Z

Thanks for the review! Agreed that the Transformers chat template looks cleaner and easier to use, and nice to see it's getting even better with your PR.
However it does look like the pipelines are the one place where we need support for both transformers and TGI chat templates format, so I can improve this PR to do so.

Rocketknight1 · 2024-11-04T15:03:27Z

@zucchini-nlp I think modifications inside apply_chat_template to do image loading are a good idea! We should ensure that the input to Jinja doesn't change though.

For the formats of content keys with PIL images vs. image paths etc. my preference is always to be close to the OpenAI API, and then add extra features on top of that if we need.

zucchini-nlp · 2024-11-05T10:51:16Z

Thanks, i think in that case we can remove the extra nesting in dicts which will still be close to OpenAI format. And yeah, we can consider passing load time params through the template dict, currently we dont have any options except for number of frames in loading videos or sampling rate in audios, which are usually same for all video/audio in the conversation

@yonigozlan wdyt? we can update the PR with these in mind

yonigozlan · 2024-11-05T17:03:42Z

WDYT if we remove extra nesting but leave the keys to define "url" or "path"? That is what I had in mind before simplifying it further in linked PR

Otherwise the format looks a bit weird to me as follows:

URL: content: ["type": "image_url", "image_url": {"url": 'https:// or "data:base64-image"'}]
PIL Image: content: ["type": "image", "image": {"image": PIL.Image}]
Path but not here yet?: content: ["type": "image_path", "image_path": {"path": "my-local-path.png"}]

Looking again at the OpenAI API doc, it seems that the only way to pass in images is through "image_url", even if the image is local, as shown here .
Same for TGI and API inference.

I also agree that the simplified format of transformers using only "image", whether we have an url, PIL image or path makes more sense and is more user friendly. So I don't think we need to add support for anything else other than the openai/TGI nested "image_url" format in this pipeline. It seems to me that we should keep the nesting in that specific case, because the goal is to be fully compatible with the openai/tgi format.

zucchini-nlp · 2024-11-15T12:29:59Z

Just to sync after the offsite so we have same formats with #34275, do I remember correctly that we will:

use unnested dict but with typing so each image might be a url, path, base64, or just PIL/array/tensor
can support OpenAI format which is only for URLs with a bit preprocessing, but only as a possible format for pipelines/endpoints while still keeping the general chat template format with unnested dicts

I agree OpenAI format is kinda standard for most users and we use it in TGI/endpoints etc., but sadly it is very inconvenient and is made to support only URLs

yonigozlan · 2024-11-18T16:11:11Z

Sounds good to me @zucchini-nlp!
Just to make sure about 1. , we would still have:

 {
     "role": "user",
     "content": [
         {
             "type": "image",
             "image": image, # where image can be a url, path, base64, or PIL/array/tensor
         },
         {"type": "text", "text": "Describe this image."},
     ],
 },

Or are we explicitly typing "image"?

zucchini-nlp · 2024-11-18T17:17:49Z

@yonigozlan no, let's keep the explicit typing only and OpenAI format (which will be available only through pipelines for now)

yonigozlan · 2024-11-18T19:41:34Z

@yonigozlan no, let's keep the explicit typing only and OpenAI format (which will be available only through pipelines for now)

Ok I see, I made some changes to reflect this in this PR then :)

zucchini-nlp · 2024-11-19T07:48:40Z

src/transformers/pipelines/image_text_to_text.py

-                    )
+            if isinstance(content, dict):
+                if content.get("type") == "image":
+                    if any(key in content for key in ["image", "url", "path"]):


let's also have one more type here - base64

zucchini-nlp

Thanks, LGTM! I'll change the chat template later to support the same format, so we can call apply_template internally in pipelines

ArthurZucker

LGTM, I think we need to mention this in the documentation of the pipeline as well! Quite a big win IMO

ArthurZucker · 2024-11-19T15:14:52Z

src/transformers/pipelines/image_text_to_text.py

-                    )
+            if isinstance(content, dict):
+                if content.get("type") == "image":
+                    if any(key in content for key in ["image", "url", "path"]):


this can be more efficient, a single for loop should be enough!

Just learned about the for...else python structure, makes it cleaner indeed, thanks!

…ext-pipeline

…o-text pipeline (huggingface#34562) * add support for openai api image_url input * change continue to elif * Explicitely add support for OpenAI/TGI chat format * rewrite content to transformers chat format and add tests * Add support for typing of image type in chat templates * add base64 to possible image types * refactor nesting

yonigozlan requested a review from zucchini-nlp November 1, 2024 13:05

zucchini-nlp reviewed Nov 4, 2024

View reviewed changes

yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 2238e81 to bf1fa1b Compare November 5, 2024 17:08

yonigozlan requested review from ArthurZucker and removed request for ArthurZucker November 5, 2024 17:08

yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 149f913 to 1b35f13 Compare November 5, 2024 17:43

yonigozlan requested review from ArthurZucker and zucchini-nlp November 5, 2024 17:46

yonigozlan added 5 commits November 18, 2024 19:38

add support for openai api image_url input

c92e2a4

change continue to elif

7a19a06

Explicitely add support for OpenAI/TGI chat format

13a39cf

rewrite content to transformers chat format and add tests

bec7dc0

Add support for typing of image type in chat templates

b30adf8

yonigozlan force-pushed the add-imsupport-openai-chat-in-image-text-to-text-pipeline branch from 04aa698 to b30adf8 Compare November 18, 2024 19:38

zucchini-nlp reviewed Nov 19, 2024

View reviewed changes

zucchini-nlp approved these changes Nov 19, 2024

View reviewed changes

add base64 to possible image types

903c30a

ArthurZucker approved these changes Nov 19, 2024

View reviewed changes

yonigozlan and others added 2 commits November 19, 2024 15:54

refactor nesting

e831478

Merge branch 'main' into add-imsupport-openai-chat-in-image-text-to-t…

1ba15fd

…ext-pipeline

yonigozlan merged commit b99ca4d into huggingface:main Nov 19, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

yonigozlan commented Nov 1, 2024

HuggingFaceDocBuilderDev commented Nov 1, 2024

zucchini-nlp left a comment

yonigozlan commented Nov 4, 2024

Rocketknight1 commented Nov 4, 2024 •

edited

Loading

zucchini-nlp commented Nov 5, 2024

yonigozlan commented Nov 5, 2024

zucchini-nlp commented Nov 15, 2024

yonigozlan commented Nov 18, 2024

zucchini-nlp commented Nov 18, 2024

yonigozlan commented Nov 18, 2024

zucchini-nlp Nov 19, 2024

yonigozlan Nov 19, 2024

zucchini-nlp left a comment

ArthurZucker left a comment

ArthurZucker Nov 19, 2024

yonigozlan Nov 19, 2024

Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline #34562

Conversation

yonigozlan commented Nov 1, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Nov 1, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

yonigozlan commented Nov 4, 2024

Rocketknight1 commented Nov 4, 2024 • edited Loading

zucchini-nlp commented Nov 5, 2024

yonigozlan commented Nov 5, 2024

zucchini-nlp commented Nov 15, 2024

yonigozlan commented Nov 18, 2024

zucchini-nlp commented Nov 18, 2024

yonigozlan commented Nov 18, 2024

zucchini-nlp Nov 19, 2024

Choose a reason for hiding this comment

yonigozlan Nov 19, 2024

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Nov 19, 2024

Choose a reason for hiding this comment

yonigozlan Nov 19, 2024

Choose a reason for hiding this comment

Rocketknight1 commented Nov 4, 2024 •

edited

Loading