Image Analysis (Vision)
A few models are capable of processing images and taking them into account for their answer generation. This works because these models have multimodal capabilities, meaning they can understand both text and visual content simultaneously. You can use this to extract text from documents, describe what’s in images, or analyze visual data.
The more context and details you add, the better your response because the model understands precisely what you expect. Do not miss our Prompt Engineering Guide to learn how to write great prompts.
Apart from uploading text files, you can also upload images (JPG, PNG) to the chat and let the model analyze them. This capability is called “vision”. The following models support it:
- GPT-4.1
- GPT-4.1 mini
- GPT-4.1 nano
- GPT-4o
- GPT-4o mini
- o1
- o3
- Claude 3.5 Sonnet
- Claude 3.7 Sonnet
- Claude Sonnet 4
- Gemini 2.5 Pro
- Gemini 2.5 Flash
Image analysis is limited to images uploaded in the chat and not available in uploaded PDFs or presentations yet.