A few models are capable of processing images and taking them into account for their answer generation. This works because these models have multimodal capabilities, meaning they can understand both text and visual content simultaneously. You can use this to extract text from documents, describe what’s in images, or analyze visual data.